Questions tagged [pandas]

Pandas, a powerful Python tool, unveils numerous possibilities for data manipulation and analysis. It wields prowess in handling datasets such as dataframes, multidimensional time series, and cross-sectional collections commonly encountered in fields like statistics, experimental science, econometrics, or finance. When it comes to data science libraries in Python, Pandas stands prominently among the front-runners.

A guide on replacing values in a pandas dataframe using interpolation

My dataset, df, resembles this: print(df) x outlier_flag 10 1 NaN 1 30 1 543 -1 50 1 I want to replace values flagged with outlier_flag==-1 by interpolating between row['A][i-1] and row['A][i+1]. In other words, I need to correct the erron ...

Exploring K Nearest Neighbors Algorithm for Big Data

In my quest to discover the nearest neighbors for a dataset A containing 25,000 rows, I have ventured into fitting dataset B into a KNN model consisting of 13 million rows. The ultimate objective is to identify 25,000 rows within dataset B that closely res ...

Finding the smallest value within the data from the past N days

In my dataset, I have the following information: ID Date X 123_Var 456_Var 789_Var A 16-07-19 3 777 250 810 A 17-07-19 9 637 121 529 A 20-07-19 2 295 272 490 A 21-07-19 3 778 600 ...

decipher intricate JSON data

After converting a JSON object from a YAML file, I attempted to serialize it but encountered errors. {'info': {'city': 'Southampton', 'dates': [datetime.date(2005, 6, 13)], 'gender': 'male', ...

Decrease the index's level

After running a pivot table, I have the result below indicating the customer grades that visited my stores. Using the 'droplevel' method, I managed to flatten the column header into one layer. Now, I am looking to do the same for the index - removing 'Grad ...

Unexpectedly large dataset for the Test and Training Sets

Currently, I am in the process of developing a predictive model using linear regression on a dataset containing 157673 records. The data is stored in a CSV file and follows this format: Timestamp,Signal_1,Signal_2,Signal_3,Signal_4,Signal_5 2021-04-13 ...

Find similarities between two string columns in a Python pandas dataframe and store the common strings in a new column

I am working with two pandas dataframes, df1 and df2: df1: df2: item_name item_cleaned abc xyz Def xuy DEF Ghi s GHI lsoe Abc p ABc ois To solve my problem, I need to create a function that can compare the valu ...

Eliminate repeated datetime index values by including small increments of timedelta

Here is the provided data: n = 8 np.random.seed(42) df = pd.DataFrame(index=[dt.datetime(2020,3,31,9,25) + dt.timedelta(seconds=x) for x in np.random.randint(0,10000,size=n).tolist()], data=np.random.randint(0,10 ...

Developing a data frame using a list in Pandas

Struggling to convert the data I have into a Pandas DataFrame. It should be an easy task but I can't seem to crack it. I have the headers and the web data, but transforming it into a list for the DataFrame function is where I'm stuck. from selenium impor ...

Setting values for a specific group of rows within a Pandas dataframe

Looking to apply conditions based on index values in a Pandas DataFrame. class test(): def __init__(self): self.l = 1396633637830123000 self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], inde ...

Tips for combining cells partially in a vertical direction within the pandas library

Here is the dataframe I am working with: index Flag Data 0 1 aaaa 1 0 bbbb 2 0 cccc 3 0 dddd 4 1 eeee 5 0 ffff 6 1 gggg 7 1 hhhh 8 1 iiii I want to obtain a merged vertical data where it's divided by Flag 1. index Flag Dat ...

Converting boolean values to string format with Pandas

I have an Excel spreadsheet with a column formatted as "General" containing values like 10cm, 0, 1, TRUE, and FALSE. I am using Pandas to parse the data into a dataframe and then save it in an SQLite table. However, I want the values in this column to reta ...

pandas locate_similar_values function returns no results

Currently, I am working on a project that involves dealing with 2 CSV files - CSV1.csv Short Description Category Device is DOWN! Server Down ...

Transforming individual row values into columns and utilizing them for reference data retrieval

How can I convert row values into columns and use them to search for values in a pandas dataframe? I've tried using iterrows and .loc indexing, but without success import pandas as pd import sys if sys.version_info[0] < 3: from StringIO imp ...

Encoding a list of categories as strings for creating Pandas dummies

I am working with a dataframe structured like this: id amenities ... 1 "TV,Internet,Shower,..." ... 2 "TV,Hot tub,Internet,..." ... 3 "Internet,Heating,Shower..." ... ... My goal is to split the string by comm ...

Struggling to find the average of values across multiple rows sharing a common identifier, without using a column or slicing method

I am working with a dataframe that contains orders from a restaurant. Each row in the dataframe represents a product within an order, including the order ID and price of the item. I would like to calculate the average price of all orders, but each order ma ...

Encountering a TypeError while attempting to utilize Django and Pandas for displaying data in an HTML format

import pandas as pd from django.shortcuts import render # Define the home view function def home(): # Read data from a CSV file and select only the 'name' column data = pd.read_csv("pandadjangoproject/nmdata.csv", nrows=11) only_city = data[[' ...

Filtering pandas dataframe to only show rows from certain months

I am dealing with a pandas dataframe that includes a date column spanning from 2015 to 2021. print(data) date time wind_speed wind_direction 0 2015-01-01 00:00 00:00 28.0 25.0 1 2015-01-01 01:00 01:00 ...

Utilizing Pandas to Transform Unique Row Values into Columns, Similar to a Pivot Table

dataset: top 50 bestselling novels each year Genre options: fiction , nonfiction (only two unique categories) Could someone guide me on how to summarize the data to create a table showing the author's name, and the count of fiction and nonfiction books th ...

Populating a DataFrame cell with a list based on two conditions for removing elements within the list

In my data frame, I have two columns. The first column contains a list of numbers in each cell, while the second column contains a list of letters in each cell. Now, I am looking to create two additional columns based on certain conditions: If a value in ...

I'm having trouble understanding the Python pipeline syntax. Can anyone provide an explanation

I'm having some trouble understanding the function of each step in this particular pipeline. Could someone provide a detailed explanation of how this pipeline is functioning? I have a general idea, but more clarity would be greatly appreciated. Wha ...

The Sklearn KNN Imputer has gaps in its data

After attempting to fill NaN values in a column using the KNN imputer from Sk-learn, I noticed that some of the NaNs were still present in the imputed column. What could be causing this issue? I have already compared the count of NaNs before and after the ...

What are some effective ways to filter out specific string patterns while using Pandas?

dataframe df.columns=['ipo_date','l2y_gg_date','l1k_kk_date'] Purpose extract dataframe with columns titled _date excluding ipo_date. Solution df.filter(regex='_date&^ipo_date') ...

Is there a way to monitor the number of rows being inserted into a table while the insertion process is still in progress?

I am facing a challenge with a dataframe containing 4 million rows and 53 columns. My goal is to write this dataframe to an oracle table using Python. Below is a snippet of the code I have been working on: import pandas as pd import cx_Oracle conn = (---- ...

Is there a way to extract a specific substring from a pandas dataframe using a provided list for filtering?

While I know this question has been asked before, I'm struggling with list comprehensions and my code has a small twist to it. In my dataframe, I have keywords that I want to filter based on whether they contain any of the keywords from a specific list. ...

What is the best way to sum values from a specific column only if there is a matching string in another

Looking to sum numbers from a specific column only when it meets a certain criteria in another column, such as adding integers in col2 when col1 is 'A'. import pandas as pd d = {'col1': ['A', 'B', 'A', &apo ...

Choose a particular tier within the MultiIndex

The code snippet below demonstrates how to extract a specific list from multiple levels: idx1 = sys_bal.index idx2 = user_bal.index idx3 = idx1.intersection(idx2) The following output is generated by the above code: MultiIndex(levels=[[3, 29193], [&apos ...

Error thrown by pandas read_csv function: ValueError

I am trying to read text data that is separated by commas and tabs using the following code: io_df = pd.read_csv('input_output.txt', sep='D| ', engine='python') However, this code is throwing an error as shown below: ---------------------------------- ...

The pandas DataFrame is incrementing one column based on the value in another column and keeping track of the counts

Looking at the dataframe provided https://i.stack.imgur.com/1bAf8.png We are tasked with counting the occurrences of 'end' in the reason column that follow certain keywords present in another column. For example, the given keywords are 'hire' and 'career ...

Changing grouped information by converting categories of groupings into fields (utilizing GraphLab or Panda's DataFrame)

Below are the records organized by user_id and action columns: user_id | action | count 1 | read | 15 1 | write | 5 1 | delete | 7 2 | write | 2 3 | read | 9 3 | write | 1 3 | delete | 2 I am looking to tr ...

Generating a stratified K-Fold split for training, testing, and validation datasets

I am attempting to utilize StratifiedKFold in order to create train/test/val splits for a non-sklearn machine learning workflow. The goal is to split the DataFrame and maintain that division. My approach involves using .values as I am working with pandas ...

Encountered an issue with tallying the frequency of values in a dataFrame using specific columns for grouping

I have a pandas dataframe that contains columns id, colA, colB, and colC: id colA colB colC 194 1 0 1 194 1 1 0 194 2 1 3 195 1 1 2 195 0 1 0 197 1 1 2 The task is to calculate the occurre ...

Generating complex JSON structures from CSV files containing incomplete rows

I have merged various excel/csv files using pandas to create a database. While I found some examples on creating nested Jsons from csvs, they didn't fully meet my needs. My data is structured in a stepwise manner as shown here. Each subject has multiple v ...

Other options instead of employing an iterator for naming variables

I am relatively new to Python, transitioning from a background in Stata, and am encountering some challenges with fundamental Python concepts. Currently, I am developing a small program that utilizes the US Census Bureau API to geocode addresses. My initia ...

Execute a function on every pair of rows from a dataframe and columns from another dataframe

Imagine a scenario where you need to multiply a row and column vector to create a matrix, and then aggregate the rows of that resulting matrix. In this case, each element in the row vector consists of two values A and B, while each element in the column v ...

Choosing which columns to copy from a Pandas DataFrame

I want to duplicate my current df to another pandas dataframe. If I specify columns to copy, I can do so like this: df_copy = df[['col_A', 'col_B', 'col_C']].copy() Is there a way to copy all columns except for the ones specified using this method? I att ...

Generate summary columns using a provided list of headers

In my dataset, there is survey data along with columns containing demographic information such as age and department, as well as ratings. I want to enhance the dataset by adding new columns based on calculations from the existing rating columns. The goal ...

PANDAS: Transforming arrays into individual numbers in a list

My list is printing out as [array([7], dtype=int64), array([11], dtype=int64), array([15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67], dtype=int64)] However, I want it to look like [7, 11, 15, 19, 23, ...] The issue arises when using pandas ...

Issue encountered: Jupyter Notebook Import Error - unable to import attribute 'np_version_under1p17' from 'pandas.compat.numpy'

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import matplotlib.dates as md import datetime as dt import time from zipfile import ZipFile from matplotlib.pyplot import xticks %matplotlib inline ------------ ...

Transforming JSON data into a pandas DataFrame using Python with examples from yahoo_financials

Can someone assist me with this JSON format: (updated dataframe) JSON: {'PSG.MC': [{'date': 1547452800,'formatted_date': '2019-01-14', 'amount': 0.032025}, {'date': 1554361200, 'formatted_date': '2019-04-04', 'amount': 0.032025}, {'date': 1562310000, 'f ...

Traversing through different levels of a DataFrame to apply filters

I am working with a DataFrame that contains the following data in a csv format: NAME,VENUE_CITY_NAME,EVENT_LANGUAGE,EVENT_GENRE satya,Pune,Hindi,|COMEDY|DRAMA| Amit,National Capital Region,English,|ACTION|ADVENTURE|SCI-FI| satya,Mumbai,Hindi,|COMEDY|DRAMA ...

Calculate the mean of a subset of rows within various pandas columns

Looking to smooth out geographical data using a dataset by finding the nearest neighbors within a specific radius for each row, then calculating the mean and adding it as a new column. The code snippet below achieves this: import pandas as pd import numpy ...

Checking the authenticity of user access records using pandas data structures

Even though I've been utilizing pandas for various projects over the past year, I still don't consider myself very proficient in it. I'm starting to feel like I'm missing some basic terminology as my searches haven't yielded much h ...

The dtype attribute of a Pandas DataFrame

Whenever you call the dtypes method on a pandas data frame, the final line of the result typically displays dtype: object. For instance: In [1]: import pandas as pd In [2]: df = pd.DataFrame({'numbers':100,'floats': 5.75,'name':'Jill'},index=['a']) In [3]: ...

What is the best method to assign np.nan values to a series based on multiple conditions?

If I have a data set like this: A B C D E F 0 x R i R nan h 1 z g j x a nan 2 z h nan y nan nan 3 x g nan nan nan nan 4 x x h x s f I am looking to update specific cells in the data by following th ...

Converting JSON data to a pandas DataFrame requires the list indices to be integers

I have a large JSON dataset that I want to convert to CSV for analysis purposes. However, when using json_normalize to build the table, I encounter the following error: Traceback (most recent call last): File "/Users/Home/Downloads/JSONtoCSV/easybill.py" ...

Pandas read_csv function encounters a MemoryError issue

I am dealing with a 5GB file named 1.csv and using a pandas script to remove duplicates. However, every time I try to run the script, I encounter a memory error. I attempted chunking the large file, but this approach only allows me to process parts of the ...

Utilize df columns for interactive interaction and apply a filtering operation using a string matching statement

Picture a dataset with several columns, all beginning with carr carr carrer banana One Two Three Is there a way to filter out only the column names that start with "carr"? Desired outcome carr carrer One Two Example dataframe: import pan ...

Tips on deleting rows in a pandas dataframe based on the value in the first column

I need to filter out rows from a pandas df. My goal is to retain everything the first string in Col A ends with. In the provided df, the string concludes with Mon, so I aim to remove any rows that do not end with this value. import pandas as pd df = pd.D ...

Extracting integer values from strings within a DataFrame column containing colons

I have a DataFrame with the following data: C1 C2 A 2:3:1:7 B 2:1:4:3 C 2:1:1:1 My task is to sort the integers in column C2, while keeping the colons intact. The desired output should be as follows: C1 C2 A 1:2:3:7 B 1:2:3:4 C 1:1: ...

Python code to transform a dictionary into binary format

I have a unique system where customer IDs are linked with movie IDs in a dictionary. Even if the customer watches the same movie multiple times, I want to simplify it as a single entry. In order to achieve this, I need to convert my dictionary data into bi ...

What's the best way to combine these date entries into monthly groups?

I am currently working with multiple CSV files containing dataframes for COVID cases. An example of the data looks like this: Region active Date 2020-03-20 Tabuk 1 2020-03-21 Tabuk 1 2020-03-22 Tabuk 1 2020-03-23 Tabuk 1 2020-03-24 ...

Using Python Pandas: Utilizing .apply() to pass multiple arguments or column values to a custom function

https://i.stack.imgur.com/n4O7x.png I'm attempting to create a custom function that takes three inputs and checks if the length of the full name (combination of first and last) is greater than the length of the occupation. I've tried two differe ...

Generate additional smaller DataFrameS by using a groupby function on the original DataFrame

I've got a dataset structured like this: Index Amount Currency 01.01.2018 25.0 EUR 01.01.2018 43.5 GBP 01.01.2018 463.0 PLN 02.01.2018 32.0 EUR 02.01.2018 12.5 GBP 02.01.2018 123.0 PLN 03.01.2018 10 ...

struggling to open a csv file using pandas

Embarking on my journey into the world of data mining, I am faced with the task of calculating the correlation between 16 variables within a dataset consisting of around 500 rows. Utilizing pandas for this operation has proven to be challenging as I encoun ...

Python code to verify if a specific condition is satisfied within a certain time period

Seeking to determine if certain conditions are met over a period of time. The data is structured as follows: Datetime Valve1 Valve2 01/01/2020 11:00:01 1 0 The condition being evaluated is: (Valve1=1 for 1h) and (Valve-0 for 1h) Utilizing rolling ...

Problem with a personalized query involving an agent utilizing LangChain and GPT-4

I've been working on a project that involves utilizing LangChain to develop an agent capable of answering questions based on pandas DataFrames. To achieve this, I'm leveraging a GPT-4 model. However, I've hit a roadblock while attempting to ...

Calculation of rolling median using pandas over a 3-month period

I'm currently working on calculating a rolling median for the past 3 months. This is what I have so far: df['ODPLYW'].rolling(min_periods=90, window=90).median() However, I specifically need the window to be exactly 3 months. The rolling function only acc ...

Converting a Pandas dataframe to JSON format: {name of column1: [corresponding values], name of column2: [corresponding values], name

I'm a beginner in Python and pandas. I've been attempting to convert a pandas dataframe into JSON with the following format: {column1 name : [Values], column2 name: [values], Column3 name... } I have tried using this code: df.to_json(orient='columns') Ho ...

Exceeded server capacity while attempting to extract pricing information from a search bar

Thanks to some guidance from the Stackoverflow community, I successfully developed a scraper that retrieves a list of part numbers along with their respective prices. Here is an example format of the data retrieved: part1 price1 part2 price2 ... .. ...

Even after removing rows with a certain value, Pandas' `value_counts` function continues to display that dropped value with a count

Here is a data frame with name and species columns: name_col species_col 0 alice cat 1 bob cat 2 darwin dog 3 frank ferret We created a new dataframe excluding ferrets: In: df_minus_ferrets = df.drop ...

Place the string into a dataframe

Upon loading a dataframe from a CSV file, I noticed some unexpected infinite values present in the data. Rather than altering the original CSV file, which serves as inputs to my program, I am seeking a way to address this issue within the loaded dataframe ...

Pandas does not have the capability to interpret Excel data as plain text

Currently, I am attempting to iterate through a series of xls files in a loop and combine them into one master dataframe. While each file contains the same columns, some have a column as a string while others have it as an integer. To avoid any potential ...

Pandas exhibits inconsistent behavior when rounding to the nearest hour

When using pandas to round a datetime to the nearest hour, there seems to be inconsistent rounding from the halfway point. Odd hours like 5 have their half-hour (5:30) rounded up, while even hours have their half-hour rounded down. For example, both 5:30 a ...

Is it possible to change the structure of a Pandas DataFrame outside of its designated class

Is it possible to enhance a class externally? I often find myself needing to insert a new column into my data frames and I am seeking a more streamlined syntax. All of my dataframes are prepared for this action. Essentially, what I want to do is: DF['Pe ...

Is it possible to utilize a yyyy-mm-w format when plotting datetime data?

I am currently working with a dataset that is recorded on a weekly basis and I want to create a matplotlib plot using this information. The date format I am using is yyyymmw, where w represents the week of the month (ranging from 1 to 5). Each week starts ...

Divide the rows of a pandas dataframe into separate columns

I have a CSV file that I needed to split on line breaks because of its file type. After splitting this data frame into two separate data frames, I am left with rows that are structured like the following: 27 Block "Column" "Row" &qu ...

Changing dictionary rows into individual columns in pandas dataframes

I am working with a dataframe that has two columns. One of these columns contains dictionaries with multiple keys and values. My goal is to expand these dictionary keys into separate columns using pandas. Is there a way to achieve this? In [1]:print df Ou ...

Transform JSON data into a table format using Python's nested import mechanism

Attempting to import a JSON file, specifically from a FB profile export, which contains multiple nested levels. In Excel, I can easily create a query that transforms all the data into a table within a minute by expanding the nested levels into new columns ...

Retrieve characteristics from the initial dataset utilized in establishing a TensorFlow dataset

Consider the dataframe df: revenue 2016-11-05 -0.352631 2016-11-06 -0.438142 2016-11-07 -0.470228 2016-11-08 -0.487672 2016-11-09 -0.491773 ... ... 2020-11-16 -0.624413 2020-11-17 -0.640770 2020-11-18 -0.606660 2020-11-19 -0.596660 20 ...

Is it possible to locate a specific column name within an Excel spreadsheet using Pandas?

When using pandas to read an excel sheet and gather data from it in order to create a new excel document, there is a challenge. The current code only works if the user selects a sheet with the exact column name specified. It is necessary to verify that the ...

Using the pandas library, you can save and manage numerous data sets within a single h5 file by utilizing the pd

If I have two different dataframes, import pandas as pd df1 = pd.DataFrame({'col1':[0,2,3,2],'col2':[1,0,0,1]}) df2 = pd.DataFrame({'col12':[0,1,2,1],'col22':[1,1,1,1]}) After successfully storing df1 with the comm ...

How can we group data by minute, hour, day, month, and year?

I've been trying to find a resolution for my current issue, but I seem to be stuck. I'm really hoping that you can assist me. The Issue: My goal is to determine the number of tweets per minute. Data Set: time sentiment 0 201 ...

Transform the information into a matrix data structure

I'm working with a dataframe structured like this: df = pd.DataFrame({'isin': ['a', 'a', 'c', 'd','c', 'e', 'd','f','s','d','c',&a ...

Using Tweepy to pull tweets from Twitter

Upon successfully adding tweets to my csv file, I noticed that the tweets were truncated and had a new text in place where they were cut off. For example, an original tweet might look something like this: Career in Risk Management Some of the programs an ...