Python: Identifying the highest value across various columns in a Pandas Dataframe

Question

Python: Identifying the highest value across various columns in a Pandas Dataframe

I'm new to python and I have a pandas dataframe with multiple columns representing months. I want to compare these columns across a period of x months and flag any rows that have ever had a value of 2 or more.

Here is the code snippet I used to generate my sample dataframe:

arr_random = np.random.randint(low=0, high=5, size=(100,26))
col_names = []
i = 0
while i <= 25:
    col_names.append('mth_'+str(i))
    i = i + 1
rand_df = pd.DataFrame(arr_random, index = None, columns = col_names)

I want to flag rows in the following way: 1 = 2+, 0 = <2, -1 = missing data (I consider NaN values as -1). Below is the code snippet I'm using to achieve this:

review_months = [12, 18, 24]
for x in review_months:
    rand_df['TWOPLUS_'+str(x)+'M'] = -1
    for i in range(x):
        rand_df['TWOPLUS_'+str(x)+'M'] = rand_df[['TWOPLUS_'+str(x)+'M', 'mth_'+str(i+1)]].max(axis = 1)
        conditions  = [ rand_df['TWOPLUS_'+str(x)+'M'] >= 2, rand_df['TWOPLUS_'+str(x)+'M'] < 2, rand_df['mth_'+str(i)] == -1 ]
        choices     = [ 1 , 0, -1 ]
        rand_df['TWOPLUS_'+str(x)+'M'] = np.select(conditions, choices, default=np.nan)

The issue I am facing is that I only get the current status of whether a row has 2 or more in a specific column within the time frame, rather than capturing if it has EVER occurred at some point over that time period.

python pandas numpy

Answer 1

Answer №1

To determine whether the data frame has ever contained a value of 2 or higher, you can utilize the provided code snippet:

for month in (12, 18, 24):
    rand_df[f'TWOPLUS_{month}M'] = (rand_df.loc[:, rand_df.columns[:month+1]] >= 2).any(axis=1).astype(int)
    rand_df[f'TWOPLUS_{month}M'].fillna(-1, inplace=True)

rand_df

This code extracts columns up to the specified month and assesses if each value is at least 2. The any(axis=1) function confirms if any value within a row is true. Subsequently, it converts to 1 for True values and 0 for False ones. Any null values are replaced with -1.

You can refer to the following links for more information on the any method and using pandas .loc: Any Method Documentation Pandas .loc Documentation

Output Table:

...(remaining rows)...

	mth_0	mth_1	mth_2	mth_3	mth_4	mth_5	mth_6	mth_7	mth_8	mth_9	...	mth_19	mth_20	mth_21	mth_22	mth_23	mth_24	mth_25	TWOPLUS_12M	TWOPLUS_18M	TWOPLUS_24M
0	1	0	3	1	4	3	4	0	0	2	...	4	4	0	1	0	1	2	1	1	1
1	0	0	3	4	4	1	0	1	4	2	...	0	2	2	0	3	3	1	1	1	1

Answer 2

To determine whether the data frame has ever contained a value of 2 or higher, you can utilize the provided code snippet:

for month in (12, 18, 24):
    rand_df[f'TWOPLUS_{month}M'] = (rand_df.loc[:, rand_df.columns[:month+1]] >= 2).any(axis=1).astype(int)
    rand_df[f'TWOPLUS_{month}M'].fillna(-1, inplace=True)

rand_df

This code extracts columns up to the specified month and assesses if each value is at least 2. The any(axis=1) function confirms if any value within a row is true. Subsequently, it converts to 1 for True values and 0 for False ones. Any null values are replaced with -1.

You can refer to the following links for more information on the any method and using pandas .loc: Any Method Documentation Pandas .loc Documentation

Output Table:

...(remaining rows)...

	mth_0	mth_1	mth_2	mth_3	mth_4	mth_5	mth_6	mth_7	mth_8	mth_9	...	mth_19	mth_20	mth_21	mth_22	mth_23	mth_24	mth_25	TWOPLUS_12M	TWOPLUS_18M	TWOPLUS_24M
0	1	0	3	1	4	3	4	0	0	2	...	4	4	0	1	0	1	2	1	1	1
1	0	0	3	4	4	1	0	1	4	2	...	0	2	2	0	3	3	1	1	1	1

Python: Identifying the highest value across various columns in a Pandas Dataframe

Answer №1

Similar questions

Is there a way to combine three separate lines of information in the output into one cohesive format?

Should you create an archive - Retain outcomes or retrieve them whenever needed?

Verifying user input against a text file to confirm its existence

What is the best way to sum values from a specific column only if there is a matching string in another

What is the method for determining the frequency of words in a list based on a string?

Python, Selenium, and gecko driver with added browser extensions

Creating a dynamic MPTT structure with expand/collapse functionality in a Django template

The Jupyter kernel encountered an error while attempting to initialize

What could be causing my Selenium URL_to_be statement to fail?

Encountering difficulties while attempting to decode specific characters from API calls in Django

Navigating the Zeppelin: A Guide to Understanding DataFrames via SQL

Connecting JSON objects based on unique GUID values generated

Is there a way to customize fonts in Jupyter Notebook?

Searching for the total value of all nodes in a tree

Python drawing techniques: A step-by-step guide

Using Selenium with JavaScript and Python to simulate key presses

Warning: Scipy curve_fit encountered a runtime overflow issue while calculating the exponential function

Python 3.5 Custom Compare Sorting Issue: Unexpected Results

What is the best way to adjust the time intervals in a time series dataframe to display average values for

Is it possible to convert a SearchQuerySet into a QuerySet without altering the existing order?