group and apply operations to all remaining keys

If I have a pandas dataframe called df, I can find the average reading ability for each age by using the code

df.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
.

But what if I want to find the average reading ability for all ages except one, say, age=k?

One way to do this is:

mu_other_ages = {}
for age in df['Age'].unique():
 mu_other_ages[age] = df[df['Age'] != age]['ReadingAbility'].mean()

This approach seems like the opposite of using groupby + apply. Is there a more efficient shortcut to achieve the same result?

Consider the example below:

In [52]: d = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])                                                                                                                        

In [53]:                                                                                                                                                                                                           

In [53]: d                                                                                                                                                                                                         
Out[53]:                                                                                                                                                                                                           
   Age  ReadingAbility                                                                                                                                                                                             
0    1              10                                                                                                                                                                                             
1    2               4                                                                                                                                                                                             
2    1               9                                                                                                                                                                                             
3    2               3                                                                                                                                                                                             

In [54]: d.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())                                                                                                                                              
Out[54]:                                                                                                                                                                                                           
Age                                                                                                                                                                                                                
1    9.5                                                                                                                                                                                                           
2    3.5                                                                                                                                                                                                           
dtype: float64                                                                                                                                                                                                     

In the case where there are only 2 different age values, the results should be inverted as follows: 2=9.5 and 1=3.5. For more classes, the value for age=k would be calculated as:

df[df['Age'] != k]['ReadingAbility'].mean()
.

To clarify, the expected result for this example is: 2=9.5 and 1=3.5

Answer №1

Required items:

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
1    3.5
2    9.5
dtype: float64

Alternatively, a quick solution involves aggregating sum and size for each group, then subtracting the summed columns using the pandas sub method. Finally, divide by:

np.random.seed(45)
d = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['Age', 'ReadingAbility']) 
print (d)
   Age  ReadingAbility
0    3               0
1    5               3
2    4               9
3    8               1
4    5               9
5    6               8
6    7               8
7    5               2
8    8               1
9    6               4

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000

c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
print (c)
     size  sum
Age           
3       1    0
4       1    9
5       3   14
6       2   12
7       1    8
8       2    2

e = c.rsub(c.sum())
e = e['sum'] / e['size']
print (e)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000
dtype: float64

Performance Analysis:

np.random.seed(45)
N = 100000
d = pd.DataFrame(np.random.randint(1000, size=(N, 2)), columns=['Age', 'ReadingAbility']) 
#print (d)


In [30]: %timeit (d.groupby('Age').apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
1 loop, best of 3: 1.27 s per loop


In [31]: %%timeit
    ...: c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
    ...: #print (c)
    ...: e = c.sub(c.sum())
    ...: e = e['sum'] / e['size']
    ...: 
100 loops, best of 3: 6.28 ms per loop

Answer №2

When you use d.groupby("Age")['ReadingAbility'].mean(), you will obtain the mean for each group based on Age.

To filter out a specific age group, such as Age = 1, you can add a query like

d.groupby("Age")['ReadingAbility'].mean().reset_index().query("Age != 1")

or

d.groupby("Age")['ReadingAbility'].mean().select(lambda x: x != 1, axis=0)

Another approach is to follow Merkle Daamgard's suggestion and filter out unnecessary values first before using groupby and mean functions.

d.query("Age != 1").groupby("Age")['ReadingAbility'].mean()
d.loc[d.Age != 1].groupby("Age")['ReadingAbility'].mean()
d.where(d.Age != 1).groupby("Age")['ReadingAbility'].mean()

For further details, refer to GroupBy.mean.

Answer №3

Perhaps that could work for your situation.

df = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])
output = df.loc[df['Age'] != 1].groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
print output

The result would be:

Age : 2 3.5

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Using Python 2.4 to deliver a notification to Pidgin

I am looking to send notifications to users via their pidgin internet messenger using python 2.4 in my application. Could someone provide guidance on how this task can be accomplished? ...

I am curious about how to check if the scroll bar has reached the end using Selenium in Python

Currently, I am working on a while loop in Selenium. I'm trying to implement a condition where the while loop stops when the scroll bar reaches the end of its scroll. How can I add this type of stopping condition to my code? Right now, I am using Keys ...

What is the best way to include non-ascii characters in Python's regular expressions?

While working with Python's regex, I noticed that using [一三四五六七八九十] will not match any character within the brackets individually. However, if you want to match each character in the brackets separately like 一, you need to specify ...

What is the equivalent to the mysqlDB fetchone() function in pandas.io.sql?

Is there a similar function in the pandas.io.sql library that functions like mysqldb's fetchone? Perhaps something along these lines: qry="select ID from reports.REPORTS_INFO where REPORT_NAME='"+rptDisplayName+"'" psql.read_sql(qry, con=d ...

What is the process for making the functions within a module accessible to a class?

Let's discuss a scenario: I have a module containing various function definitions. I am looking to create a class that can access these functions. Which approach would you recommend: Option 1 or Option 2? import ModuleWithFunctions class MyClass(o ...

Django-Rest Framework: Implementing Cursor Pagination for Efficient Data Retrieval

Currently, I am in the process of developing an API using Django-Rest-Framework and I have implemented cursor pagination which is by default ordered by the 'created' filter. This setup has been working well for most of my views. However, I have ...

Executing PHP code that imports a Python library

My attempt to execute a Python script from a PHP file is encountering issues when it comes to loading a local library. Surprisingly, the PHP successfully calls the Python script without the local libraries, and manually launching the Python script works fl ...

Tips on sending a successful HTTP 200 response for a Slack API event request in Python using the request module

I am trying to handle an event request by sending back an HTTP 2xx response using the Request method in Python. Can someone please provide guidance on how I can achieve this smoothly? The current issue I am facing is that I have tunnelling software runnin ...

Separate information into columns and save it as a two-dimensional array

Below is the data that I am working with: 49907 87063 42003 51519 21301 46100 97578 26010 52364 86618 25783 71775 1617 29096 2662 47428 74888 54550 17182 35976 86973 5323 ...... My goal is to iterate through it using for line in file. I would like to s ...

What could be causing my model to produce varying outcomes with each training session?

I'm puzzled as to why I get different results every time I train the same algorithm twice. Could this be a common occurrence, or is there potentially an issue with either the data or the code? The algorithm in question is known as the deep determinis ...

Can you explain the contrast between numpyArr[:,:,:,c] and numpyArr[...,c]?

As I progress through my deep learning course on Coursera, I stumbled upon a piece of code on GitHub while working on an assignment. 1. numpyArr[...,c] 2. numpyArr[:,:,:,c] I'm curious to know what distinguishes these two slicing methods? ...

"Using Python Regex in Expresso works perfectly, but unfortunately it does not work in Iron

Exploring HTML and diving into learning RegEx, even though I am aware of other approaches. Embracing challenges makes the process interesting... The regular expression I am working with is: publisher.php\?c=.*?\">(.*?)</a>(?:.*?)<br ...

Changing grouped information by converting categories of groupings into fields (utilizing GraphLab or Panda's DataFrame)

Below are the records organized by user_id and action columns: user_id | action | count 1 | read | 15 1 | write | 5 1 | delete | 7 2 | write | 2 3 | read | 9 3 | write | 1 3 | delete | 2 I am looking to tr ...

Pyinstaller error: File cannot be found at 'C:\Users\my.name\Desktop\EPE 2.0\dist\main\timezonefinder\timezone_names.json'. It seems to be missing

Whenever I attempt to run my executable file generated by pyinstaller (using the latest version, Python v3.6 in an Anaconda environment), I encounter the following error: File "site-packages\timezonefinder\timezonefinder.py", line 27, in <mod ...

Converting data into a hierarchical structure for a JSON file

Can someone assist me in parsing this specific file for the Gene Ontology (.obo)? I need guidance on how to accomplish this task. I am currently working on a project that involves creating a visualisation in D3. To achieve this, I require a JSON format "t ...

Step-by-step guide: Deploying your Django application on Google App Engine

As I attempt to launch a Django application on Google App Engine, I am faced with numerous errors along the way. I first tested out this example: https://github.com/GoogleCloudPlatform/appengine-django-skeleton, only to encounter the error: ImportError: ...

Is there a way to both create and modify a dictionary at the same time?

While there are plenty of answers on how to update an existing dict if a key doesn't exist, my query is slightly different. How can I update a dictionary but create the dictionary if it doesn't already exist? Here's a scenario - within a si ...

Tips for importing an Excel file into Databricks using PySpark

I am currently facing an issue with importing my Excel file into PySpark on Azure-DataBricks machine so that I can convert it to a PySpark Dataframe. However, I am encountering errors while trying to execute this task. import pandas data = pandas.read_exc ...

Python - Retrieving Required Text from a <td class = "text">Grab This Information</td>

Being a beginner in using selenium and python, my main objective is to retrieve the revenue value for a specific company from the Hoovers website. Here's my current code: company = 'Trelleborg' page = 'https://hoovers.com/company-info ...

Can a software be created to capture search results from the internet?

Is it feasible to create a program that can extract online search results? I am specifically interested in retrieving data from Some of the data I need include application numbers, such as 9078871 and 10595401 Although there are CAPTCHAs present, I am w ...