group and apply operations to all remaining keys

Question

group and apply operations to all remaining keys

If I have a pandas dataframe called df, I can find the average reading ability for each age by using the code

df.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())

.

But what if I want to find the average reading ability for all ages except one, say, age=k?

One way to do this is:

mu_other_ages = {}
for age in df['Age'].unique():
 mu_other_ages[age] = df[df['Age'] != age]['ReadingAbility'].mean()

This approach seems like the opposite of using groupby + apply. Is there a more efficient shortcut to achieve the same result?

Consider the example below:

In [52]: d = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])                                                                                                                        

In [53]:                                                                                                                                                                                                           

In [53]: d                                                                                                                                                                                                         
Out[53]:                                                                                                                                                                                                           
   Age  ReadingAbility                                                                                                                                                                                             
0    1              10                                                                                                                                                                                             
1    2               4                                                                                                                                                                                             
2    1               9                                                                                                                                                                                             
3    2               3                                                                                                                                                                                             

In [54]: d.groupby('Age').apply(lambda x: x['ReadingAbility'].mean())                                                                                                                                              
Out[54]:                                                                                                                                                                                                           
Age                                                                                                                                                                                                                
1    9.5                                                                                                                                                                                                           
2    3.5                                                                                                                                                                                                           
dtype: float64

In the case where there are only 2 different age values, the results should be inverted as follows: 2=9.5 and 1=3.5. For more classes, the value for age=k would be calculated as:

df[df['Age'] != k]['ReadingAbility'].mean()

.

To clarify, the expected result for this example is: 2=9.5 and 1=3.5

python pandas group-by

Answer 1

Answer №1

Required items:

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
1    3.5
2    9.5
dtype: float64

Alternatively, a quick solution involves aggregating sum and size for each group, then subtracting the summed columns using the pandas sub method. Finally, divide by:

np.random.seed(45)
d = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['Age', 'ReadingAbility']) 
print (d)
   Age  ReadingAbility
0    3               0
1    5               3
2    4               9
3    8               1
4    5               9
5    6               8
6    7               8
7    5               2
8    8               1
9    6               4

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000

c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
print (c)
     size  sum
Age           
3       1    0
4       1    9
5       3   14
6       2   12
7       1    8
8       2    2

e = c.rsub(c.sum())
e = e['sum'] / e['size']
print (e)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000
dtype: float64

Performance Analysis:

np.random.seed(45)
N = 100000
d = pd.DataFrame(np.random.randint(1000, size=(N, 2)), columns=['Age', 'ReadingAbility']) 
#print (d)


In [30]: %timeit (d.groupby('Age').apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
1 loop, best of 3: 1.27 s per loop


In [31]: %%timeit
    ...: c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
    ...: #print (c)
    ...: e = c.sub(c.sum())
    ...: e = e['sum'] / e['size']
    ...: 
100 loops, best of 3: 6.28 ms per loop

Answer 2

Required items:

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
1    3.5
2    9.5
dtype: float64

Alternatively, a quick solution involves aggregating sum and size for each group, then subtracting the summed columns using the pandas sub method. Finally, divide by:

np.random.seed(45)
d = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['Age', 'ReadingAbility']) 
print (d)
   Age  ReadingAbility
0    3               0
1    5               3
2    4               9
3    8               1
4    5               9
5    6               8
6    7               8
7    5               2
8    8               1
9    6               4

a = (d.groupby('Age')
      .apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))

print (a)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000

c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
print (c)
     size  sum
Age           
3       1    0
4       1    9
5       3   14
6       2   12
7       1    8
8       2    2

e = c.rsub(c.sum())
e = e['sum'] / e['size']
print (e)
Age
3    5.000000
4    4.000000
5    4.428571
6    4.125000
7    4.111111
8    5.375000
dtype: float64

Performance Analysis:

np.random.seed(45)
N = 100000
d = pd.DataFrame(np.random.randint(1000, size=(N, 2)), columns=['Age', 'ReadingAbility']) 
#print (d)


In [30]: %timeit (d.groupby('Age').apply(lambda x: d.loc[d['Age']!=x['Age'].iat[0], 'ReadingAbility'].mean()))
1 loop, best of 3: 1.27 s per loop


In [31]: %%timeit
    ...: c = d.groupby('Age')['ReadingAbility'].agg(['size','sum'])
    ...: #print (c)
    ...: e = c.sub(c.sum())
    ...: e = e['sum'] / e['size']
    ...: 
100 loops, best of 3: 6.28 ms per loop

Answer 3

Answer №2

When you use d.groupby("Age")['ReadingAbility'].mean(), you will obtain the mean for each group based on Age.

To filter out a specific age group, such as Age = 1, you can add a query like

d.groupby("Age")['ReadingAbility'].mean().reset_index().query("Age != 1")

or

d.groupby("Age")['ReadingAbility'].mean().select(lambda x: x != 1, axis=0)

Another approach is to follow Merkle Daamgard's suggestion and filter out unnecessary values first before using groupby and mean functions.

d.query("Age != 1").groupby("Age")['ReadingAbility'].mean()
d.loc[d.Age != 1].groupby("Age")['ReadingAbility'].mean()
d.where(d.Age != 1).groupby("Age")['ReadingAbility'].mean()

For further details, refer to GroupBy.mean.

Answer 4

When you use d.groupby("Age")['ReadingAbility'].mean(), you will obtain the mean for each group based on Age.

To filter out a specific age group, such as Age = 1, you can add a query like

d.groupby("Age")['ReadingAbility'].mean().reset_index().query("Age != 1")

or

d.groupby("Age")['ReadingAbility'].mean().select(lambda x: x != 1, axis=0)

Another approach is to follow Merkle Daamgard's suggestion and filter out unnecessary values first before using groupby and mean functions.

d.query("Age != 1").groupby("Age")['ReadingAbility'].mean()
d.loc[d.Age != 1].groupby("Age")['ReadingAbility'].mean()
d.where(d.Age != 1).groupby("Age")['ReadingAbility'].mean()

For further details, refer to GroupBy.mean.

Answer 5

Answer №3

Perhaps that could work for your situation.

df = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])
output = df.loc[df['Age'] != 1].groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
print output

The result would be:

Age : 2 3.5

Answer 6

Perhaps that could work for your situation.

df = pd.DataFrame([[1,10], [2,4],[1, 9], [2,3]], columns=['Age', 'ReadingAbility'])
output = df.loc[df['Age'] != 1].groupby('Age').apply(lambda x: x['ReadingAbility'].mean())
print output

The result would be:

Age : 2 3.5

group and apply operations to all remaining keys

Answer №1

Answer №2

Answer №3

Similar questions

Using Python 2.4 to deliver a notification to Pidgin

I am curious about how to check if the scroll bar has reached the end using Selenium in Python

What is the best way to include non-ascii characters in Python's regular expressions?

What is the equivalent to the mysqlDB fetchone() function in pandas.io.sql?

What is the process for making the functions within a module accessible to a class?

Django-Rest Framework: Implementing Cursor Pagination for Efficient Data Retrieval

Executing PHP code that imports a Python library

Tips on sending a successful HTTP 200 response for a Slack API event request in Python using the request module

Separate information into columns and save it as a two-dimensional array

What could be causing my model to produce varying outcomes with each training session?

Can you explain the contrast between numpyArr[:,:,:,c] and numpyArr[...,c]?

"Using Python Regex in Expresso works perfectly, but unfortunately it does not work in Iron

Changing grouped information by converting categories of groupings into fields (utilizing GraphLab or Panda's DataFrame)

Pyinstaller error: File cannot be found at 'C:\Users\my.name\Desktop\EPE 2.0\dist\main\timezonefinder\timezone_names.json'. It seems to be missing

Converting data into a hierarchical structure for a JSON file

Step-by-step guide: Deploying your Django application on Google App Engine

Is there a way to both create and modify a dictionary at the same time?

Tips for importing an Excel file into Databricks using PySpark

Python - Retrieving Required Text from a <td class = "text">Grab This Information</td>

Can a software be created to capture search results from the internet?