Split-Apply-Combine in Python

How to process data in Python?

Posted by Fan Gong on January 12, 2018

Split-Apply-Combine in Python

Last time we talked about how to do data wrangling by R, and we found that the package dplyr is very functional. Then I am wondering, does python also have a similar module or package?

The answer is Yes! Pandas has the similar groupby operation. Let us then see how does python work on split-Apply-Combine problems.

First let us load some libraries and load the classic iris data as our example dataset.

# First we load the library
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()

#Transform the sklearn Dataframe into pandas Dataframe
iris = pd.DataFrame(np.hstack((iris.data, iris.target.reshape(-1,1))),
                   columns = iris.feature_names + ['target'])
iris.target = iris.target.astype('category')
iris.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

Then let us see more specific functions comparing with dplyr in R.

1. select

There is no specific select function in pandas but we can use its normal extraction or indexing operators. Also, if we want to rename the column, we can use rename() function.
Here is an example:

# Select from column 'sepal length(cm)' to 'target'
iris.loc[:, 'sepal length (cm)':'target'].head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0
# Select and at the same time rename the columns
iris.rename(columns={'sepal length (cm)':'sepal_len',
                     'sepal width (cm)':'sepal_w',
                     'petal length (cm)':'petal_len',
                     'petal width (cm)':'petal_w'}, inplace = True)
iris.head()
sepal_len sepal_w petal_len petal_w target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

2. filter

This time pandas has a matching function called query. The difference is that we nned to write the expression as a string, and evaluate the string in the context of the DataFrame.

pd.DataFrame.query(expr, inplace=False, **kwargs)

Here is an example:

# Filter to only have the first species whose sepal_len longer or equal to 5 
iris.query('sepal_len >= 5 & target in [0.0, 1.0]').head()
sepal_len sepal_w petal_len petal_w target
0 5.1 3.5 1.4 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0
5 5.4 3.9 1.7 0.4 0.0
7 5.0 3.4 1.5 0.2 0.0
10 5.4 3.7 1.5 0.2 0.0

3. arrange


Here, the matching function is called sort_values, here is the detailed information:

pd.DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

where by is a string or list of string contains names which refer to the axis items; axis is the axis to sort. 0 represents index and 1 represents columns. for na_position we could choose put them at the beginning or at the end.

Here is an example:

#Descending sort by sepal length and width
iris.sort_values(['sepal_len', 'sepal_w'], ascending = False).head()
sepal_len sepal_w petal_len petal_w target
131 7.9 3.8 6.4 2.0 2.0
117 7.7 3.8 6.7 2.2 2.0
135 7.7 3.0 6.1 2.3 2.0
122 7.7 2.8 6.7 2.0 2.0
118 7.7 2.6 6.9 2.3 2.0

Actually, there is another sorting method called sort_index, which as its literally meaning, sorts the dataframe by index.

4. mutate

Here, the matching function is called assign. The difference between arrange and assign here is that python cannot calculate the arguments in order since here the args are considered in a dictionary which is unordered. In this case, we need to split up the assigns and pass in a callable to the second assign (We may use lambda function as our callable).

Here is an example:

# Calculate the ratio of sepal
iris.assign(sepal_ratio = iris.sepal_len/iris.sepal_w).head()
sepal_len sepal_w petal_len petal_w target sepal_ratio
0 5.1 3.5 1.4 0.2 0.0 1.457143
1 4.9 3.0 1.4 0.2 0.0 1.633333
2 4.7 3.2 1.3 0.2 0.0 1.468750
3 4.6 3.1 1.5 0.2 0.0 1.483871
4 5.0 3.6 1.4 0.2 0.0 1.388889
# Calculate the ratio of sepal and petal and then calculate the difference between them
iris.assign(sepal_ratio = iris.sepal_len/iris.sepal_w,
           petal_ratio = iris.petal_len/iris.petal_w).assign(diff = lambda df: df.sepal_ratio - df.petal_ratio).head()
sepal_len sepal_w petal_len petal_w target sepal_ratio petal_ratio diff
0 5.1 3.5 1.4 0.2 0.0 1.457143 7.0 -5.542857
1 4.9 3.0 1.4 0.2 0.0 1.633333 7.0 -5.366667
2 4.7 3.2 1.3 0.2 0.0 1.468750 6.5 -5.031250
3 4.6 3.1 1.5 0.2 0.0 1.483871 7.5 -6.016129
4 5.0 3.6 1.4 0.2 0.0 1.388889 7.0 -5.611111

5. summarise


In pandas these summarised functions are spread across pd.DataFrame.mean or pd.DataFrame.sum. There is also a pd.DataFrame.agg function. Here is the detailed information:

DataFrame.agg(func, axis=0, *args, **kwargs) where func is the function to use for aggregating the data. Accepted Combinations are:

  • string function name
  • function
  • list of functions
  • dict of column names -> functions (or list of functions)
Here is an example:

# Use simple mean function
iris.sepal_len.mean()
5.843333333333335
# Use agg
print(iris.agg(['mean','sum']))

# Specify which column
print(iris.agg({'sepal_len': ['count','std']}))
       sepal_len  sepal_w   petal_len     petal_w
mean    5.843333    3.054    3.758667    1.198667
sum   876.500000  458.100  563.800000  179.800000
        sepal_len
count  150.000000
std      0.828066

6. group_by

pandas has create an specific function for groupby. After this operation, the DataFrame will become to another type of object called pd.DataFrame.groupby objects. Here is the detailed information:

pd.DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)[source] Normally, we will pass a string or a list of column names to by.

Here is an example:

iris.groupby('target').agg(['mean','sum','count'])
sepal_len sepal_w petal_len petal_w
mean sum count mean sum count mean sum count mean sum count
target
0.0 5.006 250.3 50 3.418 170.9 50 1.464 73.2 50 0.244 12.2 50
1.0 5.936 296.8 50 2.770 138.5 50 4.260 213.0 50 1.326 66.3 50
2.0 6.588 329.4 50 2.974 148.7 50 5.552 277.6 50 2.026 101.3 50

We could see the results is a MultiIndex in the columns which can be useful when dealing with some complicated datasets. If you want to still use the normal index, remember to add as_index = False within groupby function.

iris.groupby('target').mean()
sepal_len sepal_w petal_len petal_w
target
0.0 5.006 3.418 1.464 0.244
1.0 5.936 2.770 4.260 1.326
2.0 6.588 2.974 5.552 2.026

7. join

Here we use join function in pandas. Here is the detailed information:

pd.DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False) where on is the column name of a list of column names; how can choose from {‘left’, ‘right’, ‘outer’, ‘inner’}; lsuffix and rsuffix is the left suffix and right suffix.

By the way, we can also use pd.concat. Here is an example:

# Both dfs are 3 x 4 DataFrame
df_mean = iris.groupby('target').mean()
df_count = iris.groupby('target').count()
#Make a outer join
df_count.join(df_mean, how='outer', lsuffix='_mean', rsuffix='_n')
sepal_len_mean sepal_w_mean petal_len_mean petal_w_mean sepal_len_n sepal_w_n petal_len_n petal_w_n
target
0.0 50 50 50 50 5.006 3.418 1.464 0.244
1.0 50 50 50 50 5.936 2.770 4.260 1.326
2.0 50 50 50 50 6.588 2.974 5.552 2.026

8. Make them all together!

Now we have known all the matching operations in pandas, let us now put them all together and see how it works.

Our target is: Select only sepal information for these three species, find the samples that has sepal length longer than 4 and sepal width is 3; then generate mean and max summary information for each species

iris.head()
sepal_len sepal_w petal_len petal_w target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0
# Get the groupby objects
iris_group = iris.iloc[:,[0,1,4]].query('sepal_len > 4 & sepal_w == 3').groupby('target')
# use agg
iris_group.agg(['mean','sum'])
sepal_len sepal_w
mean sum mean sum
target
0.0 4.700000 28.2 3.0 18.0
1.0 5.950000 47.6 3.0 24.0
2.0 6.716667 80.6 3.0 36.0
# Use join
iris_group.mean().join(iris_group.sum(), how = 'outer',
                       lsuffix='_mean', rsuffix='_sum' )
sepal_len_mean sepal_w_mean sepal_len_sum sepal_w_sum
target
0.0 4.700000 3.0 28.2 18.0
1.0 5.950000 3.0 47.6 24.0
2.0 6.716667 3.0 80.6 36.0


Reference:

https://pandas.pydata.org/pandas-docs/stable/groupby.html

http://nbviewer.jupyter.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0