Split-Apply-Combine in Python
Last time we talked about how to do data wrangling by R, and we found that the package dplyr
is very functional. Then I am wondering, does python also have a similar module or package?
The answer is Yes! Pandas
has the similar groupby
operation. Let us then see how does python work on split-Apply-Combine problems.
First let us load some libraries and load the classic iris data as our example dataset.
# First we load the library
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
#Transform the sklearn Dataframe into pandas Dataframe
iris = pd.DataFrame(np.hstack((iris.data, iris.target.reshape(-1,1))),
columns = iris.feature_names + ['target'])
iris.target = iris.target.astype('category')
iris.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
Then let us see more specific functions comparing with dplyr
in R.
1. select
There is no specific select function in pandas but we can use its normal extraction or indexing operators. Also, if we want to rename the column, we can use rename()
function.
Here is an example:
# Select from column 'sepal length(cm)' to 'target'
iris.loc[:, 'sepal length (cm)':'target'].head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
# Select and at the same time rename the columns
iris.rename(columns={'sepal length (cm)':'sepal_len',
'sepal width (cm)':'sepal_w',
'petal length (cm)':'petal_len',
'petal width (cm)':'petal_w'}, inplace = True)
iris.head()
sepal_len | sepal_w | petal_len | petal_w | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
2. filter
This time pandas has a matching function called query
. The difference is that we nned to write the expression as a string, and evaluate the string in the context of the DataFrame.
pd.DataFrame.query(expr, inplace=False, **kwargs)
Here is an example:
# Filter to only have the first species whose sepal_len longer or equal to 5
iris.query('sepal_len >= 5 & target in [0.0, 1.0]').head()
sepal_len | sepal_w | petal_len | petal_w | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
5 | 5.4 | 3.9 | 1.7 | 0.4 | 0.0 |
7 | 5.0 | 3.4 | 1.5 | 0.2 | 0.0 |
10 | 5.4 | 3.7 | 1.5 | 0.2 | 0.0 |
3. arrange
Here, the matching function is called
sort_values
, here is the detailed information:
pd.DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
where by
is a string or list of string contains names which refer to the axis items; axis
is the axis to sort. 0 represents index and 1 represents columns. for na_position
we could choose put them at the beginning or at the end.
Here is an example:
#Descending sort by sepal length and width
iris.sort_values(['sepal_len', 'sepal_w'], ascending = False).head()
sepal_len | sepal_w | petal_len | petal_w | target | |
---|---|---|---|---|---|
131 | 7.9 | 3.8 | 6.4 | 2.0 | 2.0 |
117 | 7.7 | 3.8 | 6.7 | 2.2 | 2.0 |
135 | 7.7 | 3.0 | 6.1 | 2.3 | 2.0 |
122 | 7.7 | 2.8 | 6.7 | 2.0 | 2.0 |
118 | 7.7 | 2.6 | 6.9 | 2.3 | 2.0 |
Actually, there is another sorting method called sort_index
, which as its literally meaning, sorts the dataframe by index.
4. mutate
Here, the matching function is called assign
. The difference between arrange
and assign
here is that python cannot calculate the arguments in order since here the args are considered in a dictionary which is unordered. In this case, we need to split up the assigns and pass in a callable to the second assign (We may use lambda
function as our callable).
Here is an example:
# Calculate the ratio of sepal
iris.assign(sepal_ratio = iris.sepal_len/iris.sepal_w).head()
sepal_len | sepal_w | petal_len | petal_w | target | sepal_ratio | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 | 1.457143 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 | 1.633333 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 | 1.468750 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 | 1.483871 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 | 1.388889 |
# Calculate the ratio of sepal and petal and then calculate the difference between them
iris.assign(sepal_ratio = iris.sepal_len/iris.sepal_w,
petal_ratio = iris.petal_len/iris.petal_w).assign(diff = lambda df: df.sepal_ratio - df.petal_ratio).head()
sepal_len | sepal_w | petal_len | petal_w | target | sepal_ratio | petal_ratio | diff | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 | 1.457143 | 7.0 | -5.542857 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 | 1.633333 | 7.0 | -5.366667 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 | 1.468750 | 6.5 | -5.031250 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 | 1.483871 | 7.5 | -6.016129 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 | 1.388889 | 7.0 | -5.611111 |
5. summarise
In
pandas
these summarised functions are spread across pd.DataFrame.mean
or pd.DataFrame.sum
. There is also a pd.DataFrame.agg
function. Here is the detailed information:
DataFrame.agg(func, axis=0, *args, **kwargs)
where func
is the function to use for aggregating the data. Accepted Combinations are:
- string function name
- function
- list of functions
- dict of column names -> functions (or list of functions)
# Use simple mean function
iris.sepal_len.mean()
5.843333333333335
# Use agg
print(iris.agg(['mean','sum']))
# Specify which column
print(iris.agg({'sepal_len': ['count','std']}))
sepal_len sepal_w petal_len petal_w
mean 5.843333 3.054 3.758667 1.198667
sum 876.500000 458.100 563.800000 179.800000
sepal_len
count 150.000000
std 0.828066
6. group_by
pandas
has create an specific function for groupby. After this operation, the DataFrame will become to another type of object called pd.DataFrame.groupby
objects. Here is the detailed information:
pd.DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)[source]
Normally, we will pass a string or a list of column names to by
.
Here is an example:
iris.groupby('target').agg(['mean','sum','count'])
sepal_len | sepal_w | petal_len | petal_w | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | sum | count | mean | sum | count | mean | sum | count | mean | sum | count | |
target | ||||||||||||
0.0 | 5.006 | 250.3 | 50 | 3.418 | 170.9 | 50 | 1.464 | 73.2 | 50 | 0.244 | 12.2 | 50 |
1.0 | 5.936 | 296.8 | 50 | 2.770 | 138.5 | 50 | 4.260 | 213.0 | 50 | 1.326 | 66.3 | 50 |
2.0 | 6.588 | 329.4 | 50 | 2.974 | 148.7 | 50 | 5.552 | 277.6 | 50 | 2.026 | 101.3 | 50 |
We could see the results is a MultiIndex
in the columns which can be useful when dealing with some complicated datasets. If you want to still use the normal index, remember to add as_index = False
within groupby
function.
iris.groupby('target').mean()
sepal_len | sepal_w | petal_len | petal_w | |
---|---|---|---|---|
target | ||||
0.0 | 5.006 | 3.418 | 1.464 | 0.244 |
1.0 | 5.936 | 2.770 | 4.260 | 1.326 |
2.0 | 6.588 | 2.974 | 5.552 | 2.026 |
7. join
Here we use join
function in pandas
. Here is the detailed information:
pd.DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
where on
is the column name of a list of column names; how
can choose from {‘left’, ‘right’, ‘outer’, ‘inner’}
; lsuffix
and rsuffix
is the left suffix and right suffix.
By the way, we can also use pd.concat
. Here is an example:
# Both dfs are 3 x 4 DataFrame
df_mean = iris.groupby('target').mean()
df_count = iris.groupby('target').count()
#Make a outer join
df_count.join(df_mean, how='outer', lsuffix='_mean', rsuffix='_n')
sepal_len_mean | sepal_w_mean | petal_len_mean | petal_w_mean | sepal_len_n | sepal_w_n | petal_len_n | petal_w_n | |
---|---|---|---|---|---|---|---|---|
target | ||||||||
0.0 | 50 | 50 | 50 | 50 | 5.006 | 3.418 | 1.464 | 0.244 |
1.0 | 50 | 50 | 50 | 50 | 5.936 | 2.770 | 4.260 | 1.326 |
2.0 | 50 | 50 | 50 | 50 | 6.588 | 2.974 | 5.552 | 2.026 |
8. Make them all together!
Now we have known all the matching operations in pandas
, let us now put them all together and see how it works.
Our target is: Select only sepal information for these three species, find the samples that has sepal length longer than 4 and sepal width is 3; then generate mean and max summary information for each species
iris.head()
sepal_len | sepal_w | petal_len | petal_w | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0.0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0.0 |
# Get the groupby objects
iris_group = iris.iloc[:,[0,1,4]].query('sepal_len > 4 & sepal_w == 3').groupby('target')
# use agg
iris_group.agg(['mean','sum'])
sepal_len | sepal_w | |||
---|---|---|---|---|
mean | sum | mean | sum | |
target | ||||
0.0 | 4.700000 | 28.2 | 3.0 | 18.0 |
1.0 | 5.950000 | 47.6 | 3.0 | 24.0 |
2.0 | 6.716667 | 80.6 | 3.0 | 36.0 |
# Use join
iris_group.mean().join(iris_group.sum(), how = 'outer',
lsuffix='_mean', rsuffix='_sum' )
sepal_len_mean | sepal_w_mean | sepal_len_sum | sepal_w_sum | |
---|---|---|---|---|
target | ||||
0.0 | 4.700000 | 3.0 | 28.2 | 18.0 |
1.0 | 5.950000 | 3.0 | 47.6 | 24.0 |
2.0 | 6.716667 | 3.0 | 80.6 | 36.0 |
Reference:
https://pandas.pydata.org/pandas-docs/stable/groupby.html
http://nbviewer.jupyter.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0