Split-Apply-Combine in Python

Last time we talked about how to do data wrangling by R, and we found that the package dplyr is very functional. Then I am wondering, does python also have a similar module or package?

The answer is Yes! Pandas has the similar groupby operation. Let us then see how does python work on split-Apply-Combine problems.

First let us load some libraries and load the classic iris data as our example dataset.

# First we load the library
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()

#Transform the sklearn Dataframe into pandas Dataframe
iris = pd.DataFrame(np.hstack((iris.data, iris.target.reshape(-1,1))),
                   columns = iris.feature_names + ['target'])
iris.target = iris.target.astype('category')

iris.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Then let us see more specific functions comparing with dplyr in R.

1. select

There is no specific select function in pandas but we can use its normal extraction or indexing operators. Also, if we want to rename the column, we can use rename() function.
Here is an example:

# Select from column 'sepal length(cm)' to 'target'
iris.loc[:, 'sepal length (cm)':'target'].head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

# Select and at the same time rename the columns
iris.rename(columns={'sepal length (cm)':'sepal_len',
                     'sepal width (cm)':'sepal_w',
                     'petal length (cm)':'petal_len',
                     'petal width (cm)':'petal_w'}, inplace = True)
iris.head()

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

2. filter

This time pandas has a matching function called query. The difference is that we nned to write the expression as a string, and evaluate the string in the context of the DataFrame.

pd.DataFrame.query(expr, inplace=False, **kwargs)

Here is an example:

# Filter to only have the first species whose sepal_len longer or equal to 5 
iris.query('sepal_len >= 5 & target in [0.0, 1.0]').head()

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
7	5.0	3.4	1.5	0.2
10	5.4	3.7	1.5	0.2

3. arrange

Here, the matching function is called sort_values, here is the detailed information:

pd.DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

where by is a string or list of string contains names which refer to the axis items; axis is the axis to sort. 0 represents index and 1 represents columns. for na_position we could choose put them at the beginning or at the end.

Here is an example:

#Descending sort by sepal length and width
iris.sort_values(['sepal_len', 'sepal_w'], ascending = False).head()

	sepal_len	sepal_w	petal_len	petal_w	target
131	7.9	3.8	6.4	2.0	2.0
117	7.7	3.8	6.7	2.2	2.0
135	7.7	3.0	6.1	2.3	2.0
122	7.7	2.8	6.7	2.0	2.0
118	7.7	2.6	6.9	2.3	2.0

Actually, there is another sorting method called sort_index, which as its literally meaning, sorts the dataframe by index.

4. mutate

Here, the matching function is called assign. The difference between arrange and assign here is that python cannot calculate the arguments in order since here the args are considered in a dictionary which is unordered. In this case, we need to split up the assigns and pass in a callable to the second assign (We may use lambda function as our callable).

Here is an example:

# Calculate the ratio of sepal
iris.assign(sepal_ratio = iris.sepal_len/iris.sepal_w).head()

	sepal_len	sepal_w	petal_len	petal_w	sepal_ratio
0	5.1	3.5	1.4	0.2	1.457143
1	4.9	3.0	1.4	0.2	1.633333
2	4.7	3.2	1.3	0.2	1.468750
3	4.6	3.1	1.5	0.2	1.483871
4	5.0	3.6	1.4	0.2	1.388889

# Calculate the ratio of sepal and petal and then calculate the difference between them
iris.assign(sepal_ratio = iris.sepal_len/iris.sepal_w,
           petal_ratio = iris.petal_len/iris.petal_w).assign(diff = lambda df: df.sepal_ratio - df.petal_ratio).head()

	sepal_len	sepal_w	petal_len	petal_w	sepal_ratio	petal_ratio	diff
0	5.1	3.5	1.4	0.2	1.457143	7.0	-5.542857
1	4.9	3.0	1.4	0.2	1.633333	7.0	-5.366667
2	4.7	3.2	1.3	0.2	1.468750	6.5	-5.031250
3	4.6	3.1	1.5	0.2	1.483871	7.5	-6.016129
4	5.0	3.6	1.4	0.2	1.388889	7.0	-5.611111

5. summarise

In pandas these summarised functions are spread across pd.DataFrame.mean or pd.DataFrame.sum. There is also a pd.DataFrame.agg function. Here is the detailed information:

DataFrame.agg(func, axis=0, *args, **kwargs) where func is the function to use for aggregating the data. Accepted Combinations are:

string function name
function
list of functions
dict of column names -> functions (or list of functions)

Here is an example:

# Use simple mean function
iris.sepal_len.mean()

5.843333333333335

# Use agg
print(iris.agg(['mean','sum']))

# Specify which column
print(iris.agg({'sepal_len': ['count','std']}))

       sepal_len  sepal_w   petal_len     petal_w
mean    5.843333    3.054    3.758667    1.198667
sum   876.500000  458.100  563.800000  179.800000
        sepal_len
count  150.000000
std      0.828066

6. group_by

pandas has create an specific function for groupby. After this operation, the DataFrame will become to another type of object called pd.DataFrame.groupby objects. Here is the detailed information:

pd.DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)[source] Normally, we will pass a string or a list of column names to by.

Here is an example:

iris.groupby('target').agg(['mean','sum','count'])

	sepal_len			sepal_w			petal_len			petal_w
	mean	sum	count	mean	sum	count	mean	sum	count	mean	sum	count
target
0.0	5.006	250.3	50	3.418	170.9	50	1.464	73.2	50	0.244	12.2	50
1.0	5.936	296.8	50	2.770	138.5	50	4.260	213.0	50	1.326	66.3	50
2.0	6.588	329.4	50	2.974	148.7	50	5.552	277.6	50	2.026	101.3	50

We could see the results is a MultiIndex in the columns which can be useful when dealing with some complicated datasets. If you want to still use the normal index, remember to add as_index = False within groupby function.

iris.groupby('target').mean()

	sepal_len	sepal_w	petal_len	petal_w
target
0.0	5.006	3.418	1.464	0.244
1.0	5.936	2.770	4.260	1.326
2.0	6.588	2.974	5.552	2.026

7. join

Here we use join function in pandas. Here is the detailed information:

pd.DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False) where on is the column name of a list of column names; how can choose from {‘left’, ‘right’, ‘outer’, ‘inner’}; lsuffix and rsuffix is the left suffix and right suffix.

By the way, we can also use pd.concat. Here is an example:

# Both dfs are 3 x 4 DataFrame
df_mean = iris.groupby('target').mean()
df_count = iris.groupby('target').count()

#Make a outer join
df_count.join(df_mean, how='outer', lsuffix='_mean', rsuffix='_n')

	sepal_len_mean	sepal_w_mean	petal_len_mean	petal_w_mean	sepal_len_n	sepal_w_n	petal_len_n	petal_w_n
target
0.0	50	50	50	50	5.006	3.418	1.464	0.244
1.0	50	50	50	50	5.936	2.770	4.260	1.326
2.0	50	50	50	50	6.588	2.974	5.552	2.026

8. Make them all together!

Now we have known all the matching operations in pandas, let us now put them all together and see how it works.

Our target is: Select only sepal information for these three species, find the samples that has sepal length longer than 4 and sepal width is 3; then generate mean and max summary information for each species

iris.head()

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

# Get the groupby objects
iris_group = iris.iloc[:,[0,1,4]].query('sepal_len > 4 & sepal_w == 3').groupby('target')
# use agg
iris_group.agg(['mean','sum'])

	sepal_len		sepal_w
	mean	sum	mean	sum
target
0.0	4.700000	28.2	3.0	18.0
1.0	5.950000	47.6	3.0	24.0
2.0	6.716667	80.6	3.0	36.0

# Use join
iris_group.mean().join(iris_group.sum(), how = 'outer',
                       lsuffix='_mean', rsuffix='_sum' )

	sepal_len_mean	sepal_w_mean	sepal_len_sum	sepal_w_sum
target
0.0	4.700000	3.0	28.2	18.0
1.0	5.950000	3.0	47.6	24.0
2.0	6.716667	3.0	80.6	36.0

Reference:

https://pandas.pydata.org/pandas-docs/stable/groupby.html

http://nbviewer.jupyter.org/gist/TomAugspurger/6e052140eaa5fdb6e8c0

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
7	5.0	3.4	1.5	0.2
10	5.4	3.7	1.5	0.2

	sepal_len	sepal_w	petal_len	petal_w	target
131	7.9	3.8	6.4	2.0	2.0
117	7.7	3.8	6.7	2.2	2.0
135	7.7	3.0	6.1	2.3	2.0
122	7.7	2.8	6.7	2.0	2.0
118	7.7	2.6	6.9	2.3	2.0

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
7	5.0	3.4	1.5	0.2
10	5.4	3.7	1.5	0.2

	sepal_len	sepal_w	petal_len	petal_w	target
131	7.9	3.8	6.4	2.0	2.0
117	7.7	3.8	6.7	2.2	2.0
135	7.7	3.0	6.1	2.3	2.0
122	7.7	2.8	6.7	2.0	2.0
118	7.7	2.6	6.9	2.3	2.0

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4
7	5.0	3.4	1.5	0.2
10	5.4	3.7	1.5	0.2

	sepal_len	sepal_w	petal_len	petal_w	target
131	7.9	3.8	6.4	2.0	2.0
117	7.7	3.8	6.7	2.2	2.0
135	7.7	3.0	6.1	2.3	2.0
122	7.7	2.8	6.7	2.0	2.0
118	7.7	2.6	6.9	2.3	2.0

	sepal_len	sepal_w	petal_len	petal_w
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2