Recently I just finished my research as a research assistant with Professor Kathryn Rudie Harrigan in Columbia Business School. During this research, I have spent a lot of time on data wrangling. So I decide to make a summary about the most important part of data wrangling -- split-apply-combine. I will divide this post into two parts: One for the usage of R and another for the usage of Python.
All right, let us see how to use R to perform data wrangling.
1. Build-in Function
Basically, there are two groups of base function for split-apply-combine in R. One is aggregate
function and another is the apply
family.
Since most of R users are very familiar with these functions, I will only have a very brief introduction.
aggregate(x, by, Fun, ...)
, noticeby
should be a list of the grouping elements. For example:aggregate(iris[,-5], by = list(iris$Species), FUN = mean)
will give the mean length/width value for each species.apply(x, MARGIN, FUN)
applies a function to each row or column of a matrix., notice forMARGIN
,1
indicates rows and2
indicates columns.lapply(x, FUN)
andsapply(x, FUN)
allow you to apply the same function to every element in a list or a vector. butlapply
always returns a list, andsapply
tries to simplify the return value whenever possible. Notice we always use'['
to get the subset of the content in each list element; and we use'[['
to extract the item in each list element. For example:lapply(a, '[[', 1,2)
will extract item[1,2]
from each list element.tapply(x, Index, FUN)
: to apply FUN to subsets of entries in x that shares a common level in Index. which means we split our data by Index first and then use our function to every part. (seldomly use)
2. plyr Package
plyr
is a data manipulating package which is especially for split-apply-combine.
plyr
builds on the build-in apply
functions by giving you control over the input and output formats and keeping the syntax consistent across all variations.
The basic format is two letters followed by ply()
. The first letter refers to the format in and the second to the format out.
-
d
= data frame -
a
= array(include matrics) -
l
= list
There are also some less common format options, say, _
, which means it throw away the output.
2.1 ddply
ddply
input a dataframe and also output a dataframe
library(plyr)
df = data.frame(year = rep(2000:2002, each = 3), count = round(runif(9, 0, 20)))
head(df)
## year count
## 1 2000 13
## 2 2000 16
## 3 2000 4
## 4 2001 14
## 5 2001 20
## 6 2001 18
#use ddply to calculate CV for every year
ddply(df, .(year), function(x){
mean = mean(x$count)
sd = sd(x$count)
cv = sd/mean
return(data.frame(cv.count = cv))
})
## year cv.count
## 1 2000 0.5677271
## 2 2001 0.1762529
## 3 2002 0.2359323
2.2 ldply
ldply
input a list and output a dataframe
#use ldply to calculate CV in every piece
df2 = data.frame(year = rep(2003:2005, each = 3), count = round(runif(9, 32, 50)))
df3 = data.frame(year = rep(2006:2008, each = 3), count = round(runif(9, 78, 90)))
l = list(period1 = df, period2 = df2, period3 = df3)
a = ldply(l, function(x){
mean = mean(x$count)
sd = sd(x$count)
cv = sd/mean
return(data.frame(cv.count = cv))
})
#also review the function of sapply
sapply(l, function(x){
mean = mean(x$count)
sd = sd(x$count)
cv = sd/mean
return(data.frame(cv.count = cv))
})
## $period1.cv.count
## [1] 0.3788022
##
## $period2.cv.count
## [1] 0.1382445
##
## $period3.cv.count
## [1] 0.05590516
3. dplyr package
dplyr is one of my favorite package in R. It is very intuitive and powerful. For most of the split-apply-combine problems, I will always use this package first. Here are the functions in this package:
3.1 select
select columns: select(.data, condition)
select a range of columns
select(iris, Sepal.Length:Petal.Length)
select all columns that start with the character string 'Se', use the function starts_with(). Also we have ends_with() = select columns that ends with a string; contains() = ...contains a character string; matches() = ...matches a regular expression; one_of() = ... from a group of names (a character vector)
select(iris, starts_with('Se')))
3.2 filter
select rows: filter(.data, condition)
. We can use the boolean operators (>,<,>=,<=,!=,%in%) to create logical condition.
filter(iris, Sepal.Length >= 2, Species %in% c('setosa', 'versicolor'))
3.3 arrange
arrange(.data, var)
is used to re-order rows by a particular column such as the taxonomic order, list the name of the column you want to arrange the rows by.
arrange(iris, Sepal.Length)
3.4 summarise
summarise
create summary statistics for a given column in the data frame such as finding the mean. There are many other summary statistics you could consider such as sd(), min(), max(), median(), sum(),n()
(return the length of the vector), n_distinct()
(number of distinct values in vector)
iris %>%
summarize(avg = mean(Sepal.Length),
min = min(Sepal.Length),
max = max(Sepal.Length),
total = n())
3.5 mutate
mutate(.data)
add new columns to the data frame. Always it is used to create the attributes that summarise
doesn't cover.
iris %>% mutate(ratio = Sepal.Length/Sepal.Width) %>% head()
3.6 group_by
group_by
split the data frame by some variable, apply a function to the individual data frames and then combine the output.
iris %>%
group_by(Species) %>%
summarize(avg = mean(Sepal.Length),
min = min(Sepal.Length),
max = max(Sepal.Length),
total = n())
3.7 join
inner_join(x, y, by = )
returns rows when there is a match in both tables. left_join(x, y, by = )
reuturns all rows from the left table, even if there are no matches in the right table. There are also right_join/full_join/semi_join/anti_join
.
3.8 Other things in dplyr
Pipe Operator: %>%
This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions(reading from inside to the outside), the idea of piping is to read the function from left to right.
Deprecated SE versions of main verbs
dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics. Sometimes it is useful especially you would like to write a function about some variable. In this case, there are two matters needing attention:
- Use the underscore version of functions. For example, use
select_()
andfilter_()
- Use
lazyeval::interp
to do format transformation. For example:# here variable is a string variable mean_max = df %>% select_("Year", variable) %>% group_by_("Year") %>% summarise_(mean = interp(~mean(variable), variable = as.name(variable)))
That is the basic data wrangling methods in R. Then I am wondering, does python also have some functions that are similar to dplyr
?
Let us see in the next post!
Reference:
http://seananderson.ca/2013/12/01/plyr.html
https://stackoverflow.com/questions/19260951/using-square-bracket-as-a-function-for-lapply-in-r