Fan Gong's Blog

Recently I just finished my research as a research assistant with Professor Kathryn Rudie Harrigan in Columbia Business School. During this research, I have spent a lot of time on data wrangling. So I decide to make a summary about the most important part of data wrangling -- split-apply-combine. I will divide this post into two parts: One for the usage of R and another for the usage of Python.

All right, let us see how to use R to perform data wrangling.

1. Build-in Function

Basically, there are two groups of base function for split-apply-combine in R. One is aggregate function and another is the apply family.

Since most of R users are very familiar with these functions, I will only have a very brief introduction.

aggregate(x, by, Fun, ...), notice by should be a list of the grouping elements. For example: aggregate(iris[,-5], by = list(iris$Species), FUN = mean) will give the mean length/width value for each species.

apply(x, MARGIN, FUN) applies a function to each row or column of a matrix., notice for MARGIN, 1 indicates rows and 2 indicates columns.

lapply(x, FUN) and sapply(x, FUN) allow you to apply the same function to every element in a list or a vector. but lapply always returns a list, and sapply tries to simplify the return value whenever possible. Notice we always use '[' to get the subset of the content in each list element; and we use '[[' to extract the item in each list element. For example: lapply(a, '[[', 1,2) will extract item [1,2] from each list element.

tapply(x, Index, FUN): to apply FUN to subsets of entries in x that shares a common level in Index. which means we split our data by Index first and then use our function to every part. (seldomly use)

2. plyr Package

plyr is a data manipulating package which is especially for split-apply-combine.

plyr builds on the build-in apply functions by giving you control over the input and output formats and keeping the syntax consistent across all variations.

The basic format is two letters followed by ply(). The first letter refers to the format in and the second to the format out.

The three main letters are:

d = data frame
a = array(include matrics)
l = list

There are also some less common format options, say, _, which means it throw away the output.

2.1 ddply

ddply input a dataframe and also output a dataframe

library(plyr)
df = data.frame(year = rep(2000:2002, each = 3), count = round(runif(9, 0, 20)))
head(df)

    ##   year count
    ## 1 2000    13
    ## 2 2000    16
    ## 3 2000     4
    ## 4 2001    14
    ## 5 2001    20
    ## 6 2001    18

#use ddply to calculate CV for every year
ddply(df, .(year), function(x){
  mean = mean(x$count)
  sd = sd(x$count)
  cv = sd/mean
  return(data.frame(cv.count = cv))
})

    ##   year  cv.count
    ## 1 2000 0.5677271
    ## 2 2001 0.1762529
    ## 3 2002 0.2359323

2.2 ldply

ldply input a list and output a dataframe

#use ldply to calculate CV in every piece
df2 = data.frame(year = rep(2003:2005, each = 3), count = round(runif(9, 32, 50)))
df3 = data.frame(year = rep(2006:2008, each = 3), count = round(runif(9, 78, 90)))
l = list(period1 = df, period2 = df2, period3 = df3)

a = ldply(l, function(x){
  mean = mean(x$count)
  sd = sd(x$count)
  cv = sd/mean
  return(data.frame(cv.count = cv))
})

#also review the function of sapply
sapply(l, function(x){
  mean = mean(x$count)
  sd = sd(x$count)
  cv = sd/mean
  return(data.frame(cv.count = cv))
})

    ## $period1.cv.count
    ## [1] 0.3788022
    ##
    ## $period2.cv.count
    ## [1] 0.1382445
    ##
    ## $period3.cv.count
    ## [1] 0.05590516

3. dplyr package

dplyr is one of my favorite package in R. It is very intuitive and powerful. For most of the split-apply-combine problems, I will always use this package first. Here are the functions in this package:

3.1 select

select columns: select(.data, condition)

select a range of columns

select(iris, Sepal.Length:Petal.Length)

select all columns that start with the character string 'Se', use the function starts_with(). Also we have ends_with() = select columns that ends with a string; contains() = ...contains a character string; matches() = ...matches a regular expression; one_of() = ... from a group of names (a character vector)

select(iris, starts_with('Se')))

3.2 filter

select rows: filter(.data, condition). We can use the boolean operators (>,<,>=,<=,!=,%in%) to create logical condition.

filter(iris, Sepal.Length >= 2, Species %in% c('setosa', 'versicolor'))

3.3 arrange

arrange(.data, var) is used to re-order rows by a particular column such as the taxonomic order, list the name of the column you want to arrange the rows by.

arrange(iris, Sepal.Length)

3.4 summarise

summarise create summary statistics for a given column in the data frame such as finding the mean. There are many other summary statistics you could consider such as sd(), min(), max(), median(), sum(),n()(return the length of the vector), n_distinct() (number of distinct values in vector)

iris %>%
  summarize(avg = mean(Sepal.Length),
            min = min(Sepal.Length),
            max = max(Sepal.Length),
            total = n())

3.5 mutate

mutate(.data) add new columns to the data frame. Always it is used to create the attributes that summarise doesn't cover.

iris %>% mutate(ratio = Sepal.Length/Sepal.Width) %>% head()

3.6 group_by

group_by split the data frame by some variable, apply a function to the individual data frames and then combine the output.

iris %>%
  group_by(Species) %>%
  summarize(avg = mean(Sepal.Length),
            min = min(Sepal.Length),
            max = max(Sepal.Length),
            total = n())

3.7 join

inner_join(x, y, by = ) returns rows when there is a match in both tables. left_join(x, y, by = ) reuturns all rows from the left table, even if there are no matches in the right table. There are also right_join/full_join/semi_join/anti_join.

3.8 Other things in dplyr

Pipe Operator: %>%
This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions(reading from inside to the outside), the idea of piping is to read the function from left to right.

Deprecated SE versions of main verbs
dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics. Sometimes it is useful especially you would like to write a function about some variable. In this case, there are two matters needing attention:

Use the underscore version of functions. For example, use select_() and filter_()

Use lazyeval::interp to do format transformation. For example:

# here variable is a string variable
mean_max = df %>%
    select_("Year", variable) %>%
    group_by_("Year") %>%
    summarise_(mean = interp(~mean(variable), variable = as.name(variable)))

That is the basic data wrangling methods in R. Then I am wondering, does python also have some functions that are similar to dplyr?

Let us see in the next post!

Reference:

http://seananderson.ca/2013/12/01/plyr.html

https://stackoverflow.com/questions/19260951/using-square-bracket-as-a-function-for-lapply-in-r

Split-Apply-Combine in R

How to process data in R