Manish Barnwal

...just another human

Tutorial on dplyr- a package for data manipulation in R

R is the most used tool in data science. It has no dearth of packages for specific use cases. There are three packages that I feel can get your most of the work done - ggplot2, dplyr, data.table.

  • ggplot2- Used for visualization. Also known as grammar of graphics. This package is used to plot graphs. The syntax is intuitive and easy to learn.

  • dplyr- Used for data manipulation. Also known as grammar of data manipulation. Most of the data munging processes and methods gets done easily using this package.

  • data.table- Used for large files. You can read huge files within seconds. The data manipulation library for larger datasets.

    In this post we will focus on learning dplyr package. Dplyr is a fast tool for data manipulation for data frame like object both in memory and out of memory. Let us get started. We will use hflights dataset to demonstrate the functions and syntax of dplyr.

    We need to first install dplyr and hflights if you don't have it already in your R environment. You can install these using:

install.packages("dplyr", dependencies = T)
install.packages("hflights", dependencies = T)

dplyr is a package which is referred to as a grammar of data manipulation. hflights is a package which has the dataframe hflights which consists of details about all the flight that fly in and out of Houston area in the year 2011

Load the installed packages and data using:

library(dplyr)
library(hflights)
data("hflights")

Get an idea of the data we will be working on by looking at its structure and printing the first few rows of the data.

str(hflights)
head(hflights)

Let me introduce to tbl_df. Look at the code below.

flights = tbl_df(hflights)
flights # prints only 10 rows and only those many columns that fits the screen for easy view

print(flights, n=20) # you can also specify how many rows you want to display

tbl_df creates a local data frame. It creates a wrapper around the original data frame that prints nicely. tbl_df introduces a new data frame like structure called tbl.

A tbl is of class data frame just that data manipulation is easy with this. This means that any function that we use with a data frame can be used by tbl type objects.

We will not use tbl objects. I rarely use it. I am comfortable with data frame class and we will focus on data frames for the rest of the tutorial as well. Below are the list of some of the important functions of dplyr.

data.frame(head(flights)) # convert back to data.frame to see all the columns

filter: Keep rows matching criteria

Example: View all flights on January 1

Base approach

flights[flights$Month==1 & flights$DayofMonth==1, ]

dplyr approach

filter(flights, Month==1 & DayofMonth==1) # Return rows with matching conditions.
unique(flights$UniqueCarrier)
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="AS") # use pipe for OR condition

select: Pick columns by name

Example: Select data for only UniqueCarrier, Distance, AirTime columns

Base approach

flights[, c("UniqueCarrier", "Distance", "AirTime")]

dplyr approach

select(flights, UniqueCarrier, Distance, AirTime) # select() keeps only the variables you mention

To get contiguous columns, use starts_with(), ends_with, contains

select(flights, Year:DayOfWeek, starts_with("Taxi"), ends_with("Time"))

Chaining or Pipelining

%>% operator helps you write multiple operations in a chain. The output of first operation becomes an input to the next command.

Let us try to understand this operator using an example. Say we want to select UniqueCarrier and DepDelay columns and filter only rows having delays over 60 minutes.

Normal approach

filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes

Chain operator approach

flights %>%
        select(UniqueCarrier, DepDelay) %>%
        filter(DepDelay > 60)

arrange: Reorder rows

Example: Select UniqueCarrier and DepDelay columns and sort by DepDelay

Base approach

flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")] # default is increasing

dplyr approach

flights %>%
        select(UniqueCarrier, DepDelay) %>%
        arrange(DepDelay)  # default is increasing/ascending


# use `desc` for descending
flights %>%
        select(UniqueCarrier, DepDelay) %>%
        arrange(desc(DepDelay))

mutate: Add new variables

  • Mutate adds new variables and preserves existing

  • Create new variables that are functions of existing variables

    Example: Add new variable named Speed

Base approach

flights$Speed = flights$Distance/flights$AirTime * 60
flights[, c("Distance", "AirTime", "Speed")]

dplyr approach

flights %>%
        select(Distance, AirTime) %>%
        mutate(Speed = Distance/AirTime * 60)

The above code doesn't add the Speed variable to flights data frame. To explicitly add variable to the table you need to assign this to a flights table. The below code does this.

# Adding 'Speed' variable to the table
flights = flights %>% mutate(Speed = Distance/AirTime * 60)

with and within

  • Perform R expressions using the items (variables) contained in a list or data frame

  • The within function will even keep track of changes made, including adding or deleting elements, and return a new object with these revised contents.

with is a generic function that evaluates expr in a local environment constructed from data.

with(flights, mean(ArrDelay, na.rm = T))
with(flights, plot(AirTime, Year))
flightsWith = with(flights, rm(Year))

# note that expr in with takes place only in the environment constructed

# Using 'with' you cannot assign the output value to any variable

names(flightsWith) # flights will still have 'Year' column

within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to a copy of data (this may fail in the data frame case if objects are created which cannot be stored in a data frame) and returns it.

flightsTemp = flights
flightsTemp1 = within(flightsTemp, rm(Year))
names(flightsTemp1)

summarise: Reduce variables to values

Example: Create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay

dplyr approach

flights %>%
        group_by(Dest) %>%
        summarise(avg_delay = mean(ArrDelay, na.rm = T))

Another example: For each carrier, calculate the percentage of flights cancelled or diverted

flights %>%
        group_by(UniqueCarrier) %>%
        summarise_each(funs(mean), Cancelled, Diverted)

One more example: For each carrier, calculate the minimum and maximum arrival and departure delays

flights %>%
        group_by(UniqueCarrier) %>%
        summarise_each(funs(min(., na.rm=T), max(., na.rm=T)), ArrDelay, DepDelay)
  • Helper function n() counts the number of rows in a group

  • Helper function n_distinct(vector) counts the number of unique items in that vector

    Example: For each day of the year, count the total number of flights and sort in descending order

flights %>%
        group_by(Month, DayofMonth) %>%
        summarise(flight_count = n()) %>%
        arrange(desc(flight_count))

SQL Joins

  • inner_join(x, y)

Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

  • left_join(x, y)

Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  • right_join(x, y)

Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

  • semi_join(x, y)

Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

  • anti_join(x, y)

Return all rows from x where there are not matching values in y, keeping just columns from x.

  • full_join(x, y)

Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

There are many other functions in dplyr. For now, I have listed down a few of these. I will update this post soon.

Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.

Advertiser Disclosure: This post contains affiliate links, which means I receive a commission if you make a purchase using this link. Your purchase helps support my work.

Comments