R is the most used tool in data science. It has no dearth of packages for specific use cases. There are three packages that I feel can get your most of the work done - ggplot2, dplyr, data.table.
-
ggplot2- Used for visualization. Also known as grammar of graphics. This package is used to plot graphs. The syntax is intuitive and easy to learn.
-
dplyr- Used for data manipulation. Also known as grammar of data manipulation. Most of the data munging processes and methods gets done easily using this package.
-
data.table- Used for large files. You can read huge files within seconds. The data manipulation library for larger datasets.
In this post we will focus on learning dplyr package. Dplyr is a fast tool for data manipulation for data frame like object both in memory and out of memory. Let us get started. We will use hflights dataset to demonstrate the functions and syntax of dplyr.
We need to first install dplyr and hflights if you don't have it already in your R environment. You can install these using:
install.packages("dplyr", dependencies = T)
install.packages("hflights", dependencies = T)
dplyr is a package which is referred to as a grammar of data manipulation. hflights is a package which has the dataframe hflights which consists of details about all the flight that fly in and out of Houston area in the year 2011
Load the installed packages and data using:
library(dplyr)
library(hflights)
data("hflights")
Get an idea of the data we will be working on by looking at its structure and printing the first few rows of the data.
str(hflights)
head(hflights)
Let me introduce to tbl_df. Look at the code below.
flights = tbl_df(hflights)
flights # prints only 10 rows and only those many columns that fits the screen for easy view
print(flights, n=20) # you can also specify how many rows you want to display
tbl_df creates a local data frame. It creates a wrapper around the original data frame that prints nicely. tbl_df introduces a new data frame like structure called tbl.
A tbl is of class data frame just that data manipulation is easy with this. This means that any function that we use with a data frame can be used by tbl type objects.
We will not use tbl objects. I rarely use it. I am comfortable with data frame class and we will focus on data frames for the rest of the tutorial as well. Below are the list of some of the important functions of dplyr.
data.frame(head(flights)) # convert back to data.frame to see all the columns
filter: Keep rows matching criteria
Example: View all flights on January 1
Base approach
flights[flights$Month==1 & flights$DayofMonth==1, ]
dplyr approach
filter(flights, Month==1 & DayofMonth==1) # Return rows with matching conditions.
unique(flights$UniqueCarrier)
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="AS") # use pipe for OR condition
select: Pick columns by name
Example: Select data for only UniqueCarrier, Distance, AirTime columns
Base approach
flights[, c("UniqueCarrier", "Distance", "AirTime")]
dplyr approach
select(flights, UniqueCarrier, Distance, AirTime) # select() keeps only the variables you mention
To get contiguous columns, use starts_with(), ends_with, contains
select(flights, Year:DayOfWeek, starts_with("Taxi"), ends_with("Time"))
Chaining or Pipelining
%>% operator helps you write multiple operations in a chain. The output of first operation becomes an input to the next command.
Let us try to understand this operator using an example. Say we want to select UniqueCarrier and DepDelay columns and filter only rows having delays over 60 minutes.
Normal approach
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
Chain operator approach
flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)
arrange: Reorder rows
Example: Select UniqueCarrier and DepDelay columns and sort by DepDelay
Base approach
flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")] # default is increasing
dplyr approach
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay) # default is increasing/ascending
# use `desc` for descending
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
mutate: Add new variables
-
Mutate adds new variables and preserves existing
-
Create new variables that are functions of existing variables
Example: Add new variable named Speed
Base approach
flights$Speed = flights$Distance/flights$AirTime * 60
flights[, c("Distance", "AirTime", "Speed")]
dplyr approach
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime * 60)
The above code doesn't add the Speed variable to flights data frame. To explicitly add variable to the table you need to assign this to a flights table. The below code does this.
# Adding 'Speed' variable to the table
flights = flights %>% mutate(Speed = Distance/AirTime * 60)
with and within
-
Perform R expressions using the items (variables) contained in a list or data frame
-
The within function will even keep track of changes made, including adding or deleting elements, and return a new object with these revised contents.
with is a generic function that evaluates expr in a local environment constructed from data.
with(flights, mean(ArrDelay, na.rm = T))
with(flights, plot(AirTime, Year))
flightsWith = with(flights, rm(Year))
# note that expr in with takes place only in the environment constructed
# Using 'with' you cannot assign the output value to any variable
names(flightsWith) # flights will still have 'Year' column
within is similar, except that it examines the environment after the evaluation of expr and makes the corresponding modifications to a copy of data (this may fail in the data frame case if objects are created which cannot be stored in a data frame) and returns it.
flightsTemp = flights
flightsTemp1 = within(flightsTemp, rm(Year))
names(flightsTemp1)
summarise: Reduce variables to values
Example: Create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
dplyr approach
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm = T))
Another example: For each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled, Diverted)
One more example: For each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min(., na.rm=T), max(., na.rm=T)), ArrDelay, DepDelay)
-
Helper function n() counts the number of rows in a group
-
Helper function n_distinct(vector) counts the number of unique items in that vector
Example: For each day of the year, count the total number of flights and sort in descending order
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count = n()) %>%
arrange(desc(flight_count))
SQL Joins
- inner_join(x, y)
Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
- left_join(x, y)
Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
- right_join(x, y)
Return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
- semi_join(x, y)
Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
- anti_join(x, y)
Return all rows from x where there are not matching values in y, keeping just columns from x.
- full_join(x, y)
Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
There are many other functions in dplyr. For now, I have listed down a few of these. I will update this post soon.
Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.