Become a Data Preparation Expert Using ‘dplyr’ in R
Enhance Your Data Manipulation Workflow With This Comprehensive Guide
Data preparation is the cornerstone of successful data analysis and machine learning. It involves the intricate tasks of cleaning, transforming, and organizing raw data into a format ready for analysis. In the R ecosystem, one tool that is guaranteed to make your life easier is dplyr
.
The Elegance of dplyr
dplyr
is a data manipulation package in R. What sets it apart are its elegant and efficient functions designed to streamline data preparation tasks. While it offers a wide range of capabilities, let's delve into some of the standout features that make using dplyr
a no-brainer:
install.packages('dplyr')
library('dplyr')
1. Groupby: Powerful Aggregation
Data often contains intrinsic categories or groups. For instance, sales data can be grouped by region or product, students’ academic performances can be grouped by districts, schools or teachers, and so on…
In grouped data, we are commonly interested in dedicated summary statistics for these specific groups:
- Which of my products has the highest margin?
- What is the average grade of students taught by a specific teacher?
- Which U.S. state has the lowest levels of air pollution?
- […]
These insights require aggregation, that is, grouping data by shared attributes. The dplyr
library enables easy aggregation with its group_by()
function, allowing you to apply operations to each group independently.
# Group data by 'Category' and calculate average values
aggregated_data <- data %>%
group_by(Category) %>%
summarise(avg_value = mean(Value))
In the above code snippet, we first group our data by a category and then display the mean of another value for each category. Using the more tangible example of air pollution in U.S. states:
# Create sample air pollution data with cities from multiple states
data <- data.frame(
City = c("Columbus", "Cleveland", "Cincinnati",
"Toledo", "Dayton", "Pittsburgh", "Erie",
"Detroit", "Indianapolis"),
State = c("Ohio", "Ohio", "Ohio", "Ohio",
"Ohio", "Pennsylvania", "Pennsylvania",
"Michigan", "Indiana"),
AirPollution = c(15, 18, 10, 12, 9, 20, 17, 22, 14)
)
# Group by 'State' and calculate mean of air pollution
aggregated_data <- data %>%
group_by(State) %>%
summarise(avg_air_pollution = mean(AirPollution))
2. Piping with %>%
: Streamlining Workflows
At the heart of dplyr
is the pipe operator %>%
. This operator allows you to create smooth, step-by-step data transformation workflows. Instead of nesting functions, you can chain them together, making your code more readable and intuitive.
Let’s look at a sample DataFrame containing customer ratings of tech products on Amazon and the customer’s ages and gender:
# Setting a seed for reproducibility
set.seed(42)
# Generate random age ranges
age_ranges <- sample(c("18-24", "25-34", "35-44", "45-54", "55+"),
100, replace = TRUE)
# Generate random ratings on a 5-star scale
ratings <- sample(1:5, 100, replace = TRUE)
# Generate random gender
gender <- sample(c("male", "female", "other"), 100, replace = TRUE)
# Create a data frame
customer_ratings <- data.frame(AgeRange = age_ranges,
Rating = ratings,
Gender = gender)
Let’s say you’re interested in the average rating based on the age of the customer. Your theory is that younger people are more tech-savvy and thus, on average, leave better reviews for tech products on Amazon. However, you want to exclude 5-star ratings as they are often fake.
Your code might look something like this:
# Select columns 'AgeRange' and 'Rating'
subset <- customer_ratings[, c("AgeRange", "Rating")]
# Filter the data to remove 5-star ratings
filtered_subset <- subset[subset$Rating < 5, ]
# Calculate the mean rating for each age group
mean_ratings <- aggregate(Rating ~ AgeRange,
data = filtered_subset,
FUN = mean)
# Create a new data frame with the result
result <- data.frame(mean_ratings)
And it works as intended:
Theres a better way though.
# using dplyr with %>%
result <- customer_ratings %>%
select(AgeRange, Rating) %>%
filter(Rating < 5) %>%
group_by(AgeRange) %>%
summarise(avg_rating = mean(Rating))
Clean, isn’t it? Using the pipe operator %>%
from the dplyr
library, we can chain all operations together. The pipe operator simplifies your code, making it easier to follow your logic; it enhances readability and the code is more efficient.
3. Data Joining: Merging Datasets
Data preparation often involves combining different datasets, for instance when you measure user behavior with multiple instruments (e.g., a step-counter and a heart rate monitor). Data from multiple data sources can be joined by shared attributes, such as a user ID. dplyr
provides functions to join multiple sources of data, using different types of joins.
- Inner Join: This type of join returns only the rows that have matching values in both datasets based on the specified key.
# Inner join two datasets by 'ID'
merged_data <- inner_join(dataset1, dataset2, by = "ID")
- Left Join: A left join returns all the rows from the left dataset and the matching rows from the right dataset. If there are no matches in the right dataset, it includes NA values for those rows.
# Left join two datasets by 'ID'
merged_data <- left_join(dataset1, dataset2, by = "ID")
- Right Join: A right join is similar to a left join, but it returns all the rows from the right dataset and the matching rows from the left dataset.
# Right join two datasets by 'ID'
merged_data <- right_join(dataset1, dataset2, by = "ID")
- Full Join: A full join returns all the rows from both datasets, including NA values where there are no matches.
# Full join two datasets by 'ID'
merged_data <- full_join(dataset1, dataset2, by = "ID")
4. Mutate: Create New Variables Easily
One of the powerful functions in the dplyr toolkit is mutate
. Just like other dplyr functions, mutate
fits seamlessly into your data transformation workflows, thanks to the pipe operator %>%
. mutate
allows you to create new variables or modify existing ones within your dataset.
Let’s say you have a dataset containing the ages of individuals, and you want to calculate their ages in months:
# using mutate to create new variable 'age_in_months'
data <- data %>%
mutate(age_in_months = age * 12)
You can also use mutate
to modify existing variables in your dataset. For instance, if you have a dataset of product prices, and you want to apply a 10% discount to all prices:
# using mutate to modify the price variable
data <- data %>%
mutate(discounted_price = price * 0.9)
If you are interested in a
dplyr
counterpart in python, check out this story on the python library pandas: