R Dplyr Cheat Sheet



data.table and dplyr cheat-sheet

This tidyverse cheat sheet will guide you through the basics of the tidyverse, and 2 of its core packages: dplyr and ggplot2! The tidyverse is a powerful collection of R packages that you can use for data science. They are designed to help you to transform and visualize data. All packages within this collection share an underlying philosophy and common APIs.

  • If you are using R to do data analysis inside a company, most of the data you need probably already lives in a database (it’s just a matter of figuring out which one!). However, you will learn how to load data in to a local database in order to demonstrate dplyr’s database tools. At the end, I’ll also give you a few pointers if you do.
  • Dplyr functions work with pipes and expect tidy data. In tidy data: pipes x%% f(y) becomes f(x, y). Data Transformation with dplyr:: CHEAT SHEET A B C A B C.

This is a cheat-sheet on data manipulation using data.table and dplyr package (sqldf will be included soon…) . The package dplyr is an excellent and intuitive tool for data manipulation in R. Due to its intuitive data process steps and a somewhat similar concepts with SQL, dplyr gets increasingly popular. Another reason is that it can be integrated in SparkR seamlessly. Mastering dplyr will be a must if you want to get started with SparkR.

R cheat sheets dplyr

I found this cheat-sheet very useful in using dplyr. My post is inspired by it. I hereby write this cheat sheet for data manipulation with data.table / data.frame and dplyr computation side by side. It is especially useful for those who wants to convert data manipulation style from data.table to dplyr. There are 6 data investigation and manipulation included:

  1. Summary of data
  2. subset rows
  3. subset columns
  4. summarize data
  5. group data
  6. create new data

Select rows that meet logical criteria:

dplyr

data.frame / data.table

Remove duplicate rows:

dplyr

data.table

Randomly select fraction of rows

dplyr

Randomly select n rows

dplyr

data.table / data.frame

Select rows by position

dplyr

data.table / data.frame

Select and order top n entries (by group if group data)

dplyr

data.table

dplyr

data.frame

> iris[c(‘Sepal.Width’,’Petal.Length’,’Species’)]

data.table

Select columns whose name contains a character string

Select columns whose name ends with a character string

Select every column

dplyr

data.frame

Select columns whose name matches a regular expression

Select columns names x1,x2,x3,x4,x5

select(iris, num_range(‘x’, 1:5))

Select columns whose names are in a group of names

Select column whose name starts with a character string

Select all columns between Sepal.Length and Petal.Width (inclusive)

Select all columns except Species.

dplyr

data.frame

The package dplyr allows you to easily compute first, last, nth, n, n_distinct, min, max, mean, median, var, st of a vector as a summary of the table.

R Dplyr Cheat Sheet

Summarize data into single row of values

dplyr

Apply summary function to each column

Note: mean cannot be applied on Factor type.

Count number of rows with each unique value of variable (with or without weights)

dplyr

data.table:

Tidyverse cheat sheet pdf

aggregate {stats}

Group data into rows with the same value of Species

dplyr

data.table: this is usually performed with some aggregation computation

Remove grouping information from data frame

dplyr

Cheat

Compute separate summary row for each group

dplyr

data.frame

data.table

R Data Manipulation Cheat Sheet

Mutate used window function, function that take a vector of values and return another vector of values, such as:

compute and append one or more new columns

data.frame / data.table

dplyr

Apply window function to each column

Tidyverse Cheat Sheet Pdf

dplyr

base

data.table

R Dplyr Cheat Sheet

Compute one or more new columns. Drop original columns

Compute new variable by group.

dplyr

iris %>% group_by(Species) %>% mutate(ave = mean(Sepal.Length))

data.table

Tibble Cheat Sheet

iris[, ave:=mean(Sepal.Length), by = Species]

R Cheat Sheets Dplyr

data.frame

R Studio Dplyr Cheat Sheet

You can verify the result df1, df2 using:





Comments are closed.