Lab 04a: Exploring data with `dplyr`

Overview

This tutorial is aimed at giving you an introduction to the popular dplyr package. dplyr provides a few simple verbs that allow you to quickly and easily select and manipulate your data, and create an interactive environment for data exploration. This tutorial is based on Chapter 5 in “R for Data Science” by Hadley Wickham & Garrett Grolemund which is vailable for free online. Material was also adapted from the dplyr introduction vignette which you can find by running the following command:

vignette("dplyr")

dplyr is part of the suite of packages known as tidyverse. The packages within the tidyverse were designed with a similar data and programming philosophy, and work together fluidly. Hadley Wickham is the lead developer of the tidyverse, and his book above takes a tidyverse-centric view, and is a fantastic resource if you want to learn more. We will be using the tidyverse packages throughout this course.

Preliminaries

Before we begin, make sure you have the tidyverse installed.

If you need to install it, run the following lines of code, as needed. Also note that the package name is inside of “” in the function call:

install.packages("tidyverse")

As a reminder, you only have to run the install.packages(...) line once per machine. Running this line downloads the package onto the machine that you are currently using. However, you do need to load the packages at the beginning of every session. The following commands “loads” the packages and makes the commands available to you. Note that once the package is installed, you do not need “” around the package name.

library(tidyverse)

Data: `starwars`

In order to demonstrate the basic data manipulation verbs of dplyr, we will be using the starwars database, which comes with dplyr (part of the tidyverse.

We will load the starwars data by running the following command:

data("starwars")

Whenever you start working with a new dataset, you should also run some quick commands to get an idea of what the data looks like, and how it is set up. I generally run the following commands (note that I have suppressed the display in this document for room, but you should run each of these on your computer):

dim(starwars)
names(starwars)
head(starwars)
tail(starwars)
str(starwars)

The dim() list the number of rows and columns, in that order. The starwars dataset has 87 rows and 87 columns.

You may notice that the head(starwars) output is different from many other data frames that you may have worked with before. It only shows the rows and columns that easily fit into your window. That’s because the starwars object is a tibble:

class(starwars)
#> [1] "tbl_df"     "tbl"        "data.frame"

Tibbles "tbl_df" act just as a data frame would, but they print only the information that can fit into your window. To see the full data frame in the Rstudio viewer use:

View(starwars)

Or as a normal data frame use:

head(as.data.frame(starwars))

To convert a data frame to a tibble, use as_tibble().

As a general rule, I almost always use the names() function when I’m looking at a new data set. This helps me get an idea of the data structure, and if the variable names are meaningful, can help me understand what types of questions I can ask, and how I may be able to answer them.

names(starwars)
#>  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
#>  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
#> [11] "species"    "films"      "vehicles"   "starships"

These variables should be relatively self-explanatory, but you can run ?starwars for more information, including units.

`dplyr`

Data manipulation is one of the primary tasks that scientists undertake. the dplyr package makes it easy to select the data you want, organize it in a useful way, calculate useful new variables, and summarize data by groups.

There is an extremely helpful Rstudio data wrangling cheat sheet which you can download directly through Rstudio by clicking Help > Cheatsheets > Data transformation with dplyr.

Practice good project organization When you download these cheat sheets, put them all in an appropriate folder so you can find them again easily. Recall that I recommend the following sub folders in your ENVS_475 project directory:

ENVS_475/
  |─── other_resources/
        |─── cheatsheets/
              |─── data-transformation.pdf

I almost always have this cheat sheet open while I’m conducting any analyses. It takes a little effort to learn how to “read” the cheat sheet, but it’s well worth the effort.

Single table verbs

dplyr aims to provide a function for each basic verb of data manipulation:

Pick observations based on their values: filter()
- many rows to fewer rows
Reorder the rows: arrange()
Pick variables based on their names: select()
- many columns to fewer columns
Add new variables (columns) based on existing variables: mutate()
Reduce many observations to a single summary: summarise()
- many rows to one
Combine summarise() with group_by() to get one row per group

All verbs work similarly and have similar syntaxes:

The first argument is a data frame.
Following arguments specify what to do with the data frame using un-quoted variable names.
Result is new data frame.

There are other verbs in the dplyr package, along with many useful “helper” functions. Browse the vignettes for dplyr: browseVignettes(package = "dplyr"), or check the Rstudio data wrangling cheat sheet which you downloaded at the beginning of this tutorial.

Filter rows with `filter()`

Many rows to fewer rows

filter() allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

Filter rows based on numeric data

For example, we can select all star wars characters who are greater than or equal to 180 cm tall with:

filter(starwars, height >= 180)
#> # A tibble: 44 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 2 Biggs Da…    183    84 black      light      brown           24   male  mascu…
#> 3 Obi-Wan …    182    77 auburn, w… fair       blue-gray       57   male  mascu…
#> 4 Anakin S…    188    84 blond      fair       blue            41.9 male  mascu…
#> # ℹ 40 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Note that R tells us that the functioned returned “A tibble: rows x columns”.
This makes sense, because we are only asking for star wars characters taller than a number.
It’s always a good idea to double check that the size of the returned data makes sense.

The use of `==`

if we wanted to know which characters were exactly 180 cm, we use the following:

filter(starwars, height == 180)
#> # A tibble: 5 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Wilhuff …    180    NA auburn, g… fair       blue              64 male  mascu…
#> 2 Han Solo     180    80 brown      fair       brown             29 male  mascu…
#> 3 Jek Tono…    180   110 brown      fair       blue              NA NA    NA    
#> 4 Ackbar       180    83 none       brown mot… orange            41 male  mascu…
#> # ℹ 1 more row
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

In the filter() command, we use == instead of =. The == is a Boolean operator. It is used to specify that we want to know if the object on the left is exactly the same as the object on the right. For example, run the following code:

5 == 5
#> [1] TRUE
5 == 9
#> [1] FALSE
"A"=="a"
#> [1] FALSE
"a" == "a"
#> [1] TRUE

The single = is another way of assigning a value to an object in R. For example a = 5 stores the value of 5 inside of the object called a.

Filtering rows based on character data

In the example above, we were filtering observations based on numeric data (i.e., height >= 150). Filter also works with character data types. For example, let’s extract all of the star wars characters that are from Naboo, which is stored in the homeworld column

filter(starwars, homeworld == "Naboo")
#> # A tibble: 11 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 R2-D2         96    32 NA         white, bl… red               33 none  mascu…
#> 2 Palpatine    170    75 grey       pale       yellow            82 male  mascu…
#> 3 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
#> 4 Jar Jar …    196    66 none       orange     orange            52 male  mascu…
#> # ℹ 7 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Note the use of "" around UA in the filter() function. Remove the quotes and run the code again. What happens?

You should have received an error code in the R console. Error codes often look like hieroglyphics the first few times you see them, but the important information in this error is the “object ‘UA’ not found”. When we enter unquoted arguments into the dplyr functions, R looks for objects (i.e., data objects, column names within a data object, etc.) with that name.

Other Boolean operators

Other Boolean operators include: * | which means “or”

& which means “and”
!= which means “not equal to” PAY ATTENTION HERE!
>=which means “greater than or equal to”
<=which means “less than or equal to”
>which means “greater than”
<which means “less than”

You can use boolean operators, singly and in combination:

# starwars from Tatooine *OR* Alderaan
filter(starwars, homeworld == "Tatooine" | homeworld == "Alderaan")

# starwars from Tatooine OR Alderaan AND that are > 100 OR < 150 cm
filter(starwars, homeworld == "Tatooine" | homeworld == "Alderaan" & height > 100 | height < 150)

Sometimes you want to see all of the data that do not meet a criteria. For example, maybe you don’t want to know what starwars characters are not from Tatooine. You can use the logical negation command !=. This command can be read as “not equal to”.

# Not Tatooine
filter(starwars, homeworld != "Tatooine")
#> # A tibble: 67 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 R2-D2         96    32 NA         white, bl… red               33 none  mascu…
#> 2 Leia Org…    150    49 brown      light      brown             19 fema… femin…
#> 3 Obi-Wan …    182    77 auburn, w… fair       blue-gray         57 male  mascu…
#> 4 Wilhuff …    180    NA auburn, g… fair       blue              64 male  mascu…
#> # ℹ 63 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Removing `NA`’s with `filter()`

A common task in data analysis requires the removal of NA values in our data. For example, in the starwars data, droids don’t have hair, so hair_color is NA.

Based on what we learned before it’s natural to think that we do this by using the condition hair_color != NA * To remove only rows with NA in specific columns use filter

filter(starwars, hair_color != NA)

Why didn’t that work?
Null values like NA are special
We don’t want to accidentally say that two “missing” things are the same
- We don’t know if they are
So use special commands
is.na() checks if the value is NA
Combine this with ! for “not”

filter(starwars, !is.na(hair_color))
#> # A tibble: 82 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
#> 2 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 3 Leia Org…    150    49 brown      light      brown           19   fema… femin…
#> 4 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
#> # ℹ 78 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Using `unique()`

It is also often useful to see all of the values that occur within a variable column. For example, what are all of the homeworld origins in this data set?

Here, I will use the base R function unique() to see all of the values in the variable homeworld. To select just that variable, I will use the $ subsetting function.

unique(starwars$homeworld)
#>  [1] "Tatooine"       "Naboo"          "Alderaan"       "Stewjon"       
#>  [5] "Eriadu"         "Kashyyyk"       "Corellia"       "Rodia"         
#>  [9] "Nal Hutta"      "Bestine IV"     NA               "Kamino"        
#> [13] "Trandosha"      "Socorro"        "Bespin"         "Mon Cala"      
#> [17] "Chandrila"      "Endor"          "Sullust"        "Cato Neimoidia"
#> [21] "Coruscant"      "Toydaria"       "Malastare"      "Dathomir"      
#> [25] "Ryloth"         "Aleen Minor"    "Vulpter"        "Troiken"       
#> [29] "Tund"           "Haruun Kal"     "Cerea"          "Glee Anselm"   
#> [33] "Iridonia"       "Iktotch"        "Quermia"        "Dorin"         
#> [37] "Champala"       "Geonosis"       "Mirial"         "Serenno"       
#> [41] "Concord Dawn"   "Zolan"          "Ojom"           "Skako"         
#> [45] "Muunilinst"     "Shili"          "Kalee"          "Umbara"        
#> [49] "Utapau"

Arrange rows with `arrange()`

Change order of rows based on values in a column

arrange() has a similar syntax as filter() but instead of selecting rows, it reorders them. It takes a data frame as the first argument, and then column name(s) to order by as subsequent arguments. The data will be arranged by the first column name provided, with ties being broken by subsequent columns:

# arrange by homeworld
arrange(starwars, homeworld)
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
#> 2 Bail Pre…    191    NA black      tan        brown             67 male  mascu…
#> 3 Raymus A…    188    79 brown      light      brown             NA male  mascu…
#> 4 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
#> # ℹ 83 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# arrange by homeworld, then height
arrange(starwars, homeworld, height)
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
#> 2 Raymus A…    188    79 brown      light      brown             NA male  mascu…
#> 3 Bail Pre…    191    NA black      tan        brown             67 male  mascu…
#> 4 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
#> # ℹ 83 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

The default order is from smallest to largest, numeric to character, and a to z. To reverse this order, use desc() to order a column in descending order:

arrange(starwars, desc(homeworld))
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Zam Wese…    168    55 blonde     fair, gre… yellow            NA fema… femin…
#> 2 Dud Bolt      94    45 none       blue, grey yellow            NA male  mascu…
#> 3 Tion Med…    206    80 none       grey       black             NA male  mascu…
#> 4 Sly Moore    178    48 none       pale       white             NA NA    NA    
#> # ℹ 83 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Select columns with `select()`

Many columns to fewer columns

Data sets generally contain numerous columns, but oftentimes you are only interested in a few for a given analysis. select() allows you to focus on a useful subset of your data while dropping un-needed columns. You can indicate which columns you want by naming them (without quotes) or select which columns you don’t want by putting a - in front of the name. I have suppressed the output to save space in this document, but be sure to run the following commands in your session and see what happens. Be sure to note the number of rows and columns that each command returns.

# Select columns by name
select(starwars, height, mass, homeworld, hair_color)

# Select all columns between name and eye_color (inclusive)
select(starwars, name:eyecolor)

# select all columns except for height
select(starwars, -height)

There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.

The order you write the arguments in select() are retained in the output. This can be helpful if you want to quickly compare the values in two columns which do not normally appear next to each other.

select(starwars, name, homeworld, eye_color)
#> # A tibble: 87 × 3
#>   name           homeworld eye_color
#>   <chr>          <chr>     <chr>    
#> 1 Luke Skywalker Tatooine  blue     
#> 2 C-3PO          Tatooine  yellow   
#> 3 R2-D2          Naboo     red      
#> 4 Darth Vader    Tatooine  yellow   
#> # ℹ 83 more rows

Add new columns with `mutate()`

Many columns to more columns

Data analysis often requires the creation of new variable columns based on values within your data. The mutate() function allows you to do this. For example, let’s calculate a new variable called bmi, which looks at how height and mass in one variable.

# make new column called "bmi"
d2 <- mutate(starwars,
  bmi = mass / height^2)

# select new columns in d2, and columns used to calculate them
select(d2, mass, height, bmi)
#> # A tibble: 87 × 3
#>    mass height     bmi
#>   <dbl>  <int>   <dbl>
#> 1    77    172 0.00260
#> 2    75    167 0.00269
#> 3    32     96 0.00347
#> 4   136    202 0.00333
#> # ℹ 83 more rows

Summarise values with `summarise()`

Many rows to a single row

The last verb is summarise(). This collapses a dataframe to a single value, based on a function:

# mean height
summarise(starwars,
  height_average = mean(height, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   height_average
#>            <dbl>
#> 1           175.
# smallest mass
summarise(starwars, min_mass = min(mass, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   min_mass
#>      <dbl>
#> 1       15

Notice the argument na.rm = TRUE within the summarize function. If you have missing or NA values within your data, it will cause the summary functions to return NA. Recall that we discussed this in the “NULL’s in Vectors” section in our introduction to R activity.

Additionally, summarize is optimized to work with functions that return a single value. For example, range() returns the minimum and maximum value of a set of numbers:

range(c(1, 2, 3, NA, 5, 6, 7), na.rm = TRUE)
#> [1] 1 7

When used in combination with summarise(), two values are returned, but they are not labeled, and it includes a warning message:

summarise(starwars,
          height_range = range(height, na.rm = TRUE))
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#>   always returns an ungrouped data frame and adjust accordingly.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> # A tibble: 2 × 1
#>   height_range
#>          <int>
#> 1           66
#> 2          264

If this is the only thing you are calling, it may be easy to tell which is which, but if we have multiple summary arguments, it can be less obvious:

summarise(starwars,
          height_range = range(height, na.rm = TRUE),
          height_mean = mean(height, na.rm = TRUE))
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#>   always returns an ungrouped data frame and adjust accordingly.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> # A tibble: 2 × 2
#>   height_range height_mean
#>          <int>       <dbl>
#> 1           66        175.
#> 2          264        175.

However, we can get around this by calling the min() and max() functions separately within summarise:

summarise(starwars, 
          min_height = min(height, na.rm = TRUE),
          max_height = max(height, na.rm = TRUE),
          mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 1 × 3
#>   min_height max_height mean_height
#>        <int>      <int>       <dbl>
#> 1         66        264        175.

Grouped operations

Many rows to one-per-group

These verbs are useful on their own, but when used in conjuction with the group_by() function, the awesomeness of dplyr starts to shine through. It organizes a dataset into specified groups of rows. Verbs are then applied group-by-group within the dataset. Conveniently, this is accomplished using the exact same syntax as above.

Here, we will group the data by homeworld and then summarize the mean height:

# make a new object which contains a "grouping variable"
home_group <- group_by(starwars, homeworld)
summarize(home_group,
          mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 49 × 2
#>   homeworld   mean_height
#>   <chr>             <dbl>
#> 1 Alderaan           176.
#> 2 Aleen Minor         79 
#> 3 Bespin             175 
#> 4 Bestine IV         180 
#> # ℹ 45 more rows

The output now contains 1 row per world

Notice that if we print out the home_group object that it has the same dimensions as the original, but there is now a new line at the top which specifies the grouping variable:

home_group
#> # A tibble: 87 × 14
#> # Groups:   homeworld [49]
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
#> 2 C-3PO        167    75 NA         gold       yellow         112   none  mascu…
#> 3 R2-D2         96    32 NA         white, bl… red             33   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> # ℹ 83 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Using `n()` to get counts

Another common data summarization task is to get the number of observations within a data set. For example, let’s say we wanted to know how many starwars each carrier was scheduled to operate. For this, we can use n(), which is a special function inside of the summary() function.

summarize(home_group, count = n())
#> # A tibble: 49 × 2
#>   homeworld   count
#>   <chr>       <int>
#> 1 Alderaan        3
#> 2 Aleen Minor     1
#> 3 Bespin          1
#> 4 Bestine IV      1
#> # ℹ 45 more rows

Chaining

The dplyr function calls don’t have any side-effects (unlike some base functions), making it easy to explore your data in an interactive way. However, one disadvantage of this is it doesn’t lead to very sucinct code, particularly if you want to perform many operations at once. You can do it step-by-step, saving a new object each time:

# calculate starwars characters that have a BMI either >0.04 or < 0.01 
a1 <- group_by(starwars, homeworld)
a2 <- select(a1, homeworld, height, mass)
a3 <- mutate(a2,
  bmi = mass / height^2)
a4 <- filter(a3, bmi > 0.04 | bmi < 0.005)
a4

However this can lead to many problems. Giving objects appropriate names can be difficult (e.g. object names in ggplot example). When naming them chronologically (as above), it can be difficult to remember which object is which (was the summarized object a2 or a3?…). Especially if you want to do the same thing over and over on different data sets or subsets of observations.

The pipe operator: `%>%`, `|>`

The pipe operator is originally from the magrittr package, but is included automatically in the dplyr package. More recently, newer version of R include a “native pipe” |> as a part of the base install. Both pipes should work the same, and either is fine for this class. I learned using the %>% pipe, so most of my code will use this one.

The pipe takes the output of one command and passes it as an input to the next command.
The pipe operator allows you to write this function: f(x,y) as x %>% f(y).

I think that this is a bit confusing to think about, but when you see some examples the power and ease of use becomes obvious.

# a silly example
x <- seq(10)
max(x)
#> [1] 10
# is the same as
x %>% max()
#> [1] 10

mean(x)
#> [1] 5.5
x %>% mean()
#> [1] 5.5

You can see that both syntaxes give the same result. Now, let’s go back our example looking at arrival and departure delays > 30 minutes

# original method, saving new object at each step
a1 <- group_by(starwars, homeworld)
a2 <- select(a1, homeworld, height, mass)
a3 <- mutate(a2,
  bmi = mass / height^2)
a4 <- filter(a3, bmi > 0.04 | bmi < 0.005)
a4

# now using the %>% operator
a5 <- starwars %>%
  group_by(homeworld) |>
  select(homeworld, height, mass) |>
  mutate(bmi = mass / height^2) |>
  filter(bmi > 0.04 | bmi < 0.005)

identical(a4, a5)
#> TRUE

It is helpful to say “then” when you see a %>% operator. e.g. take object starwars, then group by year, month and day, then select variables arr_delay and dep_delay then summarize … etc.

The pipe operator in combination with dplyr allows you to quickly examine your data and explore interesting results. One of my favorite aspects of this is you can answer interesting questions almost immediately.

“I wonder what the smallest mass character is?”

starwars %>% 
  arrange(mass)
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Ratts Ty…     79    15 none       grey, blue unknown           NA male  mascu…
#> 2 Yoda          66    17 white      green      brown            896 male  mascu…
#> 3 Wicket S…     88    20 brown      brown      brown              8 male  mascu…
#> 4 R2-D2         96    32 NA         white, bl… red               33 none  mascu…
#> # ℹ 83 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

If you want to also know the homeworld:

starwars %>% 
  arrange(mass) %>%
  select(name, mass, homeworld)
#> # A tibble: 87 × 3
#>   name                   mass homeworld  
#>   <chr>                 <dbl> <chr>      
#> 1 Ratts Tyerel             15 Aleen Minor
#> 2 Yoda                     17 NA         
#> 3 Wicket Systri Warrick    20 Endor      
#> 4 R2-D2                    32 Naboo      
#> # ℹ 83 more rows

What homeworld has the largest average mass?

starwars %>% 
  group_by(homeworld) %>%
  summarise(mean_mass = mean(mass)) %>%
  arrange(desc(mean_mass))
#> # A tibble: 49 × 2
#>   homeworld mean_mass
#>   <chr>         <dbl>
#> 1 Nal Hutta      1358
#> 2 Kalee           159
#> 3 Kashyyyk        124
#> 4 Trandosha       113
#> # ℹ 45 more rows

Formatting with pipes

A couple of pointers.

always enter a new line after a pipe %>%
if you have many arguments within a function in a pipe, enter a new line after each comma
if you are making a new variable with mutate, or summarizing variables with summarize, give them a meaningful name
try and limit the number of pipes in a single call. 5-6 is OK, but 10 or more should be avoided. If you need that many, save an intermediate object and then pipe that.

Bad formatting

# don't do these:
# no new line after each pipe
starwars %>% group_by(year, month, day) %>% filter(carrier == "FL" | 
carrier == "AA"| carrier == "UA") %>% select(arr_delay, dep_delay) %>%
summarise(arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm
= TRUE)) %>% filter(arr > 30 | dep > 30)
# no new line after each summarize argument, and meaningless variable names
starwars %>% summarize(x1 = mean(dep_delay, na.rm = TRUE), x2 = min(dep_delay, na.rm = TRUE), x3 = max(dep_delay, na.rm = TRUE), x4 = n())

Better formatting

# new line after each pipe
starwars %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(arr = mean(arr_delay, na.rm = TRUE), # new line here
            dep = mean(dep_delay, na.rm = TRUE)) %>%
  filter(arr > 30 | dep > 30)

# new line after each summarize argument, and variable names
starwars %>%
  summarize(mean_delay = mean(dep_delay, na.rm = TRUE),
            min_delay = min(dep_delay, na.rm = TRUE),
            max_delay = max(dep_delay, na.rm = TRUE),
            n_samples = n())

Note that when you press enter when “inside” of a function, RStudio automatically adds a tab space. This is very helpful for keeping the code format clean and makes it much easier to read.

ENVS475: Experimental Analysis and Design

April, 2024

Overview

Preliminaries

Data: `starwars`

`dplyr`

Single table verbs

Filter rows with `filter()`

Many rows to fewer rows

Filter rows based on numeric data

The use of `==`

Filtering rows based on character data

Other Boolean operators

Removing `NA`’s with `filter()`

Using `unique()`

Arrange rows with `arrange()`

Change order of rows based on values in a column

Select columns with `select()`

Many columns to fewer columns

Add new columns with `mutate()`

Many columns to more columns

Summarise values with `summarise()`

Many rows to a single row

Grouped operations

Many rows to one-per-group

Using `n()` to get counts

Chaining

The pipe operator: `%>%`, `|>`

Formatting with pipes

Bad formatting

Better formatting

Lab 04a: Exploring data with dplyr

ENVS475: Experimental Analysis and Design

April, 2024

Overview

Preliminaries

Data: starwars

dplyr

Single table verbs

Filter rows with filter()

Many rows to fewer rows

Filter rows based on numeric data

The use of ==

Filtering rows based on character data

Other Boolean operators

Removing NA’s with filter()

Using unique()

Arrange rows with arrange()

Change order of rows based on values in a column

Select columns with select()

Many columns to fewer columns

Add new columns with mutate()

Many columns to more columns

Summarise values with summarise()

Many rows to a single row

Grouped operations

Many rows to one-per-group

Using n() to get counts

Chaining

The pipe operator: %>%, |>

Formatting with pipes

Bad formatting

Better formatting

Lab 04a: Exploring data with `dplyr`

Data: `starwars`

`dplyr`

Filter rows with `filter()`

The use of `==`

Removing `NA`’s with `filter()`

Using `unique()`

Arrange rows with `arrange()`

Select columns with `select()`

Add new columns with `mutate()`

Summarise values with `summarise()`

Using `n()` to get counts

The pipe operator: `%>%`, `|>`