dplyr Aggregation

Remember to

Load surveys.csv data into surveys

Basic aggregation

Aggregation combines rows into groups based on one of more columns.
Calculates combined values for each group.
First step, group the data frame.
Let’s group it by year
group_by
Arguments: 1) table to work on; 2) columns to group by

group_by(surveys, year)

Different looking kind of data.frame
- Source, grouping, and data type information
Store the data frame in a variable to use in the next step

surveys_by_year <- group_by(surveys, year)

After grouping a data frame use summarize() to calculate values for each group.
Count the number of rows for each group (individuals in each species).
summarize
Arguments
- Table to work on, which needs to be a grouped table
- One additional argument for each calculation we want to do for each group
  - New column name to store calculated value
  - =
  - Calculation that we want to perform for each group
  - We’ll use the function n which is a special function that counts the rows in the table

year_counts <- summarize(surveys_by_year, abundance = n())

Can group by multiple columns
Count the number of individuals in each plot in each year

surveys_by_plot_year <- group_by(surveys, plot_id, year)
plot_year_counts <- summarize(surveys_by_plot_year, abundance = n())

Just like with other dplyr functions we could write this using pipes instead

plot_year_counts <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n())

Do Portal Data Aggregation 1-2.

We can also do multiple calculations using summarize
Use any function that returns a single value from a vector.
E.g., mean, max, min
We’ll calculate the number of individuals in each plot year combination and their average weight

plot_year_count_weight <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight))

Open table
Why did we get NA?
- mean(weight) returns NA when weight has missing values (NA)
Can fix using mean(weight, na.rm = TRUE)

plot_year_count_weight <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(),
            avg_weight = mean(weight, na.rm = TRUE))

Still has NaN for species that have never been weighed
Can filter using !is.na

plot_year_count_weight <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(),
            avg_weight = mean(weight, na.rm = TRUE)) |>
  filter(!is.na(avg_weight))

Do Portal Data Aggregation 3.

Notes

Basic aggregation