class: center, middle, inverse, title-slide .title[ # LECTURE 09: tidy data ] .subtitle[ ## ENVS475: Exp. Design and Analysis ] --- class: inverse # outline <br/> #### 1) Data table formats <br/> -- #### 2) Tidy data defined <br/> -- #### 3) Why use tidy data? <br/> -- #### 4) Working with different data formats <br/> -- #### 5) Entering tidy data --- # Data table formats * Same data, different formats ``` ## # A tibble: 6 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ``` ``` ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` --- # Tidy data 1) Each variable is a column + Each column is a variable 2) Each Observation is a row + Each row is an observation 3) Each value is a cell + Each cell is a value --- # Tidy data - Visual <img src="fig/tidy-1.png" width="1280" style="display: block; margin: auto;" /> --- # Fix `table2` * What should the columns be? * What should the rows be? * What should the values be? ``` ## # A tibble: 6 × 4 ## country year type count ## <chr> <dbl> <chr> <dbl> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ``` * Make a tidy table of the data for Afghanistan in your notes --- # Fix `table3` * What should the columns be? * What should the rows be? * What should the values be? ``` ## # A tibble: 6 × 3 ## country year rate ## <chr> <dbl> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583 ``` * Make a tidy table of the data for Afghanistan in your notes --- # Tidy data for Afghanistan ``` ## [1] "country" "year" "cases" "population" ``` * In R, you can run `table1` to see what the tidy version looks like --- # Uniformity of tidy data * Both `table2` and `table3` get modified in different ways * End up in same, tidy format -- > “Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy (Russian Writer, 1828-1910) <br/> -- > “Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham (Lead author of the `tidyverse`) --- # Why use tidy data * There are two main advantages: 1. One consistent way of storing data. * Easier to learn the tools * Underlying uniformity 2. Specific advantage to placing variables in columns * Allows R’s vectorized nature to shine. * most built-in R functions work with vectors of values. * That makes transforming tidy data feel particularly natural. * `tidyverse` * Designed to work with tidy data. * Following shows a few small examples showing how you might work with `table1`. --- # Working with `table1`, tidy data * Compute the TB rate per 1,000 individuals ```r table1 |> mutate(rate = cases / population * 10000) ``` ``` ## # A tibble: 6 × 5 ## country year cases population rate ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 0.373 ## 2 Afghanistan 2000 2666 20595360 1.29 ## 3 Brazil 1999 37737 172006362 2.19 ## 4 Brazil 2000 80488 174504898 4.61 ## 5 China 1999 212258 1272915272 1.67 ## 6 China 2000 213766 1280428583 1.67 ``` --- # Working with `table1`, tidy data * Compute total cases per year ```r table1 |> group_by(year) |> summarize(total_cases = sum(cases)) ``` ``` ## # A tibble: 2 × 2 ## year total_cases ## <dbl> <dbl> ## 1 1999 250740 ## 2 2000 296920 ``` --- # Working with `table1`, tidy data * Plot changes through time * code: ```r ggplot(table1, aes(x = year, y = cases, color = country)) + geom_line() + geom_point(size = 4) + theme_bw() + scale_x_continuous(breaks = c(1999, 2000)) ``` * output on next slide --- # Working with `table1`, tidy data <img src="lecture_09_tidy_files/figure-html/unnamed-chunk-9-1.png" width="720" style="display: block; margin: auto;" /> --- # Practice problem: Working with different formats #### How would you calculate rate for `table2` and for `table3`? You would need to perform 4 operations: 1. Extract the number of TB cases per country per year. 2. Extract the matching population per country per year. 3. Divide cases by population, and multiply by 10000. 4. Store back in the appropriate place. Sketch out how to do this in your notes. --- # `tidyr` * We won't learn how to tidy data in R * That is what the `tidyr` package was made for. * See more at [R4DS-Tidy Data](https://r4ds.hadley.nz/data-tidy.html#introduction) --- # Entering tidy data * When possible, it is always best to enter data in a tidy format from the beginning * Sometimes not possible * Field sheets might not be easy to make a tidy format * Automatic sensor data may not be tidy * May need to re-enter data in a tidy format --- # Enter untidy field data * Two sheets of field data * make a tidy csv to store: * site and physicochemical data * stop at "water samples..." section --- # looking ahead <br/> **This week:** Lab and HW combined: Entering tidy data <br/> **Reading:** Hector Ch. 6, 9 <br/> **Take Home Exam**: Go over in class on Friday <br/>