Introduction to tabular data
- We will be working with data from the Portal Project.
- Long-term experimental study of small mammals in Arizona.
Setup local RStudio
- Download
, andplots
folder. - Need to know where the data is: Right click ->
Save link as
- The dataset is composed of three tables
- Each table is stored in a
file csv
stands for “comma separated values”- This is common way of storing data that can be used across programming and data management software
- Click on species.csv and View File
- If we look at one of these files we can see that
- It is plain text, so any program can read it
- The first row is the header row, with different column headers separated by commas
- All of the other rows are the data, again with different columns separated by commas
- And so each of the values is separated by commas, hence “comma separated values”
Loading and viewing the dataset
- Load these into
surveys <- read.csv("data/surveys.csv")
species <- read.csv("data/species.csv")
plots <- read.csv("data/plots.csv")
- Display data by clicking on it in
- Three tables
- main table, one row for each rodent captured, date on date, location, species ID, sex, and sizespecies
- latin species names for each species ID + general taxonplots
- information on the experimental manipulations at the site
- Good tabular data structure
- One table per type of data
- Tables can be linked together to combine information.
- Each row contains a single record.
- A single observation or data point
- Each column or field contains a single attribute.
- A single type of information
- One table per type of data
- Main way that reusable code is shared in R
- Combination of code, data, and documentation
- R has a rich ecosystem of packages for data manipulation & analysis
- Download and install packages with the R console:
- Even if we’ve installed a package it is not automatically available to do analysis with
- This because different packages may have functions with the same names
- So don’t want to have to worry about all of the packages we’ve installed every time we right a piece of code
- Using a package:
- Load all of the functions in the package:
- Load all of the functions in the package:
Basic dplyr
- Modern data manipulation library for R
surveys <- read.csv("surveys.csv")
- Select a subset of columns.
select(surveys, year, month, day)
- They can occur in any order.
select(surveys, month, day, year)
- Add new columns with calculated values using
mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
- If we look at
now will it contain the new column? - Open
- All of these commands produce new values, data frames in this case
- To store them for later use we need to assign them to a variable
surveys_plus <- mutate(surveys,
hindfoot_length_cm = hindfoot_length / 10)
- Or we could overwrite the existing variable if we don’t need it
surveys <- mutate(surveys,
hindfoot_length = hindfoot_length / 10)
- We can sort the data in the table using
- To sort the surveys table by by weight
arrange(surveys, weight)
- We can reverse the order of the sort by “wrapping”
in another function,desc
for “descending
arrange(surveys, desc(weight))
- We can also sort by multiple columns, so if we wanted to sort first by
and then by date
arrange(surveys, plot_id, year, month, day)
- Use
to get only the rows that meet certain criteria. - Combine the data frame to be filtered with a series of conditional statements.
- Column, condition, value
- To filter the data frame to only keep the data on species
- Type the name of the function,
- Parentheses
- The name of the data frame we want to filter,
- The column the want to filter on,
- The condition, which is
for “is equal to” - And then the value,
here is a string, not a variable or a column name, so we enclose it in quotation marks
- Type the name of the function,
filter(surveys, species_id == "DS")
- Like with vectors we can have a condition that is “not equal to” using “!=”
- So if we wanted the data for all species except “DS
filter(surveys, species_id != "DS")
- We can also filter on multiple conditions at once
- In computing we combine conditions in two ways “and” & “or”
- “and” means that all of the conditions must be true
- Do this in
using additional comma separate arguments - So, to get the data on species “DS” for the year 1995:
filter(surveys, species_id == "DS", year > 1995)
- Alternatively we can use the
symbol, which stands for “and”
filter(surveys, species_id == "DS" & year > 1995)
This approach is mostly useful for building more complex conditions
- “or” means that one or more of the conditions must be true
- Do this using
- To get data on all of the Dipodomys species
filter(surveys, species_id == "DS" | species_id == "DM" | species_id == "DO")
Filtering null values
- One of the common tasks we use
for is removing null values from data - Based on what we learned before it’s natural to think that we do this by using the condition
weight != NA
filter(surveys, weight != NA)
- Why didn’t that work?
- Null values like
are special - We don’t want to accidentally say that two “missing” things are the same
- We don’t know if they are
- So use special commands
checks if the value isNA
- So if we wanted all of the data where the weigh is
We’ll learn more about why this works in the same way as the other conditional statements when we study conditionals in detail later in the course
To remove null values we combine this with
for “not”
filter(surveys, !
- So
is conceptually the same as “weight != NA” - It is common to combine a null filter with other conditions using “and”
- For example we might want all of the data on a species that contains weights
filter(surveys, species_id == "DS", !