- Download
portal_mammals.sqlite
.- Make sure the copy you are going to use in class does not have the
SpeciesCounts
table or view.
- We can work with data in databases directly from
R
- No need to export files from the database
- Lets the database do the heavy lifting
- Faster
- No RAM limits
- We can do this either directly using SQL and the
DBI
package - Or work with the data just like we’ve done before using
dplyr
Installation
DBI
for general database functionalityRSQLite
for translating between R and SQLitedbplyr
for integrating databases withdplyr
install.packages(c("DBI", "dbplyr", "RSQLite"))
Connect
library(DBI)
library(RSQLite)
library(dplyr)
portaldb <- dbConnect(SQLite(), "portal.sqlite")
- It’s most common for this code to be written as
portaldb <- dbConnect(RSQLite::SQLite(), "portal.sqlite")
portaldb
To avoid needing to load the RSQLite
package using library()
Check out database structure
- Once connected to a database we can list the tables
dbListTables(portaldb)
- We can also look at the details of individual tables
dbListFields(portaldb, "plots")
Connecting to tables
- We can also connect to individual tables
surveys <- tbl(portaldb, "surveys")
surveys
- The data is still in the database, not in
R
, so we can’t tell how many rows it has or view it in the same way we would view a data frame. - If we want to load all of the data from the table into an
R
data frame we use thecollect()
function
surveys_df <- collect(surveys)
Write a query
- We can interact with the data in the database in two ways
- First, we can write queries in SQL
- Write a query to extract counts for each
species_id
count_query <- "SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id"
dbGetQuery(portaldb, count_query)
-
This uses
DBI
to run the query and return it to the R as a data frame -
Alternatively we can use the
tbl()
function from dplyr to create a table based on the query
tbl(portaldb, sql(count_query))
- The table is still stored in the database
- Number of rows is unknown as shown by
??
- When we have the results we want we can use
collect()
to load them intoR
as a data frame
count_data <- tbl(portaldb, sql(count_query)) %>%
collect()
Using dplyr
pipelines with databases
- We can also use
dplyr
commands directly on databases - To obtain the same results as our previous query using dplyr we use
group_by()
andsummarize()
species_counts <- surveys %>%
group_by(species_id) %>%
summarize(count = n())
- All of the calculation still happens in the databases
- So outside of RAM calculations are possible
- We can then bring the resulting data into
R
usingcollect()
for further analysis
species_counts <- surveys %>%
group_by(species_id) %>%
summarize(count = n()) %>%
collect()
Write new information to database
- Can also move data we created in R into the database using
copy_to()
- We can see that the current version of the database only has the three original tables
dbListTables(portaldb)
- If we wanted to store our new
species_counts
table in the database
copy_to(portaldb, species_counts, temporary=FALSE,
name="SpeciesCounts")
dbListTables(portaldb)
- Do Copy to Database.