vignettes/articles/lab_04b_ggplot.Rmd
lab_04b_ggplot.Rmd
This lab will introduce you to the ggplot2
package in R.
This is a powerful and flexible package which is capable of producing
publication quality graphics.
Before we begin, make sure to download the data from D2L. The data we
are using in this lab is a small simulated data set called
ggplot_lab_invert_survey.csv
. You will also need to
download galapago-finches.csv
for the homework assignment.
Make sure to place both of these files in the data/
folder
in your R Project, otherwise you will need to modify the code below to
have it run correctly.
ggplot2
is a part of the tidyverse
and
was included when you previously ran
install.packages("tidyverse")
Remember you only need to install packages on each machine once.
ggplot_lab_invert_survey.csv
data set by opening them.read_csv()
function. read_csv()
is a part of
the tidyverse
and is the the main function that we will
throughout this course.
read.csv()
function that comes with the
base installation of R. It behaves very similarly but is slightly less
“smart” than read_csv()
.invert
and assign it the
output from read_csv()
data/
folder, we add
this to the beginning of the filename
invert <- read_csv("data/ggplot_lab_invert_survey.csv")
data/
folder?ggplot_lab_invert_survey.csv
in the
data/
folder?Whenever you load a dataset into R, it is always a good idea to take a look at it to familiarize yourself with the structure. Run the following lines of code in RStudio one by one (I have removed the output to save space here):
names(invert)
## [1] "month" "taxa" "body_length_mm" "body_mass_g"
head(invert)
## # A tibble: 6 × 4
## month taxa body_length_mm body_mass_g
## <dbl> <chr> <dbl> <dbl>
## 1 3 beetle 26.9 17.4
## 2 7 beetle 32.1 11.3
## 3 11 beetle 21.4 4.40
## 4 3 beetle 25.0 26.2
## 5 7 beetle 17.7 12.6
## 6 11 beetle 31.9 15.2
tail(invert)
## # A tibble: 6 × 4
## month taxa body_length_mm body_mass_g
## <dbl> <chr> <dbl> <dbl>
## 1 3 spider 23.2 15.2
## 2 7 spider 20.2 2.02
## 3 11 spider 25.3 16.0
## 4 3 spider 27.4 9.77
## 5 7 spider 25.4 12.6
## 6 11 spider 26.5 11.9
str(invert)
## spc_tbl_ [30 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ month : num [1:30] 3 7 11 3 7 11 3 7 11 3 ...
## $ taxa : chr [1:30] "beetle" "beetle" "beetle" "beetle" ...
## $ body_length_mm: num [1:30] 26.9 32.1 21.4 25 17.7 ...
## $ body_mass_g : num [1:30] 17.4 11.3 4.4 26.2 12.6 ...
## - attr(*, "spec")=
## .. cols(
## .. month = col_double(),
## .. taxa = col_character(),
## .. body_length_mm = col_double(),
## .. body_mass_g = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(invert)
## month taxa body_length_mm body_mass_g
## Min. : 3 Length:30 Min. :11.01 Min. : 2.020
## 1st Qu.: 3 Class :character 1st Qu.:20.89 1st Qu.: 6.397
## Median : 7 Mode :character Median :26.02 Median :11.931
## Mean : 7 Mean :25.00 Mean :11.671
## 3rd Qu.:11 3rd Qu.:28.35 3rd Qu.:15.630
## Max. :11 Max. :33.83 Max. :26.195
You can see the entire invert
object by running the
following command:View(invert)
we can see that it includes information on:
beetle
or spider
) as a
character vectorggplot2
Very popular plotting package
Makes publication quality plots quickly
Declarative - describe what you want not how to build it
Contrasts w/Imperative - how to build it step by step
ggplot
basic syntax
\[\underbrace{ggplot}_{initiate\;
plot}(\underbrace{data = df}_{data\;frame},\; \underbrace{aes(x =\; , y
= \;, color = \;)}_{plot\; attributes}) +
\underbrace{geom\_line()}_{geometry}\] * To build a plot using
ggplot
we start with the ggplot()
function
ggplot()
ggplot()
creates a base ggplot object that we can
then add things to
aes()
data
- which is the name of the data frame we are
working with, so invert
mapping
- which describes which columns of the data are
used for different aspects of the plotmapping
by using the aes
function, which stands for “aesthetic”, and then linking columns to
pieces of the plotbody_length_mm
of invertebrates to their body_mass_g
,
and that
Rstudio automatically indents for us. This improves readability and
makes it easier to look for typos.We can add data to the plot using layers
We do this by adding a +
after the the
ggplot()
function and then adding something called a
geom
, which stands for geometry
To make a scatter plot we use geom_point()
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g)) +
geom_point()
It is standard to hit Enter
after the plus so that
each layer shows up on its own line
To change things about the layer we can pass additional arguments
to the geom
We can do things like change
the size
of the points, we’ll set it to
3
the color
of the points, we’ll set it to
"blue"
the transparency of the points, which is called
alpha
, we’ll set it to 0.5
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g)) +
geom_point(size = 3, color = "blue", alpha = 0.5)
labs
function
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g)) +
geom_point(size = 3, color = "blue", alpha = 0.5) +
labs(x = "Body length [mm]",
y = "Body mass [g]",
title = "Simulated ground-dwelling invertebrate survey")
Throughout the semester, we will want to see the distribution of one continuous variable. The two main ways we will do that is through:
Histograms
Boxplots
We can make a histogram of body_mass_g
by modifying
our earlier code.
Note only one variable, so we remove the
y=...
part.
We also change geom_point()
to
geom_histogram
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
geom_histogram()
groups observations into bins
automatically picks bin numbers based on data (graph above used
30 bins, see message output), but we can modify this using the
bins
argument
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_histogram(bins = 5)
We can set the size of the bins with binwidth
argument
for example, let’s say we want bins that are 2.5 g across
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_histogram(binwidth = 2.5)
We can change the attributes of the histogram output in similar ways as above
Common arguments are: fill
, color
(outline), size
(outline), and alpha
(transparency).
You can also explore different “themes”. Below is
theme_bw()
, but you can try theme_classic()
,
theme_dark()
, theme_void()
, etc.
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_histogram(binwidth = 2.5,
fill = "dodgerblue",
color = "black",
size = 2,
alpha = 0.5) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Boxplots summarize the data distribution
AKA “Cat and Whisker” Plots
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_boxplot()
* whiskers: show minimum (left side) and maximum (right side) of data
Box: shows where 50% of the data lies
Line in middle of box is the median
Useful to loo for normal distributions:
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_boxplot(fill = "red",
color = "black")
Density plots show the full distribution
i.e., a histogram has discrete bins, but a density plot is continuous
ggplot(data = invert,
mapping = aes(x = body_mass_g)) +
geom_density(fill = "pink",
alpha = 0.75) +
theme_dark()
You should now be able to complete problem 01 in the homework assignment
We will often want to compare the distributions between two groups.
We can do this by modifying our boxplot code above by adding a grouping
variable as our y = ...
in the aes()
function
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa)) +
geom_boxplot()
aes()
function
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa,
fill = taxa)) +
geom_boxplot()
taxa
columnSee a list of colors available in ggplot
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa,
fill = taxa)) +
geom_boxplot() +
scale_fill_manual(values = c("dodgerblue", "hotpink"))
geom_boxplot()
argument:
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa)) +
geom_boxplot(fill = taxa)
geom_boxplot()
function, it will overwrite what you mapped
above
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa,
fill = taxa)) +
geom_boxplot(fill = "black")
Our data has the results of three surveys across months
we may want to see if there are different distributions across months
ggplot(data = invert,
mapping = aes(x = body_mass_g,
fill = month)) +
geom_boxplot()
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
this didn’t work because month
is a number, which R
interprets as a continuous variable.
notice the warning discusses “group” structure of the data
we can get around this a number of ways.
the easiest at this point is to force month to be a
factor
(what R calls a categorical variable).
ggplot(data = invert,
mapping = aes(x = body_mass_g,
fill = as.factor(month))) +
geom_boxplot()
We can also do a similar thing with histograms.
note that we need to remove the
y=...
argument, and we also need to add
position = "identity"
to the histogram call.
ggplot(data = invert,
mapping = aes(x = body_mass_g,
fill = taxa)) +
geom_histogram(position = "identity",
alpha = 0.75) +
scale_fill_manual(values = c("grey30", "goldenrod")) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You should now be able to complete problem 02 in the homework assignment
x
and y
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g)) +
geom_point(size = 3)
Using a categorical variable
Gives discrete colors
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g,
color = taxa)) +
geom_point(size = 3)
Using a continuous variable
gives “shades” of one color, based on value
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g,
color = month)) +
geom_point(size = 3)
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g,
shape = taxa)) +
geom_point(size = 4)
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g,
color = taxa,
shape = as.factor(month))) +
geom_point(size = 4)
geom_smooth()
We will often want to estimate the relationship between our
predictor (x
) and our response (y
).
do this with geom_smooth()
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g)) +
geom_point(size = 4) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
the default is a “squiggly line”
we will limit ourselves to linear relationships, so add
method = "lm"
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g)) +
geom_point(size = 4) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
geom_smooth()
inherits other aspects from
aes.
add a smooth for both taxa:
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g,
color = taxa)) +
geom_point(size = 4) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
In any of the plots, it is also possible to add facets, which gives each group their own panel:
ggplot(data = invert,
mapping = aes(x = body_length_mm,
y = body_mass_g,
color = taxa)) +
geom_point(size = 4) +
geom_smooth(method = "lm") +
facet_wrap(~taxa)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa,
fill = taxa)) +
geom_boxplot() +
scale_fill_manual(values = c("dodgerblue", "hotpink")) +
facet_wrap(~taxa)
can also add a different variable
this helps to visualize when we have 2+ grouping variables
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa,
fill = taxa)) +
geom_boxplot() +
scale_fill_manual(values = c("dodgerblue", "hotpink")) +
facet_wrap(~month)
You should now be able to finish the homework. The below information is provided for your personal edification
Uniquely describe any plot based on a defined set of information
Leland Wilkinson
Geometric object(s)
Data
Mapping
Statistical transformation
Position (allows you to shift objects, e.g., spread out overlapping data points)
Facets
Coordinates (coordinate systems other than cartesian, also allows zooming)
ggsave()
function saves whatever the last plot you
made was.
let’s rerun this code:
ggplot(data = invert,
mapping = aes(x = body_mass_g,
y = taxa,
fill = taxa)) +
geom_boxplot() +
scale_fill_manual(values = c("dodgerblue", "hotpink")) +
facet_wrap(~month)
ggsave("invert_by_month_taxa.jpg")
ggsave("invert_by_month_taxa.jpg")
if you want to save it in a specific folder, add it as
"folder_name\"
in front of the file name
For example, if you followed the recommended file structure for
this course, you should have a homework
folder.
to save the last ggplot
figure you made as a .png
file to your homework folder, enter the following:
ggsave("homework/invert_by_month_taxa.png")
Check your folder to make sure the file is there.
Open it to see how it looks
You may want to delete this from your homework folder so you don’t get confused later.