homework/
folderfull_name_hw##.R
justin_pomeranz_hw13.R
Source with echo
# load libraries
library(dplyr)
library(ggplot2)
Download the trout_cu.csv
and
island_plants.csv
files from D2L and put it in your “data”
folder. Read it in to R and convert it to a tibble
with the
following command.
# read in data for problem 1
trout <- read.csv("data/trout_cu.csv") %>%
as_tibble()
# data for problem 2
island <- read.csv("data/island_plants.csv") %>%
as_tibble()
trout_cu
This is a simulated data set illustrating the impacts of increasing
copper concentration cu_ppb
measured in parts per billion
on the number of trout surviving after chronic exposure. All treatments
began with 100 trout, and the number surviving at the end of the
experiment are recorded in the trout
variable.
1.1 Print the names()
and head()
of your
data object here.
1.2 What class are the two variables in this data set? Is this appropriate for a linear regression? Why or why not?
1.3 Produce a scatter plot of the data with trout
as the
response variable and cu_ppb
as the predictor variable.
1.4 Does there appear to be a linear relationship between trout survival and increasing copper concentration? Why or why not?
1.5 Does the distribution of the x-variable ( cu_ppb
)
appear to be uniformly distributed? In other words, are there “clumps”
of observations across the x-axis?
1.6 Add a comment to your homework script stating the null and alternative hypotheses for a linear regression.
1.7 Check the assumption of the normality of residuals by making a QQ-plot of this data.
1.8 Add a comment to your script interpreting the QQ-plot you made in 1.7 above. Does it appear that the residuals are normally distributed? Why or why not?
1.9 Regardless of your answer to 1.8 above, we will continue with a
simple linear regression analysis of this data. Fit a model using
lm()
for the trout data, and save it as a new data
object.
1.10 Before we look at the model output, let’s make a fitted vs. residual plot to check our assumption of Equal variance of residuals.
1.11 Add a comment to Interpret the fitted vs residual plot made above. Does this model appear to have equal variance in the residuals? Why or why not?
1.12 Now, use the anova()
function on your model fit
object above to print out an ANOVA table for your linear regression.
1.13 Add a comment to your script interpreting the ANOVA table for our linear regression analysis. Which of the two hypotheses stated above does this table refer to? Do we reject or fail to reject that null hypotheses?
1.14 Print out the summary()
of the linear model object
you created.
1.15 Add a comment to your script describing what each coefficient means.
1.16 Add a comment which has the adjusted R-squared value from our model as well as an interpretation for what it means.
1.17 Add a comment to your script writing out the full equation for our linear model.
1.18 Make a scatter plot of the data which includes a line of best fit on the plot.
geom_smooth(method = "lm")
argument to your ggplot code above.1.19 Use the coefficient estimates from the lm()
model
to estimate the number of surviving trout when copper concentration is 5
ppb.
1.20 Estimate the copper concentration for an experiment which has 35 trout surviving at the conclusion of the experiment.
cu_ppb
on its own on the left side of the equation.island_plants
If you haven’t already done so, load the data for problem 2.
# data for problem 2
island <- read.csv("data/island_plants.csv") %>%
as_tibble()
The theory of island biogeography predicts that the number of species will increase with island area. This data set is from an observational study of the number of plants occuring on islands of different sizes.
2.1 You should always get to know your data by looking at
names()
, head()
.
2.2 Produce a scatter plot of the data with Nspecies
as
the response variable and Area
as the predictor
variable.
2.3 Add a comment to your script interpreting this scatter plot. Does it appear that there is a linear relationship between the number of species and island area?
2.4 Make another scatter plot but this time use
log10_Nspp
as the response variable and
log10_area
as the predictor variable.
2.5 Add a comment to your script interpreting this scatter plot. Does it appear that there is a linear relationship between the log10-transformed number of species and island area?
The transformation of data is a very common thing to do in an analysis. Technically speaking, our assumptions of normality, equal variance, etc. also apply to data which has been transformed. In other words, if our assumptions are met after transformation, this constitutes a legitimate statistical analysis. You will often see language in publications that say “we transformed the variable(s) to meet the assumptions of normality” or similar.
Unfortunately, I ran out of time in class to go over this in detail, but just know that this is OK, and that there are a number of other common transformations out there.
2.6 Perform a linear regression using the lm()
function
using the log10 variables and save this model in a new data object.
2.7 Make a fitted vs. residual plot to check our assumption of equal variance in the residuals.
2.8 Add a comment to your script interpreting the fitted vs. residual plot.
2.9 Print out an ANOVA table of the lm
object and
interpret it here.
2.10 Print out the summary()
of your lm object here.
2.11 For each coefficient, write out the estimate and standard error and interpret the results. Be sure to list all appropriate statistics, and relate it back to the null and alternative hypotheses for each.
2.12 what is the adjusted R-squared for this model, and what does it mean?
2.13 Write out the full equation for this linear model. Be sure to put “log10” in the appropriate places for these variables.
2.14 Make another scatter plot of the transformed data and include a line of best fit.