class: center, middle, inverse, title-slide .title[ # LECTURE 05: Estimation ] .subtitle[ ## ENVS475: Exp. Design and Analysis ] .author[ ###
Spring 2023 ] --- class: inverse # Outline <br/> #### 1) Descriptive Statistics <br/> #### 2) Inferential Statistics <br/> #### 3) Confidence Intervals <br/> --- # Review: populations vs samples #### Population - A collection of subjects of interest - Often, a biologically meaningful unit - Sometimes a process of interest #### Sample - A finite subset of the population of interest, i.e. the data we collect - Samples allow us to draw inferences about the population - Good samples are: + Random + Representative + Sufficiently large --- # Review: parameters vs statistics ### Parameters - Attributes of the population + Mean ( `\(\mu\)` ) + Variance ( `\(\sigma^2\)` ) + Standard deviation ( `\(\sigma\)` ) - Parameters are the quantities of interest, and usually unknown -- ### Statistics - Attributes of the sample + Mean ( `\(\bar{y}\)` or `\(\hat{\mu}\)` ) + Variance ( `\(s^2\)` or `\(\hat{\sigma}^2\)` ) + Standard deviation ( `\(s\)` or `\(\hat{\sigma}\)` ) - Often treated as estimates of parameters --- class: inverse, center, middle # Descriptive Statistics --- # Descriptive Statistics ### Measures of central tendency - **Sample mean** `$$\large \bar{y} = \frac{\sum_{i=1}^n y_i}{n}$$` <br/> -- - **Median**: "middle" of the data + 50% of data is below, 50% of data is above median + Useful for non-normal, or skewed data + If data is truly normal, mean = median <br/> -- - **Mode**: Most frequent observation in the data --- ### Central Tendency Example: * Weight of Wolves (kg) from Yellowstone: 40, 35, 32, 37, 35, 37, 35, 35 <br/> * Mean (rounded): 35.8 * `mean()` <br/> * Median (rounded): 35 * `median()` <br/> * Mode: 35 * no default in function in R * For small data sets, `sort()` vector and count repeats * 32, 35, 35, 35, 35, 37, 37, 40 --- ### Measures of dispersion - Sample variance - How far, on average, are the *observations* from the mean? - squared scale - different from data `$$\large s^2 = \frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n-1}$$` <br/> -- - Sample standard deviation - How far, on average, are the *observations* from the mean? - same scale as data `$$\large s = \sqrt{s^2}$$` <br/> -- - Range * min and max values of data --- ### Dispersion Example: * Weight of Wolves (kg) from Yellowstone: 40, 35, 32, 37, 35, 37, 35, 35 <br/> * Variance: 5.4 * `var()` * Remember, different scale! <br/> * Standard Deviation: 2.3 * `sd()` * same scale as data * SD is mostly what we will use <br/> * Range: 32, 40 * `range()` --- # Describing the sample * When describing the data, always include the mean and a measure of dispersion + be sure to label measure of dispersion (var or SD?) > The weight of Yellowstone wolves is 35.8 `\(\pm\)` 2.3 kg (mean `\(\pm\)` SD) * Alternate: > The average ( `\(\pm\)` SD) weight of wolves in Yellowstone is 35.8 `\(\pm\)` 2.3 kg --- # Inferential Statistics * So far, we've been discussing *descriptive statistics* for our observations * Analyses often interested in comparing means, or estimates of `\(\mu\)`. * This requires *inferential statistics* --- # Inferential Statistics * Weight of Wolves (kg) from Yellowstone: 40, 35, 32, 37, 35, 37, 35, 35 * Mean: 35.8 * SD: 2.3 <br/> ### How confident are we in this value? * If we measured another 8 wolves, what would our estimates be? * 2nd Sample: 33, 28, 40, 43, 35, 31, 40, 42 * 2nd Mean: 36.5 * 2nd SD: 5.5 --- # Inferential Statistics ### Standard error of the mean: SEM <br/> * How far, on average, is the sample mean `\(\bar{y}\)` from the true mean `\(\mu\)`? <br/> * We could repeat experiment over and over again to get a *sampling distribution* <br/> * Rarely done due to time, logistics, funding, etc. <br/> * Luckily, we can estimate the SEM from a single sample --- # Inferential Statistics ### Standard error of the mean: SEM `$$\Large SEM = \frac{s}{\sqrt{n}}$$` <br/> * Remember, SEM tells us how far, on average, the sample mean `\(\bar{y}\)` is from the true mean `\(\mu\)` * SEM is like a standard deviation *of the mean* --- ### Helpful(?) mnemonic <br/> * SD: dispersion for the data (**D**) * How far is an individual observation from the sample mean? <br/> * SEM: dispersion for the mean (**M**) * How far is the sample mean from the true mean? <br/> * Note that SEM is sometimes just abbreviated as SE, so this mnemonic isn't always obvious. --- ### Point estimates * `\(\bar{y}\)` is a point estimate of the true mean, `\(\mu\)` * point estimates on their own are of limited value * Always include a measure of precision when you report a mean `$$\Large \bar{y} \pm SEM$$` -- #### Wolf sample 1: * Mean: 35.8 * SEM: 0.8 -- `\(35.8 \pm 0.8\)` (mean `\(\pm\)` SEM) -- * Note similarity with sample description previously -- * Always label your dispersion estimate! -- * You can also present the interval by completing the equation: * Interval = 35, 36.6 --- class: inverse, center, middle # Confidence Intervals --- # Confidence Intervals - Confidence Intervals (CI) is a range of estimates for an unknown parameter. - CI computed for different confidence levels: most often 95% - Assuming a normal distribution, 95% CI is approximately: `$$\Large 95\%CI =\bar{y} \pm 2*SEM$$` + Recall empirical rule: 95% of data within `\(\pm 2~SD\)` + In future classes we will discuss equation for CI from a *t*-distribution, i.e., modify value of 2 in above equation - If we calculated a 95% CI from `\(n\)` samples, about 95% of them would contain the true population mean + **NOTE** it *does not mean* that we are 95% sure that the true mean is in the CI --- ### Repeated samples from Population - 100 repeated samples of wolf weights - Increased to 25 wolves per sample <img src="lecture_05_estimation_files/figure-html/CI-plot-1.png" width="504" style="display: block; margin: auto;" /> * 97 / 100 CIs contain true mean ~97% --- ### Approximate 95% CI for Yellowstone Wolves `$$\Large 95\%CI =\bar{y} \pm 2*SEM$$` -- <br/> $$ \large \bar{y} = 35.8$$ -- <br/> $$ \large SEM = 0.8 $$ -- <br/> `$$\large 2 * SEM = 1.6$$` -- <br/> `$$\Large 95\%CI = 34.2, 37.4$$` --- ### Using Confidence Intervals to test hypotheses * Previous research has indicated that mean wolf weight in the Arctic is 33 kg. * Does this differ from the wolves in Yellowstone? * CI for Yellowstone wolf weight (1st sample) `$$95\%CI =\bar{y} \pm 2*SEM$$` <br/> `$$35.6 \pm 2 * 0.8 = (34,~37.2)$$` * Arctic weight of 33 is outside (below) 95% CI for Yellowstone wolves. * **Interpretation**: The data support the conclusion, with 95% confidence, that the weight of Yellowstone Wolves is between 34 and 37.2 kg, and this is higher than the mean estimated weight for wolves in the Arctic of 33 kg. --- ### Changing the Confidence Interval * We can approximate a 99% CI by multiplying the SEM by 3 `$$99\%CI =\bar{y} \pm 3*SEM$$` `$$35.6 \pm 3 * 0.8 = (33.2,~38)$$` * How does the 99% CI compare with the 95% CI? + *i.e.*, which is wider and narrower? + Why do you think this is? --- ### Using Confidence Intervals to test hypotheses * Data from British Columbia indicates that the mean weight of wolves there is 34.2 kg. -- * Do the weight of wolves in BC differ from those in Yellowstone? -- * 34.2 falls within the 95% CI (34, 37.2) for Yellowstone Wolves. -- * **Interpretation**: The data support the conclusion that the weight of Yellowstone Wolves is between 34 and 37.2 kg (95% CI), and that this is not different from the mean estimated weight of 34.2 for wolves in British Columbia. -- * Note we did not use CI's for Arctic or BC wolves. * Using CI's is one form of (rough) hypothesis testing * We can more formally test these hypotheses with t-test and general linear models * We will get to this later in the course. --- # Looking ahead ### **Wednesday**: Point estimates and CI Lab <br/> ### **Friday**: Hw 05 - For a grade <br/> ### **Reading**: Hector Chapter 5