This section just ensures you have the correct libraries. When you run the code block if no errors show continue otherwise follow the instructions in the block or ask for help.
#install.packages("pryr")
data("cars")
# If tidyverse works
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# If tidyverse doesn't work:
# 1. Comment out the "library(tidyverse)" line.
# 2. Uncomment the below lines and run the block
#install.packages("dplyr")
#library(dplyr)
Regression is a very important part of many statistical activities ranging from simple prediction, displaying a trendline, or even as a part of sohpisticated machine learning algorithms. Regression can also be used to select which features/variables are useful from data, which can be important when trying to reduce the size of a problem. At a high level, it is a mathematical technique that is used to predict the value of an outcome Y given input variables X.
To understand regression it’s important that one knows the basics of calculus, basic linear algebra (vectors, matrices, operations on matrices) and so on.
Regression can be done in n-dimension, but the math remains very similar to doing it in one dimension. When regression uses only one input to predict an output it is called simple linear regression.
The simple linear regression equation is:
\[ y_i = w_0 + w_1x_i + err_i \]
\(y_i\) stands for the output for input i, \(w_0\) is the intercept term, and \(w_1\) is the slope of the line. Again, simple linear regression is simple because you are just fitting a line. In calculus you must have seen the equation for a line - and this is kind of the same idea. The slightly complicated part is finding which line fits the set of points the “best”.
To decide what the best line is, we have to come up with an idea for measuring how good one fit is from another. We can see in many cases which line seems just by observation, but when two different lines look equally good a measure; here I will introduce a fairly standard one known as the residual sum of squares (RSS).
\[RSS = \sum_{i=1}^{n}(y_i - [w_0 + w_1x_i])^2\]
This is a relative measure of distance from the real values to the fitted line. In the initial equation the noise (\(err_i\)) is the vector that makes up for difference between line and predicted value. This value is squared to include consistent distances for negative values, as well as to amplify the distance relative to how far the points are from the line.
But how to find whether a particular line has the minimum RSS?
You know from calculus finding the minimum of things usually involves some sort of a derivative. Clearly one line fits the best (least error) so if we take every possible line going through the same (similar) intercepts, they will converge at a minimum RSS. (Think about why there can’t be two lines of minimum RSS).
The algorithm used to find the minimum line is known as gradient descent. This will give approximately the correct answer, and functions by way of hill climbing. Gradient descent is usually done in the back end of most packages - but I will give a short description of the algorithm here. Many resources are online that can explain one of the steps or theory if it is unclear.
The idea is that given a step size (n), a weight vector (w), and a tolerance: while not converged: \[w^{(t+1)} = w^{(t)} - n\nabla g(w^{(t)})\]
In the case of simple linear regression, we want: \[min_{w_0,w_1}\sum_{i=1}^{N}(y_i-[w_0+w_1 x_i])^2\]
So we need the gradient of the RSS expression with respect to \(w_0\) and \(w_1\), which (you can check it yourself) evaluates to:
\[ \nabla RSS(w_0, w_1) = \begin{bmatrix} -2\sum_{i=1}^{N}[y_i-(w_0+w_1x_i)] \\ -2\sum_{i=1}^{N}[y_i-(w_0+w_1x_i)]x_i \end{bmatrix} \]
This idea can translate to more than 2 dimensions and in that case rather than finding a local min/max in a curve, it will involve finding the local min/max in a plane or hyperplane.
To demonstrate linear regression in R, we will use the built in “cars” package which has speed and distance information from 50 cars.
cars
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32
## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48
## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
The following questions are just some filtering tasks to get some more ideas about the data:
What’s the average speed of cars with distance larger than 30?
What’s the average distance for cars with speed larger than 12?
What’s the first quartile for the speed of cars with speed greater than 12 and distance greater than 50?
For cars with distance greater than 80, is the mean or median of the speed higher?
First we should plot variables. Here I will explore a few more specific plots versus the freestyle ones you used in Titanic.
First we will explore scatter plots (which can check for a trend), boxplots (which can identify outliers), and density plots (to see if the response variable does not deviate too much; is close to normality).
# Scatter Plots
scatter.smooth(cars$speed,cars$distance, main="speed vs dist")
# Box Plots
par(mfrow=c(1,2))
boxplot(cars$speed,main="speed")
boxplot(cars$dist,main="dist")
# Density Plots
plot(density(cars$speed), main="speed")
plot(density(cars$dist), main="dist")
Next we can check the correlation of the two variables to see if in fact they are related. Correlation is a measure of how strongy linear the relation of the variables is. The correlation coefficient is calculated as follows: \[r_{xy} = \frac{s_{xy}}{s_x s_y}\] Note, \(s_{xy}\) is the covariance while \(s_x\) and \(s_y\) are the standard deviations.
Standard deviation is calculated with: \[s_x=\sqrt{\frac{\sum{(x_i - \bar{x})^2}}{n}}\] Covariance is calculated with: \[\frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{n-1}\]
A correlation coefficient close to 0 implies there is a weak linear relation, and positive or negative values close to the boundaries imply a strong positive linear or negative linear relationship, respectively. To explain this in more formal terms, a correlation coefficient close to 0 suggests that the variation in the response variable (Y) is largely unexplained by the predictor (X).
In R, you can calculate the correlation coefficient as follows:
cor(cars$speed, cars$dist)
## [1] 0.8068949
This sub section will cover building the linear model and analyzing it.
In R, you can define and save a linear model (or any model) the same way we can save functions or variables (with the assignment operator, “<-”). Below, we will create a linear model for speed and distance:
linearModel <- lm(dist ~ speed, data=cars)
print(linearModel)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
We can see more informations below:
summary(linearModel)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Here is a method for using the linear model above to predict data for new points:
# first we set a seed
set.seed(100)
# we will separate some data to use for testing our model (in this case 10 percent)
testingRowIndex <- sample(1:nrow(cars), 0.1*nrow(cars))
testData <- cars[testingRowIndex, ]
# predict distance from speed inputs
distPred <- predict(linearModel, testData)
summary (linearModel) # model summary
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Now we can make a table which checks how close the predictions are to the actual values and calculate the residual sum squares.
actuals_preds <- data.frame(cbind(actuals=testData$dist, predicteds=distPred)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds) # 82.7%
head(actuals_preds)
## actuals predicteds
## 16 26 33.542219
## 13 20 29.609810
## 27 32 45.339445
## 3 4 9.947766
## 22 60 37.474628
# Residual Sum Squares
sum((actuals_preds$predicteds - actuals_preds$actuals) ** 2)
## [1] 869.9426
Finally, we will plot our regression line on the points we have from the data using the following code:
plot(cars$speed, cars$dist)
abline(linearModel)
We can see that the data does seem to have a lot of variation, and therefore the linear model only does an ok job of predicting.
NOTE: You do this by setting a new seed (0) and rerunning the above code. It is a good idea to do this in a separate R Script or R Markdown Notebook.
QUESTION: Looking at the plots, what do you think the distance a car travelling at speed 30 would travel in the model you just built?
TASK: What are the predictions you get for the following set of speeds? 2,3,5,15,26,30
HINT: Put the data the correct form using “as.data.frame” and a vector with the speeds required.
2nd HINT: use “colnames” function to give the dataframe column the right name for predictions, and predict using the method above.
Multiple Linear Regression (MLR) works almost the same mechanically as simple linear regression (SLR), except that it accounts for multiple inputs. The formula at a high level changes from: \(y_i = w_0 + w_1x_1 + err_i\) to \(y_i = w_0 + w_1x_1 + w_2x_2 + ... + err_i\)
Code for it is as follows:
swissModel <- lm(Infant.Mortality ~ Fertility + Agriculture + Catholic, data=swiss)
summary(swissModel)
##
## Call:
## lm(formula = Infant.Mortality ~ Fertility + Agriculture + Catholic,
## data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3672 -1.4970 0.0909 1.6089 5.3076
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.565009 2.362940 5.741 8.7e-07 ***
## Fertility 0.112074 0.036105 3.104 0.00337 **
## Agriculture -0.032335 0.019207 -1.683 0.09953 .
## Catholic 0.003754 0.011045 0.340 0.73560
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.652 on 43 degrees of freedom
## Multiple R-squared: 0.225, Adjusted R-squared: 0.1709
## F-statistic: 4.161 on 3 and 43 DF, p-value: 0.01128
Now is a good time to understand what the summary command says. + The first element (“Call”) tells you back the formula that was inputted. + The residuals section shows how the residuals are distributed. Residuals are the difference between actual points and model prediction - and if the residuals are distributed symmetrically around 0 that is a good sign. Usually it is a good idea to plot residuals and check the distribution, but that feature may not always be available to you. Generally, if you see an even distribution around 0 and the relationship does not follow a pattern of dispersion it is a good sign.
plot(swissModel$residuals, main="Residuals Plot")
abline(h=0)
One biproduct of all of these pieces of information is that we can now discard features that are taking up time. That way, when we are predicting we can input less fields and also run the model on less variables (all saving runtime).
There are two easy ways you can choose to discard variables:
In summary, all of the above methods are used a lot in traditional experiments, and are a large part of statistical theory. It is usually ignored today, but still good to know about as in the future if you get funny results this can be a good indicator. Another very important part of these applications is having the ability to discard useless variables.
Answer the questions from above (except the proof in Q5 they are all numerical or TRUE/FALSE)
Fill in Answers for All Questions (above)
You will get a dataset and be required to create a model which can predict. The datasets will be split among the groups.
The task is to use the form of the notebook given above to create a similar style above, and then as a group give a 2-3 min presentation explaining what you have learned from the data.
You must: - Use some filters to highlight features of the data - Plot the data (use different plots for different things) - Build a linear model - Split the data and do some predictions - Measure the accuracy - Find some conclusion (i.e which of the multiple inputs predicts the output the best)
If you want to save a LOT of time, figure out how to do multiple regression and use all 3 inputs at once to predict the output.
Predict rating [output] given: complaints, privileges, learning [inputs]
data("attitude")
rating <- attitude[[1]]
complaints <- attitude[[2]]
privileges <- attitude[[3]]
learning <- attitude[[4]]
Predict rating [output] given: raises, critical, advance [inputs]
data("attitude")
rating <- attitude[[1]]
raises <- attitude[[5]]
critical <- attitude[[6]]
advance <- attitude[[7]]
attitude
## rating complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34
## 7 58 67 42 56 66 68 35
## 8 71 75 50 55 70 66 41
## 9 72 82 72 67 71 83 31
## 10 67 61 45 47 62 80 41
## 11 64 53 53 58 58 67 34
## 12 67 60 47 39 59 74 41
## 13 69 62 57 42 55 63 25
## 14 68 83 83 45 59 77 35
## 15 77 77 54 72 79 77 46
## 16 81 90 50 72 60 54 36
## 17 74 85 64 69 79 79 63
## 18 65 60 65 75 55 80 60
## 19 65 70 46 57 75 85 46
## 20 50 58 68 54 64 78 52
## 21 50 40 33 34 43 64 33
## 22 64 61 52 62 66 80 41
## 23 53 66 52 50 63 80 37
## 24 40 37 42 58 50 57 49
## 25 63 54 42 48 66 75 33
## 26 66 77 66 63 88 76 72
## 27 78 75 58 74 80 78 49
## 28 48 57 44 45 51 83 38
## 29 85 85 71 71 77 74 55
## 30 82 82 39 59 64 78 39
Predict Fertility[output] given: Infant.Mortality, Education, Agriculture [inputs]
data("swiss")
#output(y)
Fertility <- swiss[[1]]
Infant.Mortality <- swiss[[6]]
Education <- swiss[[4]]
Agriculture <- swiss[[2]]
#swiss
Predict Infant Mortality [output] given: Fertility, Agriculture, Catholic [input]
data("swiss")
#output(y)
Infant.Mortality <- swiss[[6]]
#input(x)
Fertility <- swiss[[1]]
Agriculture <- swiss[[2]]
Catholic <- swiss[[5]]
swiss
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Broye 83.8 70.2 16 7 92.85
## Glane 92.4 67.8 14 8 97.16
## Gruyere 82.4 53.3 12 7 97.67
## Sarine 82.9 45.2 16 13 91.38
## Veveyse 87.1 64.5 14 6 98.61
## Aigle 64.1 62.0 21 12 8.52
## Aubonne 66.9 67.5 14 7 2.27
## Avenches 68.9 60.7 19 12 4.43
## Cossonay 61.7 69.3 22 5 2.82
## Echallens 68.3 72.6 18 2 24.20
## Grandson 71.7 34.0 17 8 3.30
## Lausanne 55.7 19.4 26 28 12.11
## La Vallee 54.3 15.2 31 20 2.15
## Lavaux 65.1 73.0 19 9 2.84
## Morges 65.5 59.8 22 10 5.23
## Moudon 65.0 55.1 14 3 4.52
## Nyone 56.6 50.9 22 12 15.14
## Orbe 57.4 54.1 20 6 4.20
## Oron 72.5 71.2 12 1 2.40
## Payerne 74.2 58.1 14 8 5.23
## Paysd'enhaut 72.0 63.5 6 3 2.56
## Rolle 60.5 60.8 16 10 7.72
## Vevey 58.3 26.8 25 19 18.46
## Yverdon 65.4 49.5 15 8 6.10
## Conthey 75.5 85.9 3 2 99.71
## Entremont 69.3 84.9 7 6 99.68
## Herens 77.3 89.7 5 2 100.00
## Martigwy 70.5 78.2 12 6 98.96
## Monthey 79.4 64.9 7 3 98.22
## St Maurice 65.0 75.9 9 9 99.06
## Sierre 92.2 84.6 3 3 99.46
## Sion 79.3 63.1 13 13 96.83
## Boudry 70.4 38.4 26 12 5.62
## La Chauxdfnd 65.7 7.7 29 11 13.79
## Le Locle 72.7 16.7 22 13 11.22
## Neuchatel 64.4 17.6 35 32 16.92
## Val de Ruz 77.6 37.6 15 7 4.97
## ValdeTravers 67.6 18.7 25 7 8.65
## V. De Geneve 35.0 1.2 37 53 42.34
## Rive Droite 44.7 46.6 16 29 50.43
## Rive Gauche 42.8 27.7 22 29 58.33
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
## Broye 23.6
## Glane 24.9
## Gruyere 21.0
## Sarine 24.4
## Veveyse 24.5
## Aigle 16.5
## Aubonne 19.1
## Avenches 22.7
## Cossonay 18.7
## Echallens 21.2
## Grandson 20.0
## Lausanne 20.2
## La Vallee 10.8
## Lavaux 20.0
## Morges 18.0
## Moudon 22.4
## Nyone 16.7
## Orbe 15.3
## Oron 21.0
## Payerne 23.8
## Paysd'enhaut 18.0
## Rolle 16.3
## Vevey 20.9
## Yverdon 22.5
## Conthey 15.1
## Entremont 19.8
## Herens 18.3
## Martigwy 19.4
## Monthey 20.2
## St Maurice 17.8
## Sierre 16.3
## Sion 18.1
## Boudry 20.3
## La Chauxdfnd 20.5
## Le Locle 18.9
## Neuchatel 23.0
## Val de Ruz 20.0
## ValdeTravers 19.5
## V. De Geneve 18.0
## Rive Droite 18.2
## Rive Gauche 19.3
It is possible you are wondering: why only linear regressions? Even in many dimensions, aren’t some curves better approximated with polynomials?
You are right! There are polynomial regressions too, and they also look surprisingly the same as linear regressions. I will show a quick example here, and do so by plotting functions directly in R. By the way, plotting functions in R can be useful for non-statistics reasons as well!
set.seed(20)
#values in sequence
q <- seq(from=0, to=20, by=0.1)
#values to predict
y <- 500 + .4*(q-10)^3
# add some noise so graph looks better
noise <- rnorm(length(q), mean=10, sd=80)
noisy.y <- y + noise
# plot points
plot(q, noisy.y, col='skyblue4', xlab='q')
lines(q,y,col='firebrick1', lwd=3)
# Cubic vs Quadratic vs Linear Model
modelCube <- lm(noisy.y ~ poly(q,3))
modelQuad <- lm(noisy.y ~ poly(q,2))
modelLinr <- lm(noisy.y ~ poly(q,1))
# Summarize
summary(modelCube)
##
## Call:
## lm(formula = noisy.y ~ poly(q, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.326 -51.186 4.276 61.485 165.960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 513.615 5.602 91.69 <2e-16 ***
## poly(q, 3)1 2075.899 79.422 26.14 <2e-16 ***
## poly(q, 3)2 -108.004 79.422 -1.36 0.175
## poly(q, 3)3 864.025 79.422 10.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.42 on 197 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.8001
## F-statistic: 267.8 on 3 and 197 DF, p-value: < 2.2e-16
summary(modelQuad)
##
## Call:
## lm(formula = noisy.y ~ poly(q, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.977 -61.517 3.223 68.062 233.512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 513.62 7.07 72.649 <2e-16 ***
## poly(q, 2)1 2075.90 100.23 20.711 <2e-16 ***
## poly(q, 2)2 -108.00 100.23 -1.078 0.283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.2 on 198 degrees of freedom
## Multiple R-squared: 0.6848, Adjusted R-squared: 0.6816
## F-statistic: 215.1 on 2 and 198 DF, p-value: < 2.2e-16
summary(modelLinr)
##
## Call:
## lm(formula = noisy.y ~ poly(q, 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -329.81 -63.50 6.30 65.52 217.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 513.615 7.073 72.62 <2e-16 ***
## poly(q, 1) 2075.899 100.272 20.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.3 on 199 degrees of freedom
## Multiple R-squared: 0.6829, Adjusted R-squared: 0.6813
## F-statistic: 428.6 on 1 and 199 DF, p-value: < 2.2e-16
# Predictions + Lines
df <- data.frame(x=q,y)
predicted.intervals.Cube <- predict(modelCube,df)
lines(q,predicted.intervals.Cube, col='tan3', lwd=3)
predicted.intervals.Quad <- predict(modelQuad,df)
lines(q,predicted.intervals.Quad, col='darkolivegreen2', lwd=3)
predicted.intervals.Lin <- predict(modelLinr,df)
lines(q,predicted.intervals.Lin, col='steelblue4', lwd=3)
# Residual Sum Squares
RSScube <- sum((predicted.intervals.Cube - y)**2)
RSSquad <- sum((predicted.intervals.Quad - y)**2)
RSSlin <- sum((predicted.intervals.Lin - y)**2)
# Print Results
print(paste("RSScube (Tan): ", RSScube))
## [1] "RSScube (Tan): 55675.9249100325"
print(paste("RSSquad (Green): ", RSSquad))
## [1] "RSSquad (Green): 812791.814722302"
print(paste("RSSlin (Blue): ", RSSlin))
## [1] "RSSlin (Blue): 801127.027085057"