This assessment involves applying the concepts you have learned so far. It is open book, but the datasets are different so keep that in mind. It is expected that you will submit working code at the end in a “knit” document in html format.
The assessment has 2 questions, one will focus on regression and the other on classification.
For both questions, first use filtering functions and some plots to give a feeling of what the data is like. Then remove features (columns) that are irrelevant if they are not needed.
Predict the maximum temperature given all of the features excluding mean temperature, of a temperature dataset of WW2 data. If you have to separate by location go ahead. Do a proper cross validation. The code should appear in the code block below with comments specifying what you did. Also, at the end there should be some sort of conclusion.
Make sure you explain in the conclusion which features are good and which ones are not good (based on hypothesis testing and results in summary of the model).
filepath <- "./Datasets/summaryOfWeather.csv"
weathers <- read.csv(filepath, header=TRUE)
myModel <- lm(MaxTemp ~ as.numeric(MinTemp) + as.numeric(Precip) + as.numeric(Snowfall) + as.numeric(PoorWeather), data=weathers)
summary(myModel)
##
## Call:
## lm(formula = MaxTemp ~ as.numeric(MinTemp) + as.numeric(Precip) +
## as.numeric(Snowfall) + as.numeric(PoorWeather), data = weathers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.583 -2.647 -0.396 2.164 37.799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.4202379 0.0408799 303.82 <2e-16 ***
## as.numeric(MinTemp) 0.9062617 0.0015038 602.63 <2e-16 ***
## as.numeric(Precip) -0.0046482 0.0000592 -78.52 <2e-16 ***
## as.numeric(Snowfall) -0.2186486 0.0060262 -36.28 <2e-16 ***
## as.numeric(PoorWeather) -0.1139633 0.0058799 -19.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.016 on 119035 degrees of freedom
## Multiple R-squared: 0.7878, Adjusted R-squared: 0.7877
## F-statistic: 1.104e+05 on 4 and 119035 DF, p-value: < 2.2e-16
Build a classification model to predict which medal (or NA) an athlete will receive given the other features in this olympic dataset. Again provide some sort of conclusion.
filepath <- "./Datasets/athlete_events.csv"
events <- read.csv(filepath, header=TRUE)
events <- events[events$Sport %in% c("Basketball", "Rugby Sevens", "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating", "Cross Country Skiing", "Athletics", "Ice Hockey", "Swimming", "Badminton", "Sailing", "Biathlon"),]
events$Medal <- unlist(lapply(events$Medal, FUN=function(elem) ifelse(is.na(elem), 0, elem)))
events$Sport <- as.factor(as.vector(events$Sport))
subEvents <- events[1:10000,]
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
classifier <- randomForest(as.factor(Medal) ~ Sex + Age + Height + Weight + Sport, data=subEvents, na.action=na.exclude)
classifier
##
## Call:
## randomForest(formula = as.factor(Medal) ~ Sex + Age + Height + Weight + Sport, data = subEvents, na.action = na.exclude)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.03%
## Confusion matrix:
## 0 1 2 3 class.error
## 0 7040 20 38 8 0.009287926
## 1 318 2 9 3 0.993975904
## 2 323 6 40 5 0.893048128
## 3 318 2 12 4 0.988095238