Final Assessment

This assessment involves applying the concepts you have learned so far. It is open book, but the datasets are different so keep that in mind. It is expected that you will submit working code at the end in a “knit” document in html format.

The assessment has 2 questions, one will focus on regression and the other on classification.

For both questions, first use filtering functions and some plots to give a feeling of what the data is like. Then remove features (columns) that are irrelevant if they are not needed.

First task: Regression

Predict the maximum temperature given all of the features excluding mean temperature, of a temperature dataset of WW2 data. If you have to separate by location go ahead. Do a proper cross validation. The code should appear in the code block below with comments specifying what you did. Also, at the end there should be some sort of conclusion.

Make sure you explain in the conclusion which features are good and which ones are not good (based on hypothesis testing and results in summary of the model).

filepath <- "./Datasets/summaryOfWeather.csv"
weathers <- read.csv(filepath, header=TRUE)

myModel <- lm(MaxTemp ~ as.numeric(MinTemp) + as.numeric(Precip) + as.numeric(Snowfall) + as.numeric(PoorWeather), data=weathers)
summary(myModel)
## 
## Call:
## lm(formula = MaxTemp ~ as.numeric(MinTemp) + as.numeric(Precip) + 
##     as.numeric(Snowfall) + as.numeric(PoorWeather), data = weathers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.583  -2.647  -0.396   2.164  37.799 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             12.4202379  0.0408799  303.82   <2e-16 ***
## as.numeric(MinTemp)      0.9062617  0.0015038  602.63   <2e-16 ***
## as.numeric(Precip)      -0.0046482  0.0000592  -78.52   <2e-16 ***
## as.numeric(Snowfall)    -0.2186486  0.0060262  -36.28   <2e-16 ***
## as.numeric(PoorWeather) -0.1139633  0.0058799  -19.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.016 on 119035 degrees of freedom
## Multiple R-squared:  0.7878, Adjusted R-squared:  0.7877 
## F-statistic: 1.104e+05 on 4 and 119035 DF,  p-value: < 2.2e-16

Second task: Classification

Build a classification model to predict which medal (or NA) an athlete will receive given the other features in this olympic dataset. Again provide some sort of conclusion.

filepath <- "./Datasets/athlete_events.csv"
events <- read.csv(filepath, header=TRUE)
events <- events[events$Sport %in% c("Basketball", "Rugby Sevens", "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating", "Cross Country Skiing", "Athletics", "Ice Hockey", "Swimming", "Badminton", "Sailing", "Biathlon"),]                 

events$Medal <- unlist(lapply(events$Medal, FUN=function(elem) ifelse(is.na(elem), 0, elem)))
events$Sport <- as.factor(as.vector(events$Sport))
subEvents <- events[1:10000,]
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
classifier <- randomForest(as.factor(Medal) ~ Sex + Age + Height + Weight + Sport, data=subEvents, na.action=na.exclude)
classifier
## 
## Call:
##  randomForest(formula = as.factor(Medal) ~ Sex + Age + Height +      Weight + Sport, data = subEvents, na.action = na.exclude) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.03%
## Confusion matrix:
##      0  1  2 3 class.error
## 0 7040 20 38 8 0.009287926
## 1  318  2  9 3 0.993975904
## 2  323  6 40 5 0.893048128
## 3  318  2 12 4 0.988095238