This notebook explores methods for visualizing and analyzing data in R using tables and graphs. The library used is called tidyverse, which is the standard for doing simple visualizations (and some complicated ones) in R. When there is not much data or the audience only is looking for overviews and summary information, graphs and tables are usually enough to get a point across. This notebook will use the Titanic Dataset from Kaggle as a starting point, and then do a “fun” challenge with Gapminder data.
The sections of this are: - Reading and Cleaning Data - Plotting Data - Summary Functions - Problems
The two important packages inside tidyverse (which is a set of packages) are “dplyr” and “ggplot2”.
#install.packages("tidyverse")
#install.packages("dplyr", "ggplot2", "magrittr")
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The tidyverse book (filtering plotting and more): http://r4ds.had.co.nz/introduction.html
Data can come from many sources - a csv works for smaller sized data while maybe an avro file (serialized) with a schema can work for larger amounts of data. R has many functions for reading external data and writing data from R to other softwares. Usually it is best to look them up for specific cases, but in many cases read.csv() and write.csv() are the more important ones. Use the help command to learn more about them.
Another very common problem with data is that it does not always come in a form that is easy to analyze. That is why it is important to reshuffle and filter and clean the data to meet relevant requirements. R is not the best language to actually clean data with - especially compared to languages that allow a user to easily open and modify file contents (C++, Python etc) are good. Also, there are specific programs for specific types of files (i.e Excel is good for xlsx and csv or Tableau is good for twb).
There is no real scientific method for cleaning data or trying to decode it so it can be read into R smoothly, which unfortunately can make that task very frustrating. However, starting with a clean dataset and seeing some results will hopefully be motivating enough to emphasize spending the time in the future to clean data! (Most in class small assignments will be done with clean data, but independent or class projects etc are likely to have some mess).
In this notebook the example used is Titanic data, which gives information about different passengers on the Titanic ship and whether or not the passengers survived. Below is an example of using the read csv function to read the titanic data into a dataframe.
The data can be downloaded here: https://www.kaggle.com/c/titanic/data
Be sure to change “filepath” to match the path of the titanic data file on the computer this code is being run on.
# Change filepath to match where you downloaded
filePathTrain <- "./Datasets/trainTitanic.csv"
filePathTest <- "./Datasets/testTitanic.csv"
# Save into dataframes
dfTrain <- as.data.frame(read.csv(filePathTrain, header=TRUE))
#dfTest <- as.data.frame(read.csv(filePathTest, header=TRUE))
Using ggplot2 there are many different plotting functions that can be used to show how the data looks. Which one to use depends on which part of this data needs to be emphasized. Below is an example with Titanic data - if you want to add more things look up more options with geom_point. There are also other plotting functions in R; the simplest is “plot” (which was used in the first notebook).
There are many more packages and many other softwares for plotting as well. One challenge with graphs and plots is that often they have a lot of information and people do not have the required time to view and understand. Also, sometimes people are not sure whether they should look for a trend or a distribution, and do not think about flipping the axis to see the reverse story. Neverthless, there is some saying “a picture is worth a 1000 words” or something like that so a good plot can still be effective.
Here’s an example of two plots made with ggplot.
library(ggplot2)
ggplot(data = dfTrain) +
geom_point(mapping = aes(x = Age, y = Fare, size=as.factor(Survived), color=Sex))
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 177 rows containing missing values (geom_point).
ggplot(data=dfTrain) +
geom_point(mapping=aes(x=Age, y=Fare, color=as.factor(Pclass), alpha=Survived, shape=Sex))
## Warning: Removed 177 rows containing missing values (geom_point).
There are many more interesting display variables that can be added with ggplot as well, and many more types of graphs using geom_smooth(), geom_bar(), stat_summary(). Here is an example with geom_bar() counting how many people are in each class.
ggplot(data = dfTrain) +
geom_bar(mapping = aes(x = Pclass, fill=Survived))
Here is an example summarizing the age of all the people in a different class.
ggplot(data = dfTrain) +
stat_summary(
mapping = aes(x = Pclass, y = Age),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
## Warning: Removed 177 rows containing non-finite values (stat_summary).
There are also ways to fill inside the bars and add more details! Those can be found using the associated book.
Some summary functions are installed with R. Those still provide accurate summaries but are less useful when building complicated commands. Summary functions give results like mean, median, quantiles etc. In R there is a built in “summary()” which gives the min, max, median, mean, and 1st/3rd quantiles of a set of numbers. There will be a short example below for inbuilt R functions, and then more detailed examples later of how to use the summary and transformation functions in dplyr.
# summarize fare
summary(dfTrain$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.91 14.45 32.20 31.00 512.33
# summarize fare of survivors
summary(dfTrain$Fare[dfTrain$Survived == 1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 12.47 26.00 48.40 57.00 512.33
Sometimes it’s more useful to take a summary of very specific instances in the data, for example, a summary of the Age of all survivors where they are female or a summary of all variables for passengers who are 18. The first step to find these summaries is to filter the data and pull only the necessary rows.
Such filters are really effectively done using the “dplyr” library inside the tidyverse package. Below are some examples of filters. The full filtered data is saved in a variable and only the first few rows are shown as a display using the below code.
# filter all fields for only females that have not survived
females <- filter(dfTrain, Sex == "female")
# filter all fields for only females that have survived
survivingFemales <- filter(dfTrain, Survived == 1, Sex == "female")
# filter all passengers who were 18 years old
age18 <- filter(dfTrain, Age == 18)
head(females)
## PassengerId Survived Pclass
## 1 2 1 1
## 2 3 1 3
## 3 4 1 1
## 4 9 1 3
## 5 10 1 2
## 6 11 1 3
## Name Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 2 Heikkinen, Miss. Laina female 26 0
## 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 4 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0
## 5 Nasser, Mrs. Nicholas (Adele Achem) female 14 1
## 6 Sandstrom, Miss. Marguerite Rut female 4 1
## Parch Ticket Fare Cabin Embarked
## 1 0 PC 17599 71.2833 C85 C
## 2 0 STON/O2. 3101282 7.9250 S
## 3 0 113803 53.1000 C123 S
## 4 2 347742 11.1333 S
## 5 0 237736 30.0708 C
## 6 1 PP 9549 16.7000 G6 S
head(survivingFemales)
## PassengerId Survived Pclass
## 1 2 1 1
## 2 3 1 3
## 3 4 1 1
## 4 9 1 3
## 5 10 1 2
## 6 11 1 3
## Name Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 2 Heikkinen, Miss. Laina female 26 0
## 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 4 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0
## 5 Nasser, Mrs. Nicholas (Adele Achem) female 14 1
## 6 Sandstrom, Miss. Marguerite Rut female 4 1
## Parch Ticket Fare Cabin Embarked
## 1 0 PC 17599 71.2833 C85 C
## 2 0 STON/O2. 3101282 7.9250 S
## 3 0 113803 53.1000 C123 S
## 4 2 347742 11.1333 S
## 5 0 237736 30.0708 C
## 6 1 PP 9549 16.7000 G6 S
head(age18)
## PassengerId Survived Pclass
## 1 39 0 3
## 2 50 0 3
## 3 145 0 2
## 4 176 0 3
## 5 205 1 3
## 6 229 0 2
## Name Sex Age SibSp Parch
## 1 Vander Planke, Miss. Augusta Maria female 18 2 0
## 2 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18 1 0
## 3 Andrew, Mr. Edgardo Samuel male 18 0 0
## 4 Klasen, Mr. Klas Albin male 18 1 1
## 5 Cohen, Mr. Gurshon "Gus" male 18 0 0
## 6 Fahlstrom, Mr. Arne Jonas male 18 0 0
## Ticket Fare Cabin Embarked
## 1 345764 18.0000 S
## 2 349237 17.8000 S
## 3 231945 11.5000 S
## 4 350404 7.8542 S
## 5 A/5 3540 8.0500 S
## 6 236171 13.0000 S
summarise(survivingFemales)
## data frame with 0 columns and 0 rows
Filters and summaries are quite powerful tools to get initial ideas about what data looks like. They are also great for figuring out how much data is missing, and whether or not there are any anomalies.
Arrange sorts rows. Here are some examples (just showing the first few rows)
sortAge <- arrange(dfTrain, desc(Age))
sortId <- arrange(dfTrain, desc(PassengerId))
head(sortAge)
## PassengerId Survived Pclass Name Sex
## 1 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
## 2 852 0 3 Svensson, Mr. Johan male
## 3 97 0 1 Goldschmidt, Mr. George B male
## 4 494 0 1 Artagaveytia, Mr. Ramon male
## 5 117 0 3 Connors, Mr. Patrick male
## 6 673 0 2 Mitchell, Mr. Henry Michael male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 1 80.0 0 0 27042 30.0000 A23 S
## 2 74.0 0 0 347060 7.7750 S
## 3 71.0 0 0 PC 17754 34.6542 A5 C
## 4 71.0 0 0 PC 17609 49.5042 C
## 5 70.5 0 0 370369 7.7500 Q
## 6 70.0 0 0 C.A. 24580 10.5000 S
head(sortId)
## PassengerId Survived Pclass Name
## 1 891 0 3 Dooley, Mr. Patrick
## 2 890 1 1 Behr, Mr. Karl Howell
## 3 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
## 4 888 1 1 Graham, Miss. Margaret Edith
## 5 887 0 2 Montvila, Rev. Juozas
## 6 886 0 3 Rice, Mrs. William (Margaret Norton)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 male 32 0 0 370376 7.750 Q
## 2 male 26 0 0 111369 30.000 C148 C
## 3 female NA 1 2 W./C. 6607 23.450 S
## 4 female 19 0 0 112053 30.000 B42 S
## 5 male 27 0 0 211536 13.000 S
## 6 female 39 0 5 382652 29.125 Q
Select is especially useful when there are many columns. In this case, since there are not that many, select is not that useful, but still… an example (again only the first few rows are shown)
selectUsefulCols <- select(dfTrain, Age, Fare, Pclass, Survived)
head(selectUsefulCols)
## Age Fare Pclass Survived
## 1 22 7.2500 3 0
## 2 38 71.2833 1 1
## 3 26 7.9250 3 1
## 4 35 53.1000 1 1
## 5 35 8.0500 3 0
## 6 NA 8.4583 3 0
Mutate is especially useful when trying to create a new field out of the existing fields (field in this case is a column). This will add a new column to the end of the dataset. Here are a few examples (they seem useful but are used more to illustrate the concept).
funnyNewFields <- mutate(dfTrain, lettersInName = length(Name), farePerYear = Fare/Age)
head(funnyNewFields)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked lettersInName farePerYear
## 1 0 A/5 21171 7.2500 S 891 0.3295455
## 2 0 PC 17599 71.2833 C85 C 891 1.8758763
## 3 0 STON/O2. 3101282 7.9250 S 891 0.3048077
## 4 0 113803 53.1000 C123 S 891 1.5171429
## 5 0 373450 8.0500 S 891 0.2300000
## 6 0 330877 8.4583 Q 891 NA
Another useful application of mutate is to convert a numeric variable into a categorical one, for example it is possible to separate age into greater than 50 and less than 50 or Fare into 3 categories (less than 20, greater than 20 and less than 40, greater than 40).
selectAndMutate <- select(mutate(dfTrain, ageBuckets = as.numeric(Age>50 | is.na(Age)), fareBuckets = ifelse(Fare < 20, 2, ifelse(Fare < 40, 1, 0))), Name, Age, Fare, Pclass, ageBuckets, fareBuckets, Survived)
Finally, a summarise function in dplyr can give specific summary stats, and running a built in “summary” function on a filtered dplyr table also gives good results. Here is an example:
summary(females)
## PassengerId Survived Pclass
## Min. : 2.0 Min. :0.000 Min. :1.000
## 1st Qu.:231.8 1st Qu.:0.000 1st Qu.:1.000
## Median :414.5 Median :1.000 Median :2.000
## Mean :431.0 Mean :0.742 Mean :2.159
## 3rd Qu.:641.2 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :889.0 Max. :1.000 Max. :3.000
##
## Name Sex
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 female:314
## Abelson, Mrs. Samuel (Hannah Wizosky) : 1 male : 0
## Ahlin, Mrs. Johan (Johanna Persdotter Larsson): 1
## Aks, Mrs. Sam (Leah Rosen) : 1
## Allen, Miss. Elisabeth Walton : 1
## Allison, Miss. Helen Loraine : 1
## (Other) :308
## Age SibSp Parch Ticket
## Min. : 0.75 Min. :0.0000 Min. :0.0000 347082 : 5
## 1st Qu.:18.00 1st Qu.:0.0000 1st Qu.:0.0000 2666 : 4
## Median :27.00 Median :0.0000 Median :0.0000 110152 : 3
## Mean :27.92 Mean :0.6943 Mean :0.6497 113781 : 3
## 3rd Qu.:37.00 3rd Qu.:1.0000 3rd Qu.:1.0000 13502 : 3
## Max. :63.00 Max. :8.0000 Max. :6.0000 24160 : 3
## NA's :53 (Other):293
## Fare Cabin Embarked
## Min. : 6.75 :217 : 2
## 1st Qu.: 12.07 G6 : 4 C: 73
## Median : 23.00 E101 : 3 Q: 36
## Mean : 44.48 F33 : 3 S:203
## 3rd Qu.: 55.00 B18 : 2
## Max. :512.33 B28 : 2
## (Other): 83
summary(survivingFemales)
## PassengerId Survived Pclass
## Min. : 2.0 Min. :1 Min. :1.000
## 1st Qu.:238.0 1st Qu.:1 1st Qu.:1.000
## Median :400.0 Median :1 Median :2.000
## Mean :429.7 Mean :1 Mean :1.918
## 3rd Qu.:636.0 3rd Qu.:1 3rd Qu.:3.000
## Max. :888.0 Max. :1 Max. :3.000
##
## Name Sex
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 female:233
## Abelson, Mrs. Samuel (Hannah Wizosky) : 1 male : 0
## Aks, Mrs. Sam (Leah Rosen) : 1
## Allen, Miss. Elisabeth Walton : 1
## Andersen-Jensen, Miss. Carla Christine Nielsine: 1
## Andersson, Miss. Erna Alexandra : 1
## (Other) :227
## Age SibSp Parch Ticket
## Min. : 0.75 Min. :0.000 Min. :0.000 2666 : 4
## 1st Qu.:19.00 1st Qu.:0.000 1st Qu.:0.000 110152 : 3
## Median :28.00 Median :0.000 Median :0.000 13502 : 3
## Mean :28.85 Mean :0.515 Mean :0.515 24160 : 3
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:1.000 PC 17757: 3
## Max. :63.00 Max. :4.000 Max. :5.000 110413 : 2
## NA's :36 (Other) :215
## Fare Cabin Embarked
## Min. : 7.225 :142 : 2
## 1st Qu.: 13.000 E101 : 3 C: 64
## Median : 26.000 F33 : 3 Q: 27
## Mean : 51.939 B18 : 2 S:140
## 3rd Qu.: 76.292 B28 : 2
## Max. :512.329 B35 : 2
## (Other): 79
summarise(females, count = n(), age = mean(Age, na.rm=TRUE), fare = mean(Fare,na.rm=TRUE))
## count age fare
## 1 314 27.91571 44.47982
It is also possible to use group_by() to get summaries about specific subgroups.
# Using pipe commands to build interesting summaries
selectAndMutate %>% group_by(fareBuckets) %>% summarise(count=n(), ages=mean(Age,na.rm=TRUE))
## # A tibble: 3 x 3
## fareBuckets count ages
## <dbl> <int> <dbl>
## 1 0 176 33.9
## 2 1 200 29.8
## 3 2 515 28.0
selectAndMutate %>% group_by(Pclass) %>% summarise(count=n(), ages=mean(Fare,na.rm=TRUE))
## # A tibble: 3 x 3
## Pclass count ages
## <int> <int> <dbl>
## 1 1 216 84.2
## 2 2 184 20.7
## 3 3 491 13.7
Consult the book for more examples of putting fields together!
This is not a “dplyr” feature, but it is an important idea for working with vectors and lists. In R, it is usually best to avoid loops on vectors and instead use a function known as “apply” which applies the same function to all of the elements in a vector. This is so that the computer can take advantage of parallelization in the future, and process the code much faster.
To effectively use apply, it is good to write a function you want that can be applied to any individual element in a vector. Then, simply use lapply to apply that individual element function to every element and store the result in a list. The unlist command can convert a list back into a vector. An example of this is below, the first function will add1 to every element in the numeric vector and the second function will add the letter “a” to the start of each string.
add1 <- function(x) {
return (x+1)
}
adda <- function(x) {
return (paste("a", x, sep=""))
}
unlist(lapply(c(1,2,3,4,5), add1))
## [1] 2 3 4 5 6
unlist(lapply(c("a","ab", "abc", "abcd"), adda))
## [1] "aa" "aab" "aabc" "aabcd"
There are other methods of apply too which work on different data structures; run help(lapply) to learn more!
The extra topics in these notebooks called “functional programming” are also great tools to program smarter in R, and will be especially useful for assignments.
Using the training set, try to come up with an idea of which types of passengers survived based on the features given. To do this, use various summaries and graphs to get an idea of how variables are distributed for passengers of different types.
Also, generate at least one additional feature from the given features for yourself. For example, you can look at the “Name” field for females to determine if they were married or not and create a new vector accordingly. Another example is to use some letters in the ticket type (or whether or not there is a ticket) to see if it correlates with survival chances. The feature you generate does not have to be the best indicator - the point is more to get used to generating features from existing examples. You can do this feature generation with a function (i.e if “mrs” is a substring in name then output 1 else output 0) and the output should be a vector. An example of an easy feature is below… it simply measures whether or not the letter “m” exists in the name. Be creative when finding a feature that works!
This task is open ended - but in the end you will use whatever indicators you have built with the training set and try to classify people in the testing set. This means in the end you must write a function that takes in certain variables and outputs the result (1 if survived, 0 otherwise).
Show all of your work! That includes what summary functions and plots you used and how you analyzed them to reach your results, as well as the final function and some sample inputs+outputs.
#this is a very simplistic example, more to use as a template and complicate yourself
isLetterM <- function(x) {
if ("m" %in% unlist(strsplit(tolower(x), ""))) {
return(1)
} else {
return(0)
}
}
letterM <- unlist(lapply(dfTrain$Name, isLetterM))
# feature letterM is built
survivalTest <- function(Age, Name, Fare) {
if (Age > 20) {
if (isLetterM(Name)) {
if (Fare > 15) {
return(1)
} else {
return (0)
}
} else if (Fare < 8) {
return (1)
} else {
return(0)
}
} else if (Age < 20) {
return (0)
}
}
Find some data on gapminder homepage that interests you. Using Microsoft Excel, try to clean the data up and save it as a “csv” file. If you are able to load your cleaned data into R then you can continue, otherwise send to instructor the set you are thinking to get help cleaning it.
Gapminder has a very interesting display, but sometimes to get a bigger picture of what is going on focusing on a subset of data and putting it into a graph or plot is enough. In this task you can make use of “ggplot2” package, and the group that can create the most interesting and quirky graph wins! (For simplicity purposes try to only pick 2 indicators i.e income per person and life expectancy… if you are feeling very bold you can try to make a 3D plot and maybe you will get very interesting results).
If you do not like the results with ggplot2 then go ahead and use whatever other plotting software you know.
Both tasks will be done in teams - the Gapminder Data is more of a challenge while the Titanic data work is important for understanding.