Introduction

This notebook explores methods for visualizing and analyzing data in R using tables and graphs. The library used is called tidyverse, which is the standard for doing simple visualizations (and some complicated ones) in R. When there is not much data or the audience only is looking for overviews and summary information, graphs and tables are usually enough to get a point across. This notebook will use the Titanic Dataset from Kaggle as a starting point, and then do a “fun” challenge with Gapminder data.

The sections of this are: - Reading and Cleaning Data - Plotting Data - Summary Functions - Problems

Reading and Cleaning Data

Data can come from many sources - a csv works for smaller sized data while maybe an avro file (serialized) with a schema can work for larger amounts of data. R has many functions for reading external data and writing data from R to other softwares. Usually it is best to look them up for specific cases, but in many cases read.csv() and write.csv() are the more important ones. Use the help command to learn more about them.

Another very common problem with data is that it does not always come in a form that is easy to analyze. That is why it is important to reshuffle and filter and clean the data to meet relevant requirements. R is not the best language to actually clean data with - especially compared to languages that allow a user to easily open and modify file contents (C++, Python etc) are good. Also, there are specific programs for specific types of files (i.e Excel is good for xlsx and csv or Tableau is good for twb).

There is no real scientific method for cleaning data or trying to decode it so it can be read into R smoothly, which unfortunately can make that task very frustrating. However, starting with a clean dataset and seeing some results will hopefully be motivating enough to emphasize spending the time in the future to clean data! (Most in class small assignments will be done with clean data, but independent or class projects etc are likely to have some mess).

In this notebook the example used is Titanic data, which gives information about different passengers on the Titanic ship and whether or not the passengers survived. Below is an example of using the read csv function to read the titanic data into a dataframe.

The data can be downloaded here: https://www.kaggle.com/c/titanic/data

Be sure to change “filepath” to match the path of the titanic data file on the computer this code is being run on.

# Change filepath to match where you downloaded
filePathTrain <- "./Datasets/trainTitanic.csv"
filePathTest <- "./Datasets/testTitanic.csv"

# Save into dataframes
dfTrain <- as.data.frame(read.csv(filePathTrain, header=TRUE))
#dfTest <- as.data.frame(read.csv(filePathTest, header=TRUE))

Plotting Data

Using ggplot2 there are many different plotting functions that can be used to show how the data looks. Which one to use depends on which part of this data needs to be emphasized. Below is an example with Titanic data - if you want to add more things look up more options with geom_point. There are also other plotting functions in R; the simplest is “plot” (which was used in the first notebook).

There are many more packages and many other softwares for plotting as well. One challenge with graphs and plots is that often they have a lot of information and people do not have the required time to view and understand. Also, sometimes people are not sure whether they should look for a trend or a distribution, and do not think about flipping the axis to see the reverse story. Neverthless, there is some saying “a picture is worth a 1000 words” or something like that so a good plot can still be effective.

Here’s an example of two plots made with ggplot.

library(ggplot2)

ggplot(data = dfTrain) + 
  geom_point(mapping = aes(x = Age, y = Fare, size=as.factor(Survived), color=Sex))
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 177 rows containing missing values (geom_point).

ggplot(data=dfTrain) + 
  geom_point(mapping=aes(x=Age, y=Fare, color=as.factor(Pclass), alpha=Survived, shape=Sex))
## Warning: Removed 177 rows containing missing values (geom_point).

There are many more interesting display variables that can be added with ggplot as well, and many more types of graphs using geom_smooth(), geom_bar(), stat_summary(). Here is an example with geom_bar() counting how many people are in each class.

ggplot(data = dfTrain) + 
  geom_bar(mapping = aes(x = Pclass, fill=Survived))

Here is an example summarizing the age of all the people in a different class.

ggplot(data = dfTrain) + 
  stat_summary(
    mapping = aes(x = Pclass, y = Age),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
## Warning: Removed 177 rows containing non-finite values (stat_summary).

There are also ways to fill inside the bars and add more details! Those can be found using the associated book.

Summary, Arrange, Select, Filter, Mutate, Group By Functions

Some summary functions are installed with R. Those still provide accurate summaries but are less useful when building complicated commands. Summary functions give results like mean, median, quantiles etc. In R there is a built in “summary()” which gives the min, max, median, mean, and 1st/3rd quantiles of a set of numbers. There will be a short example below for inbuilt R functions, and then more detailed examples later of how to use the summary and transformation functions in dplyr.

# summarize fare
summary(dfTrain$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33
# summarize fare of survivors
summary(dfTrain$Fare[dfTrain$Survived == 1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.47   26.00   48.40   57.00  512.33

Filter

Sometimes it’s more useful to take a summary of very specific instances in the data, for example, a summary of the Age of all survivors where they are female or a summary of all variables for passengers who are 18. The first step to find these summaries is to filter the data and pull only the necessary rows.

Such filters are really effectively done using the “dplyr” library inside the tidyverse package. Below are some examples of filters. The full filtered data is saved in a variable and only the first few rows are shown as a display using the below code.

# filter all fields for only females that have not survived
females <- filter(dfTrain, Sex == "female")

# filter all fields for only females that have survived
survivingFemales <- filter(dfTrain, Survived == 1, Sex == "female")

# filter all passengers who were 18 years old
age18 <- filter(dfTrain, Age == 18)

head(females)
##   PassengerId Survived Pclass
## 1           2        1      1
## 2           3        1      3
## 3           4        1      1
## 4           9        1      3
## 5          10        1      2
## 6          11        1      3
##                                                  Name    Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 2                              Heikkinen, Miss. Laina female  26     0
## 3        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 4   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0
## 5                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1
## 6                     Sandstrom, Miss. Marguerite Rut female   4     1
##   Parch           Ticket    Fare Cabin Embarked
## 1     0         PC 17599 71.2833   C85        C
## 2     0 STON/O2. 3101282  7.9250              S
## 3     0           113803 53.1000  C123        S
## 4     2           347742 11.1333              S
## 5     0           237736 30.0708              C
## 6     1          PP 9549 16.7000    G6        S
head(survivingFemales)
##   PassengerId Survived Pclass
## 1           2        1      1
## 2           3        1      3
## 3           4        1      1
## 4           9        1      3
## 5          10        1      2
## 6          11        1      3
##                                                  Name    Sex Age SibSp
## 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 2                              Heikkinen, Miss. Laina female  26     0
## 3        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 4   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female  27     0
## 5                 Nasser, Mrs. Nicholas (Adele Achem) female  14     1
## 6                     Sandstrom, Miss. Marguerite Rut female   4     1
##   Parch           Ticket    Fare Cabin Embarked
## 1     0         PC 17599 71.2833   C85        C
## 2     0 STON/O2. 3101282  7.9250              S
## 3     0           113803 53.1000  C123        S
## 4     2           347742 11.1333              S
## 5     0           237736 30.0708              C
## 6     1          PP 9549 16.7000    G6        S
head(age18)
##   PassengerId Survived Pclass
## 1          39        0      3
## 2          50        0      3
## 3         145        0      2
## 4         176        0      3
## 5         205        1      3
## 6         229        0      2
##                                            Name    Sex Age SibSp Parch
## 1            Vander Planke, Miss. Augusta Maria female  18     2     0
## 2 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female  18     1     0
## 3                    Andrew, Mr. Edgardo Samuel   male  18     0     0
## 4                        Klasen, Mr. Klas Albin   male  18     1     1
## 5                      Cohen, Mr. Gurshon "Gus"   male  18     0     0
## 6                     Fahlstrom, Mr. Arne Jonas   male  18     0     0
##     Ticket    Fare Cabin Embarked
## 1   345764 18.0000              S
## 2   349237 17.8000              S
## 3   231945 11.5000              S
## 4   350404  7.8542              S
## 5 A/5 3540  8.0500              S
## 6   236171 13.0000              S
summarise(survivingFemales)
## data frame with 0 columns and 0 rows

Filters and summaries are quite powerful tools to get initial ideas about what data looks like. They are also great for figuring out how much data is missing, and whether or not there are any anomalies.

Arrange

Arrange sorts rows. Here are some examples (just showing the first few rows)

sortAge <- arrange(dfTrain, desc(Age))
sortId <- arrange(dfTrain, desc(PassengerId))

head(sortAge)
##   PassengerId Survived Pclass                                 Name  Sex
## 1         631        1      1 Barkworth, Mr. Algernon Henry Wilson male
## 2         852        0      3                  Svensson, Mr. Johan male
## 3          97        0      1            Goldschmidt, Mr. George B male
## 4         494        0      1              Artagaveytia, Mr. Ramon male
## 5         117        0      3                 Connors, Mr. Patrick male
## 6         673        0      2          Mitchell, Mr. Henry Michael male
##    Age SibSp Parch     Ticket    Fare Cabin Embarked
## 1 80.0     0     0      27042 30.0000   A23        S
## 2 74.0     0     0     347060  7.7750              S
## 3 71.0     0     0   PC 17754 34.6542    A5        C
## 4 71.0     0     0   PC 17609 49.5042              C
## 5 70.5     0     0     370369  7.7500              Q
## 6 70.0     0     0 C.A. 24580 10.5000              S
head(sortId)
##   PassengerId Survived Pclass                                     Name
## 1         891        0      3                      Dooley, Mr. Patrick
## 2         890        1      1                    Behr, Mr. Karl Howell
## 3         889        0      3 Johnston, Miss. Catherine Helen "Carrie"
## 4         888        1      1             Graham, Miss. Margaret Edith
## 5         887        0      2                    Montvila, Rev. Juozas
## 6         886        0      3     Rice, Mrs. William (Margaret Norton)
##      Sex Age SibSp Parch     Ticket   Fare Cabin Embarked
## 1   male  32     0     0     370376  7.750              Q
## 2   male  26     0     0     111369 30.000  C148        C
## 3 female  NA     1     2 W./C. 6607 23.450              S
## 4 female  19     0     0     112053 30.000   B42        S
## 5   male  27     0     0     211536 13.000              S
## 6 female  39     0     5     382652 29.125              Q

Select

Select is especially useful when there are many columns. In this case, since there are not that many, select is not that useful, but still… an example (again only the first few rows are shown)

selectUsefulCols <- select(dfTrain, Age, Fare, Pclass, Survived)
head(selectUsefulCols)
##   Age    Fare Pclass Survived
## 1  22  7.2500      3        0
## 2  38 71.2833      1        1
## 3  26  7.9250      3        1
## 4  35 53.1000      1        1
## 5  35  8.0500      3        0
## 6  NA  8.4583      3        0

Mutate

Mutate is especially useful when trying to create a new field out of the existing fields (field in this case is a column). This will add a new column to the end of the dataset. Here are a few examples (they seem useful but are used more to illustrate the concept).

funnyNewFields <- mutate(dfTrain, lettersInName = length(Name), farePerYear = Fare/Age)
head(funnyNewFields)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked lettersInName farePerYear
## 1     0        A/5 21171  7.2500              S           891   0.3295455
## 2     0         PC 17599 71.2833   C85        C           891   1.8758763
## 3     0 STON/O2. 3101282  7.9250              S           891   0.3048077
## 4     0           113803 53.1000  C123        S           891   1.5171429
## 5     0           373450  8.0500              S           891   0.2300000
## 6     0           330877  8.4583              Q           891          NA

Another useful application of mutate is to convert a numeric variable into a categorical one, for example it is possible to separate age into greater than 50 and less than 50 or Fare into 3 categories (less than 20, greater than 20 and less than 40, greater than 40).

selectAndMutate <- select(mutate(dfTrain, ageBuckets = as.numeric(Age>50 | is.na(Age)), fareBuckets = ifelse(Fare < 20, 2, ifelse(Fare < 40, 1, 0))), Name, Age, Fare, Pclass, ageBuckets, fareBuckets, Survived)

Summarise and group_by

Finally, a summarise function in dplyr can give specific summary stats, and running a built in “summary” function on a filtered dplyr table also gives good results. Here is an example:

summary(females)
##   PassengerId       Survived         Pclass     
##  Min.   :  2.0   Min.   :0.000   Min.   :1.000  
##  1st Qu.:231.8   1st Qu.:0.000   1st Qu.:1.000  
##  Median :414.5   Median :1.000   Median :2.000  
##  Mean   :431.0   Mean   :0.742   Mean   :2.159  
##  3rd Qu.:641.2   3rd Qu.:1.000   3rd Qu.:3.000  
##  Max.   :889.0   Max.   :1.000   Max.   :3.000  
##                                                 
##                                              Name         Sex     
##  Abbott, Mrs. Stanton (Rosa Hunt)              :  1   female:314  
##  Abelson, Mrs. Samuel (Hannah Wizosky)         :  1   male  :  0  
##  Ahlin, Mrs. Johan (Johanna Persdotter Larsson):  1               
##  Aks, Mrs. Sam (Leah Rosen)                    :  1               
##  Allen, Miss. Elisabeth Walton                 :  1               
##  Allison, Miss. Helen Loraine                  :  1               
##  (Other)                                       :308               
##       Age            SibSp            Parch            Ticket   
##  Min.   : 0.75   Min.   :0.0000   Min.   :0.0000   347082 :  5  
##  1st Qu.:18.00   1st Qu.:0.0000   1st Qu.:0.0000   2666   :  4  
##  Median :27.00   Median :0.0000   Median :0.0000   110152 :  3  
##  Mean   :27.92   Mean   :0.6943   Mean   :0.6497   113781 :  3  
##  3rd Qu.:37.00   3rd Qu.:1.0000   3rd Qu.:1.0000   13502  :  3  
##  Max.   :63.00   Max.   :8.0000   Max.   :6.0000   24160  :  3  
##  NA's   :53                                        (Other):293  
##       Fare            Cabin     Embarked
##  Min.   :  6.75          :217    :  2   
##  1st Qu.: 12.07   G6     :  4   C: 73   
##  Median : 23.00   E101   :  3   Q: 36   
##  Mean   : 44.48   F33    :  3   S:203   
##  3rd Qu.: 55.00   B18    :  2           
##  Max.   :512.33   B28    :  2           
##                   (Other): 83
summary(survivingFemales)
##   PassengerId       Survived     Pclass     
##  Min.   :  2.0   Min.   :1   Min.   :1.000  
##  1st Qu.:238.0   1st Qu.:1   1st Qu.:1.000  
##  Median :400.0   Median :1   Median :2.000  
##  Mean   :429.7   Mean   :1   Mean   :1.918  
##  3rd Qu.:636.0   3rd Qu.:1   3rd Qu.:3.000  
##  Max.   :888.0   Max.   :1   Max.   :3.000  
##                                             
##                                               Name         Sex     
##  Abbott, Mrs. Stanton (Rosa Hunt)               :  1   female:233  
##  Abelson, Mrs. Samuel (Hannah Wizosky)          :  1   male  :  0  
##  Aks, Mrs. Sam (Leah Rosen)                     :  1               
##  Allen, Miss. Elisabeth Walton                  :  1               
##  Andersen-Jensen, Miss. Carla Christine Nielsine:  1               
##  Andersson, Miss. Erna Alexandra                :  1               
##  (Other)                                        :227               
##       Age            SibSp           Parch            Ticket   
##  Min.   : 0.75   Min.   :0.000   Min.   :0.000   2666    :  4  
##  1st Qu.:19.00   1st Qu.:0.000   1st Qu.:0.000   110152  :  3  
##  Median :28.00   Median :0.000   Median :0.000   13502   :  3  
##  Mean   :28.85   Mean   :0.515   Mean   :0.515   24160   :  3  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:1.000   PC 17757:  3  
##  Max.   :63.00   Max.   :4.000   Max.   :5.000   110413  :  2  
##  NA's   :36                                      (Other) :215  
##       Fare             Cabin     Embarked
##  Min.   :  7.225          :142    :  2   
##  1st Qu.: 13.000   E101   :  3   C: 64   
##  Median : 26.000   F33    :  3   Q: 27   
##  Mean   : 51.939   B18    :  2   S:140   
##  3rd Qu.: 76.292   B28    :  2           
##  Max.   :512.329   B35    :  2           
##                    (Other): 79
summarise(females, count = n(), age = mean(Age, na.rm=TRUE), fare = mean(Fare,na.rm=TRUE))
##   count      age     fare
## 1   314 27.91571 44.47982

It is also possible to use group_by() to get summaries about specific subgroups.

# Using pipe commands to build interesting summaries
selectAndMutate %>% group_by(fareBuckets) %>% summarise(count=n(), ages=mean(Age,na.rm=TRUE))
## # A tibble: 3 x 3
##   fareBuckets count  ages
##         <dbl> <int> <dbl>
## 1           0   176  33.9
## 2           1   200  29.8
## 3           2   515  28.0
selectAndMutate %>% group_by(Pclass) %>% summarise(count=n(), ages=mean(Fare,na.rm=TRUE))
## # A tibble: 3 x 3
##   Pclass count  ages
##    <int> <int> <dbl>
## 1      1   216  84.2
## 2      2   184  20.7
## 3      3   491  13.7

Consult the book for more examples of putting fields together!

Applying Functions on Vectors and Lists in R

This is not a “dplyr” feature, but it is an important idea for working with vectors and lists. In R, it is usually best to avoid loops on vectors and instead use a function known as “apply” which applies the same function to all of the elements in a vector. This is so that the computer can take advantage of parallelization in the future, and process the code much faster.

To effectively use apply, it is good to write a function you want that can be applied to any individual element in a vector. Then, simply use lapply to apply that individual element function to every element and store the result in a list. The unlist command can convert a list back into a vector. An example of this is below, the first function will add1 to every element in the numeric vector and the second function will add the letter “a” to the start of each string.

add1 <- function(x) {
  return (x+1)
}

adda <- function(x) {
  return (paste("a", x, sep=""))
}

unlist(lapply(c(1,2,3,4,5), add1))
## [1] 2 3 4 5 6
unlist(lapply(c("a","ab", "abc", "abcd"), adda))
## [1] "aa"    "aab"   "aabc"  "aabcd"

There are other methods of apply too which work on different data structures; run help(lapply) to learn more!

The extra topics in these notebooks called “functional programming” are also great tools to program smarter in R, and will be especially useful for assignments.

Task with Titanic Data

Using the training set, try to come up with an idea of which types of passengers survived based on the features given. To do this, use various summaries and graphs to get an idea of how variables are distributed for passengers of different types.

Also, generate at least one additional feature from the given features for yourself. For example, you can look at the “Name” field for females to determine if they were married or not and create a new vector accordingly. Another example is to use some letters in the ticket type (or whether or not there is a ticket) to see if it correlates with survival chances. The feature you generate does not have to be the best indicator - the point is more to get used to generating features from existing examples. You can do this feature generation with a function (i.e if “mrs” is a substring in name then output 1 else output 0) and the output should be a vector. An example of an easy feature is below… it simply measures whether or not the letter “m” exists in the name. Be creative when finding a feature that works!

This task is open ended - but in the end you will use whatever indicators you have built with the training set and try to classify people in the testing set. This means in the end you must write a function that takes in certain variables and outputs the result (1 if survived, 0 otherwise).

Show all of your work! That includes what summary functions and plots you used and how you analyzed them to reach your results, as well as the final function and some sample inputs+outputs.

#this is a very simplistic example, more to use as a template and complicate yourself

isLetterM <- function(x) {
  if ("m" %in% unlist(strsplit(tolower(x), ""))) {
    return(1)
  } else {
    return(0)
  }
}

letterM <- unlist(lapply(dfTrain$Name, isLetterM))

# feature letterM is built

survivalTest <- function(Age, Name, Fare) {
  if (Age > 20) {
    if (isLetterM(Name)) {
      if (Fare > 15) {
        return(1)
      } else {
        return (0)
      }
    } else if (Fare < 8) {
      return (1)
    } else {
      return(0)
    }
  } else if (Age < 20) {
    return (0)
  }
}

Task with Gapminder Data

Find some data on gapminder homepage that interests you. Using Microsoft Excel, try to clean the data up and save it as a “csv” file. If you are able to load your cleaned data into R then you can continue, otherwise send to instructor the set you are thinking to get help cleaning it.

Gapminder has a very interesting display, but sometimes to get a bigger picture of what is going on focusing on a subset of data and putting it into a graph or plot is enough. In this task you can make use of “ggplot2” package, and the group that can create the most interesting and quirky graph wins! (For simplicity purposes try to only pick 2 indicators i.e income per person and life expectancy… if you are feeling very bold you can try to make a 3D plot and maybe you will get very interesting results).

If you do not like the results with ggplot2 then go ahead and use whatever other plotting software you know.

Both tasks will be done in teams - the Gapminder Data is more of a challenge while the Titanic data work is important for understanding.