Sunday, May 26, 2013

Reproducibility Made Easy

Reproducibility Made Easy

A lot of times I think the biggest hurdle to analyzing data and investigating something interesting is getting your hands on the data. You can look at my last post to find some basketball data and the steps needed to use it. Some of my interest there was based on how the season progresses. But after thinking about ranking things over time movies seemed to be a better fit. Thus my next hurdle was to try to find some data related to movies and particularly box office sales.

There are a few other issues I have wanted to explore recently as well. I have read a lot about on types of analysis people have done and what there results are. If the area is going to latch on to the term 'data science' I think one thing that needs to be addresed is verifiabilty and reproducablitly, which are potentially the same thing but for the purpose of completness I mention both. What I mean by these are that anybody can read about the data I am using, the hypothosis I am making, the analysis employed and the results found and be able to verify that I am not full of shit.

I hope to address the problem of making data available. Another thing I find hard is keeping data up to date, to really verify my findings you should be able to run the analysis on data that is relavent. It can go stale fairly quickly if some method is not put in place to automate the process of pulling new data. The last issue is how to allow other people to reproduce the work you did. I hope to tackle all of theses issues here, maybe some more than others though.

Collect Data

The first step of getting data is usually the hardest. This is usually the least documented as well. If you do get lucky and there is some data that you can use it is often lacking any detail as to how it was collected, when, from where and by whom. Thus any questions you have are simple forced assumptions you must make. In every data mining/data science project I've worked on this has been hard. Usually there are many hurdles you have to jump through making it a success to actually have data. Once you recieve the data it is NOT and will NEVER be in the form that would be most useful for your purpose. I have found that collecting it myself can sometimes be the easist way to get data. This way you know all of the assumptions behind it. This does require a lot of work though and I hope the inderlying processes around data will get better so tht this in not the case.

Below you can see the my initial attempt at trying to find some movie data. The data was collected from Box Office Mojo using the R XML package.

library(XML)
library(lubridate)
# This is useful for imported data.
options(stringsAsFactors = FALSE)

# Source where data comes from.  
site <- "http://www.boxofficemojo.com/daily/chart/?view=1day&sortdate=2013-03-15&p=.htm"

# Read data from that site.
siteData <- readHTMLTable(site)

# We need to investigate this to see what we really need. Looking at the site will helps as well.
# We need to take the tenth element in this list.
mList <- readHTMLTable(site)[[10]]
# We need to remove some of the leading and trailing rows as well.
mList <- mList[4:(nrow(mList) - 1), ]
# Give the fields names.
names(mList) <- c("td", "yd", "Name", "Stu", "Daily", "peru", 
                  "perd", "num", "Avg", "Gross", "Day")
# SHow a subset of the data.
mList[14:18, ]
##    td yd        Name    Stu    Daily  peru perd num  Avg        Gross Day
## 17 14 12  Life of Pi    Fox $328,783  +80% -17% 646 $509 $120,448,917 115
## 18 15 13     Quartet  Wein. $265,313  +72% -25% 688 $386  $14,162,205  64
## 19 16 14  Dark Skies W/Dim. $188,460  +45% -52% 703 $268  $16,290,971  22
## 20 17 15 Warm Bodies   LG/S $185,718  +72% -38% 659 $282  $64,196,628  43
## 21 18 18     Emperor  RAtt. $183,452 +119% -45% 311 $590   $1,586,515   8

Some cleaning was already done in removing the items from the list to just get the table with the movie data, as well as removing lines from that table that are just noise. We also had to clean the names up a bit. Much more cleaning is required though. As with any data project this will fall into the 80/20 ratio of 80% of your time is spent cleaning and transforming the data into a usebale format and twenty percent is spent in doing actual analysis. All the interesting algorithms and visualizations you can use on data are not applicable until the correct amount of cleaning has happened. I often think of this task as data wrangling and person who is skilled in this craft as a data ninja. I don't think I have ever seen a job post for a data ninja either, but the skill is crucial.

Please don't use the above code in any loop construct to pull more data as that could create a lot of stress on the sites server, it will also take a while, I don't want it to go away, and I have made a far easier method for you to get all of the data if you keeping. If you just want the data you can use the following in R.

require(devtools)
## Loading required package: devtools
install_github("chRonos", "darrkj", quiet = TRUE)
## Installing github repo(s) chRonos/master from darrkj
## Installing chRonos.zip from
## https://github.com/darrkj/chRonos/archive/master.zip

suppressPackageStartupMessages(library(chRonos))
data(mvData)

str(mvData)
## 'data.frame':    196252 obs. of  13 variables:
##  $ date    : Date, format: "2009-07-17" "2009-07-18" ...
##  $ td      : chr  "12" "14" "11" "11" ...
##  $ yd      : chr  "-" "12" "14" "11" ...
##  $ name    : chr  "(500) Days of Summer" "(500) Days of Summer" "(500) Days of Summer" "(500) Days of Summer" ...
##  $ studio  : chr  "FoxS" "FoxS" "FoxS" "FoxS" ...
##  $ daily   : num  263178 310203 261120 125900 141175 ...
##  $ peru    : num  NA 0.18 -0.16 -0.52 0.12 -0.04 0.01 2.65 0.26 -0.18 ...
##  $ perd    : num  NA NA NA NA NA NA NA 0.9 1.02 0.96 ...
##  $ Theaters: num  27 27 27 27 27 27 27 85 85 85 ...
##  $ Avg     : num  9747 11489 9671 4663 5229 ...
##  $ Gross   : num  263178 573381 834501 960401 1101576 ...
##  $ Day     : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 6 7 1 2 3 4 5 6 7 1 ...

Update Data

The next issue is data falling out of relavency. If you run the above I can gaurantee that the data you have will be atleast a week old. What does stale data mean? Think about opening your favorite weather resource and it is predicting that it will rain last Tuesday. Thats useless since it does not help you to plan for your outdoor excursion tomorrow, even worse if you know it did not rain last Tuesday, now your thoughts on this source are tainted. You need to keep data up to date and relavent.

I could probably make a method to have this data up to date for whenever you pull it. I think I may do that but for now I think this is better. Included in this package is a function called mojoRef(). This takes care of refreshing the data for you. If you run it you can see the new dates streaming in as the data is being pulled and cleaned to align with the current data frame. This can easily be appended to the rest of the data. If you wanted something to be autonomous you could add your own line to the function to ruturn nothing but overwrite the source data.

mvData <<- rbind(mvData, mojoRef())

mojoRef <- function() {
    library(chRonos)
    data(mvData)
    # Start from the day after that most recent day.
    start <- max(mvData$date) + 1
    # Remove this data.
    return(mojo(start))
}

newData <- mojoRef()
## [1] "2013-05-16"
## [1] "2013-05-17"
## [1] "2013-05-18"
## [1] "2013-05-19"
## [1] "2013-05-20"
## [1] "2013-05-21"

What is really happening here is that we isolated what data we want, how we want it cleaned, and how that all has to happen. This is really only able to happen once your data requirements have converged. I took some time for me to figure out exactly how I wanted the date stored and cleaned. Once the aquostion part was correct then I worried about refreshing it. Donald Knuth stated that premature optimization is the root of all evil. I agree but I have an equally valid statement, premature automation is the root of all headaches. Once you really know what you want and you have some need to consume new data in a specified manner then you can automate the extraction process.

Make Data Available

Above you saw how easy it was to get the data that I have made available on Github. I need to thank Hadley Wickham for his devtools package which makes the process to get the package easy, as well as the people at Github for making the storage location very accesable.

Data can be made available in other methods as well. Read my last post on basketball data to see another method employed elsewhere that made the aquisition stage easy on my part. Simply google open data and you will find a vast array of current info on this problem. If you do some work that you think is relavent you should try give others access to it. I am not saying give away trade secrets or anything. Make it open to your audience. This may be ironing out the process for your coworkers, creating internal documentation, or something similar to this post. You will know the frustrations of not doing so the first time you inherit a project where decsions where made based on some data but you have know way of validating that it still holds true. Have you ever heard of Concept Drift?

Reproducable

To make any analysis done on the data reproducable has been a nuance in the past. There are many papars out there that you just have to take the authers word. There are also cases where after the fact it has been shown that the results were faked, so we can't trust everyone. If we make ourselves accountable people can have no other course of action but to believe us. If not they cannot merely say somehting is false, if the process is repeatable and reproducable then they must redo your work and show where it is not, and make sure their work is reproducable. Above I have shown just one manner in which you can make data accesable, but in that process I have also given all of the code to do these things, tried to explain what I was doing and why. I will leave you with a small sample of analysis which I hope my next post will go into more depth.

# Get rid of really old data.
mov <- mvData[year(mvData$date) > 2002, ]
# Movies that have a first day.
wholeMov <- unique(mov[mov$Day == 1, ]$name)
# Create a list of movies in this set.
mov <- mov[mov$name %in% wholeMov, ]

library(ggplot2)
# Lets look at one movie.
title <- "X-Men: First Class"
movie <- mov[mov$name == title, ]
qplot(Day, daily, data = movie, main = title)
# Zoom in on area of interest.
qplot(Day, daily, data = movie[1:40, ], main = title)

# The day of the week seems to have a big impact on sales.  Lets break it
# out into a plot for each day.
ggplot(movie, aes(Day, daily)) + geom_point() + facet_grid(. ~ weekday)

Sunday, May 12, 2013

Getting Side-tracked with Basketball Data

Getting Side-tracked with Basketball Data

The NBA playoffs are here. Recently I have been interested in analyzing sports data. I was able to find some NBA data from older seasons to be the most accessible. This is the data that I used. The code below will take care of downloading and importing it for you though. To start with I was not really sure what I wanted to look at, I just think its interesting exploring new types of data in diverse areas. This set is pretty interesting in that it has a lot of shot data, who, when, the result and most interesting, where. It has the x and y coordinates of every shot. I thought it would be cool to look at hot spots of field goals. First though we need to pull the data.

# Load packages used.
library(rbenchmark)
library(plyr)

# This is useful for importing data.
options(stringsAsFactors = FALSE)

# Create dir for data files.
dir <- "BBdata"
dir.create(dir, showWarnings = FALSE)
temp <- tempfile()

# Location of the files.
url1 <- "http://www.basketballgeek.com/downloads/2008-2009.regular_season.zip"

# Take the base of the file name at this loaction.
file <- basename(url1)

# Download the file from the internet.
download.file(url1, file)

# Extract zipped contents to directory.
unzip(file, exdir = dir)

# The list of unzipped files.
fileList <- list.files(dir)

# Only the csv data files.
fileList <- fileList[grep(pattern = "csv", x = fileList)]

# Full location to the files.
fileList <- paste(dir, fileList, sep = "/")

Method 1: rbind

I want to explore this data as a whole season so it all needs to be loaded into the workspace. The zip files come with a csv file for each game played in a season though. I tried to load each file and append it to a growing data set. I knew this would be slow but I was in no rush. I was not aware that it would take as long as it did though. You will see later just how slow. I then thought I could for the best way to do this.

The first method I tried was to simply read each file in and append it to a data frame. This is the easist solution. Wrapping this in a function is just used for benchmarking and profiling, it is an easy way to take imperative line by line instructions and call them without copy and paste later.

# Method which uses naive appending via rbind.
M1 <- function() {
    # Use rbind to append each file to the first.
    games <- read.csv(files[1], na.strings = "")
    # Drop first that was already loaded.
    file <- files[-1]
    # Loop over each file and append it to growing list.
    for (i in file) {
        tmpGame <- read.csv(i, na.strings = "")
        games <- rbind(games, tmpGame)
    }
    return(games)
}
files <- fileList[1:100]
system.time(games1 <- M1())[3]
## elapsed 
##   8.429

Method 2: preallocation

This method is pretty bad. I may want to pull more seasons that just 2008-2009 which will make it even worse. Most everywhere you look in relation to making something in R work faster you see vectorization for math operations and pre-allocation for data storage operations. Let’s see how much these methods help with the task at hand.

First though I want to create a useful function for initializing a data frame. This is a very handy function that I made and can clean up the look of preallocation when dealing with wide data frames. You don't have to be as verbose with listing out every variable. You can just give it a list of names and a number of rows and you have all of the preallocation taken care of.

initDF <- function(name, row) {
    # String which start the data frame istantiation.
    init <- "df <- data.frame("
    for (i in name) {
        init <- paste(init, i, " = rep(NA, ", row, "), ", sep = "")
    }
    init <- substr(init, 1, nchar(init) - 2)
    init <- paste(init, ")", sep = "")
    eval(parse(text = init))
    return(df)
}

# Method which uses preallocation.
M2 <- function() {
    # Read first file to get names in the data.
    game <- read.csv(files[1], na.strings = "")
    # The number of rows in the data.
    rows <- nrow(game)
    # Its hard to know this exactly before hand so be conservative with guess.
    estRows <- ceiling(rows * length(files) * 1.2)
    # Preallocate the data frame.
    games <- initDF(names(game), estRows)
    # Initialize index.
    j <- 1
    # Loop over each file.
    for (i in files) {
        game <- read.csv(i, na.strings = "")
        # How many rows are in this data set.
        len <- nrow(game)
        # Insert these rows into there spots in the preallocated data frame.
        games[j:(j + len - 1), ] <- game
        # Increment the index
        j <- j + len
    }
    # Remove the excess from our conservative guess.
    games <- games[1:(j - 1), ]
    return(games)
}
files <- fileList[1:100]
system.time(games2 <- M2())[3]
## elapsed 
##   6.227

The first thing I notice here is that I have to make a guess on the size of the data, which may not always be easy. Here I could use the results from the first method since I have already seen how many rows we will have. To make it more realistic I used an approximation and applied a fudge factor. The next problem is the code is odd in respect to the hoops you jump through to index correctly, to not overwrite old data or leave gaps in between appended sets. We did save some time though, not much but it does go faster. Was it worth it though, dealing with the indexing maze in order to gain a small speed improvement?

Method 3: do.call rbind

It appears there has been some discussion about this type of data wrangling. There is a better way to use rbind, via do.call. http://r.789695.n4.nabble.com/Concatenating-data-frame-td799749.html This gives me another method to look into. This calls rbind on the entire list in a vectorized manner instead of multiple times. We also get rid of all of the indexing mess.


# Method which uses do.call, rbind all at once.
M3 <- function() {
    # Initialize list that will store each loaded file.
    g <- vector("list", length(files))
    # Initialize the index.
    j <- 1
    # Loop over all of the files.
    for (i in files) {
        g[[j]] <- read.csv(i, na.strings = "")
        j <- j + 1
    }
    games <- do.call("rbind", g)
    return(games)
}
files <- fileList[1:100]
system.time(games3 <- M3())[3]
## elapsed 
##   1.228

Method 4: plyr's rbind.fill

Two things to note about the code itself; the size has reduced from the pre-allocation method and we have no indices to maintain. But the best outcome is the speed, a good improvement. I could probably live with this approach but I am intrigued by another similar post and its recommendation to use rbind.fill from the plyr package. http://r.789695.n4.nabble.com/Fast-dependable-way-to-quot-stack-together-quot-data-frames-from-a-list-td2532293.html I have used this in the past when I have sets which may have different columns as it provides a nice feature of imputing them to missing instead of the error received from the regular rbind. Let’s give this a try.

# Method which uses plyr, references say it is the fastest approach.
M4 <- function() {
    # Initialize list that will store each loaded file.
    g <- vector("list", length(files))
    # Initialize the index.
    j <- 1
    # Loop over all of the files.
    for (i in files) {
        g[[j]] <- read.csv(i, na.strings = "")
        j <- j + 1
    }
    games <- rbind.fill(g)
    return(games)
}
files <- fileList[1:100]
system.time(games4 <- M4())[3]
## elapsed 
##    1.43

The code is still pretty clean, we basically just replaced one line. For the case of only looking at the first 100 files it appears to be a little slower. Let's try each method on a larger set. We should also check that everything is working as we expect it to.

Comparison


# Time the first 200
files <- fileList[1:200]

# Run a side by side comparison of each method.
benchmark(replications = rep(1, 1), M1(), M2(), M3(), M4(), columns = c("test", 
    "replications", "elapsed"))
##   test replications elapsed
## 1 M1()            1  21.349
## 2 M2()            1  12.610
## 3 M3()            1   3.274
## 4 M4()            1   3.599

# Check that each method works.
all(all(games1 == games2, na.rm = TRUE), all(games2 == games3, na.rm = TRUE), 
    all(games3 == games4, na.rm = TRUE))
## [1] TRUE

Method 5: Recursion

Thinking through what is happening here leaves me wondering if all of these methods are bad. In the cases where we use indexing we are getting beat up by the copy on modify, every time we add new rows we copy the whole data frame due to immutability. The naive rbind approach is always looking for larger sections of contiguous memory and transporting everything there. Thus in the beginning it is small, copying every increasing sets around just keeps building up though. The do.call and rbind.fill method seem to closer to the pre-allocation time, I want to see what they are doing under the hood though before I say anything about them. I would imagine that a divide and conquer approach would work better here knowing where some of the sluggishness comes from. We can use the naive rbind but never have to carry the whole data frame, actually only half one time, a quarter twice and so on. R deals with recursion by adding the environment to the call stack. In a normal function you don’t worry about cleaning the local environment before you leave it because it will be destroyed when only moving the return object. In the case of recursion though we add the whole environment to the stack then jump to the next iteration. Do to the log nature of the divide and conquer we don't have to worry about going infinitely deep but we do have to worry about copying a lot of useless data at every new level of depth. Let’s clean the environment except what we actually care about.

# Recursive row binding, you don't carry huge sets around, has divide and
# conquer on memory usage, large set only appear at the top level of the
# recursion.
recurBind <- function(dList) {
    len <- length(dList)/2
    # Preallocate list for small improvement.
    data <- vector("list", len)
    j <- 1
    for (i in seq(len)) {
        # Merge each set of two sequential data sets together.
        data[[j]] <- rbind(dList[[(i * 2) - 1]], dList[[i * 2]])
        j <- j + 1
    }
    # In case length was odd, just add last set to the end.
    if (floor(len) != len) {
        data[[j]] <- dList[[len * 2]]
    }
    # Less data to store on the stack, tail call optimization would be nice
    # here.
    rm(dList, len, j)
    # Recursive call.
    if (length(data) > 1) {
        data <- recurBind(data)
    }
    return(data)
}

# Apply the recursive method.
M5 <- function() {
    # Initialize list that will store each loaded file.
    g <- vector("list", length(files))
    # Initialize the index.
    j <- 1
    # Loop over all of the files.
    for (i in files) {
        g[[j]] <- read.csv(i, na.strings = "")
        j <- j + 1
    }
    games <- recurBind(g)[[1]]
    return(games)
}
files <- fileList[1:100]
system.time(games5 <- M5())[3]
## elapsed 
##   1.049

It's faster than all the other methods, but I was actually thinking that it would be slower for this small case and would only be asymptotically faster once we take the rest of the files into consideration. Let's run the above test again and see how good it fares.

Comparison 2


# Time the first 500
files <- fileList[1:500]

# Run a side by side comparison of each method.
benchmark(replications = rep(1, 1), M1(), M2(), M3(), M4(), M5(), columns = c("test", 
    "replications", "elapsed"))
##   test replications elapsed
## 1 M1()            1  125.85
## 2 M2()            1   79.57
## 3 M3()            1   16.13
## 4 M4()            1   19.63
## 5 M5()            1    6.49

# Check that each method works.
all(games1 == games5, na.rm = TRUE)
## [1] TRUE

It seems I was right. The first two methods are performing very poorly. We can safely ignore them from here on out. To me this is a very interesting result; we used in what in reality is the slowest approach possible, rbind, in a clever manner to get the fastest approach. What if we apply some of the faster methods inside the recursion, or even get rid of the recursion altogether by smart looping in a manner that is equivalent to recursion but never has to recreate the local environment recursively. Is there a way that we can mimic tail call optimization or even compile this? I am sure there are even faster ways of doing this.

Basketball Analysis

What was all of this for again, oh yeah Basketball.

files <- fileList
games <- M5()
made <- games[, c("team", "player", "etype", "result", "x", "y")]
made <- made[made$etype == "shot", ]
made <- made[made$result == "made", ]
made <- made[, c("result", "x", "y")]
made <- made[complete.cases(made), ]
made$result <- ifelse(made$result == "made", 1, 0)
x <- tapply(made$result, made[, c("x", "y")], sum)
x <- ifelse(is.na(x), 0, x)
image(log(x), col = rainbow(30))

missed <- games[, c("team", "player", "etype", "result", "x", "y")]
missed <- missed[missed$etype == "shot", ]
missed <- missed[missed$result == "missed", ]
missed <- missed[, c("result", "x", "y")]
missed <- missed[complete.cases(missed), ]
missed$result <- ifelse(missed$result == "missed", 1, 0)
y <- tapply(missed$result, missed[, c("x", "y")], sum)
y <- ifelse(is.na(y), 0, x)
image(log(y), col = rainbow(30))

This code is pretty gross, it is more or less hacking to get a few plots, I realized I became so interested in importing the data in a fast and clean manner that I had little time to do anything with the actual data. At a first glance thought I expected to see the first chart but the second has me a little stumped. Maybe I will have time to look more into this later.

Does it Scale

Just to check how good this is lets run a few years. This adds the prior season.


# Data for the year prior.
url2 <- "http://www.basketballgeek.com/downloads/2007-2008.regular_season.zip"

# Take the base of the file name at this loaction.
file2 <- basename(url2)

# Download the file from the internet.
download.file(url2, file2)

# Extract zipped contents to directory.
unzip(file2, exdir = dir)

# The list of unzipped files.
newList <- list.files(dir)

# Only the csv data files.
newList <- newList[grep(pattern = "csv", x = newList)]

# Full location to the files.
newList <- paste(dir, newList, sep = "/")

length(newList)
## [1] 2359

# Time larger set
files <- newList
benchmark(replications = rep(1, 1), M3(), M4(), M5(), 
          columns = c("test", "replications", "elapsed"))
##   test replications elapsed
## 1 M3()            1  360.35
## 2 M4()            1  474.80
## 3 M5()            1   57.55