Sunday, May 26, 2013

Reproducibility Made Easy

Reproducibility Made Easy

A lot of times I think the biggest hurdle to analyzing data and investigating something interesting is getting your hands on the data. You can look at my last post to find some basketball data and the steps needed to use it. Some of my interest there was based on how the season progresses. But after thinking about ranking things over time movies seemed to be a better fit. Thus my next hurdle was to try to find some data related to movies and particularly box office sales.

There are a few other issues I have wanted to explore recently as well. I have read a lot about on types of analysis people have done and what there results are. If the area is going to latch on to the term 'data science' I think one thing that needs to be addresed is verifiabilty and reproducablitly, which are potentially the same thing but for the purpose of completness I mention both. What I mean by these are that anybody can read about the data I am using, the hypothosis I am making, the analysis employed and the results found and be able to verify that I am not full of shit.

I hope to address the problem of making data available. Another thing I find hard is keeping data up to date, to really verify my findings you should be able to run the analysis on data that is relavent. It can go stale fairly quickly if some method is not put in place to automate the process of pulling new data. The last issue is how to allow other people to reproduce the work you did. I hope to tackle all of theses issues here, maybe some more than others though.

Collect Data

The first step of getting data is usually the hardest. This is usually the least documented as well. If you do get lucky and there is some data that you can use it is often lacking any detail as to how it was collected, when, from where and by whom. Thus any questions you have are simple forced assumptions you must make. In every data mining/data science project I've worked on this has been hard. Usually there are many hurdles you have to jump through making it a success to actually have data. Once you recieve the data it is NOT and will NEVER be in the form that would be most useful for your purpose. I have found that collecting it myself can sometimes be the easist way to get data. This way you know all of the assumptions behind it. This does require a lot of work though and I hope the inderlying processes around data will get better so tht this in not the case.

Below you can see the my initial attempt at trying to find some movie data. The data was collected from Box Office Mojo using the R XML package.

library(XML)
library(lubridate)
# This is useful for imported data.
options(stringsAsFactors = FALSE)

# Source where data comes from.  
site <- "http://www.boxofficemojo.com/daily/chart/?view=1day&sortdate=2013-03-15&p=.htm"

# Read data from that site.
siteData <- readHTMLTable(site)

# We need to investigate this to see what we really need. Looking at the site will helps as well.
# We need to take the tenth element in this list.
mList <- readHTMLTable(site)[[10]]
# We need to remove some of the leading and trailing rows as well.
mList <- mList[4:(nrow(mList) - 1), ]
# Give the fields names.
names(mList) <- c("td", "yd", "Name", "Stu", "Daily", "peru", 
                  "perd", "num", "Avg", "Gross", "Day")
# SHow a subset of the data.
mList[14:18, ]
##    td yd        Name    Stu    Daily  peru perd num  Avg        Gross Day
## 17 14 12  Life of Pi    Fox $328,783  +80% -17% 646 $509 $120,448,917 115
## 18 15 13     Quartet  Wein. $265,313  +72% -25% 688 $386  $14,162,205  64
## 19 16 14  Dark Skies W/Dim. $188,460  +45% -52% 703 $268  $16,290,971  22
## 20 17 15 Warm Bodies   LG/S $185,718  +72% -38% 659 $282  $64,196,628  43
## 21 18 18     Emperor  RAtt. $183,452 +119% -45% 311 $590   $1,586,515   8

Some cleaning was already done in removing the items from the list to just get the table with the movie data, as well as removing lines from that table that are just noise. We also had to clean the names up a bit. Much more cleaning is required though. As with any data project this will fall into the 80/20 ratio of 80% of your time is spent cleaning and transforming the data into a usebale format and twenty percent is spent in doing actual analysis. All the interesting algorithms and visualizations you can use on data are not applicable until the correct amount of cleaning has happened. I often think of this task as data wrangling and person who is skilled in this craft as a data ninja. I don't think I have ever seen a job post for a data ninja either, but the skill is crucial.

Please don't use the above code in any loop construct to pull more data as that could create a lot of stress on the sites server, it will also take a while, I don't want it to go away, and I have made a far easier method for you to get all of the data if you keeping. If you just want the data you can use the following in R.

require(devtools)
## Loading required package: devtools
install_github("chRonos", "darrkj", quiet = TRUE)
## Installing github repo(s) chRonos/master from darrkj
## Installing chRonos.zip from
## https://github.com/darrkj/chRonos/archive/master.zip

suppressPackageStartupMessages(library(chRonos))
data(mvData)

str(mvData)
## 'data.frame':    196252 obs. of  13 variables:
##  $ date    : Date, format: "2009-07-17" "2009-07-18" ...
##  $ td      : chr  "12" "14" "11" "11" ...
##  $ yd      : chr  "-" "12" "14" "11" ...
##  $ name    : chr  "(500) Days of Summer" "(500) Days of Summer" "(500) Days of Summer" "(500) Days of Summer" ...
##  $ studio  : chr  "FoxS" "FoxS" "FoxS" "FoxS" ...
##  $ daily   : num  263178 310203 261120 125900 141175 ...
##  $ peru    : num  NA 0.18 -0.16 -0.52 0.12 -0.04 0.01 2.65 0.26 -0.18 ...
##  $ perd    : num  NA NA NA NA NA NA NA 0.9 1.02 0.96 ...
##  $ Theaters: num  27 27 27 27 27 27 27 85 85 85 ...
##  $ Avg     : num  9747 11489 9671 4663 5229 ...
##  $ Gross   : num  263178 573381 834501 960401 1101576 ...
##  $ Day     : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 6 7 1 2 3 4 5 6 7 1 ...

Update Data

The next issue is data falling out of relavency. If you run the above I can gaurantee that the data you have will be atleast a week old. What does stale data mean? Think about opening your favorite weather resource and it is predicting that it will rain last Tuesday. Thats useless since it does not help you to plan for your outdoor excursion tomorrow, even worse if you know it did not rain last Tuesday, now your thoughts on this source are tainted. You need to keep data up to date and relavent.

I could probably make a method to have this data up to date for whenever you pull it. I think I may do that but for now I think this is better. Included in this package is a function called mojoRef(). This takes care of refreshing the data for you. If you run it you can see the new dates streaming in as the data is being pulled and cleaned to align with the current data frame. This can easily be appended to the rest of the data. If you wanted something to be autonomous you could add your own line to the function to ruturn nothing but overwrite the source data.

mvData <<- rbind(mvData, mojoRef())

mojoRef <- function() {
    library(chRonos)
    data(mvData)
    # Start from the day after that most recent day.
    start <- max(mvData$date) + 1
    # Remove this data.
    return(mojo(start))
}

newData <- mojoRef()
## [1] "2013-05-16"
## [1] "2013-05-17"
## [1] "2013-05-18"
## [1] "2013-05-19"
## [1] "2013-05-20"
## [1] "2013-05-21"

What is really happening here is that we isolated what data we want, how we want it cleaned, and how that all has to happen. This is really only able to happen once your data requirements have converged. I took some time for me to figure out exactly how I wanted the date stored and cleaned. Once the aquostion part was correct then I worried about refreshing it. Donald Knuth stated that premature optimization is the root of all evil. I agree but I have an equally valid statement, premature automation is the root of all headaches. Once you really know what you want and you have some need to consume new data in a specified manner then you can automate the extraction process.

Make Data Available

Above you saw how easy it was to get the data that I have made available on Github. I need to thank Hadley Wickham for his devtools package which makes the process to get the package easy, as well as the people at Github for making the storage location very accesable.

Data can be made available in other methods as well. Read my last post on basketball data to see another method employed elsewhere that made the aquisition stage easy on my part. Simply google open data and you will find a vast array of current info on this problem. If you do some work that you think is relavent you should try give others access to it. I am not saying give away trade secrets or anything. Make it open to your audience. This may be ironing out the process for your coworkers, creating internal documentation, or something similar to this post. You will know the frustrations of not doing so the first time you inherit a project where decsions where made based on some data but you have know way of validating that it still holds true. Have you ever heard of Concept Drift?

Reproducable

To make any analysis done on the data reproducable has been a nuance in the past. There are many papars out there that you just have to take the authers word. There are also cases where after the fact it has been shown that the results were faked, so we can't trust everyone. If we make ourselves accountable people can have no other course of action but to believe us. If not they cannot merely say somehting is false, if the process is repeatable and reproducable then they must redo your work and show where it is not, and make sure their work is reproducable. Above I have shown just one manner in which you can make data accesable, but in that process I have also given all of the code to do these things, tried to explain what I was doing and why. I will leave you with a small sample of analysis which I hope my next post will go into more depth.

# Get rid of really old data.
mov <- mvData[year(mvData$date) > 2002, ]
# Movies that have a first day.
wholeMov <- unique(mov[mov$Day == 1, ]$name)
# Create a list of movies in this set.
mov <- mov[mov$name %in% wholeMov, ]

library(ggplot2)
# Lets look at one movie.
title <- "X-Men: First Class"
movie <- mov[mov$name == title, ]
qplot(Day, daily, data = movie, main = title)
# Zoom in on area of interest.
qplot(Day, daily, data = movie[1:40, ], main = title)

# The day of the week seems to have a big impact on sales.  Lets break it
# out into a plot for each day.
ggplot(movie, aes(Day, daily)) + geom_point() + facet_grid(. ~ weekday)

Sunday, May 12, 2013

Getting Side-tracked with Basketball Data

Getting Side-tracked with Basketball Data

The NBA playoffs are here. Recently I have been interested in analyzing sports data. I was able to find some NBA data from older seasons to be the most accessible. This is the data that I used. The code below will take care of downloading and importing it for you though. To start with I was not really sure what I wanted to look at, I just think its interesting exploring new types of data in diverse areas. This set is pretty interesting in that it has a lot of shot data, who, when, the result and most interesting, where. It has the x and y coordinates of every shot. I thought it would be cool to look at hot spots of field goals. First though we need to pull the data.

# Load packages used.
library(rbenchmark)
library(plyr)

# This is useful for importing data.
options(stringsAsFactors = FALSE)

# Create dir for data files.
dir <- "BBdata"
dir.create(dir, showWarnings = FALSE)
temp <- tempfile()

# Location of the files.
url1 <- "http://www.basketballgeek.com/downloads/2008-2009.regular_season.zip"

# Take the base of the file name at this loaction.
file <- basename(url1)

# Download the file from the internet.
download.file(url1, file)

# Extract zipped contents to directory.
unzip(file, exdir = dir)

# The list of unzipped files.
fileList <- list.files(dir)

# Only the csv data files.
fileList <- fileList[grep(pattern = "csv", x = fileList)]

# Full location to the files.
fileList <- paste(dir, fileList, sep = "/")

Method 1: rbind

I want to explore this data as a whole season so it all needs to be loaded into the workspace. The zip files come with a csv file for each game played in a season though. I tried to load each file and append it to a growing data set. I knew this would be slow but I was in no rush. I was not aware that it would take as long as it did though. You will see later just how slow. I then thought I could for the best way to do this.

The first method I tried was to simply read each file in and append it to a data frame. This is the easist solution. Wrapping this in a function is just used for benchmarking and profiling, it is an easy way to take imperative line by line instructions and call them without copy and paste later.

# Method which uses naive appending via rbind.
M1 <- function() {
    # Use rbind to append each file to the first.
    games <- read.csv(files[1], na.strings = "")
    # Drop first that was already loaded.
    file <- files[-1]
    # Loop over each file and append it to growing list.
    for (i in file) {
        tmpGame <- read.csv(i, na.strings = "")
        games <- rbind(games, tmpGame)
    }
    return(games)
}
files <- fileList[1:100]
system.time(games1 <- M1())[3]
## elapsed 
##   8.429

Method 2: preallocation

This method is pretty bad. I may want to pull more seasons that just 2008-2009 which will make it even worse. Most everywhere you look in relation to making something in R work faster you see vectorization for math operations and pre-allocation for data storage operations. Let’s see how much these methods help with the task at hand.

First though I want to create a useful function for initializing a data frame. This is a very handy function that I made and can clean up the look of preallocation when dealing with wide data frames. You don't have to be as verbose with listing out every variable. You can just give it a list of names and a number of rows and you have all of the preallocation taken care of.

initDF <- function(name, row) {
    # String which start the data frame istantiation.
    init <- "df <- data.frame("
    for (i in name) {
        init <- paste(init, i, " = rep(NA, ", row, "), ", sep = "")
    }
    init <- substr(init, 1, nchar(init) - 2)
    init <- paste(init, ")", sep = "")
    eval(parse(text = init))
    return(df)
}

# Method which uses preallocation.
M2 <- function() {
    # Read first file to get names in the data.
    game <- read.csv(files[1], na.strings = "")
    # The number of rows in the data.
    rows <- nrow(game)
    # Its hard to know this exactly before hand so be conservative with guess.
    estRows <- ceiling(rows * length(files) * 1.2)
    # Preallocate the data frame.
    games <- initDF(names(game), estRows)
    # Initialize index.
    j <- 1
    # Loop over each file.
    for (i in files) {
        game <- read.csv(i, na.strings = "")
        # How many rows are in this data set.
        len <- nrow(game)
        # Insert these rows into there spots in the preallocated data frame.
        games[j:(j + len - 1), ] <- game
        # Increment the index
        j <- j + len
    }
    # Remove the excess from our conservative guess.
    games <- games[1:(j - 1), ]
    return(games)
}
files <- fileList[1:100]
system.time(games2 <- M2())[3]
## elapsed 
##   6.227

The first thing I notice here is that I have to make a guess on the size of the data, which may not always be easy. Here I could use the results from the first method since I have already seen how many rows we will have. To make it more realistic I used an approximation and applied a fudge factor. The next problem is the code is odd in respect to the hoops you jump through to index correctly, to not overwrite old data or leave gaps in between appended sets. We did save some time though, not much but it does go faster. Was it worth it though, dealing with the indexing maze in order to gain a small speed improvement?

Method 3: do.call rbind

It appears there has been some discussion about this type of data wrangling. There is a better way to use rbind, via do.call. http://r.789695.n4.nabble.com/Concatenating-data-frame-td799749.html This gives me another method to look into. This calls rbind on the entire list in a vectorized manner instead of multiple times. We also get rid of all of the indexing mess.


# Method which uses do.call, rbind all at once.
M3 <- function() {
    # Initialize list that will store each loaded file.
    g <- vector("list", length(files))
    # Initialize the index.
    j <- 1
    # Loop over all of the files.
    for (i in files) {
        g[[j]] <- read.csv(i, na.strings = "")
        j <- j + 1
    }
    games <- do.call("rbind", g)
    return(games)
}
files <- fileList[1:100]
system.time(games3 <- M3())[3]
## elapsed 
##   1.228

Method 4: plyr's rbind.fill

Two things to note about the code itself; the size has reduced from the pre-allocation method and we have no indices to maintain. But the best outcome is the speed, a good improvement. I could probably live with this approach but I am intrigued by another similar post and its recommendation to use rbind.fill from the plyr package. http://r.789695.n4.nabble.com/Fast-dependable-way-to-quot-stack-together-quot-data-frames-from-a-list-td2532293.html I have used this in the past when I have sets which may have different columns as it provides a nice feature of imputing them to missing instead of the error received from the regular rbind. Let’s give this a try.

# Method which uses plyr, references say it is the fastest approach.
M4 <- function() {
    # Initialize list that will store each loaded file.
    g <- vector("list", length(files))
    # Initialize the index.
    j <- 1
    # Loop over all of the files.
    for (i in files) {
        g[[j]] <- read.csv(i, na.strings = "")
        j <- j + 1
    }
    games <- rbind.fill(g)
    return(games)
}
files <- fileList[1:100]
system.time(games4 <- M4())[3]
## elapsed 
##    1.43

The code is still pretty clean, we basically just replaced one line. For the case of only looking at the first 100 files it appears to be a little slower. Let's try each method on a larger set. We should also check that everything is working as we expect it to.

Comparison


# Time the first 200
files <- fileList[1:200]

# Run a side by side comparison of each method.
benchmark(replications = rep(1, 1), M1(), M2(), M3(), M4(), columns = c("test", 
    "replications", "elapsed"))
##   test replications elapsed
## 1 M1()            1  21.349
## 2 M2()            1  12.610
## 3 M3()            1   3.274
## 4 M4()            1   3.599

# Check that each method works.
all(all(games1 == games2, na.rm = TRUE), all(games2 == games3, na.rm = TRUE), 
    all(games3 == games4, na.rm = TRUE))
## [1] TRUE

Method 5: Recursion

Thinking through what is happening here leaves me wondering if all of these methods are bad. In the cases where we use indexing we are getting beat up by the copy on modify, every time we add new rows we copy the whole data frame due to immutability. The naive rbind approach is always looking for larger sections of contiguous memory and transporting everything there. Thus in the beginning it is small, copying every increasing sets around just keeps building up though. The do.call and rbind.fill method seem to closer to the pre-allocation time, I want to see what they are doing under the hood though before I say anything about them. I would imagine that a divide and conquer approach would work better here knowing where some of the sluggishness comes from. We can use the naive rbind but never have to carry the whole data frame, actually only half one time, a quarter twice and so on. R deals with recursion by adding the environment to the call stack. In a normal function you don’t worry about cleaning the local environment before you leave it because it will be destroyed when only moving the return object. In the case of recursion though we add the whole environment to the stack then jump to the next iteration. Do to the log nature of the divide and conquer we don't have to worry about going infinitely deep but we do have to worry about copying a lot of useless data at every new level of depth. Let’s clean the environment except what we actually care about.

# Recursive row binding, you don't carry huge sets around, has divide and
# conquer on memory usage, large set only appear at the top level of the
# recursion.
recurBind <- function(dList) {
    len <- length(dList)/2
    # Preallocate list for small improvement.
    data <- vector("list", len)
    j <- 1
    for (i in seq(len)) {
        # Merge each set of two sequential data sets together.
        data[[j]] <- rbind(dList[[(i * 2) - 1]], dList[[i * 2]])
        j <- j + 1
    }
    # In case length was odd, just add last set to the end.
    if (floor(len) != len) {
        data[[j]] <- dList[[len * 2]]
    }
    # Less data to store on the stack, tail call optimization would be nice
    # here.
    rm(dList, len, j)
    # Recursive call.
    if (length(data) > 1) {
        data <- recurBind(data)
    }
    return(data)
}

# Apply the recursive method.
M5 <- function() {
    # Initialize list that will store each loaded file.
    g <- vector("list", length(files))
    # Initialize the index.
    j <- 1
    # Loop over all of the files.
    for (i in files) {
        g[[j]] <- read.csv(i, na.strings = "")
        j <- j + 1
    }
    games <- recurBind(g)[[1]]
    return(games)
}
files <- fileList[1:100]
system.time(games5 <- M5())[3]
## elapsed 
##   1.049

It's faster than all the other methods, but I was actually thinking that it would be slower for this small case and would only be asymptotically faster once we take the rest of the files into consideration. Let's run the above test again and see how good it fares.

Comparison 2


# Time the first 500
files <- fileList[1:500]

# Run a side by side comparison of each method.
benchmark(replications = rep(1, 1), M1(), M2(), M3(), M4(), M5(), columns = c("test", 
    "replications", "elapsed"))
##   test replications elapsed
## 1 M1()            1  125.85
## 2 M2()            1   79.57
## 3 M3()            1   16.13
## 4 M4()            1   19.63
## 5 M5()            1    6.49

# Check that each method works.
all(games1 == games5, na.rm = TRUE)
## [1] TRUE

It seems I was right. The first two methods are performing very poorly. We can safely ignore them from here on out. To me this is a very interesting result; we used in what in reality is the slowest approach possible, rbind, in a clever manner to get the fastest approach. What if we apply some of the faster methods inside the recursion, or even get rid of the recursion altogether by smart looping in a manner that is equivalent to recursion but never has to recreate the local environment recursively. Is there a way that we can mimic tail call optimization or even compile this? I am sure there are even faster ways of doing this.

Basketball Analysis

What was all of this for again, oh yeah Basketball.

files <- fileList
games <- M5()
made <- games[, c("team", "player", "etype", "result", "x", "y")]
made <- made[made$etype == "shot", ]
made <- made[made$result == "made", ]
made <- made[, c("result", "x", "y")]
made <- made[complete.cases(made), ]
made$result <- ifelse(made$result == "made", 1, 0)
x <- tapply(made$result, made[, c("x", "y")], sum)
x <- ifelse(is.na(x), 0, x)
image(log(x), col = rainbow(30))

missed <- games[, c("team", "player", "etype", "result", "x", "y")]
missed <- missed[missed$etype == "shot", ]
missed <- missed[missed$result == "missed", ]
missed <- missed[, c("result", "x", "y")]
missed <- missed[complete.cases(missed), ]
missed$result <- ifelse(missed$result == "missed", 1, 0)
y <- tapply(missed$result, missed[, c("x", "y")], sum)
y <- ifelse(is.na(y), 0, x)
image(log(y), col = rainbow(30))

This code is pretty gross, it is more or less hacking to get a few plots, I realized I became so interested in importing the data in a fast and clean manner that I had little time to do anything with the actual data. At a first glance thought I expected to see the first chart but the second has me a little stumped. Maybe I will have time to look more into this later.

Does it Scale

Just to check how good this is lets run a few years. This adds the prior season.


# Data for the year prior.
url2 <- "http://www.basketballgeek.com/downloads/2007-2008.regular_season.zip"

# Take the base of the file name at this loaction.
file2 <- basename(url2)

# Download the file from the internet.
download.file(url2, file2)

# Extract zipped contents to directory.
unzip(file2, exdir = dir)

# The list of unzipped files.
newList <- list.files(dir)

# Only the csv data files.
newList <- newList[grep(pattern = "csv", x = newList)]

# Full location to the files.
newList <- paste(dir, newList, sep = "/")

length(newList)
## [1] 2359

# Time larger set
files <- newList
benchmark(replications = rep(1, 1), M3(), M4(), M5(), 
          columns = c("test", "replications", "elapsed"))
##   test replications elapsed
## 1 M3()            1  360.35
## 2 M4()            1  474.80
## 3 M5()            1   57.55

Sunday, April 21, 2013

Trying Out Shiny with Cellular Automatons

Try Out Shiny

I created an implementation of the Elementary Cellular Automaton from Stephen Wolfram's book A New Kind of Science a while back. This was done using the Manipulate package. It allows you to have a very simple interactive interface within the RStudio IDE.

A little background before we dive in may be useful if you are not familiar with this concept. The premise is that we have a two dimensional grid, each cell can be in one of two states, on or off. The top row is initialized with the center cell being on and all other cells being off. The state of each cell in a row is determined by the state of the cells in the row directly above that row. A given cell determines its state by looking at the cell directly above it and the diagonals. Thus it is dependent on three cells. Since each of these three cells can be in one of two states, we have 2 to the third power of permutations or eight ways the child can determine its state based on its three parents. If we take each of these eight permutations as being able to define the alive or dead state of the child we end up with 2 to the eighth power of permutations or 256 different rules we can follow. Each of these rules leads to very different outcomes. Below you can see one of these 256 sets of rules.

For every child here we could have alive or dead for each of the eight combinations of parents. Now we shall proceed to construct an image for this model. To do this we need to create a few functions that will handle the computations required. First we create function that will capture the state of a child's parents and return a value to indicate the specific sequence. We denote this sequence of the parent cells by return a value of one to eight. We then compile this function to make it faster, about 3x to 5x faster.

cellEval <- function(parents) {
    if (all(parents == c(1, 1, 1))) {
        child <- 1
    } else if (all(parents == c(1, 1, 0))) {
        child <- 2
    } else if (all(parents == c(1, 0, 1))) {
        child <- 3
    } else if (all(parents == c(1, 0, 0))) {
        child <- 4
    } else if (all(parents == c(0, 1, 1))) {
        child <- 5
    } else if (all(parents == c(0, 1, 0))) {
        child <- 6
    } else if (all(parents == c(0, 0, 1))) {
        child <- 7
    } else if (all(parents == c(0, 0, 0))) {
        child <- 8
    }
    return(child)
}
require(compiler, quietly = T)
# Compile this functon for speed.
ccEval <- cmpfun(cellEval)

Next we need to define a grid to contain each cell and apply this function to each cell below the first row. This function needs a parameter so that we can changes the size of the grid. We also need to pass this function a rule, what to assign the child based on the parents. The rule we give here is just an integer, similar to how an integer is represented in binary.

  • 00000000 -> 1
  • 00000001 -> 2
  • 00000010 -> 3
  • 11111111 -> 256

So this integer will be broken back down into a rule to specify the state of the child based on parents. We also force the length of a row to be odd so that we have a center cell. We construct the grid by using a matrix with the size argument as the height and the length of the row is this doubled and made odd. We then set the midpoint of the first row to one. We also transform the rule to binary as mentioned above. You can investigate the details of this function but the purpose here is to explore the interface so I will leave this to the reader.

You may notice the cmpfun function as well, this is used to compile functions. If you take a look at my previous post you will see some discussion of vectorizing and compiling, namely that using for loops is bad and you should vectorize when feasible. The problem here is that it is hard to vectorize problems that are sequential, we don't have a full vector at the beginning. Vectorization is hard when things are sequentially dependent. For loops are slow, very slow, and we can't easily vectorize this so to get some speedup we can compile these functions.

cellAut <- function(rule, size = 101) {
    x1 <- size * 2 + 1
    grid <- matrix(0, nrow = x1, ncol = size, byrow = FALSE)

    mid <- ceiling(x1/2)
    grid[mid, 1] <- 1
    d <- rev(sapply(rule, function(x) {
        as.integer(intToBits(x))
    })[1:8])

    val <- mid + x1 - 2
    for (x in val:length(grid)) {
        grid[x] <- d[ccEval(grid[(x - (x1 + 1)):(x - (x1 - 1))])]
    }
    grid <- grid[, ncol(grid):1]
    image(1 - grid, axes = F)
}
cAut <- cmpfun(cellAut)

Below is one of the more interesting rules, Rule 30. It demonstrates that given a very small set of 'simple' rules complexity can still arise. I have placed emphasis on simple here because all of this may seem complex but the rules are static whilst the outcome seems very dynamic. Even though the rules are static deterministic it is hard to say what will happen a few rows down without actually going through each step. You can also see very dramatic changes by changing the rules, a change to Rule 100 leads to a result which is very boring.

cAut(30, 101)
cAut(100, 101)

If you load the manipulate package you can explore these rules interactively by running the following code.

require(manipulate)
manipulate(cAut(Rule, Size), Rule = slider(1, 256), Size = slider(11, 201))

I want to try to use Shiny to have this same interface in the browser. After reading parts of the wonderful tutorial it seemed that it would be pretty straightforward to do this. The only things needed to do this was to create ui.R and server.R files. The above code for the cellEval (ccEval) and cellAut (cAut) functions need to be visible as well. This can be done by having these functions in the workspace, sourcing the file within the server.R code, or adding the function definiton into the server.R file. The code below assumes they are in the workspace. This is cleaner for the post but probably the least robust.

First lets see the contents of the ui.R file.

shinyUI(pageWithSidebar(
  # Application title
  headerPanel("Elementary Cellular Automata"),

  # Sidebar with a slider input for number of observations
  sidebarPanel(
    numericInput("size", "Size of Grid:", 51),
    numericInput("rule", "Rule:", 30)
  ),

  # Show a plot of the generated distribution
  mainPanel(
    plotOutput("CAPlot")
  )
))

Basically all that is happening here is that we are using the shinyUI function with three parts;

  • headerPanel
  • sidebarPanel
  • mainPanel

The headerPanel just gets a string that is displayed in the browser tab and the title on the page. The sidebarPanel gets two numericInputs. The arguments to these are the names of variables we will refer to in server.R, the text displayed to inform the user what they do and an initial value. In the mainPanel we specify that we will have a plotOutput and it comes from CAPlot.

The contents of the server.R file can be seen below.

shinyServer(function(input, output) {
    output$CAPlot <- renderPlot({
        cAut(input$rule, input$size)
    })
})

This is even easier and more concise than the ui.R file. We just use the shinyServer function and say the output is CAPlot, which is what we wanted in the mainPanel of ui.R, is set in the renderPlot function. The renderPlot function calls our cAut function with the two input values. The reactive nature is really cool as well, any time the user changes the rule or the size, the plot is automatically updated.

To make all of this work you just need to have the server.R file and the ui.R file in a directory called say test and run the following from a directory above the test directory.

require(shiny)

runApp("test")

You can see a screenshot of the result below. I wish I was able to include the result here but my I am yet unsure how to do this. I hope to try some other features out in this package in the future. I hope this presents how simple it is to make something with Shiny.

Saturday, April 13, 2013

Speeding up R Functions

Speeding up R Functions

Sometimes writing code in a matrix based language can be awkward. There can be situations where you have to write code in a vectorized manner for it to be efficient. This makes it look different and forces you to think differently. In this post I want to present an example of vectorizing code useful to data mining and how much better it performs. I also want to take a naive look into the somewhat newer compiler package in R.

library(rbenchmark)
library(rpart)
library(compiler)

# Useful for data imports.
options(stringsAsFactors = FALSE)

# Useful function for getting an average time.
timer <- function(fun, iters, ...) {
    system.time(for (n in seq(iters)) fun(...))[3]/iters
}

Model Development

The data used here is referred to as the adult data. It is a common data set used for bench-marking machine learning algorithms. Commas were added to make it a csv. It is a binary decision problem, you have lots of independent variables and want to train a model to predict the dependent variable which can have one of only two outcomes. R offers a variety of methods to train a model to make predictions of this type. I am going to use the rpart package in order to build a decision tree to solve this problem.

adult.train <- read.csv("../../data/adulttrain.csv", na.strings = "?")

# Change dichotomous dependent variable to binary.
adult.train$y <- as.character(adult.train$y)
adult.train$y <- ifelse(adult.train$y == "<=50K", 0, 1)

# Set random seed to make repeatable.
set.seed(2)

# Number of observations in training set.
size <- nrow(adult.train)
n <- ceiling(size * 0.75)
z <- sample(size)

# Partition to train and test set.
train <- adult.train[z[1:n], ]
test <- adult.train[z[(n + 1):size], ]

# Create decision tree model on training set.
rmod <- rpart(factor(y) ~ ., train, control = rpart.control(minsplit = 2, cp = .0005))
# Prune the tree to reduce overfit.
rmod <- prune(rmod, cp = 0.0011)

# Evaluate model on test set.
pred <- predict(rmod, newdata = test, type = "prob")[, 2]

# Simple check of accuracy.
value <- ifelse(pred > 0.5, 1, 0)
accuracy <- sum(test$y == value)/length(value)

The accuracy is a very simple method to determine the quality of the models predictions on data the model was not trained on. There are a lot of things that the accuracy cannot tell us though. Many real world problems weigh one outcome higher than another. For instance in medical diagnosis a false negative, saying something is not cancer when it is can have severe results. You may not care that the accuracy is high because you need to focus on picking positive cases correctly. If there are only five percent true you could get 95% by picking all false. When it comes to mortgages the opposite is true, you do not want a false positive, giving a mortgage that is unlikely to be paid back (at least in theory). To determine how well the model works given these influences we need to construct some new scoring mechanisms.

Methods under consideration

Version 1: Naive Looping Approach

Most methods of evaluating how good a supervised model work depend on four parameters;

  • Number of true positives
  • Number of true negatives
  • Number of false positives
  • Number of false negatives

The following function does this in a straightforward way. This is how a lot of programmers coming from an imperative background would approach this task. It is also very logical, when thinking through how to solve this problem you would probably end up with something similar.

params1 <- function(act, pred, cutoff = 0.5) {
    pred <- ifelse(pred > cutoff, 1, 0)
    tp <- 0
    tn <- 0
    fp <- 0
    fn <- 0
    for (i in 1:length(pred)) {
        if (act[i] == 1 && pred[i] == 1) {
            tp <- tp + 1
        } else if (act[i] == 0 && pred[i] == 0) {
            tn <- tn + 1
        } else if (act[i] == 1 && pred[i] == 0) {
            fn <- fn + 1
        } else if (act[i] == 0 && pred[i] == 1) {
            fp <- fp + 1
        }
    }
    return(list(tp = tp, fp = fp, fn = fn, tn = tn))
}

# Look at how well we did.
tmp <- params1(test$y, pred)
accuracy2 <- (tmp$tp + tmp$tn)/(tmp$tp + tmp$tn + tmp$fp + tmp$fn)

# Sanity check.
accuracy == accuracy2
## [1] TRUE

Create a function to calculate the ROC (receiver operating characteristic). This will allow us to determine how much better than random guessing our model is performing. There are a lot of things that can be improved in both of these functions and we will look at some of them in the later sections.

roc1 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params1(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

One thing that you can see here is that these loops become rather verbose. There are doing a lot of increment operations happening, set a variable equal to one plus itself. It would be cleaner if there was a ++ increment operation like most other imperative languages. Some of the reasons there is no increment (decrement) operator in R is a side effect of it being an interpreted language. For loops are slow in these languages so it is not advised to use loops approaches where increment operation usually reside. Next we can look at how slow they actually are.

r <- roc1(test$y, pred)
plot(r[, 1], r[, 2], ann = FALSE)
title(xlab = "False Positive Rate", ylab = "True Positive Rate")
# How long does this take?
system.time(roc1(train$y, pred, gran = 0.1))
##    user  system elapsed 
##    0.54    0.00    0.54
system.time(roc1(train$y, pred, gran = 0.01))
##    user  system elapsed 
##    5.11    0.00    5.13

# Kind of slow, but lets get a better estimate.
Time10a <- timer(roc1, 10, train$y, pred, gran = 0.1)
Time100a <- timer(roc1, 10, train$y, pred, gran = 0.01)
# I don't want to wait for this one to go through a loop.
Time1000a <- system.time(roc1(train$y, pred, gran = 0.001))[3]

Not a rigorous asymptotic analysis but we can see that there is almost a 10x jump from 0.557 to 5.1. A 100x gives 50.87, which still seems to be linear.

Can we make this faster? Neither function follows all the idioms for fast R code. The roc function does pre-allocate s1 and s2, but it falls short by having a for loop and using cbind instead of pre-allocation. This loop is small though, and the cbind only happens once. Calls to cbind (or rbind) can be expensive if it is done in a loop, and loops are expensive in general. I have a good guess as to what is making it slow. Maybe I will use some actual profiling to determine where the hot spots are. Let's revise the heart of the roc function though, params1. Basically all it does is call params1, so let's start there. Nothing is growing or being appended, but there is alone big for loop. It is being used to sum the type of each observation. Let's vectorize this.

Version 2: Vectorized Operations


tpos <- params1(test$y, pred, 0.5)$tp
# We have 3490 true positives at a cutoff of .5.  These are the actual
# predicted classes with that threshold.
binPred <- ifelse(pred > 0.5, 1, 0)
acc1 <- sum(test$y == binPred)/length(binPred)
# We did this calculation earlier which found all cases were the actual
# was equal to the predicted in a vectorized manner.  We can take a
# similar action on each case since they are binary.
tpos1 <- sum(test$y & binPred)
all(tpos1 == tpos)
## [1] TRUE

It worked, now we can apply binary negations to calculate all of the rest of the cases. These methods should be much faster. Lets create a vectorized version of the function.

params2 <- function(act, pred, cutoff = 0.5) {
    pred <- ifelse(pred > cutoff, 1, 0)
    tp <- sum(act & pred)
    tn <- sum(!act & !pred)
    fp <- sum(!act & pred)
    fn <- sum(act & !pred)
    return(list(tp = tp, fp = fp, fn = fn, tn = tn))
}

roc2 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params2(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

# How long does this take
system.time(roc2(test$y, pred, gran = 0.1))
##    user  system elapsed 
##    0.06    0.00    0.06
system.time(roc2(test$y, pred, gran = 0.01))
##    user  system elapsed 
##     0.5     0.0     0.5
# It already seems much faster
Time10b <- timer(roc2, 10, test$y, pred, gran = 0.1)
Time100b <- timer(roc2, 10, test$y, pred, gran = 0.01)

# It looks like we have a 10x gain in speed
Time1000b <- system.time(roc2(test$y, pred, gran = 0.001))[3]

This took 0.056, 0.519 and 5.18 for the ten, one hundred, and one thousand iteration version respectively.

Yep, now our for loop in roc could be hindering us. This is a fairly short loop and we will overlook it here. This code is odd though, especially if you are not used to R, or don't come from a Matlab/octave background. This can be very non-intuitive for anyone coming from something like C++ or java and are using R for modeling purposes. It can also be weird for those new to R but coming from a background like SAS (without using IML).

Version 3: Compiled Function

The compiler library which is newer to the R framework may help. It could allow you to write code that is not vectorized but still get speed improvements.

params3 <- cmpfun(params1)

roc3 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params3(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

You can try compiling the roc function as well but it provides very improvement. This is pretty interesting. We did not have to rewrite anything. We just compiled the function. This can help those not used to thinking in terms of vectorization, the cases where it is not easy or impossible to vectorize, and in some cases vectorized code takes a little longer to understand.

Is it any faster though? We need to see how long it takes before we can say it has any merit.

system.time(roc3(test$y, pred, gran = 0.1))
##    user  system elapsed 
##    0.17    0.00    0.18
system.time(roc3(test$y, pred, gran = 0.01))
##    user  system elapsed 
##     1.7     0.0     1.7

# It already seems much faster than the looping method.
Time10c <- timer(roc3, 10, test$y, pred, gran = 0.1)
Time100c <- timer(roc3, 10, test$y, pred, gran = 0.01)
Time1000c <- system.time(roc3(test$y, pred, gran = 0.001))[3]

This took 0.186, 1.706 and 16.88 for the ten, one hundred, and one thousand iterations respectively. It looks like we have almost 3x gain in speed, than the baseline method, but still slower than the vectorized version. I am sure that this will be improved over time as the functionality is fairly new. Java was thought to be slow 15 years ago but continual work on its byte code compiler has led to a lot of improvements, so much so that it is now considered to be fast. So these same improvements could start to appear in R in the future giving further speed improvements.

Version 4: Compiling the Vectorized Version


# Can we compile the vectorized for speed improvement?
params4 <- cmpfun(params2)

roc4 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params4(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

# How long does this take
system.time(roc4(test$y, pred, gran = 0.1))
##    user  system elapsed 
##    0.05    0.00    0.05
system.time(roc4(test$y, pred, gran = 0.01))
##    user  system elapsed 
##    0.53    0.00    0.54

Time10d <- timer(roc4, 10, test$y, pred, gran = 0.1)
Time100d <- timer(roc4, 10, test$y, pred, gran = 0.01)
Time1000d <- system.time(roc4(test$y, pred, gran = 0.001))[3]

Results

Below is the final table. We can see that we don't get any additional improvements from compiling the vectorized version. More runs would give more accurate results. There could be other improvements to each case. I could compile/vectorize the roc functions respectively. There could more places for pre-allocation. There could also be better methods for vectorizing; it would be interesting if there was a way to vectorize the [tp, tn, fn, tp] all in one. There are also some additional features of the compiler that may add some improvements like the enableJIT() which may help. Before looking into these types of things though it may be useful to explore some of the profiling mechanisms in R. This way we can know which areas are the most in need of work. Effort can then be focused on making the biggest impacts.

lst <- c(Time10a, Time100a, Time1000a, Time10b, Time100b, Time1000b, Time10c, 
    Time100c, Time1000c, Time10d, Time100d, Time1000d)

times <- matrix(lst, nrow = 3, ncol = 4, byrow = FALSE, dimnames = list(c("10 Iters", 
    "100 Iters", "1000 Iters"), c("Baseline", "Vectorized", "Compiled", "Vect + Comp")))
times
##            Baseline Vectorized Compiled Vect + Comp
## 10 Iters      0.557      0.056    0.186       0.054
## 100 Iters     5.100      0.519    1.706       0.530
## 1000 Iters   50.870      5.180   16.880       5.240


benchmark(roc1(test$y, pred, gran = 0.1), roc2(test$y, pred, gran = 0.1), roc3(test$y, 
    pred, gran = 0.1), roc4(test$y, pred, gran = 0.1))
##                             test replications elapsed relative user.self
## 1 roc1(test$y, pred, gran = 0.1)          100   52.02    9.223     51.97
## 2 roc2(test$y, pred, gran = 0.1)          100    5.66    1.004      5.66
## 3 roc3(test$y, pred, gran = 0.1)          100   19.12    3.390     19.11
## 4 roc4(test$y, pred, gran = 0.1)          100    5.64    1.000      5.64
##   sys.self user.child sys.child
## 1        0         NA        NA
## 2        0         NA        NA
## 3        0         NA        NA
## 4        0         NA        NA

Conclusion

Now we have the machinery to look at all sorts of methods to score the quality of a model in an efficient manner. We can use the ROC curve to show us the trade offs of a model, its ability to focus more on getting all of the true positives or not having any false negatives. We can also aggregate the each of the four outcome in numerous ways to give scoring methods for various types of scenarios. A few of these methods are Precision and Recall and Sensitivity and Specificity.