Randomly Distributed Thoughts: April 2013

Sunday, April 21, 2013

Trying Out Shiny with Cellular Automatons

Try Out Shiny

I created an implementation of the Elementary Cellular Automaton from Stephen Wolfram's book A New Kind of Science a while back. This was done using the Manipulate package. It allows you to have a very simple interactive interface within the RStudio IDE.

A little background before we dive in may be useful if you are not familiar with this concept. The premise is that we have a two dimensional grid, each cell can be in one of two states, on or off. The top row is initialized with the center cell being on and all other cells being off. The state of each cell in a row is determined by the state of the cells in the row directly above that row. A given cell determines its state by looking at the cell directly above it and the diagonals. Thus it is dependent on three cells. Since each of these three cells can be in one of two states, we have 2 to the third power of permutations or eight ways the child can determine its state based on its three parents. If we take each of these eight permutations as being able to define the alive or dead state of the child we end up with 2 to the eighth power of permutations or 256 different rules we can follow. Each of these rules leads to very different outcomes. Below you can see one of these 256 sets of rules.

For every child here we could have alive or dead for each of the eight combinations of parents. Now we shall proceed to construct an image for this model. To do this we need to create a few functions that will handle the computations required. First we create function that will capture the state of a child's parents and return a value to indicate the specific sequence. We denote this sequence of the parent cells by return a value of one to eight. We then compile this function to make it faster, about 3x to 5x faster.

cellEval <- function(parents) {
    if (all(parents == c(1, 1, 1))) {
        child <- 1
    } else if (all(parents == c(1, 1, 0))) {
        child <- 2
    } else if (all(parents == c(1, 0, 1))) {
        child <- 3
    } else if (all(parents == c(1, 0, 0))) {
        child <- 4
    } else if (all(parents == c(0, 1, 1))) {
        child <- 5
    } else if (all(parents == c(0, 1, 0))) {
        child <- 6
    } else if (all(parents == c(0, 0, 1))) {
        child <- 7
    } else if (all(parents == c(0, 0, 0))) {
        child <- 8
    }
    return(child)
}
require(compiler, quietly = T)
# Compile this functon for speed.
ccEval <- cmpfun(cellEval)

Next we need to define a grid to contain each cell and apply this function to each cell below the first row. This function needs a parameter so that we can changes the size of the grid. We also need to pass this function a rule, what to assign the child based on the parents. The rule we give here is just an integer, similar to how an integer is represented in binary.

00000000 -> 1
00000001 -> 2
00000010 -> 3
…
11111111 -> 256

So this integer will be broken back down into a rule to specify the state of the child based on parents. We also force the length of a row to be odd so that we have a center cell. We construct the grid by using a matrix with the size argument as the height and the length of the row is this doubled and made odd. We then set the midpoint of the first row to one. We also transform the rule to binary as mentioned above. You can investigate the details of this function but the purpose here is to explore the interface so I will leave this to the reader.

You may notice the cmpfun function as well, this is used to compile functions. If you take a look at my previous post you will see some discussion of vectorizing and compiling, namely that using for loops is bad and you should vectorize when feasible. The problem here is that it is hard to vectorize problems that are sequential, we don't have a full vector at the beginning. Vectorization is hard when things are sequentially dependent. For loops are slow, very slow, and we can't easily vectorize this so to get some speedup we can compile these functions.

cellAut <- function(rule, size = 101) {
    x1 <- size * 2 + 1
    grid <- matrix(0, nrow = x1, ncol = size, byrow = FALSE)

    mid <- ceiling(x1/2)
    grid[mid, 1] <- 1
    d <- rev(sapply(rule, function(x) {
        as.integer(intToBits(x))
    })[1:8])

    val <- mid + x1 - 2
    for (x in val:length(grid)) {
        grid[x] <- d[ccEval(grid[(x - (x1 + 1)):(x - (x1 - 1))])]
    }
    grid <- grid[, ncol(grid):1]
    image(1 - grid, axes = F)
}
cAut <- cmpfun(cellAut)

Below is one of the more interesting rules, Rule 30. It demonstrates that given a very small set of 'simple' rules complexity can still arise. I have placed emphasis on simple here because all of this may seem complex but the rules are static whilst the outcome seems very dynamic. Even though the rules are static deterministic it is hard to say what will happen a few rows down without actually going through each step. You can also see very dramatic changes by changing the rules, a change to Rule 100 leads to a result which is very boring.

cAut(30, 101)

cAut(100, 101)

If you load the manipulate package you can explore these rules interactively by running the following code.

require(manipulate)
manipulate(cAut(Rule, Size), Rule = slider(1, 256), Size = slider(11, 201))

I want to try to use Shiny to have this same interface in the browser. After reading parts of the wonderful tutorial it seemed that it would be pretty straightforward to do this. The only things needed to do this was to create ui.R and server.R files. The above code for the cellEval (ccEval) and cellAut (cAut) functions need to be visible as well. This can be done by having these functions in the workspace, sourcing the file within the server.R code, or adding the function definiton into the server.R file. The code below assumes they are in the workspace. This is cleaner for the post but probably the least robust.

First lets see the contents of the ui.R file.

shinyUI(pageWithSidebar(
  # Application title
  headerPanel("Elementary Cellular Automata"),

  # Sidebar with a slider input for number of observations
  sidebarPanel(
    numericInput("size", "Size of Grid:", 51),
    numericInput("rule", "Rule:", 30)
  ),

  # Show a plot of the generated distribution
  mainPanel(
    plotOutput("CAPlot")
  )
))

Basically all that is happening here is that we are using the shinyUI function with three parts;

headerPanel
sidebarPanel
mainPanel

The headerPanel just gets a string that is displayed in the browser tab and the title on the page. The sidebarPanel gets two numericInputs. The arguments to these are the names of variables we will refer to in server.R, the text displayed to inform the user what they do and an initial value. In the mainPanel we specify that we will have a plotOutput and it comes from CAPlot.

The contents of the server.R file can be seen below.

shinyServer(function(input, output) {
    output$CAPlot <- renderPlot({
        cAut(input$rule, input$size)
    })
})

This is even easier and more concise than the ui.R file. We just use the shinyServer function and say the output is CAPlot, which is what we wanted in the mainPanel of ui.R, is set in the renderPlot function. The renderPlot function calls our cAut function with the two input values. The reactive nature is really cool as well, any time the user changes the rule or the size, the plot is automatically updated.

To make all of this work you just need to have the server.R file and the ui.R file in a directory called say test and run the following from a directory above the test directory.

require(shiny)

runApp("test")

You can see a screenshot of the result below. I wish I was able to include the result here but my I am yet unsure how to do this. I hope to try some other features out in this package in the future. I hope this presents how simple it is to make something with Shiny.

Saturday, April 13, 2013

Speeding up R Functions

Sometimes writing code in a matrix based language can be awkward. There can be situations where you have to write code in a vectorized manner for it to be efficient. This makes it look different and forces you to think differently. In this post I want to present an example of vectorizing code useful to data mining and how much better it performs. I also want to take a naive look into the somewhat newer compiler package in R.

library(rbenchmark)
library(rpart)
library(compiler)

# Useful for data imports.
options(stringsAsFactors = FALSE)

# Useful function for getting an average time.
timer <- function(fun, iters, ...) {
    system.time(for (n in seq(iters)) fun(...))[3]/iters
}

Model Development

The data used here is referred to as the adult data. It is a common data set used for bench-marking machine learning algorithms. Commas were added to make it a csv. It is a binary decision problem, you have lots of independent variables and want to train a model to predict the dependent variable which can have one of only two outcomes. R offers a variety of methods to train a model to make predictions of this type. I am going to use the rpart package in order to build a decision tree to solve this problem.

adult.train <- read.csv("../../data/adulttrain.csv", na.strings = "?")

# Change dichotomous dependent variable to binary.
adult.train$y <- as.character(adult.train$y)
adult.train$y <- ifelse(adult.train$y == "<=50K", 0, 1)

# Set random seed to make repeatable.
set.seed(2)

# Number of observations in training set.
size <- nrow(adult.train)
n <- ceiling(size * 0.75)
z <- sample(size)

# Partition to train and test set.
train <- adult.train[z[1:n], ]
test <- adult.train[z[(n + 1):size], ]

# Create decision tree model on training set.
rmod <- rpart(factor(y) ~ ., train, control = rpart.control(minsplit = 2, cp = .0005))
# Prune the tree to reduce overfit.
rmod <- prune(rmod, cp = 0.0011)

# Evaluate model on test set.
pred <- predict(rmod, newdata = test, type = "prob")[, 2]

# Simple check of accuracy.
value <- ifelse(pred > 0.5, 1, 0)
accuracy <- sum(test$y == value)/length(value)

The accuracy is a very simple method to determine the quality of the models predictions on data the model was not trained on. There are a lot of things that the accuracy cannot tell us though. Many real world problems weigh one outcome higher than another. For instance in medical diagnosis a false negative, saying something is not cancer when it is can have severe results. You may not care that the accuracy is high because you need to focus on picking positive cases correctly. If there are only five percent true you could get 95% by picking all false. When it comes to mortgages the opposite is true, you do not want a false positive, giving a mortgage that is unlikely to be paid back (at least in theory). To determine how well the model works given these influences we need to construct some new scoring mechanisms.

Methods under consideration

Version 1: Naive Looping Approach

Most methods of evaluating how good a supervised model work depend on four parameters;

Number of true positives
Number of true negatives
Number of false positives
Number of false negatives

The following function does this in a straightforward way. This is how a lot of programmers coming from an imperative background would approach this task. It is also very logical, when thinking through how to solve this problem you would probably end up with something similar.

params1 <- function(act, pred, cutoff = 0.5) {
    pred <- ifelse(pred > cutoff, 1, 0)
    tp <- 0
    tn <- 0
    fp <- 0
    fn <- 0
    for (i in 1:length(pred)) {
        if (act[i] == 1 && pred[i] == 1) {
            tp <- tp + 1
        } else if (act[i] == 0 && pred[i] == 0) {
            tn <- tn + 1
        } else if (act[i] == 1 && pred[i] == 0) {
            fn <- fn + 1
        } else if (act[i] == 0 && pred[i] == 1) {
            fp <- fp + 1
        }
    }
    return(list(tp = tp, fp = fp, fn = fn, tn = tn))
}

# Look at how well we did.
tmp <- params1(test$y, pred)
accuracy2 <- (tmp$tp + tmp$tn)/(tmp$tp + tmp$tn + tmp$fp + tmp$fn)

# Sanity check.
accuracy == accuracy2

## [1] TRUE

Create a function to calculate the ROC (receiver operating characteristic). This will allow us to determine how much better than random guessing our model is performing. There are a lot of things that can be improved in both of these functions and we will look at some of them in the later sections.

roc1 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params1(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

One thing that you can see here is that these loops become rather verbose. There are doing a lot of increment operations happening, set a variable equal to one plus itself. It would be cleaner if there was a ++ increment operation like most other imperative languages. Some of the reasons there is no increment (decrement) operator in R is a side effect of it being an interpreted language. For loops are slow in these languages so it is not advised to use loops approaches where increment operation usually reside. Next we can look at how slow they actually are.

r <- roc1(test$y, pred)

plot(r[, 1], r[, 2], ann = FALSE)
title(xlab = "False Positive Rate", ylab = "True Positive Rate")

# How long does this take?
system.time(roc1(train$y, pred, gran = 0.1))

##    user  system elapsed 
##    0.54    0.00    0.54

system.time(roc1(train$y, pred, gran = 0.01))

##    user  system elapsed 
##    5.11    0.00    5.13


# Kind of slow, but lets get a better estimate.
Time10a <- timer(roc1, 10, train$y, pred, gran = 0.1)
Time100a <- timer(roc1, 10, train$y, pred, gran = 0.01)
# I don't want to wait for this one to go through a loop.
Time1000a <- system.time(roc1(train$y, pred, gran = 0.001))[3]

Not a rigorous asymptotic analysis but we can see that there is almost a 10x jump from 0.557 to 5.1. A 100x gives 50.87, which still seems to be linear.

Can we make this faster? Neither function follows all the idioms for fast R code. The roc function does pre-allocate s1 and s2, but it falls short by having a for loop and using cbind instead of pre-allocation. This loop is small though, and the cbind only happens once. Calls to cbind (or rbind) can be expensive if it is done in a loop, and loops are expensive in general. I have a good guess as to what is making it slow. Maybe I will use some actual profiling to determine where the hot spots are. Let's revise the heart of the roc function though, params1. Basically all it does is call params1, so let's start there. Nothing is growing or being appended, but there is alone big for loop. It is being used to sum the type of each observation. Let's vectorize this.

Version 2: Vectorized Operations

tpos <- params1(test$y, pred, 0.5)$tp
# We have 3490 true positives at a cutoff of .5.  These are the actual
# predicted classes with that threshold.
binPred <- ifelse(pred > 0.5, 1, 0)
acc1 <- sum(test$y == binPred)/length(binPred)
# We did this calculation earlier which found all cases were the actual
# was equal to the predicted in a vectorized manner.  We can take a
# similar action on each case since they are binary.
tpos1 <- sum(test$y & binPred)
all(tpos1 == tpos)

## [1] TRUE

It worked, now we can apply binary negations to calculate all of the rest of the cases. These methods should be much faster. Lets create a vectorized version of the function.

params2 <- function(act, pred, cutoff = 0.5) {
    pred <- ifelse(pred > cutoff, 1, 0)
    tp <- sum(act & pred)
    tn <- sum(!act & !pred)
    fp <- sum(!act & pred)
    fn <- sum(act & !pred)
    return(list(tp = tp, fp = fp, fn = fn, tn = tn))
}

roc2 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params2(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

# How long does this take
system.time(roc2(test$y, pred, gran = 0.1))

##    user  system elapsed 
##    0.06    0.00    0.06

system.time(roc2(test$y, pred, gran = 0.01))

##    user  system elapsed 
##     0.5     0.0     0.5

# It already seems much faster
Time10b <- timer(roc2, 10, test$y, pred, gran = 0.1)
Time100b <- timer(roc2, 10, test$y, pred, gran = 0.01)

# It looks like we have a 10x gain in speed
Time1000b <- system.time(roc2(test$y, pred, gran = 0.001))[3]

This took 0.056, 0.519 and 5.18 for the ten, one hundred, and one thousand iteration version respectively.

Yep, now our for loop in roc could be hindering us. This is a fairly short loop and we will overlook it here. This code is odd though, especially if you are not used to R, or don't come from a Matlab/octave background. This can be very non-intuitive for anyone coming from something like C++ or java and are using R for modeling purposes. It can also be weird for those new to R but coming from a background like SAS (without using IML).

Version 3: Compiled Function

The compiler library which is newer to the R framework may help. It could allow you to write code that is not vectorized but still get speed improvements.

params3 <- cmpfun(params1)

roc3 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params3(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

You can try compiling the roc function as well but it provides very improvement. This is pretty interesting. We did not have to rewrite anything. We just compiled the function. This can help those not used to thinking in terms of vectorization, the cases where it is not easy or impossible to vectorize, and in some cases vectorized code takes a little longer to understand.

Is it any faster though? We need to see how long it takes before we can say it has any merit.

system.time(roc3(test$y, pred, gran = 0.1))

##    user  system elapsed 
##    0.17    0.00    0.18

system.time(roc3(test$y, pred, gran = 0.01))

##    user  system elapsed 
##     1.7     0.0     1.7


# It already seems much faster than the looping method.
Time10c <- timer(roc3, 10, test$y, pred, gran = 0.1)
Time100c <- timer(roc3, 10, test$y, pred, gran = 0.01)
Time1000c <- system.time(roc3(test$y, pred, gran = 0.001))[3]

This took 0.186, 1.706 and 16.88 for the ten, one hundred, and one thousand iterations respectively. It looks like we have almost 3x gain in speed, than the baseline method, but still slower than the vectorized version. I am sure that this will be improved over time as the functionality is fairly new. Java was thought to be slow 15 years ago but continual work on its byte code compiler has led to a lot of improvements, so much so that it is now considered to be fast. So these same improvements could start to appear in R in the future giving further speed improvements.

Version 4: Compiling the Vectorized Version

# Can we compile the vectorized for speed improvement?
params4 <- cmpfun(params2)

roc4 <- function(act, pred, gran = 0.1) {
    s1 <- rep(NA, (1/gran) + 1)
    s2 <- rep(NA, (1/gran) + 1)
    j <- 1
    for (i in seq(0, 1, gran)) {
        x <- params4(act, pred, cutoff = i)
        s1[j] <- (x$tp/(x$tp + x$fn))
        s2[j] <- (x$fp/(x$fp + x$tn))
        j <- j + 1
    }
    return(cbind(s2, s1))
}

# How long does this take
system.time(roc4(test$y, pred, gran = 0.1))

##    user  system elapsed 
##    0.05    0.00    0.05

system.time(roc4(test$y, pred, gran = 0.01))

##    user  system elapsed 
##    0.53    0.00    0.54


Time10d <- timer(roc4, 10, test$y, pred, gran = 0.1)
Time100d <- timer(roc4, 10, test$y, pred, gran = 0.01)
Time1000d <- system.time(roc4(test$y, pred, gran = 0.001))[3]

Results

Below is the final table. We can see that we don't get any additional improvements from compiling the vectorized version. More runs would give more accurate results. There could be other improvements to each case. I could compile/vectorize the roc functions respectively. There could more places for pre-allocation. There could also be better methods for vectorizing; it would be interesting if there was a way to vectorize the [tp, tn, fn, tp] all in one. There are also some additional features of the compiler that may add some improvements like the enableJIT() which may help. Before looking into these types of things though it may be useful to explore some of the profiling mechanisms in R. This way we can know which areas are the most in need of work. Effort can then be focused on making the biggest impacts.

lst <- c(Time10a, Time100a, Time1000a, Time10b, Time100b, Time1000b, Time10c, 
    Time100c, Time1000c, Time10d, Time100d, Time1000d)

times <- matrix(lst, nrow = 3, ncol = 4, byrow = FALSE, dimnames = list(c("10 Iters", 
    "100 Iters", "1000 Iters"), c("Baseline", "Vectorized", "Compiled", "Vect + Comp")))
times

##            Baseline Vectorized Compiled Vect + Comp
## 10 Iters      0.557      0.056    0.186       0.054
## 100 Iters     5.100      0.519    1.706       0.530
## 1000 Iters   50.870      5.180   16.880       5.240



benchmark(roc1(test$y, pred, gran = 0.1), roc2(test$y, pred, gran = 0.1), roc3(test$y, 
    pred, gran = 0.1), roc4(test$y, pred, gran = 0.1))

##                             test replications elapsed relative user.self
## 1 roc1(test$y, pred, gran = 0.1)          100   52.02    9.223     51.97
## 2 roc2(test$y, pred, gran = 0.1)          100    5.66    1.004      5.66
## 3 roc3(test$y, pred, gran = 0.1)          100   19.12    3.390     19.11
## 4 roc4(test$y, pred, gran = 0.1)          100    5.64    1.000      5.64
##   sys.self user.child sys.child
## 1        0         NA        NA
## 2        0         NA        NA
## 3        0         NA        NA
## 4        0         NA        NA

Conclusion

Now we have the machinery to look at all sorts of methods to score the quality of a model in an efficient manner. We can use the ROC curve to show us the trade offs of a model, its ability to focus more on getting all of the true positives or not having any false negatives. We can also aggregate the each of the four outcome in numerous ways to give scoring methods for various types of scenarios. A few of these methods are Precision and Recall and Sensitivity and Specificity.