Sunday, June 1, 2014

Using Writer Monads in R

Getting setup

I was somewhat mystified in finding a solution to the problem posed at the end of a previous blog. I was trying to find a way to attach meta data to pieces of data as they flowed through various analytic steps, a common task while constructing a dataset. I saw a great blog that seems to provide the perfect solution. I should have known that the world of functional programming has already solved this problem, only everyone outside of functional programming is probably scared of the answer becuase it contains the M word.

I have recreated that example here to see how easy it is to implement and understand the writer monad concept.

To restate the problem, it would be nice to know where the data you are working with came from, what steps were taken to create it and have that code to redo it yourself to validate concerns as to how it came to be.

Lets create a few functions to display the problem, basically trying to recreate the code from the source above.

Create a sine function.

sine <- function(x) sin(x)

Create a cube function.

cube <- function(x) x*x*x

We can call these functions pretty easily, but at the end we just have a value. That is fine for this simple case.

## [1] 0.9093
## [1] 8

In a normal data flow though we need to create a process for the data to flow through. A simple example would be to string the fucntions together.

Compose the functions into a new function.

sineCubed <- function(x) cube(sine(x))
## [1] 0.7518
# This works better though.
compose <- function(f, g) {
  function(x) f(g(x))

# Now we can apply any two functions we wish.
compose(sine, cube)(3)
## [1] 0.9564

Back to the probelm at hand, we wish to attach some metadata to pass around. Lets add this to the function call.

sine <- function(x) {
  list(sin(x), 'sine was called')
## [[1]]
## [1] 0.9093
## [[2]]
## [1] "sine was called"
cube <- function(x) {
  list(x*x*x, 'cube was called')
## [[1]]
## [1] 8
## [[2]]
## [1] "cube was called"

We now see along with the value we get some info on what happened. We can no longer compose anything though as this is no longer an acceptable input to our functions.

Error in x * x : non-numeric argument to binary operator
compose(sine, cube)(3)
Error in sin(x) : non-numeric argument to mathematical function

The problem here is that the second function being called gets passed a list from the first function which it does not accept. We need a way to wrap a function to get around this. This is exactly what a monadic container is for. The bind function helps with this. It takes a function and returns a new function that accepts a list of type T :: {number, string} and gives the function of interest the value while passing the string to the output.

bind <- function(f) {
  function(tuple) {
    x <- unlist(tuple[[1]])
    s <- tuple[2]
    fx <- f(x)
    y <- unlist(fx[1])
    t <- fx[2]
    list(y, paste(s, t, '.'))
## [[1]]
## [1] 0.9093
## [[2]]
## [1] "NA sine was called ."
bind(sine)(list(2, ''))
## [[1]]
## [1] 0.9093
## [[2]]
## [1] " sine was called ."
## [[1]]
## [1] 8
## [[2]]
## [1] "NA cube was called ."
bind(cube)(list(2, ''))
## [[1]]
## [1] 8
## [[2]]
## [1] " cube was called ."

We have to pass it a string or we get a missing value as the initial value of the string. We can now compose these types of functions like we did earlier except that we must bind each function first.

f <- compose(bind(sine), bind(cube))
## [[1]]
## [1] 0.9564
## [[2]]
## [1] "NA cube was called . sine was called ."
f(list(3, ''))
## [[1]]
## [1] 0.9564
## [[2]]
## [1] " cube was called . sine was called ."

We have to also give it the initialized string to start off with or we get an NA which will start to get annoying. Here is were unit comes in. The bind function extracted the value from the container, unit does the opposite, it puts a value in a container. At first this function seems almost to simple to be of any real value, but stick with me.

unit <- function(x) list(x, '')

## [[1]]
## [1] 3
## [[2]]
## [1] ""
## [[1]]
## [1] 0.9564
## [[2]]
## [1] " cube was called . sine was called ."
compose(f, unit)(3)
## [[1]]
## [1] 0.9564
## [[2]]
## [1] " cube was called . sine was called ."

We have just created to inverse operations, each counters what the other does. Paraphrasing Wikipedia Formally, a monad consists of a type constructor M and two operations, bind and unit. The unit operation takes a value from a plain type and puts it into a monadic container using the constructor, creating a monadic value. The bind operation performs the reverse process, extracting the original value from the container and passing it to the associated next function in the pipeline.

Now as we may want to add more functions to our pool we have to add the string construction to each function. We want to lift our functions up to be on par with our monadic container. R makes it very easy to fulfill the needs of lift by being very functional an allowing you to pass functions to functions as arguments and have functions return functions.

lift <- function(f) compose(unit, f)

roundDebug <- lift(round)

f <- compose(bind(roundDebug), bind(sine))

## [[1]]
## [1] 0
## [[2]]
## [1] " sine was called .  ."

This is the basic idea of the Writer Monad. It allows you to add context to by writing data to the computations and having machinery to extract and inject data and functions into this monadic container.

This can be super powerful if we tweak our functions just a bit.

arg <- function(x) {
  gsub('()', '', x, fixed = T)

Bind <- function(f) {
  fcall <- arg( = FALSE)[2])
  function(tuple) {
    x <- unlist(tuple[[1]])
    s <- tuple[2]
    fx <- f(x)
    y <- unlist(fx[1])
    t <- fcall
    list(y, paste(s, '>>=', t))

## [[1]]
## [1] 0.9093
## [[2]]
## [1] "NA >>= sin"

Now the Bind function tells us what it did. If we get a piece of data we have the start of an audit trail.

Unit <- function(x) {
  fcall <- arg( = FALSE)[2])
  list(x, fcall)

## [[1]]
## [1] 3
## [[2]]
## [1] "3"
## [[1]]
## [1] 0.1411
## [[2]]
## [1] "3 >>= sin"
f <- compose(Bind(sin), Bind(round))
## [[1]]
## [1] 0.1411
## [[2]]
## [1] "3 >>= round >>= sin"
f <- compose(Bind(tan), compose(Bind(sin), Bind(round)))
## [[1]]
## [1] -0.9444
## [[2]]
## [1] "4 >>= round >>= sin >>= tan"

Now we see the value that was pipe through this analysis.

Lift <- function(x) paste(x[[2]], '>>=', x[[1]])

## [1] "3 >>= sin >>= 0.141120008059867"
## [1] "4 >>= round >>= sin >>= tan >>= -0.94438388367123"

We now see the full pipeline with the result. What you may also notice if you are a functional programming connoisseur is that this is basically writing Haskell syntax for writer monads. I now think that the solution is the problem is solved but it will take more time to formulate how to implement it correctly.

Monday, May 19, 2014

Traversing the World of Networks

Getting setup

The areas of Graph Theory, Link Analysis and Social Network Analysis all hinge on similar underlying concepts. I find this area fascinating in respects to data science. One issue in this area is the large number of very diverse technologies. In many ways a more limited set of tools would free you to think about the problem at hand. If you have no constraints there are probably to many choices bogging you down.

If you need a graph database there are lots of options:

There are even different options for query languages such as Cypher and Gremlin and even the whole Tinkerpop stack to do all sorts of tasks.

For tools to do real analytic work there are different options. I think this consists of two main types.

Graph Processing engines

Graph Analytic libraries

As well as libraries in most general purpose languages

Once you do your analysis you need to visualize the data. There are tons of tools to visualize networks that offer point and click functionality:

If you need to make your own plots programmatically javascript has tons of tools:

What about file formats, there are tons here as well depending on what tool generates the data, can the data change over time, is it directed, can nodes have attributes, can edges? To many choices. Here is actually a site of the options you can use inside of Gephi. I am actually getting overwhelmed with the shear number of different types of tools, much less the specific tools in that domain.

I want to try to walk through a way that you can utilize some of the best methods and tools and have them all communicate with each other.

Getting and Loading the Data

The first thing to address is file formats for graphs. This is really about storing graphs which graph databases would come to mind first but that is a bigger topic. One that allows lots of flexibility in the types of graphs you describe is the gexf format. This was created by Gephi. It may be a little more complicated that some of the simpler methods but it adds a lot of flexibility. As an example the amazing site has gexf files related to movies. I thought it would be very fitting to use The Social Network but it was not there so we have to settle for Pulp Fiction.

Once we have this file we can use the rgexf package to load the data into R.

options(stringsAsFactors = F)

# Get the gexf file from the site an place it into a file
gex <- as.character(GET(''))
cat(gex, file = 'movie.gexf')

# Read it in with the gexf reader
pulp <- read.gexf('movie.gexf')

# Investigate file
## [1] "gexf"
## GEXF graph object
## $`N of nodes`
## [1] 38
## $`N of edges`
## [1] 102
## $`Node Attrs`
## $`Edge Attrs`
# plot(pulp)

I am not sure how to go about including the plot here. This plot is okay, it is using sigmajs, not as great as what Gephi could have done but it is in the browser over stuck in Gephi. Once we have this graph though what do we do with it. Usually we want to know something about the structure of the network. Who is the most central person in the network, does it have any interesting features, does it follow a power law distribution. The file type and the plot have no relevance here, technically I guess you could use the plot to get at these with the old pencil and paper method. We need to turn our network from its gexf form into something more capable for analysis.

Lets try igraph. The gexf package comes with a function to transform gexf to igraph. There is also the reverse to go from igraph to a gexf which can then be consumed by Gephi.


# Transform to igraph class
ipulp <-


The plot looks pretty bad but we were interested in analysis over aesthetics.

# Analysis at the individual level
scores <- data.frame(alpha = alpha.centrality(ipulp),
                     authority = authority.score(ipulp)[[1]],
                     closeness = closeness(ipulp))
##              alpha authority closeness
## BRETT            1    0.2136   0.01250
## BUDDY            1    0.0790   0.01075
## BUTCH            2    0.3828   0.01429
## CAPT KOONS       3    0.1124   0.01250
## ED SULLIVAN      1    0.0790   0.01075
## ENGLISH DAVE     3    0.1763   0.01053
# Analysis at the network level
cliques(ipulp, min = 6, max = 7)
## [[1]]
## [1]  1  3 14 17 18 27
## [[2]]
## [1]  1  3 14 17 18 32
## [[3]]
## [1]  1  3 14 17 27 32
## [[4]]
## [1]  1  3 14 18 27 32
## [[5]]
## [1]  1  3 17 18 27 32
## [[6]]
## [1]  1 14 17 18 27 32
## [[7]]
## [1]  3  4 20 21 32 35
## [[8]]
## [1]  3 14 17 18 27 32
## [[9]]
## [1] 11 14 16 22 25 32
## [[10]]
## [1]  1  3 14 17 18 27 32

Plotting with D3

Now that we did the math thing and know everything there is to know about this network we want to present our findings. The plot from igraph was pretty rough, the plot from gexf was good but only offered us the ability to zoom in and out and gave us some over capability. We need to give people the ability to interact with the presentation. What we really need here some d3. There is always a package to help us, here it comes in the form of d3network. There is a pretty good tutorial on the website.

We need a way to transform our igraph into the appropriate data structure for d3. The following function takes an igraph class and returns the data frame to send to d3Network plots.

igraph_2_d3 <- function(igr) {
  Source <- c()
  Target <- c()
  for (i in seq(length(E(igr)))) {
    e <- get.edge(igr, i)
    x <- V(igr)
    s <- if(is.null(x[e[1]]$name)) x[e[1]] else x[e[1]]$name
    t <- if(is.null(x[e[2]]$name)) x[e[2]] else x[e[2]]$name
    Source <- c(Source, s)
    Target <- c(Target, t)
  data.frame(Source, Target)


Now we can just use one of the network plots provided by the package and we get an interactive network that is quite fun to play with. For some reason it works in my browser but as soon as I put it in blogger it vanishes. I need to spend some time tracking this down where this goes wrong.


We can add more details by changing the the size and color of nodes and edges to have certain features from out analysis, for instance size the nodes by there closeness and color them to differentiate the 7-clique from the rest of the network.

If you are using RStudio you can use this function to make the plot appear in the viewer, which is very useful having everything in one IDE.

d3plot <- function(network, h = 300, w = 700) {
  # Create temporary html file
  htmlFile <- tempfile(fileext=".html")
  if(is.igraph(network)) network <- igraph_2_d3(network)
  # Write d3 network viz to html file
  d3SimpleNetwork(network, height = h, width = w, file = htmlFile)
  # (code to write some content to the file)


Persisting with Neo4j

I think my next step would be to persist this data to a graph database. I am going to use Neo4j as the database. Once you download and unzip you just need to run the following at the command line.

bin/neo4j start

To persist the data we need to be able to talk to the database. I looked around and found a few older examples but none actually worked. I think you could use rcurl but I am just going to write my own using system commands as a first go.


# Put data into format for neo4j
neo_pulp <- igraph_2_d3(ipulp)
##   Source    Target
## 3  BRETT     ROGER
## 5  BUDDY       MIA
# List of all nodes
nodes <- unique(c(neo_pulp$Source, neo_pulp$Target))
## [1] "BRETT"      "BUDDY"      "BUTCH"      "FABIENNE"   "FOURTH MAN"
## [6] "GAWKER #2"
# Create some common strings.
neoA <- 'Accept:application/json'
neoB <- 'http://localhost:7474/db/data/node/'
neoC <- 'Content-Type:application/json'

# Common helper functions.
qy <- function(x) fromJSON(paste(system(x, intern = T), collapse = ' '))
prep <- function(lst) paste("'", toJSON(lst), "'", sep = '')

# Function to write a node to a database.
create_node <- function(at) {
  x <- qy(paste('curl -H ', neoA, ' -H ', neoC, ' -X POST -d ', 
                prep(at), ' ', neoB, sep = ''))$self
  as.integer(gsub(neoB, '', x))

# Function to pull node from database.
get_node <- function(id) {
  qy(paste('curl -H ', neoA, ' ', neoB, id, sep = ''))$data

# Add node and check it exists.
lookup <- create_node(list(name = nodes[1]))
## $name
## [1] "BRETT"
# Need to create a lookup table, nodes are key: value
nodes_lk <- c()

# Now we just need to loop through all nodes and add them to the database.
for (i in nodes) {
  nodes_lk <- c(nodes_lk, create_node(list(name = i)))

# Check the lookup table
## $name
## [1] "JULES"
# You can query the database with the following query.
# MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r;

# Function to create edges between two nodes.
create_edge <- function(to, from) {
  from <- nodes_lk[which(nodes == from)]
  to <- nodes_lk[which(nodes == to)]
  x <- system(paste('curl -H ', neoA, ' -H ', neoC, ' -X POST -d ', 
    '\'{"to": "', neoB, to, '", "type": "KNOWS"}\' ', neoB, from, 
    '/relationships', sep = ''), intern = T)

# Add each edge to the database
for (i in 1:nrow(neo_pulp)) {
  create_edge(neo_pulp$Target[i], neo_pulp$Source[i])

This is what the console looks like in the Neo4j dashboard.


I think I may have actually gone in a very strange order. I started with data from Gephi and did some work to get it into R. After doing some analysis with I plot the data in D3. Then put the data in a graph database. Usually you start with data in a database then do the analysis and then one of the either presentation graphics in Gephi or interactive graphics in D3. It does do what I had hoped in making a bunch of desperate functionality together.

Another technology that I have not worked with here are the large scale graph processing frameworks like Graphx. I have managed to get spark setup so that I can try Graphx but there seems to be a lot more to learn before interacting with it.

Saturday, May 10, 2014

Detecting All Stars Using CADE

Getting setup

I recentely listend to David Jensen give a brief overview of an anomaly detection method called CADE or Classifier Adjusted Density Estimation. The method seemed very easy to grasp at first which is not usualy the case for recently published machine learning work. I can never resist trying to implement these things myself on a toy example to learn what makes them tick and if they really deliver on what they promise.

I wanted to try to create an example that will find anomolous players in the NBA. I think in order to do this I may have to expand on some of the later steps of this methodolgy though. But first we need some data. Lets walk through the steps I am using to get this data.


I have a package on Github that we will utilize to get data. First we need to get this functionality installed. The code is actaully very similar to some code I have used in previous posts except I realized some of the sites have changed and we only need a subset of that functionality.




Before we proceed I should explain a bit more about what I am attempting to do here. I am hoping to find the best players in the league, All Stars, in order to do that I am going to try to find those which are the largest outliers. In doing this I may find some players that are not all stars but different for some other reason which will also be interesting. The methodology being used looks at player-days, statistics from a single game for a specific player. It then finds the player-games that are anomolous to the rest of the player-game population. My intuition tells me that I can aggregate these games up to the player level and have and ordering of players. Those that are the furthest out should contain the group of all stars, possibly including other groups of players as well. I will attempt to explain the details of the actual outlier detection method as it is needed.

# You can pull all data for the 2013 season via the following code.
season <- seasonify(2013)
# Or use the 2013 season that comes with the R package.

head(season[, 1:3])
##                       away                home       date
## 72          Boston Celtics          Miami Heat 2012-10-30
## 269       Dallas Mavericks  Los Angeles Lakers 2012-10-30
## 1303    Washington Wizards Cleveland Cavaliers 2012-10-30
## 294       Dallas Mavericks           Utah Jazz 2012-10-31
## 335         Denver Nuggets  Philadelphia 76ers 2012-10-31
## 406  Golden State Warriors        Phoenix Suns 2012-10-31

Now we have an entire season of NBA games. This is still only the games though. We need to capture each players’ stats for each of these games. We are only going to pull the basic statistics. This will also take a while so I would recomend using the data in the package.

# Pull statistics for a season of games
stats <- pull_stats(season)

Or load it from the package.


head(stats[, 1:3])
##          player FG FGA
## 1   Rajon Rondo  9  14
## 2   Paul Pierce  6  15
## 3 Kevin Garnett  4   8
## 4  Brandon Bass  6  11
## 5  Courtney Lee  5   6
## 7   Jason Terry  2   7

Implement the Algorithm

Now on to the technichal aspects. The heart of CADE relies on two things:

  1. Creating uniform distributions over variables
  2. Generate a classifier which can produce a probability

The followng function takes care of the uniform distribution by taking a varaible that may have any type of distribution and returns one with the same length and range but is uniformly distributed.

uni <- function(x, len = length(x)) {
  if ( is.integer(x) ) {
    sample(min(x):max(x), len, replace = TRUE)
  } else if ( is.numeric(x) ) {
    runif(len, min(x), max(x))
  } else if ( is.factor(x) ) {
    factor(sample(levels(x), len, replace = TRUE))
  } else {
    sample(unique(x), len, replace = TRUE)

Here is how this looks for a Poisson distribution.

test <- rpois(10000, 35)
par(mfrow = c(2, 1))


Now the rest of CADE is just about creating predictions. The predictions are of cases that are outliers. To do this we make a naive assumption that none of our data is an outlier. Thus we will create a new target varaible y, and give it all values of 0. Then we need to take the fake uniform data that is tructurally the same and give it the same target variable y but only call of the outliers so they get a value of 1. Then we combine these data sets into one and run a machine learning algorithm on them which is able to return a probability instead of a predicted class outcome. Then we use this classifier to evaluate our original data. We call this predicted probability the probability of being an outlier.

cade <- function(df, numTree = 500) {

  # This is the data we will make the 'no' case
  real <- df
  # Create similar but uniform data
  fake <-, uni))
  real$y <- 0
  fake$y <- 1
  # Combine real and fake data
  data <- rbind(real, fake)
  # Build classifier
  tree <- randomForest(as.factor(y) ~ ., data = data, ntree = numTree)
  # The classifier probabilities
  df$prob <- predict(tree, newdata = df, type = 'prob')[, 2]
  df$prob <- df$prob / (1 - df$prob)


Everything is now in place to try out our experiment. We need to pass the data into the CADE function, only the fields that are needed though. Since the result of the function is a new data frame with a field called prob, we can just take take that field and assign it to the data we pass into the function.

# Run cade on data, this takes a minute, for analysis only use relevent fields.
stats$prob <- cade(subset(stats, select = -c(player, date, guid)))$prob

# Order by most likely to be an outlier.
stats <- stats[order(stats$prob, decreasing = TRUE), ]

# Do people appear in the top frequently.
##      Kevin Durant           James Harden       Kevin Love  
##                 7                 4                 3                 
##      Russell Westbrook      LeBron James       Kobe Bryant
##                 2                 2                 2 
##      Spencer Hawes          Paul Pierce        Paul George      
##                 1                 1                 1   
##      Nicolas Batum          Kyrie Irving       Josh Smith
##                 1                 1                 1 
##      John Wall              Jeff Green         Dwight Howard      
##                 1                 1                 1         
##      Byron Mullens        
##                 1

Even though it is somehat subjective to as to who is better than who in the NBA, I think there would be little down that these are not some of the best players of 2013. How would these results line up against the actual 2013 Allstar Game?

First we have to find a way to get rid of player-game values and just get values based on player. I thhought over just counts under some threshold but I could not think of any way to determine this threshold. It was all kind of arbitrary. Adding probabilities up is not something that commonly pops up. This may be okay here though, I know the sum is no longer any way a probability, some will rise above 1.

# Aggregate the per game score up to just the player
rank <- ddply(stats, .(player), summarise, score = sum(prob))

# Order largest sum
rank <- rank[order(rank$score, decreasing = TRUE), ]

# Rank each player by there sum total.
rank$rank <- 1:nrow(rank)
##              player score rank
## 266    Kevin Durant 7.668    1
## 191    James Harden 5.745    2
## 298    LeBron James 4.443    3
## 59  Carmelo Anthony 4.058    4
## 136   Dwight Howard 4.022    5
## 278     Kobe Bryant 3.938    6

This look great


# Load all star game data
al2013 <- ''
al2013 <- readHTMLTable(al2013)

# Join and clean these fields.
al2013 <- setdiff(c(al2013[[1]]$` EAST`, al2013[[2]]$` WEST`), 'TOTALS')

# Move all text to lower to be safe(er) in joining data.
al2013 <- data.frame(player = tolower(al2013))
rank$player <- tolower(rank$player)

# Join data
al2013 <- merge(al2013, rank)

# Order data by rank
al2013 <- al2013[order(al2013$rank), ]

# How far into the list do you need to go to capture the whole all star lineup
al2013$depth <- al2013$rank / nrow(rank)

##               player  score rank    depth
## 12      kevin durant 7.6682    1 0.002128
## 9       james harden 5.7449    2 0.004255
## 17      lebron james 4.4426    3 0.006383
## 3    carmelo anthony 4.0576    4 0.008511
## 7      dwight howard 4.0224    5 0.010638
## 14       kobe bryant 3.9376    6 0.012766
## 19       paul george 2.6216    7 0.014894
## 21 russell westbrook 2.3884    8 0.017021
## 10       joakim noah 1.4070   15 0.031915
## 22        tim duncan 1.3865   16 0.034043
## 8        dwyane wade 1.0168   21 0.044681
## 5         chris paul 0.9468   24 0.051064
## 2        brook lopez 0.8438   25 0.053191
## 16 lamarcus aldridge 0.7993   27 0.057447
## 15      kyrie irving 0.6615   34 0.072340
## 18         luol deng 0.6580   35 0.074468
## 25     zach randolph 0.6336   37 0.078723
## 11      jrue holiday 0.5908   40 0.085106
## 20       rajon rondo 0.5218   47 0.100000
## 1      blake griffin 0.4044   55 0.117021
## 4         chris bosh 0.3317   67 0.142553
## 24    tyson chandler 0.2165   89 0.189362
## 6          david lee 0.2083   92 0.195745
## 13     kevin garnett 0.1961   95 0.202128
## 23       tony parker 0.1893   97 0.206383


It seems to work pretty well. There are few ways I think it can be expanded. I only pulled the basic stats but there are is whole plethora of advanced stats available from the same source. There is also some newly available statistics that come from the SportVU cameras. These data points may give better lift over some of the more basic stats used here.

I also wanted to try it compared to a few outer methods to see how well it worked. I started down the path to do a comparison but but hit a roadbloack with the data preprocessing, most other methods depend on lost of up front normalization. One major advantage is that I had to do very little cleaning to get this method to work. Everything else depends on certain normal like distributions, this method is pretty robust leading to quicker results.

I also should look at some of the players between those in the Allstar game. Were they the outliers on the other side, the worst players in the league. Often figureing out what makes them an outlier is harder than finding out who are the actual outliers. It also takes a lot of domain knowledge.

Sunday, April 20, 2014

I Second the Notion

Getting setup

I recently read a blog with some support that got me thinking. I thought it was an interesting idea that seems valid, but there are other problems with data transmission in general that this method could help to solve. The three issues that R solved were all valid:

  • The storage mechanism is efficient
  • The platform is open source
  • Various functions and data can be bundled together

The potential really lies in the last item. Aside from a standalone data file you can add comments and metadata to this file to explain to the recipient what they are actually getting. You can also include functions that operate on the data or were used preprocessing and cleaning of the data.

I created a function, which is still in its infancy, to capture some of this idea. I was hoping to get other insights on how this could be useful. The notes part was added since it is hard to capture input calling it from knitr, not really a part of the function. Some functionality would also be useful to allow it to only save out a subset of the workspace.


write.manifest <- function(hist = F, file = stop("'file' must be specified")) {
  loc <- setdiff(ls(envir = .GlobalEnv), c('read.manifest', 'write.manifest'))
  evl <- function(x) eval(parse(text = x))
  print('Comments: ')
  notes <- scan(what = character(), quiet = T)
  if (length(notes) == 0) notes <- 'These are the notes you would have typed to explain the data.'

  manifest <- list(summary = ldply(loc, function(a) c(name = a, 
                                                      class = class(evl(a)), 
                                                      type = typeof(evl(a)), 
                                                      mode = mode(evl(a)), 
                                                      size = object.size(evl(a)))),
                   session = sessionInfo(),
                   time = Sys.time(),
                   createdby = Sys.getenv()[['USER']],
                   notes = paste(notes, collapse = ' '))
  manifest$history <- if (hist) readLines('.RHistory') else NULL
  loc <- c(loc, 'manifest')
  save(list = loc, file = file)

Now to show how this would work lets create some simple data.

x <- 1
y <- 2
z <- 3

myData <- cars

myData$date <- Sys.Date() + sample(1:10, nrow(myData), replace = TRUE)

myData2 <- myData[myData$date > Sys.Date()+1, ]

myData2$rand <- sample(c(x, y, z), nrow(myData2), replace = TRUE)

We now have three variables in our workspace. If I were to send this data out to someone it would not make much sense. Instead of using the save method lets create the output using the manifest. Ignore the comments, it makes more sense when you are running interactively.

write.manifest(file = 'x.RData')
## [1] "Comments: "

Now let’s act like we are the recipients of this data.

# Clear our workspace.
rm(list = ls())
## character(0)
read.manifest <- function(file) {
  load(file, envir = .GlobalEnv)
  disp <- paste('The file ' , file, ' was created by ', manifest$createdby, 
                ' on ', manifest$time, '.', sep = '')

## The file x.RData was created by kdarrell on 2014-04-20 10:12:04.
## These are the notes you would have typed to explain the data.
## $summary
##       name      class      type      mode size
## 1 metadata       list      list      list  800
## 2   myData data.frame      list      list 2304
## 3  myData2 data.frame      list      list 2808
## 4        x  character character character   96
## 5        y    numeric    double   numeric   48
## 6        z    numeric    double   numeric   48
## [1] "manifest"      "metadata"      "myData"        "myData2"      
## [5] "read.manifest" "x"             "y"             "z"
##       name      class      type      mode size
## 1 metadata       list      list      list  800
## 2   myData data.frame      list      list 2304
## 3  myData2 data.frame      list      list 2808
## 4        x  character character character   96
## 5        y    numeric    double   numeric   48
## 6        z    numeric    double   numeric   48

You can also see the session that was used to build the data.

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] plyr_1.8.1
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.3   formatR_0.10     knitr_1.5        Rcpp_0.11.1     
## [5] rmarkdown_0.1.77 stringr_0.6.2    tools_3.1.0      yaml_2.1.11

There were other things I thought that would be useful to add to this train of thought as well, it would be nice to check for comments on every R object being written to this file. If there are no comments it should prompt you to explain what the object is. If you have never used comments before they are pretty easy to grasp.

value <- 1
## [1] 1
comment(value) <- 'This value is not useful but the example may be.'

## [1] 1
## [1] "This value is not useful but the example may be."

Changing Directions

At the onset this was what I thought would be very useful. As with any learning experience or solving any problem you usually end up facing a greater learning experience or a larger problem yet to solve but slightly more capable. When I got to this point I thought it would be useful to see more about the data than just some comments, size and type. My first idea was to add a list of each variable that has some informative summary to display, the comment above would be the first go. I think I know how to go about this and it should not bring much difficulty. After thinking more about it though, I thought it would be really useful to capture how the data was created, this way you could recreate the data yourself, or at least inspect how it was done, maybe even modify it at some intermediate point.

One interesting feature of R is that it captures the history of what you do inside its interpreter. The one downside I still need to figure out is how to get it to write to the .Rhistory file real time instead of storing it and writing once the session closes. (At least I think this is what it is doing, I still need to do some exploration)

Once the .Rhistory file has been created we can read it in.

read.hist <- function() {
  hist <- readLines('.Rhistory')

  # Comments id
  comments_index <- gregexpr('^#', hist, perl = T) == 1
  comments <- hist[comments_index]
  hist <- hist[!comments_index]

  # Does not account for for loops

  # Break the code up into assignments
  assignments <- grep('<-', hist, value = TRUE)
  assignments <- grep("'<-'", assignments, value = TRUE, invert = TRUE)
  assignments <- grep('"<-"', assignments, value = TRUE, invert = TRUE)
  assignments <- strsplit(assignments, '<-')

  history <- list(code = hist, comments = comments, 
                  name = lapply(assignments, `[`, 1),
                  value = lapply(assignments, `[`, 2))
  # Need better cleaning of leading and trailing spaces
  last <- function(str) substr(str, nchar(str), nchar(str))
  first <- function(str) substr(str, 1, 1)
  history$name <- ifelse(last(history$name) == ' ', 
                         substr(history$name, 1, nchar(history$name) - 1), 
  history$value <- ifelse(first(history$value) == ' ', 
                          substr(history$value, 2, nchar(history$value)), 
  history$name <- unlist(history$name)
  history$value <- unlist(history$value)

history <- read.hist()

You can now use this history to create the history of any data element. I had the sense that you could use attach the code for this purpose. However most of the time I realize that parts of the code may have been added via the command line as a quick fix. This may be a problem with some of the idioms around R, check out this blog for more info. A script is kind of like a claim that this was the process, it can, but should’t, be fudged. The history cannot be fudged.

To get the history of a variable (value, list, data frame, etc) we need to construct more machinery to process the history file.

Can we recreate how we arrived at the small analysis above? The final data set under consideration was data2. How did it come to be?

origin <- function(var, all = F) {
  name <- lapply(strsplit(history$name, split = '$', fixed = T), `[`, 1)
  x <- which(var == name)
  if( length(x) == 0) return(c())
  if (all) {
    unique(paste(history$name[x], '<-', history$value[x]))
  } else { 
    head(paste(history$name[x], '<-', history$value[x]), 1)

## [1] "myData2 <- myData[myData$date > Sys.Date()+1, ]"

We can see its origin or also trace it’s history.

origin('myData2', T)
## [1] "myData2 <- myData[myData$date > Sys.Date()+1, ]"                  
## [2] "myData2$rand <- sample(c(x, y, z), nrow(myData2), replace = TRUE)"

This piece of data stems from another piece of data though. We can see its origins by running the same code on this set but it would be useful if there was a recursive way to do this.

## [1] "myData <- cars"
origin('myData', T)
## [1] "myData <- cars"                                                        
## [2] "myData$date <- Sys.Date() + sample(1:10, nrow(myData), replace = TRUE)"
condense <- function(str) {
  if(is.null(origin(str))) return(str)
  # Just take the first and call recursively.
  elem <- gregexpr('$', str, fixed = T)[[1]][1]
  if (elem > 0) {
    chars <- strsplit(str, NULL)[[1]]
    ind <- chars %in% c(letters, LETTERS, as.character(0:9))
    # The plus one is to offset the dollar sign
    ind[1:elem] <- TRUE
    str <- paste(chars[c(1:(elem-1), min(which(!ind)):nchar(str))], collapse = '')

# Adds infix functions predicate, eg(<-, <<-, +)
is.func <- function(str) {
  if (exists(str)) {
    eval(parse(text = paste('is.function(`', str, '`)', sep = '')))
  } else {

#rm(myData, myData2)

depends <- function(str) {
  x <- condense(origin(str))
  if(!is.null(x)) {
    x <- all.names(parse(text = x), unique = TRUE)
    x <- setdiff(x, str)
    x <- x[!unlist(lapply(x, is.func))]
    if (length(x) > 0) unique(c(x, depends(x))) else x
  } else {

Now lets try this out.

## [1] "myData" "cars"

There are some major problems starting to appear. Having to close R to persist the session’s history to the .Rhistory file is a real burden, it means you can’t work with an objects history interactively. You are forced to close everything up before writing utilizing any of this, while it would be very useful for interactiveness, tracing back through your current analysis. There is also a lot of intelligent code needed to read through all of the R code and determine what is really part of the object’s history, did you run something twice because you wanted to walk back through it all or does it recurse on itself, things like loops also become a big deal here. R is a very complicated language (or it gives you the freedom to do some very strange things that are hard to deal with in this manner).

I realized after writing a few more functions that I was basically creating a layer to sit on top and analysis all of my code, re-parsing the whole language is not the path I want to go down. So I think parts of this may be useful but the overall design needs some thought. I would enjoy hearing if anyone else has ever done this or has any ideas of how implement this or requirements that may be useful. I do think that the early portion of adding metadata to your output files is pretty sound though.