Monday, May 19, 2014

Traversing the World of Networks

Getting setup

The areas of Graph Theory, Link Analysis and Social Network Analysis all hinge on similar underlying concepts. I find this area fascinating in respects to data science. One issue in this area is the large number of very diverse technologies. In many ways a more limited set of tools would free you to think about the problem at hand. If you have no constraints there are probably to many choices bogging you down.

If you need a graph database there are lots of options:

There are even different options for query languages such as Cypher and Gremlin and even the whole Tinkerpop stack to do all sorts of tasks.

For tools to do real analytic work there are different options. I think this consists of two main types.

Graph Processing engines

Graph Analytic libraries

As well as libraries in most general purpose languages

Once you do your analysis you need to visualize the data. There are tons of tools to visualize networks that offer point and click functionality:

If you need to make your own plots programmatically javascript has tons of tools:

What about file formats, there are tons here as well depending on what tool generates the data, can the data change over time, is it directed, can nodes have attributes, can edges? To many choices. Here is actually a site of the options you can use inside of Gephi. I am actually getting overwhelmed with the shear number of different types of tools, much less the specific tools in that domain.

I want to try to walk through a way that you can utilize some of the best methods and tools and have them all communicate with each other.

The first thing to address is file formats for graphs. This is really about storing graphs which graph databases would come to mind first but that is a bigger topic. One that allows lots of flexibility in the types of graphs you describe is the gexf format. This was created by Gephi. It may be a little more complicated that some of the simpler methods but it adds a lot of flexibility. As an example the amazing site has gexf files related to movies. I thought it would be very fitting to use The Social Network but it was not there so we have to settle for Pulp Fiction.

Once we have this file we can use the rgexf package to load the data into R.

require(rgexf)
require(httr)
options(stringsAsFactors = F)

# Get the gexf file from the site an place it into a file
gex <- as.character(GET('http://media.moviegalaxies.com/gexf/660.gexf'))
cat(gex, file = 'movie.gexf')

# Investigate file
class(pulp)
## [1] "gexf"
summary(pulp)
## GEXF graph object
## $N of nodes ## [1] 38 ## ##$N of edges
## [1] 102
##
## $Node Attrs ## NULL ## ##$Edge Attrs
## NULL
# plot(pulp)

I am not sure how to go about including the plot here. This plot is okay, it is using sigmajs, not as great as what Gephi could have done but it is in the browser over stuck in Gephi. Once we have this graph though what do we do with it. Usually we want to know something about the structure of the network. Who is the most central person in the network, does it have any interesting features, does it follow a power law distribution. The file type and the plot have no relevance here, technically I guess you could use the plot to get at these with the old pencil and paper method. We need to turn our network from its gexf form into something more capable for analysis.

Lets try igraph. The gexf package comes with a function to transform gexf to igraph. There is also the reverse to go from igraph to a gexf which can then be consumed by Gephi.

library(igraph)

# Transform to igraph class
ipulp <- gexf.to.igraph(pulp)

plot(ipulp)

The plot looks pretty bad but we were interested in analysis over aesthetics.

# Analysis at the individual level
scores <- data.frame(alpha = alpha.centrality(ipulp),
authority = authority.score(ipulp)[[1]],
closeness = closeness(ipulp))
head(scores)
##              alpha authority closeness
## BRETT            1    0.2136   0.01250
## BUDDY            1    0.0790   0.01075
## BUTCH            2    0.3828   0.01429
## CAPT KOONS       3    0.1124   0.01250
## ED SULLIVAN      1    0.0790   0.01075
## ENGLISH DAVE     3    0.1763   0.01053
# Analysis at the network level
cliques(ipulp, min = 6, max = 7)
## [[1]]
## [1]  1  3 14 17 18 27
##
## [[2]]
## [1]  1  3 14 17 18 32
##
## [[3]]
## [1]  1  3 14 17 27 32
##
## [[4]]
## [1]  1  3 14 18 27 32
##
## [[5]]
## [1]  1  3 17 18 27 32
##
## [[6]]
## [1]  1 14 17 18 27 32
##
## [[7]]
## [1]  3  4 20 21 32 35
##
## [[8]]
## [1]  3 14 17 18 27 32
##
## [[9]]
## [1] 11 14 16 22 25 32
##
## [[10]]
## [1]  1  3 14 17 18 27 32

Plotting with D3

Now that we did the math thing and know everything there is to know about this network we want to present our findings. The plot from igraph was pretty rough, the plot from gexf was good but only offered us the ability to zoom in and out and gave us some over capability. We need to give people the ability to interact with the presentation. What we really need here some d3. There is always a package to help us, here it comes in the form of d3network. There is a pretty good tutorial on the website.

We need a way to transform our igraph into the appropriate data structure for d3. The following function takes an igraph class and returns the data frame to send to d3Network plots.

igraph_2_d3 <- function(igr) {
Source <- c()
Target <- c()

for (i in seq(length(E(igr)))) {
e <- get.edge(igr, i)
x <- V(igr)

s <- if(is.null(x[e[1]]$name)) x[e[1]] else x[e[1]]$name
t <- if(is.null(x[e[2]]$name)) x[e[2]] else x[e[2]]$name
Source <- c(Source, s)
Target <- c(Target, t)
}

data.frame(Source, Target)
}

library(d3Network)

Now we can just use one of the network plots provided by the package and we get an interactive network that is quite fun to play with. For some reason it works in my browser but as soon as I put it in blogger it vanishes. I need to spend some time tracking this down where this goes wrong.

d3SimpleNetwork(igraph_2_d3(ipulp))

We can add more details by changing the the size and color of nodes and edges to have certain features from out analysis, for instance size the nodes by there closeness and color them to differentiate the 7-clique from the rest of the network.

If you are using RStudio you can use this function to make the plot appear in the viewer, which is very useful having everything in one IDE.

d3plot <- function(network, h = 300, w = 700) {
# Create temporary html file
htmlFile <- tempfile(fileext=".html")

if(is.igraph(network)) network <- igraph_2_d3(network)
# Write d3 network viz to html file
d3SimpleNetwork(network, height = h, width = w, file = htmlFile)

# (code to write some content to the file)
rstudio::viewer(htmlFile)
}

d3plot(barabasi.game(20))

Persisting with Neo4j

I think my next step would be to persist this data to a graph database. I am going to use Neo4j as the database. Once you download and unzip you just need to run the following at the command line.

bin/neo4j start

To persist the data we need to be able to talk to the database. I looked around and found a few older examples but none actually worked. I think you could use rcurl but I am just going to write my own using system commands as a first go.

library(bitops)
library(RCurl)
library(rjson)

# Put data into format for neo4j
neo_pulp <- igraph_2_d3(ipulp)
head(neo_pulp)
##   Source    Target
## 1  BRETT MARSELLUS
## 2  BRETT    MARVIN
## 3  BRETT     ROGER
## 4  BRETT   VINCENT
## 5  BUDDY       MIA
## 6  BUDDY   VINCENT
# List of all nodes
nodes <- unique(c(neo_pulp$Source, neo_pulp$Target))
head(nodes)
## [1] "BRETT"      "BUDDY"      "BUTCH"      "FABIENNE"   "FOURTH MAN"
## [6] "GAWKER #2"
# Create some common strings.
neoA <- 'Accept:application/json'
neoB <- 'http://localhost:7474/db/data/node/'
neoC <- 'Content-Type:application/json'

# Common helper functions.
qy <- function(x) fromJSON(paste(system(x, intern = T), collapse = ' '))
prep <- function(lst) paste("'", toJSON(lst), "'", sep = '')

# Function to write a node to a database.
create_node <- function(at) {
x <- qy(paste('curl -H ', neoA, ' -H ', neoC, ' -X POST -d ',
prep(at), ' ', neoB, sep = ''))$self as.integer(gsub(neoB, '', x)) } # Function to pull node from database. get_node <- function(id) { stopifnot(is.numeric(id)) qy(paste('curl -H ', neoA, ' ', neoB, id, sep = ''))$data
}

# Add node and check it exists.
lookup <- create_node(list(name = nodes[1]))
get_node(lookup)
## $name ## [1] "BRETT" # Need to create a lookup table, nodes are key: value nodes_lk <- c() # Now we just need to loop through all nodes and add them to the database. for (i in nodes) { nodes_lk <- c(nodes_lk, create_node(list(name = i))) } # Check the lookup table get_node(nodes_lk[10]) ##$name
## [1] "JULES"
# You can query the database with the following query.
# MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r;

# Function to create edges between two nodes.
create_edge <- function(to, from) {
from <- nodes_lk[which(nodes == from)]
to <- nodes_lk[which(nodes == to)]
x <- system(paste('curl -H ', neoA, ' -H ', neoC, ' -X POST -d ',
'\'{"to": "', neoB, to, '", "type": "KNOWS"}\' ', neoB, from,
'/relationships', sep = ''), intern = T)
}

# Add each edge to the database
for (i in 1:nrow(neo_pulp)) {
create_edge(neo_pulp$Target[i], neo_pulp$Source[i])
}

This is what the console looks like in the Neo4j dashboard.

Conclusion

I think I may have actually gone in a very strange order. I started with data from Gephi and did some work to get it into R. After doing some analysis with I plot the data in D3. Then put the data in a graph database. Usually you start with data in a database then do the analysis and then one of the either presentation graphics in Gephi or interactive graphics in D3. It does do what I had hoped in making a bunch of desperate functionality together.

Another technology that I have not worked with here are the large scale graph processing frameworks like Graphx. I have managed to get spark setup so that I can try Graphx but there seems to be a lot more to learn before interacting with it.

Saturday, May 10, 2014

Getting setup

I recentely listend to David Jensen give a brief overview of an anomaly detection method called CADE or Classifier Adjusted Density Estimation. The method seemed very easy to grasp at first which is not usualy the case for recently published machine learning work. I can never resist trying to implement these things myself on a toy example to learn what makes them tick and if they really deliver on what they promise.

I wanted to try to create an example that will find anomolous players in the NBA. I think in order to do this I may have to expand on some of the later steps of this methodolgy though. But first we need some data. Lets walk through the steps I am using to get this data.

Setup

I have a package on Github that we will utilize to get data. First we need to get this functionality installed. The code is actaully very similar to some code I have used in previous posts except I realized some of the sites have changed and we only need a subset of that functionality.

require(plyr)
require(XML)
require(devtools)
require(randomForest)

require(CadeAllStar)

Before we proceed I should explain a bit more about what I am attempting to do here. I am hoping to find the best players in the league, All Stars, in order to do that I am going to try to find those which are the largest outliers. In doing this I may find some players that are not all stars but different for some other reason which will also be interesting. The methodology being used looks at player-days, statistics from a single game for a specific player. It then finds the player-games that are anomolous to the rest of the player-game population. My intuition tells me that I can aggregate these games up to the player level and have and ordering of players. Those that are the furthest out should contain the group of all stars, possibly including other groups of players as well. I will attempt to explain the details of the actual outlier detection method as it is needed.

# You can pull all data for the 2013 season via the following code.
season <- seasonify(2013)
# Or use the 2013 season that comes with the R package.
data(season)

head(season[, 1:3])
##                       away                home       date
## 72          Boston Celtics          Miami Heat 2012-10-30
## 269       Dallas Mavericks  Los Angeles Lakers 2012-10-30
## 1303    Washington Wizards Cleveland Cavaliers 2012-10-30
## 294       Dallas Mavericks           Utah Jazz 2012-10-31
## 335         Denver Nuggets  Philadelphia 76ers 2012-10-31
## 406  Golden State Warriors        Phoenix Suns 2012-10-31

Now we have an entire season of NBA games. This is still only the games though. We need to capture each players’ stats for each of these games. We are only going to pull the basic statistics. This will also take a while so I would recomend using the data in the package.

# Pull statistics for a season of games
stats <- pull_stats(season)

Or load it from the package.

data(stats)

head(stats[, 1:3])
##          player FG FGA
## 1   Rajon Rondo  9  14
## 2   Paul Pierce  6  15
## 3 Kevin Garnett  4   8
## 4  Brandon Bass  6  11
## 5  Courtney Lee  5   6
## 7   Jason Terry  2   7

Implement the Algorithm

Now on to the technichal aspects. The heart of CADE relies on two things:

1. Creating uniform distributions over variables
2. Generate a classifier which can produce a probability

The followng function takes care of the uniform distribution by taking a varaible that may have any type of distribution and returns one with the same length and range but is uniformly distributed.

uni <- function(x, len = length(x)) {
if ( is.integer(x) ) {
sample(min(x):max(x), len, replace = TRUE)
} else if ( is.numeric(x) ) {
runif(len, min(x), max(x))
} else if ( is.factor(x) ) {
factor(sample(levels(x), len, replace = TRUE))
} else {
sample(unique(x), len, replace = TRUE)
}
}

Here is how this looks for a Poisson distribution.

test <- rpois(10000, 35)
par(mfrow = c(2, 1))

hist(test)
hist(uni(test))

Now the rest of CADE is just about creating predictions. The predictions are of cases that are outliers. To do this we make a naive assumption that none of our data is an outlier. Thus we will create a new target varaible y, and give it all values of 0. Then we need to take the fake uniform data that is tructurally the same and give it the same target variable y but only call of the outliers so they get a value of 1. Then we combine these data sets into one and run a machine learning algorithm on them which is able to return a probability instead of a predicted class outcome. Then we use this classifier to evaluate our original data. We call this predicted probability the probability of being an outlier.

cade <- function(df, numTree = 500) {
stopifnot(is.data.frame(df))

# This is the data we will make the 'no' case
real <- df

# Create similar but uniform data
fake <- as.data.frame(lapply(real, uni))

real$y <- 0 fake$y <- 1

# Combine real and fake data
data <- rbind(real, fake)

# Build classifier
tree <- randomForest(as.factor(y) ~ ., data = data, ntree = numTree)

# The classifier probabilities
df$prob <- predict(tree, newdata = df, type = 'prob')[, 2] df$prob <- df$prob / (1 - df$prob)

df
}

Application

Everything is now in place to try out our experiment. We need to pass the data into the CADE function, only the fields that are needed though. Since the result of the function is a new data frame with a field called prob, we can just take take that field and assign it to the data we pass into the function.

# Run cade on data, this takes a minute, for analysis only use relevent fields.
stats$prob <- cade(subset(stats, select = -c(player, date, guid)))$prob

# Order by most likely to be an outlier.
stats <- stats[order(stats$prob, decreasing = TRUE), ] # Do people appear in the top frequently. rev(sort(table(stats$player[1:30])))
##
##      Kevin Durant           James Harden       Kevin Love
##                 7                 4                 3
##      Russell Westbrook      LeBron James       Kobe Bryant
##                 2                 2                 2
##      Spencer Hawes          Paul Pierce        Paul George
##                 1                 1                 1
##      Nicolas Batum          Kyrie Irving       Josh Smith
##                 1                 1                 1
##      John Wall              Jeff Green         Dwight Howard
##                 1                 1                 1
##      Byron Mullens
##                 1



Even though it is somehat subjective to as to who is better than who in the NBA, I think there would be little down that these are not some of the best players of 2013. How would these results line up against the actual 2013 Allstar Game?

First we have to find a way to get rid of player-game values and just get values based on player. I thhought over just counts under some threshold but I could not think of any way to determine this threshold. It was all kind of arbitrary. Adding probabilities up is not something that commonly pops up. This may be okay here though, I know the sum is no longer any way a probability, some will rise above 1.

# Aggregate the per game score up to just the player
rank <- ddply(stats, .(player), summarise, score = sum(prob))

# Order largest sum
rank <- rank[order(rank$score, decreasing = TRUE), ] # Rank each player by there sum total. rank$rank <- 1:nrow(rank)
head(rank)
##              player score rank
## 266    Kevin Durant 7.668    1
## 191    James Harden 5.745    2
## 298    LeBron James 4.443    3
## 59  Carmelo Anthony 4.058    4
## 136   Dwight Howard 4.022    5
## 278     Kobe Bryant 3.938    6

This look great

Results

# Load all star game data
al2013 <- 'http://www.allstarnba.es/editions/2013.htm'

# Join and clean these fields.
al2013 <- setdiff(c(al2013[[1]]$ EAST, al2013[[2]]$ WEST), 'TOTALS')

# Move all text to lower to be safe(er) in joining data.
al2013 <- data.frame(player = tolower(al2013))
rank$player <- tolower(rank$player)

# Join data
al2013 <- merge(al2013, rank)

# Order data by rank
al2013 <- al2013[order(al2013$rank), ] # How far into the list do you need to go to capture the whole all star lineup al2013$depth <- al2013\$rank / nrow(rank)

al2013
##               player  score rank    depth
## 12      kevin durant 7.6682    1 0.002128
## 9       james harden 5.7449    2 0.004255
## 17      lebron james 4.4426    3 0.006383
## 3    carmelo anthony 4.0576    4 0.008511
## 7      dwight howard 4.0224    5 0.010638
## 14       kobe bryant 3.9376    6 0.012766
## 19       paul george 2.6216    7 0.014894
## 21 russell westbrook 2.3884    8 0.017021
## 10       joakim noah 1.4070   15 0.031915
## 22        tim duncan 1.3865   16 0.034043
## 8        dwyane wade 1.0168   21 0.044681
## 5         chris paul 0.9468   24 0.051064
## 2        brook lopez 0.8438   25 0.053191
## 16 lamarcus aldridge 0.7993   27 0.057447
## 15      kyrie irving 0.6615   34 0.072340
## 18         luol deng 0.6580   35 0.074468
## 25     zach randolph 0.6336   37 0.078723
## 11      jrue holiday 0.5908   40 0.085106
## 20       rajon rondo 0.5218   47 0.100000
## 1      blake griffin 0.4044   55 0.117021
## 4         chris bosh 0.3317   67 0.142553
## 24    tyson chandler 0.2165   89 0.189362
## 6          david lee 0.2083   92 0.195745
## 13     kevin garnett 0.1961   95 0.202128
## 23       tony parker 0.1893   97 0.206383

Conclusion

It seems to work pretty well. There are few ways I think it can be expanded. I only pulled the basic stats but there are is whole plethora of advanced stats available from the same source. There is also some newly available statistics that come from the SportVU cameras. These data points may give better lift over some of the more basic stats used here.

I also wanted to try it compared to a few outer methods to see how well it worked. I started down the path to do a comparison but but hit a roadbloack with the data preprocessing, most other methods depend on lost of up front normalization. One major advantage is that I had to do very little cleaning to get this method to work. Everything else depends on certain normal like distributions, this method is pretty robust leading to quicker results.

I also should look at some of the players between those in the Allstar game. Were they the outliers on the other side, the worst players in the league. Often figureing out what makes them an outlier is harder than finding out who are the actual outliers. It also takes a lot of domain knowledge.