UPDATE: Figured out how to do this using the corrplot package which add a lot more options. Check it out for GSW:
This post owes its genesis to Alex Konkel who blogs at Sports Skeptic. He asked if I could calculate something called the variance inflation factor (VIF) for the adjusted +/- regressions I’ve been doing. This would apparently enable us to examine the collinearity between variables (i.e. players). We’re actually trying to work out some kinks in that analysis, but in the meantime, it gave me an idea. Why not just calculate the correlation matrix for all players?
For every stint of possessions, we know what players are on the floor (coded as dummy variables). It stands to reason that players who are on the floor more often with certain players will “correlate” more with those players. In R, this is an extremely easy calculation to do using the cor() function. And once you have the correlation matrix, the question is how to visualize the data. I thought about doing individual plots for each player, but that would quickly get crazy. Then I thought about creating a network graph which would have different colored edges representing the correlations, but with over 400 players, that would get messy. Another way to do it is to simply plot the entire matrix as an image, so that each entry in the matrix becomes a pixel whose color is proportional to the correlation value.
I found an even more sophisticated way of doing this using the heatmap() function. Heatmaps are frequently used in biology to show microarray data (i.e. gene expression). I already knew about heatmaps from years of having to sit through presentation after presentation of microarray data in grad school. What I didn’t know as much about were dendrograms and hierarchical clustering. While I don’t understand all the nitty-gritty details, the basic idea appears to be that you can take the correlation matrix and identify highly related clusters of genes or whatever. In this post, the whatever are clusters of players who are on the floor at the same time. The neat thing about the R heatmap function is that it automatically rearranges the rows and columns, so that the clusters become more easily identifiable. In these plots, you will be able to see starting units and bench units quite clearly. You’ll also be able to identify who plays with who (or who never plays with who). Each team has a sort of unique fingerprint.
To “read” these, first note that each player has a perfect correlation with himself, and so the diagonal will be dark red. If two players never play together, the tile will be white. In between those two extremes is a range of different shades of red of increasing intensity.
Enjoy and feel free to ask questions.
2012 Player Floor Time Correlations
Click on plots to enlarge.