July | 2014 | A HopStat and Jump Away

The table command is great in its simplicity for cross tabulations. I have run into some settings where it is slow and I wanted to demonstrate one simple example here of why you may want to use other functions or write your own tabler. This example is a specific case where, for some examples and functions, you don't need all the good error-checking or flexibility that a function contains, but you want to do something specific and can greatly speed up computation.

Setup of example

I have some brain imaging data. I have a gold standard, where an expert hand-traced (on a computer) a brain scan delineating the brain. I'll refer to this as a brain “mask”. (We use the word mask in imaging to denote a segmented image – either done manually or automatically and I generally reserve the word mask for binary 0/1 values, but others use the term more broadly.)

Using automated methods, I can try to re-create this mask automatically. This image is also binary. I want to simply get a \(2\times2\) contingency table of the automated versus manual masks so I can get sensitivity/specificity/accuracy/etc.

The data

For simplicity and computation, let's consider the images as just a really long vectors. I'll call them manual and auto for the manual and automatic masks, respectively.

These are long logical vectors (9 million elements):

length(manual)

[1] 9175040

length(auto)

[1] 9175040

head(manual)

[1] FALSE FALSE FALSE FALSE FALSE FALSE

head(auto)

[1] FALSE FALSE FALSE FALSE FALSE FALSE

Naturally, you can run table on this data:

stime = system.time({ tab = table(manual, auto) })
print(stime)

   user  system elapsed 
  3.294   0.082   3.386

print(tab)

       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162

The computation took about 3.4 seconds on my MacBook Pro (2013, 16Gb RAM, 2.8GHz Intel i7), which isn't that bad. Realize, though, that I could have hundreds or thousands of these images. We need to speed this up.

What is the essense of what we're doing?

Taking 3.4 seconds to get 4 numbers seems a bit long. As the data is binary, we can simply compute these with the sum command and logical operators.

Let's make the twoXtwo command:

twoXtwo = function(x, y, dnames=c("x", "y")){
    tt <- sum( x &  y)
        tf <- sum( x & !y)
        ft <- sum(!x &  y)
        ff <- sum(!x & !y)
        tab = matrix(c(ff, tf, ft, tt), ncol=2)
    n = list(c("FALSE", "TRUE"), c("FALSE", "TRUE"))
    names(n) = dnames
    dimnames(tab) = n
    tab = as.table(tab)
    dim
        tab
}

And let's see how fast this is (and confirm the result is the same):

stime2 = system.time({ twotab = twoXtwo(manual, auto, dnames=c("manual", "auto")) })
print(stime2)

   user  system elapsed 
  0.828   0.006   0.835

print(twotab)

       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162

identical(tab, twotab)

[1] TRUE

Viola, twoXtwo runs about 4.06 times faster than table, largely because we knew we did not have to check certain characteristics of the data and that it's a specific example of a table.

More speed captain!

This isn't something astronomical such as a 100-fold increase, but we can increase the speed by not doing all the logical operations on the vectors, but taking differences from the margin sums.

Let's confirm this is faster and accurate by running it on our data:

stime3 = system.time({ twotab2 = twoXtwo2(manual, auto, dnames=c("manual", "auto")) })
print(stime3)

   user  system elapsed 
  0.198   0.001   0.200

print(twotab2)

       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162

identical(tab, twotab2)

[1] TRUE

Now, if I were going for speed, this code is good enough for me: it runs about 16.93 times faster than table. The one downside is that it is not as readable as twoXtwo. For even greater speed, I could probably move into C++ using the Rcpp package, but that seems overkill for a two by two table.

Other examples of speeding up the calculation can be found here.

Finishing up

I said I wanted sensitivity/specificity/accuracy/etc. so I will show how to get these. I'm going to use prop.table, which I didn't know about for a while when I first started using R (see margin.table too).

ptab = prop.table(twotab)
rowtab = prop.table(twotab, margin=1)
coltab = prop.table(twotab, margin=2)

As you can see, like the apply command, the prop.table command can either take no margin or take the dimension to divide over (1 for rows, 2 for columns). This means that in ptab, each cell of twotab was divided by the grand total (or sum(tab)). For rowtab, each cell was divided by the rowSums(tab) to get a proportion, and similarly cells in coltab were divided by colSums(tab). After the end of the post, I can show you these are the same.

Getting Performance Measures

Accuracy

Getting the accuracy is very easy:

accur = sum(diag(ptab))
print(accur)

[1] 0.997

Sensitivity/Specificity

For sensitivity/specificity, the “truth” is the rows of the table, so we want the row percentages:

sens = rowtab["TRUE", "TRUE"]
spec = rowtab["FALSE", "FALSE"]
print(sens)

[1] 0.9874

print(spec)

[1] 0.9985

Positive/Negative Predictive Value

We can also get the positive predictive value (PPV) and negative predictive value (NPV) from the column percentages:

ppv = coltab["TRUE", "TRUE"]
npv = coltab["FALSE", "FALSE"]
print(ppv)

[1] 0.9902

print(npv)

[1] 0.9981

Conclusions

After using R for years, I find things to still be very intuitive. Sometimes, though, for large data sets or specific examples, you may want to write your own function for speed, checking against the base functions for a few iterations as a double-check. I have heard this to be a nuisance for those who dislike R, as well as hampering reproducibility in some cases. Overall, I find that someone has made a vetted package that does what you want, but there are still simple cases (such as above) where “rolling your own” is OK and easier than adding a dependency to your code.

Aside: How prop.table works

Over all margins

For just doing prop.table without a margin, you can think of the table being divided by its sum.

print(round(ptab, 3))

       auto
manual  FALSE  TRUE
  FALSE 0.866 0.001
  TRUE  0.002 0.131

print(round(twotab / sum(twotab), 3))

       auto
manual  FALSE  TRUE
  FALSE 0.866 0.001
  TRUE  0.002 0.131

Over row margins

For the margin=1, or row percentages, you can think of dividing the table by the row sums.

print(round(rowtab, 3))

       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.987

print(round(twotab / rowSums(twotab), 3))

       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.987

Over column margins

Now for column percentages, you can think of R dividing each cell by its column's sum. This is what prop.table does.

Let's look at the result we should get:

print(round(coltab, 3))

       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

But in R, when you take a matrix and then add/divide/subtract/multiply it by a vector, R does the operation column-wise. So when you take column sums on the table, you get a vector with the same number of columns as the table:

print(colSums(twotab))

  FALSE    TRUE 
7956925 1218115

If you try to divide the table by this value, it will not give you the desired result:

print(round( twotab / colSums(twotab), 3))

       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.990

R operations with matrices and vectors

This is because R thinks of a vector as a column vector (or a matrix of 1 column). R then takes the first column of the table and divides the first element from the first column sum (which is correct), but take the second element of the first column and divides it by the second column sum (which is not correct).
A detailed discussion on SO is located here of how to do row-wise operations on matrices.

Back to column percentages

We can use the t( t() ) operation to get the correct answer:

print(round( t( t(twotab) / colSums(twotab)), 3))

       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

You can think of R implicitly making the matrix of the correct size with the correct values and then dividing:

cs = colSums(twotab)
cs = matrix(cs, nrow=nrow(tab), ncol=ncol(tab), byrow=TRUE)
print(cs)

        [,1]    [,2]
[1,] 7956925 1218115
[2,] 7956925 1218115

print(round( twotab/cs, 3))

       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

Happy tabling!

I'm helping out with teaching a class on an introduction to R for students this week. I figured it'd be a good time to discuss my thoughts on programming in R and how a newcomer should feel about learning the language.

Those Who Teach R, Should Use R

Many of the students in the class at the beginning are overwhelmed. They see a bunch of different symbols and syntax; being overwhelmed is understandable. Moreover, those teaching you R can seem lightning fast when running code, moving around screens, or figuring out problems, and they usually are. Do you know why?

Those who are teaching you R, use R. They use R a lot. They use it daily, and for hours a day usually. Don't try to be them immediately, get the basics.

I hear students saying “Oh, you're fast” when power users help them, usually lined with an undertone of low self-esteem. That's like me going to a basketball camp run by Michael Jordan and saying “Oh man Michael, you're really good on the court”. Be like Mike: work hard to learn the basics like the back of your hand. You'll be dunking in no time.

And, of course, those who teach you are fast. Why would you take class from someone slow or unsure while doing what they're teaching you? Yes, your teachers may be fast, but that's the point. Learn from those who do.

I feel as though I'm relatively fast on my machine and I felt this starting out. Even more, I didn't know what tab completion was when starting out my first class. I thought my professor could actually type that fast when writing variables or directory paths. I thought it was magic. If I didn't stick it out, I wouldn't figure out how to make that magic myself.

R is a language

I took Spanish in high school for 4 years. I remember some vocabulary words and some conjugation rules, but am far from conversational. I've never taken French – I don't know French. Now, if I took an introduction to French class for 4 hours, do you think I could speak (or write) French fluently? No, of course not. Yet students think they can with R. My Spanish is like a background in Stata: some words/phrases/commands are similar, other are misleading and can be confusing.

R is a programming language. Just like a foreign language, R has syntax and grammar. You must learn simple punctuation such as placement of commas, assigment using “=” or “<-”, where to close parentheses, and when to notice when a quote is unmatched. My overall message is:

R is a language, treat it like one.

Remember to tell students to hold themselves to the same level of comprehension as they would for a spoken (or signed) or written language. Hopefully, that will put learning R in perspective, even if it does not make it any less overwhelming. I wonder if Rosetta Stone will make an R module one of these days.

A HopStat and Jump Away

Trying to at least Doggie Paddle through the Sea of Data, Contributor to http://bmorebiostat.com

Monthly Archives: July 2014

Sometimes Table is not the Answer – a Faster 2×2 Table