#rstats Make arrays into vectors before running table

Setup of Problem

While working with nifti objects from the oro.nifti, I tried to table the values of the image. The table took a long time to compute. I thought this was due to the added information about a medical image, but I found that the same sluggishness happened when coercing the nifti object to an array as well.

Quick, illustrative simulation

But, if I coerced the data to a vector using the c function, things were much faster. Here's a simple example of the problem.

library(microbenchmark)
dim1 = 30
n = dim1 ^ 3
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = c(dim1, dim1, dim1))
microbenchmark(table(vec), table(arr), table(c(arr)), times = 100)
Unit: milliseconds
          expr       min        lq      mean    median        uq      max
    table(vec)  5.767608  5.977569  8.052919  6.404160  7.574409 51.13589
    table(arr) 21.780273 23.515651 25.050044 24.367534 25.753732 68.91016
 table(c(arr))  5.803281  6.070403  6.829207  6.786833  7.374568  9.69886
 neval cld
   100  a 
   100   b
   100  a 

As you can see, it's much faster to run table on the vector than the array, and the coercion of an array to a vector doesn't take much time compared to the tabling and is comparable in speed.

Explanation of simulation

If the code above is clear, you can skip this section. I created an array that was 30 × 30 × 30 from random binomial variables with half probabily and 15 Bernoulli trials. To keep things on the same playing field, the array (arr) and the vector (vec) have the same values in them. The microbenchmark function (and package of the same name) will run the command 100 times and displays the statistics of the time component.

Why, oh why?

I've looked into the table function, but cannot seem to find where the bottleneck occurs. Now, for and array of 30 × 30 × 30, it takes less than a tenth of a second to compute. The problem is when the data is 512 × 512 × 30 (such as CT data), the tabulation using the array form can be very time consuming.

I reduced the replicates, but let's show see this in a reasonable image dimension example:

library(microbenchmark)
dims = c(512, 512, 30)
n = prod(dims)
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = dims)
microbenchmark(table(vec), table(arr), table(c(arr)), times = 10)
Unit: seconds
          expr      min       lq     mean    median        uq       max
    table(vec) 1.871762 1.898383 1.990402  1.950302  1.990898  2.299721
    table(arr) 8.935822 9.355209 9.990732 10.078947 10.449311 11.594772
 table(c(arr)) 1.925444 1.981403 2.127866  2.018741  2.222639  2.612065
 neval cld
    10  a 
    10   b
    10  a 

Conclusion

I can't figure out why right now, but it seems that coercing an array (or nifti image) to a vector before running table can significantly speed up the procedure. If anyone has any intuition why this is, I'd love to hear it. Hope that helps your array tabulations!

Advertisements

3 thoughts on “#rstats Make arrays into vectors before running table

  1. Profiling is your friend! A quick profiling shows that the time is spent in factor() inside table(arr). If you benchmark factor instead of table, you will get almost the same times. I didn’t go on, but you could profile factor(arr) and see where the time was spent inside it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s