A Faster Scale Function

Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the scale command. The scale function is great. It is simple and has 2 options other than passing in the matrix:

  1. Center – should the columns have their mean subracted off?
  2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The matrixStats package has some very quick operations for row/column operations and computing statistics along these margins.

Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

set.seed(1)
mat = matrix(rnorm(1e7, mean = 14, sd = 5),
    nrow = 1e5)
dim(mat)

How fast is scale?

Let’s see how fast the scale function is:

system.time({
    sx = scale(mat)
    })
   user  system elapsed
  0.971   0.285   1.262

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in mat.

Defining a new function

First, we must load in the matrixStats package and the only function we really are using is the colSds.

library(matrixStats)

colScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]
    }

  ################
  # Get the column means
  ################
    cm = colMeans(x, na.rm = TRUE)
  ################
  # Get the column sd
  ################
    if (scale) {
        csd = colSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    }
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    }
    x = t( (t(x) - cm) / csd )
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        }
        if (scale) {
            attr(x, "scaled:scale") <- csd
        }
    }
    return(x)
}

Let’s break down what the function is doing:

  1. The function takes in a matrix x with options:
    1. subsetting rows or cols
    2. center each column (by the mean) or not
    3. scale each column (by the standard deviation) or not
    4. Add the attributes of center/scale, so they match the scale output.
  2. The functions subsets the matrix if options are passed.
  3. Column means are calculated
  4. Column standard deviations are calculated (using the colmeans) if scale = TRUE or simply set to 1 if scale = FALSE.
  5. If the data is not to be centered, the centers are set to 0.
  6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
    • The reason this is done is because R operates column-wise. Let p be the number of columns. The column means/sds are of length p. If one simply subtracted the column means, R would try to do this to each individual column. For instance, it would recycle the p numbers to get to length n (number of rows), and do that subtraction, which is not what we want.
  7. The attributes are added to the matrix to match scale output.

colScale timing

Now we can see how fast the colScale command would take:

system.time({
    csx = colScale(mat)
})
   user  system elapsed
  0.231   0.039   0.271

This is a lot faster than the scale function. First and foremost, let us make sure that these give the same results:

all.equal(sx, csx)
[1] TRUE

Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the microbenchmark package to

library(microbenchmark)
mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
print(mb)
Unit: seconds
          expr       min        lq      mean    median        uq      max
 colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
    scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
 neval cld
    20  a
    20   b

We can visualize the results using ggplot2 and some violin plots.

library(ggplot2)
g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")
print(g)

plot of chunk gg

What about scaling rows!

If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

system.time({
  scaled_row = t( scale(t(mat)) )
})
   user  system elapsed
  2.165   0.624   3.398
all(abs(rowMeans(scaled_row)) < 1e-15)
[1] TRUE

Again, we can do the same thing with colScale:

system.time({
  colscaled_row = t( colScale(t(mat)) )
})
   user  system elapsed
  0.426   0.097   0.542
all(abs(rowMeans(colscaled_row)) < 1e-15)
[1] TRUE
all.equal(colscaled_row, scaled_row)
[1] TRUE

And we see the results are identical

Creating rowScale

The above results are good for what we would like to do. We may want to define the rowScale function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

rowScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]
    }

  ################
  # Get the column means
  ################
    cm = rowMeans(x, na.rm = TRUE)
  ################
  # Get the column sd
  ################
    if (scale) {
        csd = rowSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    }
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    }
    x = (x - cm) / csd
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        }
        if (scale) {
            attr(x, "scaled:scale") <- csd
        }
    }
    return(x)
}

Now let’s see how we do with rowScale:

system.time({
  rowscaled_row = rowScale(mat)
})
   user  system elapsed
  0.174   0.016   0.206
all(abs(rowMeans(rowscaled_row)) < 1e-15)
[1] TRUE
all.equal(rowscaled_row, scaled_row)
[1] TRUE

Let’s look at the times for this breakdown using microbenchmark:

mb_row = microbenchmark(t( colScale(t(mat)) ),
                        t( scale(t(mat)) ),
                        rowScale(mat),
                        times = 20, unit = "s")
print(mb_row)
Unit: seconds
                expr       min        lq      mean    median        uq
 t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
    t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
       rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
       max neval cld
 0.9008130    20  a
 5.0518348    20   b
 0.5138103    20  a

and visualize the results:

g %+% mb_row

plot of chunk gg_row

Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The scale function is well suited for this purpose, but the matrixStats package allows for faster computation done in C. The scale function will have different behavior as the code below from base::scale.default:

f <- function(v) {
  v <- v[!is.na(v)]
  sqrt(sum(v^2)/max(1, length(v) - 1L))
}
scale <- apply(x, 2L, f)

If the data is not centered and center = FALSE, the data will be divided by the squared sum of each column (divided by n-1). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and colScale/rowScale can do that if necessary. I will talk to Henrik Bengtsson (matrixStats author/maintainer) about incorporating these into matrixStats for general use. But for now, you can use the above code.

Advertisements

How I build up a ggplot2 figure

Recently, Jeff Leek at Simply Statistics discussed why he does not use ggplot2. He notes “The bottom line is for production graphics, any system requires work.” and describes a default plot that needs some work:

library(ggplot2)
ggplot(data = quakes, aes(x = lat,y = long,colour = stations)) + geom_point()

plot of chunk plot

To break down what is going on, here is what R interprets (more or less):

  1. Make a container for data ggplot.
  2. Use the quakes data.frame: data = quakes.
  3. Map certain “aesthetics” with the aes to three different aesthetics (x, y, z) to certain variables from the dataset lat, long, stations, respectively.
  4. Add a layer of geometric things, in this case points (geom_point).

Implicitly, ggplot2 notes that all 3 aesthetics are continuous, so maps them onto the plot using a “continuous” scale (color bar). If stations were a factor or character column, the plot would not have a color bar but a “discrete” scale.

Now, Jeff goes on to describe elements he believes required to make this plot “production ready”:

  1. make the axes bigger
  2. make the labels bigger
  3. make the labels be full names (latitude and longitude, ideally with units when variables need them
  4. make the legend title be number of stations reporting

As such, I wanted to go through each step and show how you can do each of these operations

Make the Axes/Labels Bigger

First off, let’s assign this plot to an object, called g:

g = ggplot(data = quakes,
           aes(x = lat,y = long,colour = stations)) +
  geom_point()

Now, you can simply call print(g) to show the plot, but the assignment will not do that by default. If you simply call g, it will print/show the object (as other R objects do), and plot the graph.

Theme – get to know it

One of the most useful ggplot2 functions is theme. Read the documentation (?theme). There is a slew of options, but we will use a few of them for this and expand on them in the next sections.

Setting a global text size

We can use the text argument to change ALL the text sizes to a value. Now this is where users who have never used ggplot2 may be a bit confused. The text argument (input) in the theme command requires that text be an object of class element_text. If you look at the theme help it says “all text elements (element_text)”. This means you can’t just say text = 5, you must specify text = element_text().

As text can have multiple properties (size, color, etc.), element_text can take multiple arguments for these properties. One of these arguments is size:

g + theme(text = element_text(size = 20))

plot of chunk bigger_axis

Again, note that the text argument/property of theme changes all the text sizes. Let’s say we want to change the axis tick text (axis.text), legend header/title (legend.title), legend key text (legend.text), and axis label text (axis.title) to each a different size:

gbig = g + theme(axis.text = element_text(size = 18),
                 axis.title = element_text(size = 20),
                 legend.text = element_text(size = 15),
                 legend.title = element_text(size = 15))
gbig

plot of chunk bigger_axis2

Now, we still have the plot g stored, but we make a new version of the graph, called gbig.

Make the Labels to be full names

To change the x or y labels, you can just use the xlab/ylab functions:

gbig = gbig + xlab(&quot;Latitude&quot;) + ylab(&quot;Longitude&quot;)
gbig

plot of chunk lab_full

We want to keep these labels, so we overwrote gbig.

Maybe add a title

Now, one may assume there is a main() function from ggplot2 to give the title of the graph, but that function is ggtitle(). Note, there is a title command in base R, so this was not overwritten. It can be used by just adding this layer:

gbig + ggtitle(&quot;Spatial Distribution of Stations&quot;)

plot of chunk title

Note, the title is smaller than the specified axes label sizes by default. Again if we wanted to make that title bigger, we can change that using theme:

gbig +
  ggtitle(&quot;Spatial Distribution of Stations&quot;) +
  theme(title = element_text(size = 30))

plot of chunk big_title

I will not reassign this to a new graph as in some figures for publications, you make the title in the figure legend and not the graph itself.

Making a better legend

Now let’s change the header/title of the legend to be number of stations. We can do this using the guides function:

gbigleg_orig = gbig + guides(colour = guide_colorbar(title = &quot;Number of Stations Reporting&quot;))
gbigleg_orig

plot of chunk leg

Here, guides takes arguments that are the same as the aesthetics from before in aes. Also note, that color and colour are aliased so that you can spell it either way you want.

I like the size of the title, but I don’t like how wide it is. We can put line breaks in there as well:

gbigleg = gbig + guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;))
gbigleg

plot of chunk leg2

Ugh, let’s also adjust the horizontal justification, so the title is centered:

gbigleg = gbigleg +
  guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
                                 title.hjust = 0.5))
gbigleg

plot of chunk leg_adjust

That looks better for the legend, but we still have a lot of wasted space.

Legend IN the plot

One of the things I believe is that the legend should be inside the plot. In order to do this, we can use the legend.position from the themes:

gbigleg +
  theme(legend.position = c(0.3, 0.35))

plot of chunk leg_inside

Now, there seems can be a few problems here:

  1. There may not be enough place to put the legend
  2. The legend may mask out points/data

For problem 1., we can either 1) make the y-axis bigger or the legend smaller or a combination of both. In this case, we do not have to change the axes, but you can use ylim to change the y-axis limits:

gbigleg +
  theme(legend.position = c(0.3, 0.35)) +
  ylim(c(160, max(quakes$long)))

plot of chunk change_ylim

I try to not do this as area has been added with no data information. We have enough space, but let’s make the legend “transparent” so we can at least see if any points are masked out and to make the legend look a more inclusive part of the plot.

Making a transparent legend

I have a helper “function” transparent_legend that will make the box around the legend (legend.background) transparent and the boxes around the keys (legend.key) transparent as well. Like text before, we have to specify boxes/backgrounds as an element type, but these are rectangles (element_rect) compared to text (element_text).

transparent_legend =  theme(
  legend.background = element_rect(fill = &quot;transparent&quot;),
  legend.key = element_rect(fill = &quot;transparent&quot;,
                            color = &quot;transparent&quot;)
)

One nice thing is that we can save this as an object and simply “add” it to any plot we want a transparent legend. Let’s add this to what we had and see the result:

gtrans_leg = gbigleg +
  theme(legend.position = c(0.3, 0.35)) +
  transparent_legend
gtrans_leg

plot of chunk leg_inside2

Moving the title of the legend

Now, everything in gtrans_leg looks acceptable (to me) except for the legend title. We can move the title of the legend to the left hand side:

gtrans_leg + guides(colour = guide_colorbar(title.position = &quot;left&quot;))

plot of chunk leg_left

Damnit! Note, that if you respecify the guides, you must make sure you do it all in one shot (easiest way):

gtrans_leg + guides(
  colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
                          title.hjust = 0.5,
                          title.position = &quot;left&quot;))

plot of chunk leg_left_correct

A little more advanced

The last statement is not entirely true, as we could dig into the ggplot2 object and assign a different title.position property to the object after the fact.

gtrans_leg$guides$colour$title.position = &quot;left&quot;
gtrans_leg

plot of chunk respec

“I don’t like that theme”

Many times, I have heard people who like the grammar of ggplot2 but not the specified theme that is default. The ggthemes package has some good extensions of theme from ggplot2, but there are also a bunch of themes included in ggplot2, which should be specified before changing specific elements of theme as done above:

g + theme_bw()

plot of chunk themes

g + theme_dark()

plot of chunk themes

g + theme_minimal()

plot of chunk themes

g + theme_classic()

plot of chunk themes

Conclusions

I agree that ggplot2 can deceive new users by making graphs that look “good”-ish. This may be a detriment as they may believe they are good enough, when they truly need to be changed. The changes are available in base or ggplot2 and the overall goal was to show how the recommendations can be achieved using ggplot2 commands.

Below, I discuss some other aspects of the post, where you can use ggplot2 to make quick-ish exploratory plots. I believe, however, that ggplot2 is not the fastest for quick basic exploratory plots. What is is better than base graphics is for making slightly more complex exploratory plots that are necessary for analysis, where base can take more code to do.

How to make quick exploratory plots

I agree with Jeff that the purpose of exploratory plots should be done quickly and a broad range of plots done with minimal code.

Now, I agree that plot is a great function. I do believe that you can create many quick plots using ggplot2 and can be faster than base in some instances. A specific case would be that you have a binary y variable and multiple continous x variables. Let’s say I want to plot jittered points, a fit from a binomial glm (logistic regression), and one from a loess.

Here we will use mtcars and say if the car is automatic or manual (am variable) is our outcome.

g = ggplot(aes(y = am), data = mtcars) +
  geom_point(position = position_jitter(height = 0.2)) +
  geom_smooth(method = &quot;glm&quot;,
              method.args = list(family = &quot;binomial&quot;), se = FALSE) +
  geom_smooth(method = &quot;loess&quot;, se = FALSE, col = &quot;red&quot;)

Then we can simply add the x variables as aesthetics to look at each of these:

g + aes(x = mpg)

plot of chunk unnamed-chunk-2

g + aes(x = drat)

plot of chunk unnamed-chunk-2

g + aes(x = qsec)

plot of chunk unnamed-chunk-2

Yes, you can create a function to do the operations above in base, but that’s 2 sides of the same coin: function versus ggplot2 object.

 

All code is located on my GitHub for my blog