## Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:

(x – μ)/σ

where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the `scale`

command. The `scale`

function is great. It is simple and has 2 options other than passing in the matrix:

- Center – should the columns have their mean subracted off?
- Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The `matrixStats`

package has some **very** quick operations for row/column operations and computing statistics along these margins.

## Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

set.seed(1) mat = matrix(rnorm(1e7, mean = 14, sd = 5), nrow = 1e5) dim(mat)

### How fast is scale?

Let’s see how fast the `scale`

function is:

system.time({ sx = scale(mat) })

user system elapsed 0.971 0.285 1.262

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in `mat`

.

## Defining a new function

First, we must load in the `matrixStats`

package and the only function we really are using is the `colSds`

.

library(matrixStats) colScale = function(x, center = TRUE, scale = TRUE, add_attr = TRUE, rows = NULL, cols = NULL) { if (!is.null(rows) && !is.null(cols)) { x <- x[rows, cols, drop = FALSE] } else if (!is.null(rows)) { x <- x[rows, , drop = FALSE] } else if (!is.null(cols)) { x <- x[, cols, drop = FALSE] } ################ # Get the column means ################ cm = colMeans(x, na.rm = TRUE) ################ # Get the column sd ################ if (scale) { csd = colSds(x, center = cm) } else { # just divide by 1 if not csd = rep(1, length = length(cm)) } if (!center) { # just subtract 0 cm = rep(0, length = length(cm)) } x = t( (t(x) - cm) / csd ) if (add_attr) { if (center) { attr(x, "scaled:center") <- cm } if (scale) { attr(x, "scaled:scale") <- csd } } return(x) }

Let’s break down what the function is doing:

- The function takes in a matrix
`x`

with options:- subsetting
`rows`

or`cols`

- center each column (by the mean) or not
- scale each column (by the standard deviation) or not
- Add the attributes of center/scale, so they match the
`scale`

output.

- subsetting
- The functions subsets the matrix if options are passed.
- Column means are calculated
- Column standard deviations are calculated (using the colmeans) if
`scale = TRUE`

or simply set to 1 if`scale = FALSE`

. - If the data is not to be centered, the centers are set to 0.
- The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
- The reason this is done is because
`R`

operates column-wise. Let`p`

be the number of columns. The column means/sds are of length`p`

. If one simply subtracted the column means,`R`

would try to do this to each individual column. For instance, it would recycle the`p`

numbers to get to length`n`

(number of rows), and do that subtraction, which is not what we want.

- The reason this is done is because
- The attributes are added to the matrix to match
`scale`

output.

### colScale timing

Now we can see how fast the `colScale`

command would take:

system.time({ csx = colScale(mat) })

user system elapsed 0.231 0.039 0.271

This is a lot faster than the `scale`

function. First and foremost, let us make sure that these give the same results:

all.equal(sx, csx)

[1] TRUE

## Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the `microbenchmark`

package to

library(microbenchmark) mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s") print(mb)

Unit: seconds expr min lq mean median uq max colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507 scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223 neval cld 20 a 20 b

We can visualize the results using `ggplot2`

and some violin plots.

library(ggplot2) g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)") print(g)

## What about scaling rows!

If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

system.time({ scaled_row = t( scale(t(mat)) ) })

user system elapsed 2.165 0.624 3.398

all(abs(rowMeans(scaled_row)) < 1e-15)

[1] TRUE

Again, we can do the same thing with `colScale`

:

system.time({ colscaled_row = t( colScale(t(mat)) ) })

user system elapsed 0.426 0.097 0.542

all(abs(rowMeans(colscaled_row)) < 1e-15)

[1] TRUE

all.equal(colscaled_row, scaled_row)

[1] TRUE

And we see the results are identical

### Creating rowScale

The above results are good for what we would like to do. We may want to define the `rowScale`

function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

rowScale = function(x, center = TRUE, scale = TRUE, add_attr = TRUE, rows = NULL, cols = NULL) { if (!is.null(rows) && !is.null(cols)) { x <- x[rows, cols, drop = FALSE] } else if (!is.null(rows)) { x <- x[rows, , drop = FALSE] } else if (!is.null(cols)) { x <- x[, cols, drop = FALSE] } ################ # Get the column means ################ cm = rowMeans(x, na.rm = TRUE) ################ # Get the column sd ################ if (scale) { csd = rowSds(x, center = cm) } else { # just divide by 1 if not csd = rep(1, length = length(cm)) } if (!center) { # just subtract 0 cm = rep(0, length = length(cm)) } x = (x - cm) / csd if (add_attr) { if (center) { attr(x, "scaled:center") <- cm } if (scale) { attr(x, "scaled:scale") <- csd } } return(x) }

Now let’s see how we do with `rowScale`

:

system.time({ rowscaled_row = rowScale(mat) })

user system elapsed 0.174 0.016 0.206

all(abs(rowMeans(rowscaled_row)) < 1e-15)

[1] TRUE

all.equal(rowscaled_row, scaled_row)

[1] TRUE

Let’s look at the times for this breakdown using `microbenchmark`

:

mb_row = microbenchmark(t( colScale(t(mat)) ), t( scale(t(mat)) ), rowScale(mat), times = 20, unit = "s") print(mb_row)

Unit: seconds expr min lq mean median uq t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429 t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272 rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327 max neval cld 0.9008130 20 a 5.0518348 20 b 0.5138103 20 a

and visualize the results:

g %+% mb_row

## Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The `scale`

function is well suited for this purpose, but the `matrixStats`

package allows for faster computation done in C. The `scale`

function will have different behavior as the code below from `base::scale.default`

:

f <- function(v) { v <- v[!is.na(v)] sqrt(sum(v^2)/max(1, length(v) - 1L)) } scale <- apply(x, 2L, f)

If the data is not centered and `center = FALSE`

, the data will be divided by the squared sum of each column (divided by `n-1`

). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and `colScale/rowScale`

can do that if necessary. I will talk to Henrik Bengtsson (`matrixStats`

author/maintainer) about incorporating these into `matrixStats`

for general use. But for now, you can use the above code.