# A Faster Scale Function

## Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the `scale` command. The `scale` function is great. It is simple and has 2 options other than passing in the matrix:

1. Center – should the columns have their mean subracted off?
2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The `matrixStats` package has some very quick operations for row/column operations and computing statistics along these margins.

## Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

```set.seed(1)
mat = matrix(rnorm(1e7, mean = 14, sd = 5),
nrow = 1e5)
dim(mat)
```

### How fast is scale?

Let’s see how fast the `scale` function is:

```system.time({
sx = scale(mat)
})
```
```   user  system elapsed
0.971   0.285   1.262
```

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in `mat`.

## Defining a new function

First, we must load in the `matrixStats` package and the only function we really are using is the `colSds`.

```library(matrixStats)

colScale = function(x,
center = TRUE,
scale = TRUE,
rows = NULL,
cols = NULL) {

if (!is.null(rows) && !is.null(cols)) {
x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {
x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) {
x <- x[, cols, drop = FALSE]
}

################
# Get the column means
################
cm = colMeans(x, na.rm = TRUE)
################
# Get the column sd
################
if (scale) {
csd = colSds(x, center = cm)
} else {
# just divide by 1 if not
csd = rep(1, length = length(cm))
}
if (!center) {
# just subtract 0
cm = rep(0, length = length(cm))
}
x = t( (t(x) - cm) / csd )
if (center) {
attr(x, "scaled:center") <- cm
}
if (scale) {
attr(x, "scaled:scale") <- csd
}
}
return(x)
}
```

Let’s break down what the function is doing:

1. The function takes in a matrix `x` with options:
1. subsetting `rows` or `cols`
2. center each column (by the mean) or not
3. scale each column (by the standard deviation) or not
4. Add the attributes of center/scale, so they match the `scale` output.
2. The functions subsets the matrix if options are passed.
3. Column means are calculated
4. Column standard deviations are calculated (using the colmeans) if `scale = TRUE` or simply set to 1 if `scale = FALSE`.
5. If the data is not to be centered, the centers are set to 0.
6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
• The reason this is done is because `R` operates column-wise. Let `p` be the number of columns. The column means/sds are of length `p`. If one simply subtracted the column means, `R` would try to do this to each individual column. For instance, it would recycle the `p` numbers to get to length `n` (number of rows), and do that subtraction, which is not what we want.
7. The attributes are added to the matrix to match `scale` output.

### colScale timing

Now we can see how fast the `colScale` command would take:

```system.time({
csx = colScale(mat)
})
```
```   user  system elapsed
0.231   0.039   0.271
```

This is a lot faster than the `scale` function. First and foremost, let us make sure that these give the same results:

```all.equal(sx, csx)
```
``` TRUE
```

## Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the `microbenchmark` package to

```library(microbenchmark)
mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
print(mb)
```
```Unit: seconds
expr       min        lq      mean    median        uq      max
colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
neval cld
20  a
20   b
```

We can visualize the results using `ggplot2` and some violin plots.

```library(ggplot2)
g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")
print(g)
``` If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

```system.time({
scaled_row = t( scale(t(mat)) )
})
```
```   user  system elapsed
2.165   0.624   3.398
```
```all(abs(rowMeans(scaled_row)) < 1e-15)
```
``` TRUE
```

Again, we can do the same thing with `colScale`:

```system.time({
colscaled_row = t( colScale(t(mat)) )
})
```
```   user  system elapsed
0.426   0.097   0.542
```
```all(abs(rowMeans(colscaled_row)) < 1e-15)
```
``` TRUE
```
```all.equal(colscaled_row, scaled_row)
```
``` TRUE
```

And we see the results are identical

### Creating rowScale

The above results are good for what we would like to do. We may want to define the `rowScale` function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

```rowScale = function(x,
center = TRUE,
scale = TRUE,
rows = NULL,
cols = NULL) {

if (!is.null(rows) && !is.null(cols)) {
x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {
x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) {
x <- x[, cols, drop = FALSE]
}

################
# Get the column means
################
cm = rowMeans(x, na.rm = TRUE)
################
# Get the column sd
################
if (scale) {
csd = rowSds(x, center = cm)
} else {
# just divide by 1 if not
csd = rep(1, length = length(cm))
}
if (!center) {
# just subtract 0
cm = rep(0, length = length(cm))
}
x = (x - cm) / csd
if (center) {
attr(x, "scaled:center") <- cm
}
if (scale) {
attr(x, "scaled:scale") <- csd
}
}
return(x)
}
```

Now let’s see how we do with `rowScale`:

```system.time({
rowscaled_row = rowScale(mat)
})
```
```   user  system elapsed
0.174   0.016   0.206
```
```all(abs(rowMeans(rowscaled_row)) < 1e-15)
```
``` TRUE
```
```all.equal(rowscaled_row, scaled_row)
```
``` TRUE
```

Let’s look at the times for this breakdown using `microbenchmark`:

```mb_row = microbenchmark(t( colScale(t(mat)) ),
t( scale(t(mat)) ),
rowScale(mat),
times = 20, unit = "s")
print(mb_row)
```
```Unit: seconds
expr       min        lq      mean    median        uq
t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
max neval cld
0.9008130    20  a
5.0518348    20   b
0.5138103    20  a
```

and visualize the results:

```g %+% mb_row
``` ## Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The `scale` function is well suited for this purpose, but the `matrixStats` package allows for faster computation done in C. The `scale` function will have different behavior as the code below from `base::scale.default`:

```f <- function(v) {
v <- v[!is.na(v)]
sqrt(sum(v^2)/max(1, length(v) - 1L))
}
scale <- apply(x, 2L, f)
```

If the data is not centered and `center = FALSE`, the data will be divided by the squared sum of each column (divided by `n-1`). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and `colScale/rowScale` can do that if necessary. I will talk to Henrik Bengtsson (`matrixStats` author/maintainer) about incorporating these into `matrixStats` for general use. But for now, you can use the above code.

# How I build up a ggplot2 figure

Recently, Jeff Leek at Simply Statistics discussed why he does not use ggplot2. He notes “The bottom line is for production graphics, any system requires work.” and describes a default plot that needs some work:

```library(ggplot2)
ggplot(data = quakes, aes(x = lat,y = long,colour = stations)) + geom_point()
``` To break down what is going on, here is what R interprets (more or less):

1. Make a container for data `ggplot`.
2. Use the `quakes` `data.frame`: `data = quakes`.
3. Map certain “aesthetics” with the `aes` to three different aesthetics (`x`, `y`, `z`) to certain variables from the dataset `lat`, `long`, `stations`, respectively.
4. Add a layer of geometric things, in this case points (`geom_point`).

Implicitly, `ggplot2` notes that all 3 aesthetics are continuous, so maps them onto the plot using a “continuous” scale (color bar). If `stations` were a factor or character column, the plot would not have a color bar but a “discrete” scale.

Now, Jeff goes on to describe elements he believes required to make this plot “production ready”:

1. make the axes bigger
2. make the labels bigger
3. make the labels be full names (latitude and longitude, ideally with units when variables need them
4. make the legend title be number of stations reporting

As such, I wanted to go through each step and show how you can do each of these operations

## Make the Axes/Labels Bigger

First off, let’s assign this plot to an object, called `g`:

```g = ggplot(data = quakes,
aes(x = lat,y = long,colour = stations)) +
geom_point()
```

Now, you can simply call `print(g)` to show the plot, but the assignment will not do that by default. If you simply call `g`, it will print/show the object (as other R objects do), and plot the graph.

### Theme – get to know it

One of the most useful `ggplot2` functions is `theme`. Read the documentation (`?theme`). There is a slew of options, but we will use a few of them for this and expand on them in the next sections.

### Setting a global text size

We can use the `text` argument to change ALL the text sizes to a value. Now this is where users who have never used `ggplot2` may be a bit confused. The `text` argument (input) in the `theme` command requires that `text` be an object of class `element_text`. If you look at the `theme` help it says “`all text elements (element_text)`”. This means you can’t just say `text = 5`, you must specify `text = element_text()`.

As text can have multiple properties (`size`, `color`, etc.), `element_text` can take multiple arguments for these properties. One of these arguments is `size`:

```g + theme(text = element_text(size = 20))
``` Again, note that the `text` argument/property of theme changes all the text sizes. Let’s say we want to change the axis tick text (`axis.text`), legend header/title (`legend.title`), legend key text (`legend.text`), and axis label text (`axis.title`) to each a different size:

```gbig = g + theme(axis.text = element_text(size = 18),
axis.title = element_text(size = 20),
legend.text = element_text(size = 15),
legend.title = element_text(size = 15))
gbig
``` Now, we still have the plot `g` stored, but we make a new version of the graph, called `gbig`.

## Make the Labels to be full names

To change the x or y labels, you can just use the `xlab`/`ylab` functions:

```gbig = gbig + xlab(&quot;Latitude&quot;) + ylab(&quot;Longitude&quot;)
gbig
``` We want to keep these labels, so we overwrote `gbig`.

Now, one may assume there is a `main()` function from `ggplot2` to give the title of the graph, but that function is `ggtitle()`. Note, there is a `title` command in `base` R, so this was not overwritten. It can be used by just adding this layer:

```gbig + ggtitle(&quot;Spatial Distribution of Stations&quot;)
``` Note, the title is smaller than the specified axes label sizes by default. Again if we wanted to make that title bigger, we can change that using `theme`:

```gbig +
ggtitle(&quot;Spatial Distribution of Stations&quot;) +
theme(title = element_text(size = 30))
``` I will not reassign this to a new graph as in some figures for publications, you make the title in the figure legend and not the graph itself.

## Making a better legend

Now let’s change the header/title of the legend to be number of stations. We can do this using the `guides` function:

```gbigleg_orig = gbig + guides(colour = guide_colorbar(title = &quot;Number of Stations Reporting&quot;))
gbigleg_orig
``` Here, `guides` takes arguments that are the same as the aesthetics from before in `aes`. Also note, that `color` and `colour` are aliased so that you can spell it either way you want.

I like the size of the title, but I don’t like how wide it is. We can put line breaks in there as well:

```gbigleg = gbig + guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;))
gbigleg
``` Ugh, let’s also adjust the horizontal justification, so the title is centered:

```gbigleg = gbigleg +
guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
title.hjust = 0.5))
gbigleg
``` That looks better for the legend, but we still have a lot of wasted space.

## Legend IN the plot

One of the things I believe is that the legend should be inside the plot. In order to do this, we can use the `legend.position` from the themes:

```gbigleg +
theme(legend.position = c(0.3, 0.35))
``` Now, there seems can be a few problems here:

1. There may not be enough place to put the legend
2. The legend may mask out points/data

For problem 1., we can either 1) make the y-axis bigger or the legend smaller or a combination of both. In this case, we do not have to change the axes, but you can use `ylim` to change the y-axis limits:

```gbigleg +
theme(legend.position = c(0.3, 0.35)) +
ylim(c(160, max(quakes\$long)))
``` I try to not do this as area has been added with no data information. We have enough space, but let’s make the legend “transparent” so we can at least see if any points are masked out and to make the legend look a more inclusive part of the plot.

### Making a transparent legend

I have a helper “function” `transparent_legend` that will make the box around the legend (`legend.background`) transparent and the boxes around the keys (`legend.key`) transparent as well. Like `text` before, we have to specify boxes/backgrounds as an `element` type, but these are rectangles (`element_rect`) compared to text (`element_text`).

```transparent_legend =  theme(
legend.background = element_rect(fill = &quot;transparent&quot;),
legend.key = element_rect(fill = &quot;transparent&quot;,
color = &quot;transparent&quot;)
)
```

One nice thing is that we can save this as an object and simply “add” it to any plot we want a transparent legend. Let’s add this to what we had and see the result:

```gtrans_leg = gbigleg +
theme(legend.position = c(0.3, 0.35)) +
transparent_legend
gtrans_leg
``` ### Moving the title of the legend

Now, everything in `gtrans_leg` looks acceptable (to me) except for the legend title. We can move the title of the legend to the left hand side:

```gtrans_leg + guides(colour = guide_colorbar(title.position = &quot;left&quot;))
``` Damnit! Note, that if you respecify the guides, you must make sure you do it all in one shot (easiest way):

```gtrans_leg + guides(
colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
title.hjust = 0.5,
title.position = &quot;left&quot;))
``` The last statement is not entirely true, as we could dig into the `ggplot2` object and assign a different `title.position` property to the object after the fact.

```gtrans_leg\$guides\$colour\$title.position = &quot;left&quot;
gtrans_leg
``` ### “I don’t like that theme”

Many times, I have heard people who like the grammar of `ggplot2` but not the specified theme that is default. The `ggthemes` package has some good extensions of theme from `ggplot2`, but there are also a bunch of themes included in `ggplot2`, which should be specified before changing specific elements of `theme` as done above:

```g + theme_bw()
``` ```g + theme_dark()
``` ```g + theme_minimal()
``` ```g + theme_classic()
``` ## Conclusions

I agree that `ggplot2` can deceive new users by making graphs that look “good”-ish. This may be a detriment as they may believe they are good enough, when they truly need to be changed. The changes are available in `base` or `ggplot2` and the overall goal was to show how the recommendations can be achieved using `ggplot2` commands.

Below, I discuss some other aspects of the post, where you can use `ggplot2` to make quick-ish exploratory plots. I believe, however, that `ggplot2` is not the fastest for quick basic exploratory plots. What is is better than `base` graphics is for making slightly more complex exploratory plots that are necessary for analysis, where `base` can take more code to do.

### How to make quick exploratory plots

I agree with Jeff that the purpose of exploratory plots should be done quickly and a broad range of plots done with minimal code.

Now, I agree that `plot` is a great function. I do believe that you can create many quick plots using `ggplot2` and can be faster than base in some instances. A specific case would be that you have a binary `y` variable and multiple continous `x` variables. Let’s say I want to plot jittered points, a fit from a binomial `glm` (logistic regression), and one from a `loess`.

Here we will use `mtcars` and say if the car is automatic or manual (`am` variable) is our outcome.

```g = ggplot(aes(y = am), data = mtcars) +
geom_point(position = position_jitter(height = 0.2)) +
geom_smooth(method = &quot;glm&quot;,
method.args = list(family = &quot;binomial&quot;), se = FALSE) +
geom_smooth(method = &quot;loess&quot;, se = FALSE, col = &quot;red&quot;)
```

Then we can simply add the `x` variables as aesthetics to look at each of these:

```g + aes(x = mpg)
``` ```g + aes(x = drat)
``` ```g + aes(x = qsec)
``` Yes, you can create a function to do the operations above in base, but that’s 2 sides of the same coin: function versus `ggplot2` object.

All code is located on my GitHub for my blog