R is a Language: Treat it Like One

I'm helping out with teaching a class on an introduction to R for students this week. I figured it'd be a good time to discuss my thoughts on programming in R and how a newcomer should feel about learning the language.

Those Who Teach R, Should Use R

Many of the students in the class at the beginning are overwhelmed. They see a bunch of different symbols and syntax; being overwhelmed is understandable. Moreover, those teaching you R can seem lightning fast when running code, moving around screens, or figuring out problems, and they usually are. Do you know why?

Those who are teaching you R, use R. They use R a lot. They use it daily, and for hours a day usually. Don't try to be them immediately, get the basics.

I hear students saying “Oh, you're fast” when power users help them, usually lined with an undertone of low self-esteem. That's like me going to a basketball camp run by Michael Jordan and saying “Oh man Michael, you're really good on the court”. Be like Mike: work hard to learn the basics like the back of your hand. You'll be dunking in no time.

And, of course, those who teach you are fast. Why would you take class from someone slow or unsure while doing what they're teaching you? Yes, your teachers may be fast, but that's the point. Learn from those who do.

I feel as though I'm relatively fast on my machine and I felt this starting out. Even more, I didn't know what tab completion was when starting out my first class. I thought my professor could actually type that fast when writing variables or directory paths. I thought it was magic. If I didn't stick it out, I wouldn't figure out how to make that magic myself.

R is a language

I took Spanish in high school for 4 years. I remember some vocabulary words and some conjugation rules, but am far from conversational. I've never taken French – I don't know French. Now, if I took an introduction to French class for 4 hours, do you think I could speak (or write) French fluently? No, of course not. Yet students think they can with R. My Spanish is like a background in Stata: some words/phrases/commands are similar, other are misleading and can be confusing.

R is a programming language. Just like a foreign language, R has syntax and grammar. You must learn simple punctuation such as placement of commas, assigment using “=” or “<-”, where to close parentheses, and when to notice when a quote is unmatched. My overall message is:

R is a language, treat it like one.

Remember to tell students to hold themselves to the same level of comprehension as they would for a spoken (or signed) or written language. Hopefully, that will put learning R in perspective, even if it does not make it any less overwhelming. I wonder if Rosetta Stone will make an R module one of these days.

Advertisements

Be Careful with Using Model Design in R

In R, useful functions for making design matrices are model.frame and model.matrix. I will to discuss some of the differences of behavior across and within the two functions. I also have an example where I have run into this problme and it caused me to lose time.

Using model.frame for a design matrix

Whenever I use the word “design” I mean the sytematic part of a model; in this case, linear models. For example, if you say

\displaystyle  Y = X\beta + \varepsilon

I'm referring to the X as the design.

model.frame creates a design data.frame of the covariates given, keeping any factor variables as factors with the same levels. Let's create a toy data.frame called df, where Y is a normal random variable linearly related to two variables in the dataset:

n = 100
df = data.frame(X1 = rnorm(n), 
                X2 = rpois(n, lambda = 5), 
                X3= rnorm(100, mean = 4, sd = 2), 
                Sex = factor(rep(c("Male", "Female"), each = 50)))
df$Y = with(df, X1 + 3*X2 + rnorm(100, sd = 10))

Now, if Y is included on the left hand side of the formula, then it is included in the output of model.frame as such:

model.df = model.frame(Y ~ X1 + X2 + X3 + Sex, data=df)
head(model.df, 2)
       Y      X1 X2    X3  Sex
1  9.223  0.3849  2 5.960 Male
2 12.467 -0.5061  5 1.651 Male

This gives you a data.frame with the outcome and the covariates fitting that outcome (not including an intercept).

If Y is not included on the left hand side of the formula:

model.df2 = model.frame(~ X1 + X2 + X3 + Sex, data=df)
head(model.df2, 2)
       X1 X2    X3  Sex
1  0.3849  2 5.960 Male
2 -0.5061  5 1.651 Male

then we see that Y is not included in the output of model.frame. Thus, if you want to create a “design data.frame”, then you likely will want to remove Y from the formula.

Note, in both cases, we see that there is no intercept term added to the data.frame and nothing is done to factor variables.

Using model.matrix

Most cases I'm making model design elements is using model.matrix to then use matrix multiplications to make procedures faster or do “smarter” (i.e. fewer) computations. I will discuss the differences between model.frame and model.matrix using our toy dataset and also dicuss one gotcha) for using model.matrix and lm.

Let's use model.matrix with and without Y on the left hand side of the formula.

model.mat = model.matrix(Y ~ X1 + X2 + X3 + Sex, data=df)
model.mat2 = model.matrix(~ X1 + X2 + X3 + Sex, data=df)
all.equal(model.mat, model.mat2)
[1] TRUE

We see that using any element on the left hand side doesn't affect the output of model.matrix. Difference #1 from model.frame.

Let's look at the output from model.matrix.

head(model.mat, 3)
  (Intercept)      X1 X2    X3 SexMale
1           1  0.3849  2 5.960       1
2           1 -0.5061  5 1.651       1
3           1 -1.3739  3 3.197       1

We see a column was added named (Intercept) with a column of ones for the \beta_0 usually in a model. Difference #2 from model.frame. Also, we see that our factor Sex was converted to an indicator (numeric) variable. Difference #3 from model.frame. We only have 2 levels in Sex in this example. In general, a factor with L levels will generate L - 1 indicator variables using model.matrix.

Review over, how did this affect me?

I wanted to discuss the differences above to note them if you haven't seen them before. Also, I want to show that using model.matrix and a -1 or 0 in a formula can affect how some of your results are calculated using linear models with lm. Running the model with our now-ready model matrix:

mod = lm(df$Y ~ model.mat)
summary(mod)
Call:
lm(formula = df$Y ~ model.mat)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.644  -8.617   0.448   7.648  30.245 

Coefficients: (1 not defined because of singularities)
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -3.1871     3.9533   -0.81     0.42    
model.mat(Intercept)       NA         NA      NA       NA    
model.matX1            1.1894     1.0987    1.08     0.28    
model.matX2            3.6243     0.5790    6.26  1.1e-08 ***
model.matX3            0.0164     0.5422    0.03     0.98    
model.matSexMale       2.6174     2.2726    1.15     0.25    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.2 on 95 degrees of freedom
Multiple R-squared:  0.312, Adjusted R-squared:  0.283 
F-statistic: 10.8 on 4 and 95 DF,  p-value: 3.03e-07

We see that the intercept term created in model.matrix was made NA because it's identical to the intercept term inherently generated by R and is linearly dependent. This is also seen with the warning: “(1 not defined because of singularities)”. This is good to know, but not revelatory or new; just be aware.

When model.matrix goes … differently

Well model.mat already has an intercept, so why not just take out the intercept term with a -1? The model should be the same, right? I would assume this is the case, but let's do it:

mod.noint = lm(df$Y ~ model.mat - 1)
summary(mod.noint)
Call:
lm(formula = df$Y ~ model.mat - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.644  -8.617   0.448   7.648  30.245 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
model.mat(Intercept)  -3.1871     3.9533   -0.81     0.42    
model.matX1            1.1894     1.0987    1.08     0.28    
model.matX2            3.6243     0.5790    6.26  1.1e-08 ***
model.matX3            0.0164     0.5422    0.03     0.98    
model.matSexMale       2.6174     2.2726    1.15     0.25    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.2 on 95 degrees of freedom
Multiple R-squared:  0.717, Adjusted R-squared:  0.702 
F-statistic: 48.1 on 5 and 95 DF,  p-value: <2e-16

We see the intercepts look exactly the same (except we have removed the NA). But note the r.squared, adjusted.r.squared and F-statistic values!

Let's focus on R^2:

summary(mod)$r.squared
[1] 0.3121
summary(mod.noint)$r.squared
[1] 0.7168

These are different – way different – which seems off. Why? If you look into the summary.lm code, you will notice a some of statements involve the expression:

attr(z$terms, "intercept")

and calculate quantities differently depending on whether it flags that test as TRUE or FALSE.

Let's look our two models again from model.matrix:

attr(mod$terms, "intercept")
[1] 1
attr(mod.noint$terms, "intercept")
[1] 0

We see that when you construct the intercept yourself, this code evaluates to FALSE, even though the model has an “intercept”. The model has an intercept, but R hasn't assigned it that attribute. This effects the calculation of the model sum of squares (from summary.lm):

mss <- if (attr(z$terms, "intercept")) 
sum((f - mean(f))^2)
else sum(f^2)

as well as others. So be aware of this behavior.

Conclusion

I was writing something for a linear model that allowed me to compute a large number of regressions (> 1,000,000) on a matrix of outcomes with a fixed adjustment matrix and changing 1 piece of the design matrix. I was doing a voxel-on-scalar regression with covariate adjustment, but also wanted to incorporate the ability to compute the results on a matrix of permutations of Y.

Either way, I ran into a problem checking my results against the output from lm and it took a while to see why the r.squared values were different but all other elements the same. I realized that this was because I was constructing my own design matrix using model.matrix and was using -1 when running lm and those results were not being calculated correctly. Hope you don't run into this problem ever.

Aside: What I wanted to do

Just to be precise, my model was:

\displaystyle  Y = X\beta + Z\theta + \varepsilon

where X was n (n=100) \times V (V=100,000), Z was n \times p (p=5) and Y was n\times 1, but wanted to be run 1000 times with different permutations. If I simply wanted p-values, I could switch X and Y to get those and run 1000 lm commands versus running 100,000 lm commands. (Doing this is not efficient – solving matrix inversions is time-consuming and should not be redundant). I wanted R^2 values and \beta coefficients as well, so I needed something more powerful. I know packages such as vows or limma can do these regressions – but they are usually when the design is fixed and not changing for every voxel and usually the end result is a p-value.

I have it working and may release it into the wild soon. Let me know if you know of anything that will do this, including covariate adjustment and where you can run for a matrix of permuted Y values.

fslr: An R Package Interfacing with FSL for Neuroimaging Analysis

I use a set of neuroimaging tools, but my language of choice is R. FSL, which is from the University of Oxford's Functional MRI of the Brain (FMRIB), and stands for FMRIB Software Library, is one tool I commonly use. I wrote some wrapper functions into an R package called fslr and wanted to discuss some of the functions.

FSL – what is it?

FSL is has a command line interface with shell-type syntax as well as a GUI (which I generally don't use). It has a lot of functions that are good for general imaging purposes – fslstats and fslmaths have loads of functionality. Using FSL during your pipeline is fine, but I don't like switching between shell and R too much in an analysis and like to have scripts that don't jump around too much, so I created fslr.

One of the main problems I have is that I read an image into R, manipulate it in some way, and then want to use an FSL function on that manipulated. This presents a problem because I have an R object and not a NIfTI file, which FSL expects. I also want the result back in R, not in a file that I have to then find. fslr bridges the gap between R and FSL.

fslr functionality

fslr is implicitly linked to the oro.nifti by Brandon Whitcher. The workflow is this:
1. Read NIfTI data into R using readNIfTI from oro.nifti. This is now an R object of class nifti.
2. Manipulate the data in some way. Maybe you applied a mask or changed some values.
3. Pass this nifti object to a function in fslr, which will write the object to a temporary NIfTI file, run the FSL command, and read the result into an R nifti object again using readNIfTI.

From the user's perspective, it's how it always is in R: pass in R object to R function and get out R object.

Alternatively, if the user specifies a filename of a NIfTI file instead of passing in an R object, fslr commands will simply run the FSL command without the file ever having to be read into R, yet the result can still be returned in R as an R object. This method may be much faster, especially if the image for the function to operate on is on the hard disk (otherwise it would be read in and written out before an FSL command was run).

outfile and retimg options

For any function that has an image output should have two arguments: outfile and retimg. The argument outfile is default set to NULL, which will create a temporary file using tempfile(), which will be deleted when your R session is terminated. Alternatively, you can specify a path to the output file if the user wants it saved to disk.

The retimg argument is a logical indicator for whether you want the output read into R and returned (get it – return image). If retimg=FALSE and outfile=NULL, then the function will error as the outfile will be a temporary file and deleted and no image will be returned. Thus, the user will not ever be able to use the output image from the function which I believe is done in error.

Function call-outs

All functions use the system command in R to execute FSL commands. If you are using an R GUI instead of the command line in a shell, then you will want to specify options(fsl.path=). If you are using the command line and FSL is in your PATH, the path to FSL will be found using Sys.getenv("FSLDIR") as your shell evnironment variables will be available. The function have.fsl() provides a logical check as to whether you can run an fslr command. I have not tested this on a Windows machine. I did not use system2 because I ran into some problems, but may want to change this in a future release for portability. Let's look at an example of using fslr

Example of fslr commands

Let's check to make sure we have FSL in our PATH:

library(fslr)
options(fsl.path="/usr/local/fsl")
have.fsl()
## [1] TRUE

Similarly to fsl.path, you should specify an output type, usually NIFTI_GZ. See here for more information about output type.

options(fsl.outputtype = "NIFTI_GZ")

Reading in Data

Here I'm going to read in a template T1 MRI brain image (no skull), with 1mm resolution and visualize it.

fname = file.path( getOption("fsl.path"), "data", "standard", "MNI152_T1_1mm_brain.nii.gz")
img = readNIfTI(fname)
print(img)
## NIfTI-1 format
##   Type            : nifti
##   Data Type       : 4 (INT16)
##   Bits per Pixel  : 16
##   Slice Code      : 0 (Unknown)
##   Intent Code     : 0 (None)
##   Qform Code      : 4 (MNI_152)
##   Sform Code      : 4 (MNI_152)
##   Dimension       : 182 x 218 x 182
##   Pixel Dimension : 1 x 1 x 1
##   Voxel Units     : mm
##   Time Units      : sec
orthographic(img)

plot of chunk readin

Smoothing

Let's smooth the data using a 5mm Gaussian kernel and view the results:

smooth = fslsmooth(img, sigma = 5, retimg=TRUE)
orthographic(smooth)

plot of chunk smooth

That example is a little boring in that this could be much easier done in FSL than using fslr. If the data is manipulated beforehand, and in an explorative way, it may be easier to see fslr's use. Let's Z-score the image and then keep the z-scores above 2.

Z-score and threshold

thresh.img = img
thresh.img = (thresh.img - mean(thresh.img))/sd(thresh.img)
thresh.img[thresh.img < 2] = NA
orthographic(thresh.img)

plot of chunk theshimg
What? An empty picture? That doesn't make sense. Looking at the histogram of thresh.img we see there is data:

hist(thresh.img)

plot of chunk thresh_hist

So what gives? The orthographic function from oro.nifti uses the slots cal_min and cal_max to determine the grayscale for the picture. fslr also has some helper functions to make it easier to rescale these values so that you can visualize the images again.

Setting cal_min and cal_max

thresh.img = cal_img(thresh.img)
orthographic(thresh.img)

plot of chunk thresh_ortho
There we go. Looks like some white matter regions. Now let's smooth these z-scores using a 4mm Gaussian kernel.

smooth.thresh = fslsmooth(thresh.img, sigma = 4, retimg=TRUE)
orthographic(smooth.thresh)

plot of chunk smooth.thresh

Now you can do all your fun statistics and be able to call FSL and keep everything in R! Also for most of the functions, there is a FUNCTION.help function that will print out FSL's help file:

fslhd.help()
## 
## Usage: fslhd [-x] <input>
##        -x : instead print an XML-style NIFTI header

But other packages exist!!

Some may think “Hmm I had a package with that functionality decades ago”. Other packages in R exist and have functions for images. I know this. But I believe:

  1. R should be able to integrate other established and available neuroimaging software. Python is doing this everywhere and seems to be helpful for the Python part of the community.
  2. Some of the functions the FSL package have are not implemented in R. What package do I call for skull-stripping an image like the BET from FSL? You can call fslbet by the way in fslr.
  3. Some functions are faster in FSL than in some functions of R. Large 3D Gaussian smoothers take a long time in some packages of R.
  4. It allows you to still your pipeline written in R even if it calls FSL or other languages.
  5. If someone wrote it before, I don't want to write it again.

I am writing up a more comprehensive white paper hopefully in the coming weeks. Don't forget to check out other packages like ANTsR which is a wrapper for ANTS in R. I'm excited to start using it, especially for its MRI inhomogeneity correction.

Work to be done

I haven't implemented much of the software for FEAT, which is part of FSL's fMRI analysis pipeline. I hope to do that in the future, but would love feedback if people would want that integration. Always – feedback is welcome on other parts as well.

Making Back-to-Back Histograms

A colleage of mine asked me how to do back to back histograms (instead of on top of each other). I feel as though there should be a function like voilin plot from the vioplot package. Voilin plots are good for displaying data, but the violin must have the left and right (or top and bottom) of the violin to be from the same distribution, and therefore are symmetrical. Many times people want to compare two distributions.

Cookbook for R ) shows how to overlay histograms (or densities) on top of each other, so go there if that's what you want. (NB: that is the way I tend to compare distributions, especially more than 2. I provide the code below because some have different preferences.)

ggplot implementation:

library(ggplot2)
df = data.frame(x = rnorm(100), x2 = rnorm(100, mean=2))

g = ggplot(df, aes(x)) + geom_histogram( aes(x = x, y = ..density..),
                                             binwidth = diff(range(df$x))/30, fill="blue") + 
  geom_histogram( aes(x = x2, y = -..density..), binwidth = diff(range(df$x))/30, fill= "green")
print(g)
## Warning: Stacking not well defined when ymin != 0

plot of chunk unnamed-chunk-1

I simply simulated 2 normal distributions of 100 points and then plotted them. Not the ..density call in the aes for the histograms. This just scales the histogram to a density and not a count. The -..density.. flips the second histogram around zero so that they are back-to-back. We see that ggplot doesn't like stacking when you have negative data, but it's ok for this exmaple and don't overlap.

print(g + coord_flip())
## Warning: Stacking not well defined when ymin != 0

plot of chunk unnamed-chunk-2

Using coord_flip plots back-to-back histograms horizontally. This code can easily be extended using geom_density and actually a volcano plot version is in the help for stat_density.

Base implementation

Not everyone likes ggplot2 so I figured I would provide in implementation in base graphics.

## using base
h1 = hist(df$x, plot=FALSE)
h2 = hist(df$x2, plot=FALSE)
h2$counts = - h2$counts
hmax = max(h1$counts)
hmin = min(h2$counts)
X = c(h1$breaks, h2$breaks)
xmax = max(X)
xmin = min(X)
plot(h1, ylim=c(hmin, hmax), col="green", xlim=c(xmin, xmax))
lines(h2, col="blue")

plot of chunk unnamed-chunk-3

The code calculates the histograms for each distribution and stores the information. I simply take the negative number of counts to flip the histogram over the x-axis.

Go forth and prosper

You can adjust the axes to positive numbers, make more implementations with densities/etc, but this is a simple graphic I've seen people use. Hope this helps someone out.

SMART Hackathon: Day 2: Writing Packages in RStudio

So day 2 of the #JHUSMARTHack was last week, but I figured this would be a good time to discuss what was accomplished. I created some packages that are somewhat specialized and aren't fully finished yet, so I'll hold off. What I really want to discuss though is why I like using RStudio for making packages.

What was accomplished?

I had spoken before about the repositories I had worked on, and also about Developing Packages in RStudio. I'll discuss the workflow I settled into for for making a package.

Workflow for an R package

I'm assuming your R code is already available, presumably functions you had created during a project or analysis. If code is not available, GREAT! You can start your workflow for your new package or product all the same. I'll try to put command-line equivalents in double brackets [[ ]].

Workflow:

  1. Start RStudio.
  2. Go to File -> New Project. (Save any unsaved work).
  3. Select New Directory. Now this may be counterintuitive if you have work saved, but if you're creating a package choose this. This will setup a new folder and copy over any code you have already created.
  4. Select R Package. This will allow you to name your package (it will be used in the library statement unless changed later); let's call it mypckg. You can also choose code you have to operationalize, e.g. put into a package. Also – select you want to create a git repository.
  5. Voila! The folder is created where you have the components of an R package, such as the documentation (man), extra install dir (inst), the R code (R), etc. [[library(devtools); create("mypckg");]]

Now, here is one of the main reasons I like using RStudio for projects: the .Rproj file. The .Rproj file is an RStudio project file. It allows you to work on stuff in multiple tabs/scripts, then close the project, and pop up the other tabs/scripts you were working on before opening up that project. If you are in RStudio, the top right should show a Project: None if you don't have a project loaded. These project files allows me to segregate my workflows and scripts, and they help me organize a bit more. I highly recommend checking out Hilary Parker's post before continuing, especially if you're not an RStudio fan.

Using RStudio Build Tools

Now, when I say RStudio Build Tools, I essentially mean wrappers for the devtools package. The package is amazing (hardly shocking since Hadley Wickham is the main author), and along with the roxygen2 package, allow package creation to be as easy as possible.

Now, let's set up our options for Build Tools. In RStudio, go to Build -> Configure Build Tools (again you must be in an RStudio project). For Check Package, I recommend putting the --as-cran option (especially if you plan to submit to CRAN. You should also see a checkbox saying Generate documentation using Roxygen. If this is not available, run install.packages("roxygen2"), close and reopen the project. Check this box, and click the Configure button and I usually click all options.

Setting up a remote git repository

Before, we checked for a git repository to be created. Now, you can create a new repository in your favorite GitHub remote repository. Mine is GitHub. You can use the GUIs such as the GitHub GUI or SourceTree, but I generally set this up using the Terminal by just adding the remote. (Here is a link to create ssh keys so you don't have to type in passwords for git). Now, if you restarted the RStudio project, go to Build -> Configure Build Tools and you should see the remote repository if you click the Git/SVN tab.

Now that the repository is set up (even if you don't use a remote repository), you can go to Tools --> Commit to commit to the repository. This allows you to add and stage the changed files while adding a commit comment. You can also see a visual history of the differences and changes as well as do much of what you would need to from the command line. Again, I like the Terminal, but I like having this all in one program and not having to switch back and forth.

DOCUMENTATION! EXAMPLES! VIGNETTES!

Now that you have everything set up, you have to do the big things that differentiate a bunch of functions from a package: documentation and examples (including vignettes). Again, for documentation, we'll be using the roxygen2 package. Roxygen is essentially a format that starts with a line with #' followed by @ followed by a “tag”. The tags can be found at ?rd_roclet. Now, I highly recommend vignettes, but I'm not an expert on these and think we'll just stick to function docs right now.

Jumping to Sublime Text

Before we start documentation, let me again tell about MY workflow rather than Roxygen. My workflow now jumps to Sublime Text. I have Line-by-line installed (which you will need), which allows me to run a script to parse an R function and create the necessary Roxygen tags. See Alyssa's post for the description and a more command-line workflow for R packages.

Now that we're in Sublime Text, I open the .R files from my packages R directory. Select the function definition such as x = function(z, y, l=4, ...){ and use CMD+D to create Roxygen tags! This is like meta-programming for documentation: running scripts to make documentation (granted it's in Python). As an aside, one powerful feature of this documentation is that if you have code as:

LFPCAg <- function(
Y,# an n x D matrix
                   # Y is assumed to be centered by its mean function
                   gridpoints = 1:ncol(Y),       # a vector of grid points along curves, corresponding to columns of Y 
                   Zlist=NA,
                   G=NA,
                   Ivec=NA,
                   ...){

this will parse the Roxygen tags as the comments for each argument/parameter (even if multi-line!):

#' @title <brief desc>
#'
#' @description <full description>
#' @param Y an n x D matrix Y is assumed to be centered by its mean function
#' @param gridpoints a vector of grid points along curves corresponding to columns of Y
#' @param Zlist
#' @param G
#' @param Ivec
#' @param ... 
#' @export
#' @keywords
#' @seealso
#' @return
#' @aliases
#' @examples \dontrun{
#'
#'}

This also puts in your mind, even if you're only creating functions and not a package, that you'll almost have documentation ready made when using this function format from day 1.

Jumping Back to RStudio

Now, opening these Roxygen-tagged functions, I can fill in the rest in RStudio. One thing to note is that RStudio will assume you're trying to stay in Roxygen notation with a return of line (which is great for multi-line descriptions/titles/etc). Also, if you have #' @ starting a line, then RStudio will do tab completion of Roxygen tags. Not leaps and bounds saved on time, but hey, I like tab completion.

Now you have to write your examples, the description of arguments (denoted as parameters), the overall function description and title, and use the @export to allow this function accessible to the user. One note is that if you depend on another function or package, use the @import pkgname or @importsFrom pkgname::funcName tags. R CMD check will warn you if you don't have anything in @keywords, @aliases, or @examples, so remove these if not necessary.

Just let me check my functions!

If you're still working on the package and want to play with functions and no so much the documentation, you can use Build -> Load All [[devtools::load_all]] to load the functions (even those not exported) into memory.

Compile and Load

Now let's fast-forward to when you have created the the documentation for your functions. While still in your project, go to Build -> Build and Reload to get your package loaded into memory [[devtools::build then devtools::install]]. Roxygen will create the docs. FYI – if you change around function names and recompile, the man folder may have obsolete .Rd files, so you can delete old ones.

You should edit the DESCRIPTION file to change some specifications, such as Depends: fields for package dependencies. That's documented many places on the web to find about what goes in there.

Now edit your functions and docs, push to the remote repository and then allow people install your package by using:

library(devtools)
install_github("mypckg", "myGitHubUserName")

and there you have it – you've released software. Build -> Check Package is good for testing your package (will tests your examples) and make sure everything looks OK.

Conclusion

R package creation seems like a daunting task. You can use tools in RStudio such as Code -> Extract Function to take loads of code to try to functional-ize it. When you have a collection of functions, creating an RStudio project allows you to separate your package creation process from regular RStudio analysis and use, let's you have a one-stop shop for git version control, building, and checking of packages. It let's you get over any hurdles of learning new functions in devtools (which may not be a good thing) and get you running in a short amount of time. The Sublime Text plugin is not a crucial step, but can allow you to parse semi-documented functions and create a Roxygen header that's partially filled in. This workflow allowed me to develop multiple projects and get them documented quickly at the hackathon.

Hopefully this helps and good luck packaging!

SMART Hackathon: Day 1

So day 1 of the #JHUSMARTHack 2014 is over and day 2 is underway. It's been awesome. One student estimated he did “one week's worth of work”. Some interesting (I think) things I thought about:

  1. Flooding rain is good weather for coding. (As long as no sinkholes arise)
  2. The Admiral Fell Inn is a pretty sweet place to host this event (and supposedly reasonably priced).
  3. A lot of other conferences have Hackathons, such as OHBM. We discussed the potential for these at ENAR and JSM.
  4. Headphones are good and bad at times. They can block out any talking but they also block out any cool discussions.
  5. I'm used to coding by myself a lot, which is hard to break out of and when knowing to ask for help.
  6. Shiny apps are easy to implement and cool products. (Not really new for me, but worthwhile to say).
  7. It's good to have someone that knows languages you don't, such as Python.
  8. You want to have example data before you come for your packages to be able to use, i.e. is IRB approved for distribution.
  9. Doing some prep before, like reading through Developing Packages with RStudio, is helpful beforehand. (I think Roxygen2 should be default ON)
  10. Using -- in Mac OSX when trying to put in options for R CMD check, such as --as-cran, turns -- into (which fails on the check).

Check out my github repos (posted yesterday) for progress. I've added the most functionality to WhiteStripe (white matter “segmentation”) and fslr (a wrapper for FSL).

Today, I'm looking at the knitr vignettes, amongst others, to make my first vignette for these. I'm very excited I can write in R Markdown with these. Happy hacking!

SMART Hackathon 2014

Today is our SMART group's 2014 Hackathon! We will be at an undisclosed location, just plugging away at code/packages/apps. The mission statement:

There's no structure except that you have to work on a software
project and you have to present what you accomplished or didn't at a
subsequent SMART meeting.

We will be tweeting at #JHUSMARThack for the next two days.

I personally will be working on a suite of imaging tools for R for MS lesion imaging, white matter segmentation, CT image analysis, and a package integrating FSL and R: fslr.

I hope to get a lot of commits and some feedback on the package from the group. Let me know if you want to check any of them out and if you want any functionality with them!