Writing Accountability Groups (WAGs)

Recently, 7 other students and myself joined the first writing accountability group (WAG) for students in our biostatistics department. Johns Hopkins has a set of WAGs all around campus, but we are the first student-only group.

What's a WAG?

I posted before about strategies of How to Write a Lot based on Paul Silvia's Book of the same name. One of his recommendations (especially for junior faculty) was to start what he calls an “agraphia group”, where you discuss goals, go on your way, and then come back and discuss if you met those goals or not. He (and I believe in this) seemed rigorous in this regard: if you came 2 times saying “I didn't do anything”- you were out.

The WAG is the implementation of such a group. The group consists of 3 parts:

  1. 15 minutes of discussing previous goals and goals for the 30 minute writing session.
  2. 30 minutes of “writing” – where writing means accomplishing the goals set in 1.
  3. 15 minutes of discussing if the goals were met from the 30 minute session and setting goals for the next week.

The WAG Makeup

The group consists of 4-8 participants. Don't have more than that – you won't have enough time in an hour and it will become unruly. Also, it's good when you can spot in a second who is missing. It runs for 10 weeks (it can run more), but gives a good concrete timeframe. This inhibits participants from saying “I'll come next time”. There are only 10 sessions, show up – a maximum of 2 misses are allowed.

Why did we start the group?

Now, why did we start this group – why not “just write”?

  1. Peer pressure can be used for good. No one wants to say “I've done nothing”. Even if you get 1 paragraph done as a last-ditch effort before the meetings, it's > 0 paragraphs.
  2. Goals are hard to set; it's easier with others flushing out your ideas.
  3. You are writing for 30 minutes even if you get nothing else done that week.

Overall Goals

Types of overall goals are group are trying to accomplish:
1. Write a paper.
2. Edit a paper for review/submission.
3. Write background/significance section on a grant.
4. Write more in a blog

30-minute Session Goals

These are large and somewhat abstract. In a goal-setting session for the 30 minute writing, we have more concrete goals such as:
1. Write 2 paragraphs of the results section for a paper.
2. Make patient demographics table
3. Incorporate all comments from collaborator 1.
4. Get 1 polished paragraph for background section
5. Write 3 sections of blog post, post by end of day

15-minute End Goal Creation

Examples of these goals are:
1. Write 2 pages of paper 1.
2. Make patient demographics table and 3 figures.
3. Incorporate all comments from all collaborators and format for journal.
4. “Lock down” background section as done.
5. Post blog post, write second for editing next Monday.

How do I start one?

If you are at Hopkins, please visit the WAG facebook group above. If not, just get a few people together and follow the above setup. Make people commit to a timeline (maybe even sign a pseudo contract so they know what they're getting into) that is short, concrete, and has a seeable end. I recommend buying a few of the Write a Lot books (as well as others) to share and discuss.

Additional WAG-related Time

The students in our group have also discussed that the 30 minutes in our WAG is helpful, and an additional 30 minutes would be helpful. We are testing an additional writing “meeting” and we will try to determine its effect. We will adapt based on our WAG (like 10 min goals, 45 mins writing, 5 end goals).

Advertisements

Statisticians in Neuroimaging Need to Learn Preprocessing

I have worked in neuroimaging for the past 4-5 years. I work on CT scans for stroke patients, but have also worked on fMRI and some structural image analyses. One important thing I have learned: (pre?)processing means a lot.

Take a note from Bioinformatics

In some respects, Bioinformaticians had the best opportunity when sequencing became more affordable and the data exploded. I say they had it good because they were the ones who got the raw (mostly) data and had to figure out how to analyze it. That's not to say, in any way, it was easy to figure out correct analysis methods, develop an entire industry from the ground up, or jump into big datasets that required memory far beyond the range laptops could handle. The reason I say they have it good is because the expectation for those working on the data (e.g. (bio)statisticians) to know (and usually agree) with how the data was (pre)processed.

You trust that data?

I remember a distinct conversation at a statistics conference when I spoke to a post-doc, trained in statistics, who worked in imaging. I sat next to him/her and asked “how do you preprocess your data”? The response: “Oh, I don't know, my collaborator does it and I work on the processed data.” I was confused. “You trust that data?” I thought. I have heard that more times since then, but increasingly hear more people getting involved in the whole pipeline: from data collection to analysis.

An analogy to standard datasets

To those who don't work in bioinformatics or imaging, I'll make this analogy: someone gives you dataset and one column is transformed in a non-linear way, but those giving you the data can't really tell you how it was transformed. I think for many it'd be hard to trust and accept that data. My biostat training has forever given me data trust issues. It's hard for me to trust people who give me data.

Questions I usually rattle off rapid-fire:
1. How was it collected?
2. Why is this missing?
3. Why is this point really weird?
4. What does -9/999/. mean?
5. Where is your codebook?
6. Is that patient information!? Ugh. I'm deleting this email and you can remove that and resend. Better yet, use DropBox. NO – don't keep the original with the patient info in there!

It sounds more like an investigation rather than a collaboration – I'm working on changing that. But I was trained to do that because those things are important.

Back to imaging

Many times though, this is exactly how a dataset is given to a statistician. The images were processed in a way, and sometimes registered to a “template” image in a non-linear way.

Why do I think that this happens more often in neuroimaging? (It probably happens in bioinformatics too but I won't speak to that). I think it's because

Preprocessing is uninteresting/hard/non-rewarding/time-consuming

Moreover, I believe there is a larger

MISCONCEPTION that Preprocessing is not important

.

When I started my lab in fMRI, they had me preprocess data BY HAND (well by click, but you get the idea). They had me go through each step so that I understood what that step did to the data. It made me realize why and where things would go wrong and also taught me an important lesson: decisions upstream in the processing can have tremendous effects downstream. I am forever grateful for that.

It also taught me that preprocessing can be a boring and pain-staking business. Even after I got to scripting the preprocessing, there were still manual steps to check that are inherent. For example, if you co-register (think matching my brain and your brain together) images, you want to make sure it works right. Did this brain really match up way to that brain? There are some methods to try to estimate quality, but almost everyone has to look at the images.

Statisticians are trained to look at data, so we should be USED TO THIS PRACTICE. The problem is 1) if it works, the response is “OK next” and feel like time was wasted (it wasn't) or 2) if it doesn't you have to fix it or throw away the data, which can be painful and long.

How long do you think looking at one scan would take? OK, looking at 1-2 scans is quick, but what about 100? What if I said 1000?

Before I discuss trusting collaborators let me make my message clear:

Even if you don't do the preprocessing yourself, you should know what preprocessing was done on your data. Period.

In my eyes, speakers lose a lot of credibility if they can't answer a few simple questions about how their data got to the analysis stage. Now I haven't remembered every flip angle we've used, but I for sure knew if the data was band-pass filtered.

Trusting Collaborators

Here comes the dilemma: get in the trenches and do hours of work preprocessing the data or get the data preprocessed from collaborators. I say a little of both. Sit down with one of the people that do the preprocessing and watch them/go over their scripts with them. Ask many questions – people may ask you these questions later.

The third option, and one I believe we strive for in our group, is to develop methods that require “light preprocessing”. That is, do things on a per-image or per-person level, derive measures, and then analyze these (usually low-dimensional) measures for groups/populations.

There are some steps that are unavoidable. If you want population information on a spatial brain level, you'll likely have to register/warp images to a template. But if this is the case, do some “procedure sensitivity analysis” – try a couple different registration/warping procedures and see how sensitive your results are. If they are highly dependent on registration, you should be sure the one you chose is “correct”. Dr. Ani Eloyan just had a paper accepted on this very topic titled “Health effects of lesion localization in multiple sclerosis: Spatial registration and confounding adjustment” to come out in PLoS ONE in the next month. If others are doing the processing, and you don't know how to, this can be hard to figure out the right questions to ask. So learn.

Moreover, ask collaborators about the data they threw out along the way. Was it all the females? All under 5 years old? All the people who move too much? Don't stop asking the questions about missing data and potential biases lurking in those discarded (costly) images.

Large Benefits of Learning Preprocessing

Each pre-processing is used for some sort of goal: to correct this or to normalize that, etc. Thus, there is a industry of developing and checking preprocessing steps. So not only can you help to develop statistical models for the data, you can also develop methods that may improve processing, check whether preprocessing steps are good, or test whether one preprocessing method is better than the other (this would be huge). If you don't know about the processing you're missing out on a large piece of the methodological work that can be done.

Conclusion

Learn about preprocessing. It's part of the game with imaging. This may scare some people – good. Let them leave; there are plenty of questions and problems for the rest of us. Looking at brain images (and showing your friends) is still pretty cool to me, that's why I'm still in the imaging game. But I needed to learn the basics.

One warning: if you do know how to process data, people will want you to do it for them. Try as best you can to fight this and instead train others how to do their own processing and convince them why it is useful.

Warning: Shameless Plug

At ENAR 2015, Ciprian Crainiceanu, Ani Eloyan, Elizabeth Sweeney, Taki Shinohara and myself will be presenting a 1 hour 45-minute tutorial on converting raw images, reading data into R, and some basic preprocessing methods.

Converting LaTeX to MS Word

Last year, Elizabeth Sweeney wrote about how she converts LaTeX to Word. If you're trying all open-source solutions to this problem, visit there.

In my experience, I was writing in LaTeX as well. I had a journal that only accepted Word documents. I had to convert from LaTeX to Word.

Same story, different day

I tried a lot of the solutions from StackExchange: latex2rtf, pandoc, TeX2Word, etc.

I think the best quote there is

There is no pain-free way to do this. Really.

And no, nothing really worked VERY well straight out of the box. My solution was hackey as well, but it worked the best for me with a reasonable amount of formatting for me. The biggest problems were, not surprisingly, equations. Some garbled the text everywhere, others created image files that were included in the file.

What I used

My pipeline converted PDF to Word (.docx) using Acrobat Adobe Pro. This is relatively cheap for our students and is a solid program, though somewhat pricey. The conversion was similar for headings/sections to those above, but the equations were “converted”. The equations were converted to some pseudo-equations, but when highlighted in the Word doc, and then clicking Insert > Equation, and viola! The equations looked pretty good (aka usable).

I would say, though, the conversion was not perfect. I had some odd problems with superscripts and such, and ended up uploading the .docx to Google Drive as we were going to try editing together, but it never worked. I did noticed the Google Drive document fixed many formatting issues the OCR had caused from Acrobat Pro. I downloaded it from Drive and the formatting issues were fixed…but equations were a mess again! I ended up just copying and pasting the equations from the pre-uploaded .docx into the Google-Drive-converted .docx. That's where I had my best results.

Someone Please Stop the Madness!

Is this a good pipeline? No. Did it technically “work”? Yes. Why did I go through LaTeX in the first place. Well, 1) I didn't know they only took Word. This is my fault, but could have easily been my 2nd submission to a journal and the other journal accept PDF. 2) I had equations. MS equations, even though they “accept” LaTeX – no. 3) I know how to get LaTeX to format things good enough.

I will go on 1 rant and then discuss some light in this darkness.

Rant

Seriously journals, you only accept Word documents? What is this bullshit. The journal even accepted PDFs in Supplements. Everyone can read and annotate PDFs nowadays; get rid of Word requirements.
I imagine this perpetuates because 1) it's easy to use Word's word count and say “that's how many words you sent”, even though that's ridiculous (you included references in your word count!?), and 2) the editors/typesetters have used it for print journals, 3) some reviewers only use Word and cannot annotate PDF.

You know what – get rid of the reviewers from 3). You're reviewing cutting edge research and didn't keep up with a technology that pretty much every journal uses for papers. Maybe you aren't the best for that job.

Light up the Darkness

RStudio released the Knit to Word button in their new versions. Now, many people who use pandoc as discussed before, knew how to do this on some level. The big difference for me is that 1) I never thought to say only in R Markdown and skip LaTeX altogether, 2) It's click-button in RStudio which means more will use it, and 3) I can switch between PDF and Word with one click. With citation style files and knitcitations, I think I can get close to LaTeX references and automated reporting.

Next post to follow up on this.

How to Write a Lot

I recently just finished reading How to Write a Lot: A Practical Guide to Productive Academic Writing by Paul Silvia. Hilary Parker had recommended this book a few years ago and I just got around to reading it. I highly recommend it: it's not expensive ($10) on Amazon and it's free if you swing by my office and borrow my copy. In this post I wanted to summarize some of the key points and reflect on my experience after trying the strategies recommended.

Make a Writing Schedule and Stick to It

If you didn't read the section header, let me reiterate:

Make a Writing Schedule and Stick to It

Silvia argues that making a schedule and sticking to it is the only strategy that works for writing. Though this one statement summarizes the book's message, you should still read it. The title of the book denotes it as “A Practical Guide to Productive Academic Writing”. The book tells you not only that making a writing schedule and sticking to it is what you must do to write, but how to do it. In addition, chapter 2 – my favorite – list specious barriers to writing (aka excuses) that people (including me) make that stops them from even starting to write. This chapter helps you realize that no thing or person other than you is stopping you from writing.

Outline Your Writing

One of the things I've heard since grade school for writing that I still am not good doing is outlining. You wouldn't build a car, toy, or building without a schematic, concept, or blueprint. Write an outline first – it can change later – before the full text.

Make Concrete Goals and Track your progress

As a biostatistician I'm trained to look at data – all kinds of data.

Whenever someone makes a claim, I reflectively think “Where is the data to back up that claim?”. When I say I'm going to write more, I need data. Therefore, I have to track it, even if only for myself. Silvia promotes a database or Excel spread sheet for tracking. Though I tend to discourage Excel for data collection, Excel is not a bad option for this single-user single-use purpose. Pick something that's easy for you to use for tracking and format the data so it can be analyzed with statistical software. I will track my progress and may report the results in another post.

Similar to the stage of an outline for your manuscript, tracking your progress takes planning. At the beginning of a session, you must set your goals (plan) and record if you met those goals or not. The goals must be concrete. “Write X paper” is not concrete; “writing 100 words on paper X” is. You don't have to -and probably shouldn't- write 10,000 words in a session. Goals don't have have to be actual “writing”; doing a literature review, editing a paper, incorporating comments, or formatting are all part of the writing process. Your scheduled time is when you should do all of these parts of writing.

Start a Agraphia (Writing) Group

Writing is hard – friends help. Silvia calls the writing group an agraphia group as agraphia is the loss in the ability to communicate through writing. Peer pressure exists, even in graduate school; use it to your advantage. You are not alone your fear/disdain of writing and your fellow grad students are a valuable resource. Having a bunch of “Not met” goals on your progress sheet is different than telling someone that you didn't meet your goals 3 weeks in a row. No one wants feel like a failure, so this positive peer pressure will push you to perform.

Also, editing papers can be boring; you've been with the topic and paper for so long it's no longer exciting to you. To others, it's usually novel and easily seen as great work. Mistakes and unclear thoughts can be corrected. You may think something is clear, but fresh eyes can determine that for sure. Use your group to peer edit. You can use this editing to find out what your classmates/colleagues are doing in their research as well.

You (and the Rest of Us) will get Rejected

Your paper will likely get rejected. Now you can submit to the Journal of Universal Rejection, and you'll have 100% guarantee of rejection. For some journals, that may not be much higher than their actual rejection rates. If you get rejected, you're in the majority. Silvia notes that getting a paper back for revision is a good thing – it passed the level of flat-out rejection. I didn't always see it that way before. Moreover, Silvia says to write assuming your paper will be rejected. He says that this will make your writing less defensive and better. So you'll get rejected, but remember:

  1. Take the criticism constructively – most reviewers want to make your paper better. Realize that.
  2. Be quick and methodical with revisions. Revisions are higher priority than first drafts. Make sure you respond to all comments or explain why you haven't incorporated some reviewer's comments.
  3. Don't let mean reviewers get you down. One quote I remember from a friend when I was younger that stuck with me: “I gotta be doing something right – cause I got HATERS!”. Let them fuel your hate fire. If you've ever heard the phrase “dust your shoulders off” and didn't know where it came from read this. Use If you can revise, incorporate their comments. Getting angry or writing angry letters just wastes time where you could be doing more writing on of your topics.
  4. If it wasn't clear to the reviewers, it's not clear to the readers.

NB: What I could find for statistical journal acceptance rates: http://www.hindawi.com/journals/jps/, http://imstat.org/officials/reports/AnnualReports2010.pdf, and http://www.hsph.harvard.edu/bcoull/ENARJrWorkshop/XLPub2006.pdf.

Reflections

Time is a Zero-sum Game (or is it a flat circle?)

The time in a day is fixed and finite; each day is as long as the others. One of my friends and fellow Biostat grad students Alyssa Frazee likes to say frequently that “Time is a zero-sum game” in the sense that the activities we do now take up time that could be used for other activities.

As a result, many times I ask myself “When is it okay to relax?”. This feeling is common when I am writing a paper. Scheduling relieves much of the stress of when I am supposed to write. Meeting the goals for the days allows me to let go more easily and feel that it is OK to relax if my duties are done. There are fringe benefits to making a schedule.

I Write More

Again – I've only done it for about 2 weeks, but I feel as though I'm getting more done for my papers and writing more. The data will tell, and I don't know if I have a good comparison sample.

Don't Stop Writing

At one point, Silvia notes that you should award yourself when meeting goals but that award should NOT be with skipping a writing session. He likens it to awarding yourself with a cigarette after successfully not smoking for a period of time.

Conclusion

Creating a writing schedule is easy; sticking to it is hard. Try it for yourself and read his book. I think you'll be writing more than you think with this strategy. Let me know how things turn out!

Extra Links for Writing

Typinator: Text is Better Expanded

Last year, Aaron Fisher spoke at a computing club about a text expander named Typinator. In the past year, I have used it for the majority of my LaTeX and math writing and wanted to discuss a bit why I use Typinator.

Seeing your Math symbols

The main reason I use Typinator is to expand text to unicode – symbols such as β instead of writing \beta in LaTeX. When I say “expand text” is I type a string that I set in Typinator and it replaces that string with the symbol or phrase that I designated as the replacement. I type :alpha and out comes an α symbol.

Why should you care

Writing \alpha or :alpha saves no time – it's the same number of characters. I like using unicode because I like reading in the LaTeX:

Y = X β + ε

instead of

Y = X \beta + \varepsilon

and “the β estimate is” versus “the $\beta$ estimate is”. I think it's cleaner, easier to read, and easier to edit. One problem is: unicode doesn't work with LaTeX right off.

pdflatex doesn't show my characters! Use XeLaTeX

Running pdflatex on your LaTeX document will not render these unicode symbols out of the box, depending on your encoding. Using the package LaTeX inputenc with a command such as \usepackage[utf8x]{inputenc} can incorporate unicode (according to this StackExchange Post), but I have not used this so I cannot confirm this.

I use XeLaTeX, which has inherent unicode support. In my preamble I have

\usepackage{ifxetex}
\ifxetex
  \usepackage{unicode-math}
  \setmathfont{[Asana-Math]}
\fi

to tell the compiler that I want this font for my math. I then run the xelatex command on the document and the unicode α symbol appears in the PDF and all is right with the world.

You can also incorporate xelatex in your knitr documents in RStudio by going to RStudio -> Preferences -> Sweave Tab -> Typset LaTeX into PDF using and change this option to XeLaTeX. Now you're ready to knit with unicode!

Other uses for Unicode than LaTeX

If you don't use LaTeX, this information above is not relevant to you but Unicode can be used in other settings than LaTeX. Here are some instances where I use Unicode other than LaTeX:

  1. Twitter. Using β or ↑/↓ can be helpful in conveying information as well as saving characters or writing things such as 𝜃̂.
  2. E-mail. Using symbols such as σ versus \sigma are helpful within Gmail or when emailing a class (such as in CoursePlus) for conveying information within the email compared to attaching a LaTeX'd PDF.
  3. Word Documents. I don't like the Microsoft Word Equation Editor. By don't like I mean get angry with and then stop using. Inserting symbols are more straightforward and using a text expansion is easier than clicking them on the symbol keyboard.
  4. Grading. When annotating PDFs for grading assignments, many times I use the same comment – people tend to make the same errors. I make a grading typeset where I can write a small key such as :missCLT for missing the Central Limit Theorem in a proof so that I type less and grade faster. Who doesn't want to grade faster?
  5. Setting Variables. I don't do this nor do I recommend it, but technically in R you can use unicode to set a variable:
σ = 5
print(σ)
## [1] 5

My Typinator sets.

My set of Typinator keys that you can download and import into Typinator are located here.

  1. Math Symbols for Greek and other math-related symbols. (This was my first typeset so not well organized.)
  2. Bars for making bars on letters such as 𝑥̄.
  3. Hats for making hats on letters such as 𝜃̂.
  4. Arrows just ↑ and ↓ for now.

NB: GitHub thinks the .tyset file is a folder and not an object, so the .txt files are here for Math Symbols, Bars, Hats, and Arrows, which can be imported into Typinator.

If you comment, be sure to use a Unicode. symbol

Non Academia: Starting a Scholarship for My High School

FYI: This post is non-academia in nature. TL;DR: Some friends and I created a scholarship for our high school and you should too.

What happened?

My alma mater high school is Sun Valley, located in Aston, Pennsylvania, my hometown. Last year around this time, I saw two brothers started a $1,000 micro-scholarship for their former high school (also in Pennsylvania). I loved the idea: be the other person on the scholarship award ceremony, the one giving the award.

A few friends of mine said why not, let's make our own scholarship: the Vanguard (our mascot). With the help of one of our friends who happens to work in the guidance department of Sun Valley, we put together a requirement list and description of questions to answer.

The questions we asked were:

  1. If you could go back and go through high school over again, what would you do differently?
  2. What is the best and worst qualities of your generation?
  3. How did growing up in Delco (Delaware County) shape you?

How did it work out?

Overall, we collected $200 per person just for the starter year and hope to do some fundraising events next year. We received a small number of applicants, due to the inclusion criteria and the timeliness of our submission (we were late). I think it was a success and want to just make some notes if you want to try this out at your school.

  1. You'd be surprised how many people are willing to donate money to their alma mater high school.
  2. Begin early and work with the staff at school. They have done this before, many many times.
  3. Make the inclusion criteria liberal – I surprisingly liked reading the responses from students and would have liked more.
  4. Even if you cannot or do not want to donate money, you can always go back and donate your time. We have heard (and even thought in high school): “What has anyone done who graduated here”? This feeling may be more common in public schools, where the alumni boards are non-existent or not as present as private schools. Be a role model. You (literally) were in these student's seats not too long ago. Show them they have connections and can see someone who has succeeded. Don't sell yourself short – you've likely succeeded in far more ways than you perceive.
  5. Don't listen to haters. I believe if people are putting you down or saying that what you're doing is “dumb” or something else derogatory, they've got some stuff going on. Leave them and their words be.

But I already donate to my college!

Many people that have jobs at my age give infrequently or not at all to their undergrad alma mater. I get calls from the University of Scranton, my undergrad, to donate and give back. The callers are students at the university and tell me about what they are working on and how the school has changed.

Some people may give you the “I already donated to my college”. I think the demographic is different who donate to their high school than those who donate to their college. Our high school is public, so I think you get less of the “I already paid thousands to my university” dismissal. I like to think my schools had a strong hand in making the man I am, and would like to repay them.

How (and why) to start

Just think before reading this and going on your way: saying you created a scholarship an awesome thing to put on your resume and a conversation starter; it separates you from many others.

  1. Just get 5-10 people who are willing to donate a moderate amount of money. It doesn't have to be sizable, many scholarships are under $300 – that's only $30 per person if you have 10 people.
  2. Talk to your school and ask any paperwork you need to fill out.
  3. Figure out an inclusion/exclusion criteria.
  4. Make a prompt/question for students to answer. Allow cool submissions like videos, Vines, etc.
  5. Select your winner. We just used Survey Monkey to have each person rank the winner. We chose the best ranking as the top. Ties were broken by total number of #1 votes (this happened).
  6. Give the prize. Try to get your donors to the awards banquet.

We will also ask students to give any updates on post-high-school endeavors so we can see how our recipients are doing.

Side note: Network and political capital

I am not very political in nature, but for those who are, this is a great way to meet and network in your area. Many of the local political and not political organizations and clubs donate scholarships to students. This gets you a seat (literally) at the table. Also, most people who donate are successful and you can enhance your network.

Creating Smaller PDFs in R

Making plots with many points

Whenever I make a lot of plots in R, I tend to make a multi-page PDF document. PDF is usually great: it's vectorized, which means it will scale no matter how much I zoom in or out. The problem with it is that for “large'' plots, which have a lot of points or lines, or just generally have a lot going on, the size of the PDF can become very big. I recently got a new laptop this year and Preview still pinwheels (for Windows people, hour-glasses) and freezes for some PDFs that are 16Mb in size.

Alternatively, for one-off plots, I use PNGs. What if I want multiple PNGs but in one file? I have found that for most purposes, creating a bunch of PNG files, then concatenating them into a PDF gives smaller-size PDFs that do not cause Preview or Adobe Reader to choke, while still giving good-enough quality plots.

Quick scatterplot example: using pdf

For example, let's say you wanted to make 5 different scatterplots of 1000000 bi-variate normals. You can do something like:

print(nplots)
[1] 5
x = matrix(rnorm(1e+06 * nplots), ncol = nplots)
y = matrix(rnorm(1e+06 * nplots), ncol = nplots)

tm1 <- benchmark({
    pdfname = "mypdf.pdf"

    pdf(file = pdfname)
    for (idev in seq(nplots)) {
        plot(x[, idev], y[, idev])
    }
    dev.off()
}, replications = 1)

The syntax is easy: open a PDF device with pdf(), run your plotting commands, and then close your pdf device using dev.off(). If you've ever forgotten to close your device, you know that you cannot open the PDF and you will be told it is corrupted.

The plot is relatively simple, and there are other packages that can do better visualization, based on the number of pixels plotted and overplotting, such as bigvis, but let's say this is my plot to make.

Let's look at the size of the PDF.

fsize = file.info(pdfname)$size/(1024 * 1024)
cat(fsize, "Mb\n")
32.49 Mb

As, we see the file is about 32 Mb. I probably don't need a file that large and that level of granularity for zooming and vectorization.

Quick scatterplot example: using multiple png devices

I could also make a series of 5 PNG files and put them into a folder. I could then open Preview, drag and drop them into a PDF and then save them or scroll through them using a quick preview. This is not a terrible solution, but it's not too reproducible, especially in a larger framework of creating multi-page PDFs.

One alternative is to give R a temporary filename to give to the set of PNGs, create them in a temporary directory, and then concatenate the PNGs using ImageMagick.

tm2 <- benchmark({
    pdfname2 = "mypdf2.pdf"
    tdir = tempdir()
    mypattern = "MYTEMPPNG"
    fname = paste0(mypattern, "%05d.png")
    gpat = paste0(mypattern, ".*\\.png")
    takeout = list.files(path = tdir, pattern = gpat, full.names = TRUE)
    if (length(takeout) > 0) 
        file.remove(takeout)
    pngname = file.path(tdir, fname)
    # png(pngname)
    png(pngname, res = 600, height = 7, width = 7, units = "in")
    for (idev in seq(nplots)) {
        plot(x[, idev], y[, idev])
    }
    dev.off()

    pngs = list.files(path = tdir, pattern = gpat, full.names = TRUE)
    mystr = paste(pngs, collapse = " ", sep = "")
    system(sprintf("convert %s -quality 100 %s", mystr, pdfname2))
}, replications = 1)

One thing of note is that I visited the png help page many times, but never stopped to see:

The page number is substituted if a C integer format is included in the character string, as in the default.

which tells me that I don't need to change around the filename for each plot – R will do that automatically.

Let's look at how big this file is:

fsize = file.info(pdfname2)$size/(1024 * 1024)
cat(fsize, "Mb\n")
4.94 Mb

We see that there is a significant reduction in size. The quality of the png is 600ppi, which has sufficient (actually good) resolution for most applications and a lot of journal requirements.

So what are the downsides?

  1. You have to run more code.
  2. You can't simply replace the pdf and dev.off syntax

To combat these two downsides, I wrapped these into functions mypdf and mydev.off.

mypdf = function(pdfname, mypattern = "MYTEMPPNG", ...) {
    fname = paste0(mypattern, "%05d.png")
    gpat = paste0(mypattern, ".*\\.png")
    takeout = list.files(path = tempdir(), pattern = gpat, full.names = TRUE)
    if (length(takeout) > 0) 
        file.remove(takeout)
    pngname = file.path(tempdir(), fname)
    png(pngname, ...)
    return(list(pdfname = pdfname, mypattern = mypattern))
}
# copts are options to sent to convert
mydev.off = function(pdfname, mypattern, copts = "") {
    dev.off()
    gpat = paste0(mypattern, ".*\\.png")
    pngs = list.files(path = tempdir(), pattern = gpat, full.names = TRUE)
    mystr = paste(pngs, collapse = " ", sep = "")
    system(sprintf("convert %s -quality 100 %s %s", mystr, pdfname, copts))
}

mypdf opens the device and sets up the format for the PNGs (allowing options to be passed to png). It returns the pdfname and the regular expression pattern for the PNG files. The mydev.off function takes in these two arguments, and any options to the convert function from ImageMagick, closes the device and concatenates the PNGs into a multi-page PDF.

Let's see how we could implement this.

tm3 <- benchmark({
    res = mypdf("mypdf3.pdf", res = 600, height = 7, width = 7, units = "in")
    for (idev in seq(nplots)) {
        plot(x[, idev], y[, idev])
    }
    mydev.off(pdfname = res$pdfname, mypattern = res$mypattern)
}, replications = 1)

And just for good measure, show that this PDF is the same size as before:

fsize = file.info("mypdf3.pdf")$size/(1024 * 1024)
cat(fsize, "Mb\n")
4.94 Mb

Of note, the main difference between using this and pdf with respect to syntax is that dev.off() usually doesn't take an argument (it defaults to the current device).

This process is slow

Let's look at how long it takes to create the PDF for each scenario, using benchmark from the rbenchmark package.

print(tm1$elapsed)
[1] 16.74
print(tm2$elapsed)
[1] 294.7
print(tm3$elapsed)
[1] 276.3

We see that it takes longer (by a factor of around 17) to make the PDF with PNGs and then concatenate them. This is likely because 1) there may be some overhead with creating multiple PNGs versus one device and 2) there is the added PNG concatenation into a PDF step.

But the files are smaller and quicker to render

######### Ratio of file sizes
ratio = file.info("mypdf.pdf")$size/file.info("mypdf2.pdf")$size
print(ratio)
[1] 6.577

Here we see the gain in file size (and quickness of rendering) is about 7, but again that gain is traded off by speed of code. You can see the result of using pdf() and using mypdf.

Post-hoc compression

Obviously I'm not the only one who has had this problem; others have created some things to make smaller PDFs. For example, tools::compactPDF, which uses qpdf or GhostScript, compresses already-made PDFs. Also, there are other reasons to use other formats, such as TIFF (which many journals prefer), but I'm just using PNG as my preference. JPEG, BMP, TIFF, etc should work equally as well as above.

BONUS!

Here are some helper functions that I made to make things easier for viewing PDFs directly from R (calling bash). Other functions exist in packages such as openPDF from BioBase, but these are simple to implement. (Note, I use xpdf for my pdfviewer, but getOption("pdfviewer") is a different viewer that failed on our cluster). The first 3 are viewers for PDFs, PNGs, and the third tries to guess given the filename. The 4th: open.dev uses the fname given to open the device. This allows you to switch the filename to .png from .pdf and run the same code.

view.pdf = function(fname, viewer = getOption("pdfviewer")) {
    stopifnot(length(fname) == 1)
    if (is.null(viewer)) {
        viewer = getOption("pdfviewer")
    }
    system(sprintf("%s %s&", viewer, fname))
}

view.png = function(fname, viewer = "display") {
    stopifnot(length(fname) == 1)
    system(sprintf("%s %s&", viewer, fname))
}

view = function(fname, viewer = NULL) {
    stopifnot(length(fname) == 1)
    get.ext = gsub("(.*)\\.(.*)$", "\\2", file)
    stopifnot(get.ext %in% c("pdf", "bmp", "svg", "png", "jpg", "jpeg", "tiff"))
    if (get.ext == "pdf") {
        if (is.null(viewer)) {
            viewer = getOption("pdfviewer")
        }
    }
    if (is.null(viewer)) {
        warning("No viewer given, trying open")
        viewer = "open"
    }
    system(sprintf("%s %s&", viewer, fname))
}

#### open a device from the filename extension
open.dev = function(file, type = "cairo", ...) {
    get.ext = gsub("(.*)\\.(.*)$", "\\2", file)
    stopifnot(get.ext %in% c("pdf", "bmp", "svg", "png", "jpg", "jpeg", "tiff"))

    ## device is jpeg
    if (get.ext == "jpg") 
        get.ext = "jpeg"
    ### difff arguments for diff devices
    if (get.ext %in% c("pdf")) {
        do.call(get.ext, list(file = file, ...))
    } else if (get.ext %in% c("bmp", "jpeg", "png", "tiff", "svg")) {
        do.call(get.ext, list(filename = file, type = type, ...))
    }
}