A Graduate School Open House: Words from a Student

So you got invited to an open house after you applied for your PhD. Now you get to visit the university, see the faculty and students, and meet the department overall. Here are some pointers that I have picked up over the years, things to ask if you freeze up and can't think, and things to look for when visiting. I'll try to discuss things that are relevant to PhD and Master's students, but I'm currently a PhD student, so these tips may be more applicable to a PhD.

What is grad school?

I've been around for some time, and let me tell you one thing: you are more prepared than when I was looking at graduate schools and programs. When I applied, I essentially Googled “What can you do with a biomathematics degree?” and biostatistics came up and off I went applying. Not a very rigorous or studious way of doing things, but I think it worked out. So what is grad school? For one thing I can definitively say for me: it is not undergrad. My undergrad is not a research institution and I hadn't done any heavy programming or even knew what markdown was (nor LaTeX or R ). I also didn't realize it wasn't common to wear shorts, a t-shirt, and a backwards hat to class. So if you know this, you are more prepared than I was.

School Determination

It's a little late for this, but the first thing you need to do is pick a school that has the program you want. For a Master's degree, that means the program you want to study and at an affordable price. I didn't thoroughly weigh the cost of my education across programs, which I regret. I believe I stil made the best choice, but a more informed financial decision would still have been better.

Questions

Here I'll go through some questions you can ask if you don't feel like you know what to ask when you get there. I'll break them up into overall, for other students, for professors, as these 3 audiences are different in what and how you should ask questions.

Questions to ask overall

These are questions you can, and should, ask in a big group so that other students can hear the responses as well. They may be asked to administrators or the graduate program director, as well as other students.

  1. What are some examples of departmental events?
  2. What resources does the department have for new students?
  3. What resources does the deparmtent have? e.g. a computing cluster, money for students books/a new computer
  4. Do students (which students) get offices (if any)?
  5. How big is the department?
  6. What are the requirements for graduation (comps, orals, dissertation, etc)?

Questions to ask students

These are usually much more informal than those to the professors or staff. The level of informality can inform whether the student body fits well with your personality.

  1. What is student life like?
  2. Does the stipend cover the cost of living?
  3. What neighborhoods do you live in?
  4. How much does rent usually cost?
  5. How much does a cheeseburger and/or beer cost at a restaurant you usually go to? This is a question that can give you an idea of how much the cost of living is.
  6. Do you have/need a car?
  7. What is the public transportation like (do you ride the bus, do you usually cab, etc)?
  8. How much interactions do you have with professors that aren't your advisor/teacher? When I was interviewing at another department, I told one of their students that I was interviewed by Professor X (not Xavier), and they replied “Oh, I don't know who that is.” That told me a lot of how the department operated, and I realized I liked smaller departments.
  9. What student groups exist? This is interesting because it'll let you know how the students interact. For example, we have a computing club, a journal club, and an informal blog meeting.

Questions to ask professors

These are more formal in some respects and allow you to find out what research people are really doing and see if you connect with anyone.

  1. What is the coolest thing you've done recently?
  2. How many students do you have under you, on average?
  3. How often do you meet with your students?
  4. How many classes do you teach a year?
  5. How many students of yours have graduated? This is biased against new professors, but they should let you know that they are new.
  6. How are your students funded? Do they find funding or do you have funding usually available?

Be yourself

The main message I can send is: be yourself. For however cliche it is, being yourself will let you accurately know if you get along with the department or their students. Some personalities just don't go with certain departments or certain professors, but that's not constrained to grad school. It's good to get as much information as the “feel” of the department and if you fit there by the end of the visit.

Like where you live

Lastly, I think the best advice about choosing a department for a PhD was given from a friend of mine: “live somewhere you would like to live for the next 5 years”. I'd say the same thing for a 2-year or 18-month long Master's degree. The department can be great, but liking where you live is a huge component of maximizing happiness while you're in graduate school. And that should be #2 on your list, right after “getting it done”.

Changing the Peer Review Process: Thinking like a 10-Year Old

Abstract

I discuss my idea for another type of peer review process, where there would be a general requirement for needing to review before getting reviewed.

A story about when I was a 10-Year-Old

I couldn't really sleep this morning, so naturally I started thinking about the peer review process. I started thinking about elementary school and when teachers would make us grade each other's homework. When I had to grade my neighbors' assignments and they had to grade mine, I usually thought 3 things: 1) the teacher was too lazy to grade our homework (how wrong I was!), 2) I was happy to get feedback so quickly, and 3) the grading was generally less harsh than if the teacher had done it.

Back to Peer Review

I was thinking about how this idea could apply to a peer-reviewed journal. I think it would be interesting if whenever you submitted your paper, then the only way you get feedback is if you reviewed another paper from that journal. I think this would be interesting for a few reasons.

  1. It would encourage fast feedback from reviewers. (I'll get to how this may be a bad thing).
  2. You'd have a large pool of reviewers.
  3. Although many (if not most) academics reviewing papers are paying forward from getting their own papers reviewed, the pay forward here would be immediate. Also, you wouldn't have any “leechers)” publishing in that journal.
  4. It would promote the idea that one aspect of research is “giving back”.

How to Select Which Papers to Review

You could select from a list of papers that are currently available to be reviewed and pick the one (or more) that you feel qualified to review. You may have to give a reason why you feel qualified (or not). If you didn't feel qualified for any, then you can either pay a fee, or give fields you do feel qualified in and wait for the journal to respond. Editors would still check the paper is in line with the journal's mission and can double check if they don't see a good fit for a paper reviewer.

Caching

I like the idea that you could “cache” reviews so that if you (or a co-author) submitted to this journal (or a network of journals) and had more reviews than submissions, you could simply submit. This would be useful because:

  1. Younger academics (i.e. students) would be encouraged to get into the peer review process earlier.
  2. You may review at any time if you think you have a submission in the future.
  3. You know you still “gained” something from the review above and beyond what you gain from being a reviewer in the current system (new knowledge, seeing other writing styles, etc.)
  4. Having a co-author that reviews a lot of papers may be more desirable to collaborate with (not a great reason, but still gives incentive to review).

Drawbacks

Potential drawbacks for this system obviously exist, such as:
1. Quality of review. You'd run the risk of people trying to review too quickly or reviewing papers they are not qualified for.
2. Using co-authors “caches” more often than giving back, so that only a few are still doing the reviews.
3. Competing papers might get rival authors reviewing them (could be rare, but worth considering)
4. Allowing people to select papers that need to be reviewed may neglect some more-complicated papers. (You could maybe assign these).

Conclusions

The idea may be not really hashed out, and there are probably unforeseen problems, but I think this system would be interesting to try out, if it's not being used already. (Incidentally, if you know of a place where this system is used, links/comments are welcome!). The peer review process would be even more of a “scratch my back, I'll scratch yours” situation. Also, it would give direct incentives for reviewing, which I believe is a good thing.

Changing Terminal Tab Names

So I was looking around how to change Terminal tab names. I want the tab name to change to the current working directory if I'm on my local system and to “Enigma” if I'm on our cluster host computer and “Node” if I'm on a cluster node.

After some tweaking, I found a solution that I like.

In my ~/.bashrc file, I have:

function tabname {
  x="$1"
  export PROMPT_COMMAND='echo -ne "33]0;${x}07"'
  # printf "\e]1;$1\a"
}
### changing tab names
tname=`hostname | awk '/enigma/ { print "Enigma"; next; } { print "Node" }'`
tabname "$tname"

which essentially just does a regular expression for the word enigma using the hostname command. It then assigns this to a bash variable tname and then tabname assigns that tab name.

In my personal ~/.bashrc, I added:

function tabname {
  x="$1"
  export PROMPT_COMMAND='echo -ne "33]0;${x}07"'
  # printf "\e]1;$1\a"
}
### changing tab names
tname=`hostname | awk -v PWD=$PWD '/macbook/ { print PWD; next; }'`
tabname "$tname"

so that when I'm on my macbook (change this as needed for your machine), it will have the working directory as the tab name. Now, yes, I know that Terminal usually puts the working directory in the window name, but I find that I tend to not look at that and only tab names.

Now, you can combine these to have:

tname=`hostname | awk -v PWD=$PWD '/enigma/ { print "Enigma: " PWD; next; } { print "Node: " PWD }'`

if you want to describe where you are on the cluster.

Here's the result:

Tabs

This worked great on our cluster, but remained when I exited an ssh session, so I'm still tweaking. Any comments would be appreciated.

Faster XML conversion to Data Frames

Problem Setup

I had noted in a previous post that I have been using the XML package in R to process an XML from an export of our database. I used xmlToDataFrame to change from an XML set to an R data.frame and I have found it to be remarkably slow. After some Googling, I found a link where the author states that xmlToDataFrame is a generic function and if you know the structure of the data, you can leverage that to speed up the function.

So, that's what I did for my data. I think this structure is applicable to similar data structures in XML, so I thought I'd share.

Data Structure

Let's look at the data structure. For my data, an example XML would be:

<?xml version="1.0" encoding="UTF-8"?>
<export date="13-Jan-2014 14:08 -0600" createdBy="John Muschelli" role="Data Manager">

  <dataset1>
    <ID>001</ID>
    <age>50</age>
    <field3>blah</field3>
    <field4 />
  </dataset1>
  <dataset2>
    <ID>001</ID>
    <visit>1</visit>
    <scale1>20</scale1>
    <scale2 />
    <scale3>20</scale3>
  </dataset2>
  <dataset1>
    <ID>002</ID>
    <age>40</age>
    <field4 />
  </dataset1>  
</export>  

which tells me a few things:

  1. I'm XML (first line). There are other pieces of information which can be extracted as tags, but we won't cover that here.
  2. This is part of a large field called export (a parent in XML talk I believe) (second line)
  3. Datasets are a child of export (it's nested in export). For example, we have dataset1 and dataset2 in this export.
  4. There are missing data points, either references by <tag></tag> or <tag />. Both are valid XML.
  5. Not all records of the datasets have all fields. The second record of dataset1 doesn't have fields field3 but has field4.

So I wrote this function to make my data.frames, which I found to be much faster for conversion for large datasets.

require(XML)
xmlToDF = function(doc, xpath, isXML = TRUE, usewhich = TRUE, verbose = TRUE) {

    if (!isXML) 
        doc = xmlParse(doc)
    #### get the records for that form
    nodeset <- getNodeSet(doc, xpath)

    ## get the field names
    var.names <- lapply(nodeset, names)

    ## get the total fields that are in any record
    fields = unique(unlist(var.names))

    ## extract the values from all fields
    dl = lapply(fields, function(x) {
        if (verbose) 
            print(paste0("  ", x))
        xpathSApply(proc, paste0(xpath, "/", x), xmlValue)
    })

    ## make logical matrix whether each record had that field
    name.mat = t(sapply(var.names, function(x) fields %in% x))
    df = data.frame(matrix(NA, nrow = nrow(name.mat), ncol = ncol(name.mat)))
    names(df) = fields

    ## fill in that data.frame
    for (icol in 1:ncol(name.mat)) {
        rep.rows = name.mat[, icol]
        if (usewhich) 
            rep.rows = which(rep.rows)
        df[rep.rows, icol] = dl[[icol]]
    }

    return(df)
}

Function Options

So how do I use this?:

  • You need the XML package.
  • doc is an parsed XML file. For example, run:
doc = xmlParse("xmlFile.xml")
  • xpath is an XPath expression extracting the dataset you want. For example if I wanted dataset1, I'd run:
doc = xmlParse("xmlFile.xml")
xmlToDF(doc, xpath = "/export/dataset1")
  • You can set isXML=FALSE and pass in a character string of the xml filename, which just parses it for you.
xmlToDF("xmlFile.xml", xpath = "/export/dataset1", isXML = FALSE)
  • usewhich just flags if you should use which for subsetting. It seems faster, and I'm trying to think of reasons logical subsetting would be faster. This doesn't change functionality really as long as which returns something of length > 1, which it should by construction, but maybe speed up the code for large datasets.
  • verbose – do you want things printed to screen?

Function Explanation

So what is this code doing?:

  1. Parses the document (if isXML = FALSE)
  2. Extracts the nodes that are for that specific dataset.
  3. Gets the variable names for each record (var.names)
  4. Takes the union of all those variable names (fields). This will be the variable names of the resultant dataset. If every record had all fields, then this would be redundant, but this is a safer way of getting the column/variable names.
  5. Extract all the values from each field for each record (dl, which is a list).
  6. For each record, a logical matrix is made to record if that record had that field represented in XML.
  7. A loop over each field then fills in the values to the data.frame.
  8. data.frame is returned.

Timing differences

Obviously, I wanted to use this because I think it'd be faster than xmlToDataFrame. First off, what was the size of the dataset I was converting?

dim(df$df.list[[1]])
# [1] 16824 161

So only 16824 rows and 161 columns. Let's see how long it took to convert using xmlToDataFrame:

    user   system  elapsed 
4194.900   93.590 4288.996 

Where each measurement is in seconds, so that's over 1 hour! I think this is pretty long, and don't know all the checks going on, so that may not be unreasonable to those who have used this package a lot. But I think that's unscalable for large datasets.

What about xmlToDF?

   user  system elapsed 
225.004   0.356 225.391 

which takes about 4 minutes. This is significantly faster, and makes it reasonable to parse the 150 or so datasets I have.

Conclusion

This function (xmlToDF) may be useful if you're converting XML to a data.frame with similar structure from XML. If you're data is different, then you may have to tweak it to make it fit your needs. I understand that the for loop is probably not the most efficient, but it was clearer in code to those I'm writing for (other collaborators) and the efficiency gains from using this function over the xmlToDataFrame were enough for our needs.

The code is hosted here. Also, you can use this function (and any updates that are made) by using the processVISION packagea:

require(devtools)
install_github("processVISION", "muschellij2")

Stata Markdown

Abstract/Summary

This blog post is about options for making dynamic documents in Stata using Markdown, discussing the options of StatWeave and a do file created from a user, knitr.do. I will discuss some the capacities of these options and show options for custom use if you know how to use RMarkdown.

Knitr: Dynamic Documents

If you use R, or even if you don't, you may have heard of the phrases “dynamic documents”, “reproducible reports”, “markdown”, “Rmarkdown”, or more specifically “knitting/knitr”. For pronunciation: according to the knitr documentation in R:

The pronunciation of knitr is similar to neater (neater than what?) or you can think of knitter (but it is single t). The name comes from knit + R (while Sweave = S + weave).

Now, if you haven't heard it, well I guess now you have. But more importantly, do some research on knitr. It's awesome, and there's even a Book from the author of knitr, Yihui Xie and a corresponding GitHub repository. Also, you may want to read why you should care about reproducible results.

Overall, knitr is a system that allows for dynamic documents, which I will define as files that contain code and prose/text/words/comments/notes.

Knitr Languages

Why am I talking about Stata? Well, I use Stata. Also, if you're using SAS, Python, CoffeeScript or some other languages, then knitr has already incorporated these into the R system: http://yihui.name/knitr/demo/engines/.

Let's just list some resources for doing some knitting in Stata:

Now, I highly suggest taking a look at the github repo and knitr.do and StatWeave. Actually, no. Stop reading this post and check it out. I can wait. Go. I'm going to talk about how to do this within R.

And… We're back

So these options are good and are mainly options to create a markdown document that Stata will run/manipulate. This is vital for someone who doesn't know R. Here are some notes I have:

  1. knitr has a lot of good options already made and is expanding. No inventing the wheel with respect to highlighting/parsing/etc. Also, a large community uses it.
  2. I want to know one syntax for markdown. OK, maybe two, one for html, the other for LaTeX.
  3. knitr.do uses parsing based on indenting from what I can see. I like demarcating code; I feel like it's safer/robust. This could easily be changed if the user wanted it. StatWeave allows code demarcation by \begin{Statacode} and \end{Statacode}.
  4. knitr.do didn't seem to have an inline code option. StatWeave allows you to add inline text. For example, stating that the dataset had 100 rows and the maximum age was 30. StatWeave uses the Sweave syntax, but uses Stataexpr instead of \Sexpr, so that you could fill in that 30 by using \Stataexpr{max(age)} instead of writing 30. This is a huge capability for truly dynamic documents.
  5. StatWeave is maintained mainly, I believe, by one person (Russell Lenth). This is how knitr started in some capacity before it became more popular, but it was built upon a community-used system R that had a pre-existing software that was similar (Sweave). Hence, I think knitr has more longevity and more learning capital compared to either option. Also, StatWeave (or its functionality) may be integrated into knitr.
  6. StatWeave can only be written in LaTeX syntax (since OpenOffice bug precludes it from making odt docs). knitr.do can do markdown, which can be converted to pdf, docx, html, or many other formats using pandoc.
  7. Neither option allows for automatically saving and naming plots in any system I can see. This must be done in Stata code using normal graph saving methods, e.g. graph export.
  8. knitr.do inherently uses logs. I can't really determine what StatWeave uses because it's written in Java.

Now, I'm going to assume how to use knitr and see how we could do some reporting using knitr.

99 Problems and they're Stata problems

If you are running knitr from R, again, Yihui has incorporated a lot of other languages to process. What are some potential problems with processing the same way in Stata?

  • Stata is inherently just a command line, but when you call it, it calls a GUI if you don't have Stata(console). More on Stata(console) that later.
    • You can start Stata by the command line in Unix or Windows with help from those links.
  • In order to use Stata from the command line, you probably need to put the path to Stata in your PATH variable: http://www.stata.com/support/faqs/mac/advanced-topics/. For example, the path /Applications/Stata/Stata.app/Contents/MacOS/ is in my PATH, so that I can go to the Terminal and type Stata. (Side note: this is the way to start multiple Stata sessions on a Mac). Let's assume you didn't do this though.

Let's just make a test .do file:

  cat Stata_Markdown.do
clear 
disp "hello world!"
exit, clear

Now how to run it? Let's use bash, which is supported by knitr. So I just have in my knitr code chunk options, engine='bash'. Don't forget comment="" if you don't want # to be printed (which is the default comment character).

stata -b Stata_Markdown.do 
echo $? ### print last return
127

Since echo $? is supposed to print 0 if there is no error, there was an error. Worse off, there was a silent error in the sense it didn't print a message of error as output for bash. This error occurs because my bash doesn't have a stata or Stata command. We can either make aliases in .bash_profile or .bashrc or again put Stata in my path, but let's just be explicit about the Stata command by using the full path: for me, it's /Applications/Stata/Stata.app/Contents/MacOS/stata. We also don't see anything from the log file, which makes sense because nothing happened.

  • But a real problem is the Stata log file is not made in a “timely” manner in this process. Let's rerun the code with the full path for Stata:
/Applications/Stata/Stata.app/Contents/MacOS/stata -b "Stata_Markdown.do" 
echo $?
cat Stata_Markdown.log
0

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   11.2   Copyright 1985-2009 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
                                      College Station, Texas 77845 USA
                                      800-STATA-PC        http://www.stata.com
                                      979-696-4600        stata@stata.com
                                      979-696-4601 (fax)

35-student Stata lab perpetual license:
       Serial number:  30110513240
         Licensed to:  Biostat
                       Johns Hopkins University

Notes:
      1.  10.00 MB allocated to data
      2.  Stata running in batch mode

. do Stata_Markdown.do 

. clear 

. disp "hello world!"
hello world!

. exit, clear

end of do-file
  • Success! Well, it worked by the error being 0, but not really a “success” as nothing was printed. So what does this code for running Stata mean?

    • /Applications/Stata/Stata.app/Contents/MacOS/stata says “run stata”
    • -b says I want to run in “batch mode”, which is much different than “beast mode”.
    • Stata_Markdown.do filename I want to run
      Now, if there was a space in the path to Stata, it needs to be quoted with ". But IMPORTANTLY, the Stata console came up and I had to hit “OK”, INTERACTIVELY!! Not very automated, but we'll fix this in a moment.
  • But what about the cat Stata_Markdown.log, which is auto-generated by the Stata command? Was the log empty?

    cat Stata_Markdown.log
    
    ___  ____  ____  ____  ____ (R)
    /__    /   ____/   /   ____/
    ___/   /   /___/   /   /___/   11.2   Copyright 1985-2009 StataCorp LP
    Statistics/Data Analysis            StataCorp
                                        4905 Lakeway Drive
                                        College Station, Texas 77845 USA
                                        800-STATA-PC        http://www.stata.com
                                        979-696-4600        stata@stata.com
                                        979-696-4601 (fax)
    
    35-student Stata lab perpetual license:
         Serial number:  30110513240
           Licensed to:  Biostat
                         Johns Hopkins University
    
    Notes:
        1.  10.00 MB allocated to data
        2.  Stata running in batch mode
    
    . do Stata_Markdown.do 
    
    . clear 
    
    . disp "hello world!"
    hello world!
    
    . exit, clear
    
    end of do-file
    

    WHAT? Running the command again gives us what we want? Now, we can either do 2 code chunks, but if we set the results='hold' option in knitr, then things work fine.

    • You can get around this unwanted “interactivity” using the console version of Stata, but I didn't set it up and Stata for Mac says:
      > Can I display graphs with Stata(console)?
      > No. Stata(console) is a text-based application and has no graphical display capabilities. However, it can generate and save Stata graphs, which can then be viewed with Stata(GUI). Stata(console) can also convert Stata graphs to PostScript and save them as files.

Also, Stata(console) for Mac needs Stata/SE or State/MP (aka more costly Stata) according to Section C.4 Stata(console) for Mac OS X. So for most users you'd have to buy a different Stata.

  • Another way of getting around this interaction would be having Stata auto-exit; let's do that. Exiting Stata is possible without having interaction with a specific option when you exit, so you have exit, clear STATA. Let's look at our new script Stata_Markdown_Exit.do:
cat Stata_Markdown_Exit.do
clear 
disp "hello world!"
exit, clear STATA

Now let's run it.

/Applications/Stata/Stata.app/Contents/MacOS/stata -b "Stata_Markdown_Exit.do"
echo $?
cat Stata_Markdown_Exit.log
0

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   11.2   Copyright 1985-2009 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
                                      College Station, Texas 77845 USA
                                      800-STATA-PC        http://www.stata.com
                                      979-696-4600        stata@stata.com
                                      979-696-4601 (fax)

35-student Stata lab perpetual license:
       Serial number:  30110513240
         Licensed to:  Biostat
                       Johns Hopkins University

Notes:
      1.  10.00 MB allocated to data
      2.  Stata running in batch mode

. do Stata_Markdown_Exit.do 

. clear 

. disp "hello world!"
hello world!

. exit, clear STATA

It looks the same as before with no output, but I did not have to interact with Stata. Note: if you use & at the end of the command, the echo $? will come up zero, because bash will see it a background process.

But I don't want to show the whole script all the time

You may notice that I printed with cat the entire log that was created with Stata. Honestly, I don't like Stata logs. They seem like a nuisance. I have a script and can make outputs, so do I why need a log? But here, it seems useful. But what happens when you want to show parts of a script at different points? You can obviously make a series of .do files. Not really a good solution.

What's a better solution? Create logs in your Stata code and then cat them to different code chunks. Here's an example:

cat Stata_Markdown_logs.do
clear 
log using print_hello.log, replace
disp "hello world!"
log close

log using run_summ.log, replace
set obs 100
gen x = rnormal(100)
summ x
log close 
exit, clear STATA
/Applications/Stata/Stata.app/Contents/MacOS/stata -b "Stata_Markdown_logs.do" 

Now, since print_hello.log, and run_summ.log were created, I can just do:

cat print_hello.log
--------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/print_hello.log
  log type:  text
 opened on:  11 Jan 2014, 18:20:29

. disp "hello world!"
hello world!

. log close
      name:  <unnamed>
       log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/print_hello.log
  log type:  text
 closed on:  11 Jan 2014, 18:20:29
--------------------------------------------------------------------------------------------------------

and then later print:

cat run_summ.log
--------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/run_summ.log
  log type:  text
 opened on:  11 Jan 2014, 18:20:29

. set obs 100
obs was 0, now 100

. gen x = rnormal(100)

. summ x

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           x |       100    100.0006    1.061928   97.11491   101.8377

. log close 
      name:  <unnamed>
       log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/run_summ.log
  log type:  text
 closed on:  11 Jan 2014, 18:20:29
--------------------------------------------------------------------------------------------------------

No header/footer from log

This works, but you have a header and footer, that you probably can't delete with some simple option. Now, obviously you can read them in R and do string manipulation and then print them back out, but that's a little convoluted. Regardless, I wrote a simple function in R that will do it (R code):

catlog <- function(filename, runcat = TRUE, comment = "") {
    x = readLines(filename)
    lenx = length(x)
    x = x[7:(lenx - 6)]
    writeLines(x, filename)
    if (runcat) 
        cat(x, sep = "\n")
}
catlog("run_summ.log")
. set obs 100
obs was 0, now 100

. gen x = rnormal(100)

. summ x

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           x |       100    100.0006    1.061928   97.11491   101.8377

which simply drops the first 6 and last 6 lines of the log. Thus, you can then print it totally using R or then just use the saved log file can print it using cat from bash:

cat run_summ.log
. set obs 100
obs was 0, now 100

. gen x = rnormal(100)

. summ x

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
           x |       100    100.0006    1.061928   97.11491   101.8377

or in bash, one example would be:

nlines=`awk 'END{print NR}' print_hello.log`
nhead=`expr $nlines - 6`
ntail=`expr $nlines - 12`
head -$nhead print_hello.log | tail -$ntail
. disp "hello world!"
hello world!

or even better, let's make a function for bash that will do it:

catlog () {
  nlines=`awk 'END{print NR}' $1`
  nhead=`expr $nlines - 6`
  ntail=`expr $nlines - 12`
  head -$nhead $1 | tail -$ntail
}

catlog print_hello.log
. disp "hello world!"
hello world!

OK – I can see the allure of using StatWeave in some capacity at this point. But still, if you use knitr, this may make sense or the way you want to do it without going to StatWeave.

Cleanup

You can just do some .log clean up using:

rm *.log  

if you want to delete all logs in your folder (assuming you never changed directories).

Thoughts

You can do “markdown” in Stata. My thoughts:
1. It's complicated.
2. The knitr.do file is a good start and great that's it's totally within Stata (you still need a Markdown converter), but doesn't have code demarcation. It also doesn't do inline commands, which are a requirement for a dynamic doc, so you don't have to fill in the numbers and can do it dynamically with code
3. StatWeave has more functionality than knitr.do and inline functions, but uses added software (a Java program), and can't do general markdown; the user needs to understand LaTeX.
3. Plotting hasn't really been integrated. You can always do a graph export myplot.pdf, as(pdf) (on Mac) or whatever and then just put in <img src="myplot.pdf"> in your html, or \includegraphics{myplot.pdf} in LaTeX, but that's not as integrated as it is in other systems.
4. If you make it to a Markdown document, you can use the great pandoc to potentially then just make it a Word .doc.
5. It will likely be integrated in the future. The question is how close is that “future”?

Conclusion

I like both options for respective pieces but my main concern with either option is putting in a lot of time for these and then they becoming obsolete with knitr integration. That's not a big problem, since I know knitr but something to think about for someone who doesn't. My recommendation, if you know and want to use LaTeX or need inline numbers, go with StatWeave. Otherwise knitr.do may do the trick. Also, I've given you some directions on “growing your own”, which is the most customizable for you but even worse with respect to time, reinventing the wheel, and no support from others.

Anyway, those are the current options I know about when doing Markdown with Stata.

Creating Stata dtas from R, Issues and Resolutions

So I am currently on a clinical trial and we have a very interesting export process. Essentially, the data comes in an XML format, then is converted to data sets using SAS. That’s all fine and good, because SAS is good at converting things (I’ve heard). The problem is, the people who started writing code to process the data and the majority of people maintaining the code use Stata.

Now that’s a problem. OK, well let’s take these SAS data sets and convert them to Stata using Stat-Transfer. Not totally reprehensible, it at least preserves some form versus a CSV to dta (Stata format) nightmare. The only problem with this is that for some reason, the SAS parsing of the XML started to chew up a bit of memory, for about 140 data sets (it’s a lot of forms). Oh, by the way, a bit of memory was about 16 gigs from a 100 meg file. That’s atrocious. I don’t care what it’s doing but an 160 fold increase in just converting some XML and copying the datasets. Not only that, it took over 4 hours. What the hell SAS?

Anyway, we just started a phase III trial collecting similar data from before. We’re using the same database. I decided to stop the insanity and convert the XML in R. The data still needs to produce in Stata data sets, but at least I could likely control the memory consumption and the time limits. At least I could throw it onto our computing cluster if things got out of control (I guess I could have done that with SAS, but I’m not doing that).

Now, thank God that the XML package exists. So pretty much just using some xmlParse and xmlToDataFrame commands, I had some data.frames! Now, just make them into Stata data sets right? Not really.

Let me first say that the foreign package is awesome. You can read in pretty much any respectable statistical software dataset into R and do some exporting. Also, the SASxport package allows you to export to the XPORT format using write.xport, which is a widely used (and FDA-compliant) format.

Now what problems did I have with the foreign package?

  1. I believe that Stata data sets can have length-32 variable names.
    After some correspondence, the maintainers argue that Stata’s
    documentation only support “up to 32” characters, which they
    interpret as only 31. The
    documentation states:

    varlist contains the names of the Stata variables 1, …, nvar, each up
    to 32 characters in length

    A week after my discussion, foreign had noted in their ChangeLog:

    man/{read,write}.dta: Freeze Stata support.

    Well, I guess I’ll just change write.dta to do what I want.

    a. My solution: Copy write.dta and change the 31L to 32L. Or moreover, I could have had the user pass a truncation length. But let’s default some stuff. The only concern is the command do_writeStata which is a hidden (non-exported) function from foreign. So I just slapped a foreign:::do_writeStata on there, and away we go (not the best practice, but the only way I could think – importFrom did not work).

  2. Empty strings in R are represented as "". According to the foreign package, Stata documentation states that empty strings is not supported, which is true:

    Strings in Stata may be from 1 to 244 bytes long.

    and "" has 0 bytes:

    nchar("", "bytes")
    
    ## [1] 0
    

    I know from reading in Stata dta files, character variables can have data "", it’s treated as missing. See the Stata code below (I’m going to post about how to knit with Stata in followup)

    /Applications/Stata/Stata.app/Contents/MacOS/stata "test.do"
    cat test.log
    
    --------------------------------------------------------------------------------------------------------
        name:  <unnamed>
         log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log
    log type:  text
    opened on:  11 Jan 2014, 13:39:54
    
    . set obs 1
    obs was 0, now 1
    
    . gen x = ""
    (1 missing value generated)
    
    . count if mi(x)
      1
    
    . count if x == ""
      1
    
    . log close
        name:  <unnamed>
         log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log
    log type:  text
    closed on:  11 Jan 2014, 13:39:54
    --------------------------------------------------------------------------------------------------------
    

    This isn’t a major problem, as long as you know about it.

    a. My solution? Make the "" a " " (a space). Is this optimal? No. If there are true spaces in the data, then these are aliased. But who really cares about them? If you do, then change the code. If not, then great, use my code. If you have a large problem with that, and throw it in the comments and someone will probably read it.

Then, in Stata, you can define the function:

*** recaststr- making the " "  to ""; (to get around str0 cases)
capture program drop recaststr
program define recaststr
  foreach var of varlist * {
        local vtype : type `var';
        if ( index("`vtype'", "str") > 0) replace `var' = "" if `var' == " ";
    }
end

OK, so if you ever want to use these functions,

require(devtools)
install_github("processVISION", "muschellij2")
library(processVISION)

should start you off: and the functions write32.dta and create_stata_dta should be what you’re looking for. I do some attempt at formatting the columns into numeric and dates in create_stata_dta. If you don’t want that, just use the argument tryConvert=FALSE. Happy converting.

Followup Post: How to make a feeble attempt at knitting a Stata .do file and explaining some attempts/packages out there.