# Stata Markdown

## Abstract/Summary

This blog post is about options for making dynamic documents in Stata using Markdown, discussing the options of StatWeave and a do file created from a user, knitr.do. I will discuss some the capacities of these options and show options for custom use if you know how to use RMarkdown.

## Knitr: Dynamic Documents

If you use R, or even if you don't, you may have heard of the phrases “dynamic documents”, “reproducible reports”, “markdown”, “Rmarkdown”, or more specifically “knitting/knitr”. For pronunciation: according to the knitr documentation in R:

The pronunciation of knitr is similar to neater (neater than what?) or you can think of knitter (but it is single t). The name comes from knit + R (while Sweave = S + weave).

Now, if you haven't heard it, well I guess now you have. But more importantly, do some research on knitr. It's awesome, and there's even a Book from the author of knitr, Yihui Xie and a corresponding GitHub repository. Also, you may want to read why you should care about reproducible results.

Overall, knitr is a system that allows for dynamic documents, which I will define as files that contain code and prose/text/words/comments/notes.

### Knitr Languages

Why am I talking about Stata? Well, I use Stata. Also, if you're using SAS, Python, CoffeeScript or some other languages, then knitr has already incorporated these into the R system: http://yihui.name/knitr/demo/engines/.

Let's just list some resources for doing some knitting in Stata:

Now, I highly suggest taking a look at the github repo and knitr.do and StatWeave. Actually, no. Stop reading this post and check it out. I can wait. Go. I'm going to talk about how to do this within R.

## And… We're back

So these options are good and are mainly options to create a markdown document that Stata will run/manipulate. This is vital for someone who doesn't know R. Here are some notes I have:

1. knitr has a lot of good options already made and is expanding. No inventing the wheel with respect to highlighting/parsing/etc. Also, a large community uses it.
2. I want to know one syntax for markdown. OK, maybe two, one for html, the other for LaTeX.
3. knitr.do uses parsing based on indenting from what I can see. I like demarcating code; I feel like it's safer/robust. This could easily be changed if the user wanted it. StatWeave allows code demarcation by \begin{Statacode} and \end{Statacode}.
4. knitr.do didn't seem to have an inline code option. StatWeave allows you to add inline text. For example, stating that the dataset had 100 rows and the maximum age was 30. StatWeave uses the Sweave syntax, but uses Stataexpr instead of \Sexpr, so that you could fill in that 30 by using \Stataexpr{max(age)} instead of writing 30. This is a huge capability for truly dynamic documents.
5. StatWeave is maintained mainly, I believe, by one person (Russell Lenth). This is how knitr started in some capacity before it became more popular, but it was built upon a community-used system R that had a pre-existing software that was similar (Sweave). Hence, I think knitr has more longevity and more learning capital compared to either option. Also, StatWeave (or its functionality) may be integrated into knitr.
6. StatWeave can only be written in LaTeX syntax (since OpenOffice bug precludes it from making odt docs). knitr.do can do markdown, which can be converted to pdf, docx, html, or many other formats using pandoc.
7. Neither option allows for automatically saving and naming plots in any system I can see. This must be done in Stata code using normal graph saving methods, e.g. graph export.
8. knitr.do inherently uses logs. I can't really determine what StatWeave uses because it's written in Java.

Now, I'm going to assume how to use knitr and see how we could do some reporting using knitr.

### 99 Problems and they're Stata problems

If you are running knitr from R, again, Yihui has incorporated a lot of other languages to process. What are some potential problems with processing the same way in Stata?

• Stata is inherently just a command line, but when you call it, it calls a GUI if you don't have Stata(console). More on Stata(console) that later.
• You can start Stata by the command line in Unix or Windows with help from those links.
• In order to use Stata from the command line, you probably need to put the path to Stata in your PATH variable: http://www.stata.com/support/faqs/mac/advanced-topics/. For example, the path /Applications/Stata/Stata.app/Contents/MacOS/ is in my PATH, so that I can go to the Terminal and type Stata. (Side note: this is the way to start multiple Stata sessions on a Mac). Let's assume you didn't do this though.

Let's just make a test .do file:

  cat Stata_Markdown.do

clear
disp "hello world!"
exit, clear


Now how to run it? Let's use bash, which is supported by knitr. So I just have in my knitr code chunk options, engine='bash'. Don't forget comment="" if you don't want # to be printed (which is the default comment character).

stata -b Stata_Markdown.do
echo $? ### print last return  127  Since echo$? is supposed to print 0 if there is no error, there was an error. Worse off, there was a silent error in the sense it didn't print a message of error as output for bash. This error occurs because my bash doesn't have a stata or Stata command. We can either make aliases in .bash_profile or .bashrc or again put Stata in my path, but let's just be explicit about the Stata command by using the full path: for me, it's /Applications/Stata/Stata.app/Contents/MacOS/stata. We also don't see anything from the log file, which makes sense because nothing happened.

• But a real problem is the Stata log file is not made in a “timely” manner in this process. Let's rerun the code with the full path for Stata:
/Applications/Stata/Stata.app/Contents/MacOS/stata -b "Stata_Markdown.do"
echo $? cat Stata_Markdown.log  0 ___ ____ ____ ____ ____ (R) /__ / ____/ / ____/ ___/ / /___/ / /___/ 11.2 Copyright 1985-2009 StataCorp LP Statistics/Data Analysis StataCorp 4905 Lakeway Drive College Station, Texas 77845 USA 800-STATA-PC http://www.stata.com 979-696-4600 stata@stata.com 979-696-4601 (fax) 35-student Stata lab perpetual license: Serial number: 30110513240 Licensed to: Biostat Johns Hopkins University Notes: 1. 10.00 MB allocated to data 2. Stata running in batch mode . do Stata_Markdown.do . clear . disp "hello world!" hello world! . exit, clear end of do-file  • Success! Well, it worked by the error being 0, but not really a “success” as nothing was printed. So what does this code for running Stata mean? • /Applications/Stata/Stata.app/Contents/MacOS/stata says “run stata” • -b says I want to run in “batch mode”, which is much different than “beast mode”. • Stata_Markdown.do filename I want to run Now, if there was a space in the path to Stata, it needs to be quoted with ". But IMPORTANTLY, the Stata console came up and I had to hit “OK”, INTERACTIVELY!! Not very automated, but we'll fix this in a moment. • But what about the cat Stata_Markdown.log, which is auto-generated by the Stata command? Was the log empty? cat Stata_Markdown.log  ___ ____ ____ ____ ____ (R) /__ / ____/ / ____/ ___/ / /___/ / /___/ 11.2 Copyright 1985-2009 StataCorp LP Statistics/Data Analysis StataCorp 4905 Lakeway Drive College Station, Texas 77845 USA 800-STATA-PC http://www.stata.com 979-696-4600 stata@stata.com 979-696-4601 (fax) 35-student Stata lab perpetual license: Serial number: 30110513240 Licensed to: Biostat Johns Hopkins University Notes: 1. 10.00 MB allocated to data 2. Stata running in batch mode . do Stata_Markdown.do . clear . disp "hello world!" hello world! . exit, clear end of do-file  WHAT? Running the command again gives us what we want? Now, we can either do 2 code chunks, but if we set the results='hold' option in knitr, then things work fine. • You can get around this unwanted “interactivity” using the console version of Stata, but I didn't set it up and Stata for Mac says: > Can I display graphs with Stata(console)? > No. Stata(console) is a text-based application and has no graphical display capabilities. However, it can generate and save Stata graphs, which can then be viewed with Stata(GUI). Stata(console) can also convert Stata graphs to PostScript and save them as files. Also, Stata(console) for Mac needs Stata/SE or State/MP (aka more costly Stata) according to Section C.4 Stata(console) for Mac OS X. So for most users you'd have to buy a different Stata. • Another way of getting around this interaction would be having Stata auto-exit; let's do that. Exiting Stata is possible without having interaction with a specific option when you exit, so you have exit, clear STATA. Let's look at our new script Stata_Markdown_Exit.do: cat Stata_Markdown_Exit.do  clear disp "hello world!" exit, clear STATA  Now let's run it. /Applications/Stata/Stata.app/Contents/MacOS/stata -b "Stata_Markdown_Exit.do" echo$?
cat Stata_Markdown_Exit.log

0

___  ____  ____  ____  ____ (R)
/__    /   ____/   /   ____/
___/   /   /___/   /   /___/   11.2   Copyright 1985-2009 StataCorp LP
Statistics/Data Analysis            StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-STATA-PC        http://www.stata.com
979-696-4600        stata@stata.com
979-696-4601 (fax)

Serial number:  30110513240
Johns Hopkins University

Notes:
1.  10.00 MB allocated to data
2.  Stata running in batch mode

. do Stata_Markdown_Exit.do

. clear

. disp "hello world!"
hello world!

. exit, clear STATA


It looks the same as before with no output, but I did not have to interact with Stata. Note: if you use & at the end of the command, the echo $? will come up zero, because bash will see it a background process. ### But I don't want to show the whole script all the time You may notice that I printed with cat the entire log that was created with Stata. Honestly, I don't like Stata logs. They seem like a nuisance. I have a script and can make outputs, so do I why need a log? But here, it seems useful. But what happens when you want to show parts of a script at different points? You can obviously make a series of .do files. Not really a good solution. What's a better solution? Create logs in your Stata code and then cat them to different code chunks. Here's an example: cat Stata_Markdown_logs.do  clear log using print_hello.log, replace disp "hello world!" log close log using run_summ.log, replace set obs 100 gen x = rnormal(100) summ x log close exit, clear STATA  /Applications/Stata/Stata.app/Contents/MacOS/stata -b "Stata_Markdown_logs.do"  Now, since print_hello.log, and run_summ.log were created, I can just do: cat print_hello.log  -------------------------------------------------------------------------------------------------------- name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/print_hello.log log type: text opened on: 11 Jan 2014, 18:20:29 . disp "hello world!" hello world! . log close name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/print_hello.log log type: text closed on: 11 Jan 2014, 18:20:29 --------------------------------------------------------------------------------------------------------  and then later print: cat run_summ.log  -------------------------------------------------------------------------------------------------------- name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/run_summ.log log type: text opened on: 11 Jan 2014, 18:20:29 . set obs 100 obs was 0, now 100 . gen x = rnormal(100) . summ x Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x | 100 100.0006 1.061928 97.11491 101.8377 . log close name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/Stata_Markdown/run_summ.log log type: text closed on: 11 Jan 2014, 18:20:29 --------------------------------------------------------------------------------------------------------  ### No header/footer from log This works, but you have a header and footer, that you probably can't delete with some simple option. Now, obviously you can read them in R and do string manipulation and then print them back out, but that's a little convoluted. Regardless, I wrote a simple function in R that will do it (R code): catlog <- function(filename, runcat = TRUE, comment = "") { x = readLines(filename) lenx = length(x) x = x[7:(lenx - 6)] writeLines(x, filename) if (runcat) cat(x, sep = "\n") } catlog("run_summ.log")  . set obs 100 obs was 0, now 100 . gen x = rnormal(100) . summ x Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x | 100 100.0006 1.061928 97.11491 101.8377  which simply drops the first 6 and last 6 lines of the log. Thus, you can then print it totally using R or then just use the saved log file can print it using cat from bash: cat run_summ.log  . set obs 100 obs was 0, now 100 . gen x = rnormal(100) . summ x Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- x | 100 100.0006 1.061928 97.11491 101.8377  or in bash, one example would be: nlines=awk 'END{print NR}' print_hello.log nhead=expr$nlines - 6
ntail=expr $nlines - 12 head -$nhead print_hello.log | tail -$ntail  . disp "hello world!" hello world!  or even better, let's make a function for bash that will do it: catlog () { nlines=awk 'END{print NR}'$1
nhead=expr $nlines - 6 ntail=expr$nlines - 12
head -$nhead$1 | tail -$ntail } catlog print_hello.log  . disp "hello world!" hello world!  OK – I can see the allure of using StatWeave in some capacity at this point. But still, if you use knitr, this may make sense or the way you want to do it without going to StatWeave. ### Cleanup You can just do some .log clean up using: rm *.log  if you want to delete all logs in your folder (assuming you never changed directories). ### Thoughts You can do “markdown” in Stata. My thoughts: 1. It's complicated. 2. The knitr.do file is a good start and great that's it's totally within Stata (you still need a Markdown converter), but doesn't have code demarcation. It also doesn't do inline commands, which are a requirement for a dynamic doc, so you don't have to fill in the numbers and can do it dynamically with code 3. StatWeave has more functionality than knitr.do and inline functions, but uses added software (a Java program), and can't do general markdown; the user needs to understand LaTeX. 3. Plotting hasn't really been integrated. You can always do a graph export myplot.pdf, as(pdf) (on Mac) or whatever and then just put in <img src="myplot.pdf"> in your html, or \includegraphics{myplot.pdf} in LaTeX, but that's not as integrated as it is in other systems. 4. If you make it to a Markdown document, you can use the great pandoc to potentially then just make it a Word .doc. 5. It will likely be integrated in the future. The question is how close is that “future”? ## Conclusion I like both options for respective pieces but my main concern with either option is putting in a lot of time for these and then they becoming obsolete with knitr integration. That's not a big problem, since I know knitr but something to think about for someone who doesn't. My recommendation, if you know and want to use LaTeX or need inline numbers, go with StatWeave. Otherwise knitr.do may do the trick. Also, I've given you some directions on “growing your own”, which is the most customizable for you but even worse with respect to time, reinventing the wheel, and no support from others. Anyway, those are the current options I know about when doing Markdown with Stata. Advertisements # MagSafe Charger Reinforced So I posted on Reinforcing a MacBook MagSafe Charger here: and got around to it. I know that there are Ways to properly wrap a charger , but I wanted to reinforce the ends and wanted to post on the results. Looks a little odd, but I’m happy with the Sugru and don’t expect it to fail anytime soon. # Creating Stata dtas from R, Issues and Resolutions So I am currently on a clinical trial and we have a very interesting export process. Essentially, the data comes in an XML format, then is converted to data sets using SAS. That’s all fine and good, because SAS is good at converting things (I’ve heard). The problem is, the people who started writing code to process the data and the majority of people maintaining the code use Stata. Now that’s a problem. OK, well let’s take these SAS data sets and convert them to Stata using Stat-Transfer. Not totally reprehensible, it at least preserves some form versus a CSV to dta (Stata format) nightmare. The only problem with this is that for some reason, the SAS parsing of the XML started to chew up a bit of memory, for about 140 data sets (it’s a lot of forms). Oh, by the way, a bit of memory was about 16 gigs from a 100 meg file. That’s atrocious. I don’t care what it’s doing but an 160 fold increase in just converting some XML and copying the datasets. Not only that, it took over 4 hours. What the hell SAS? Anyway, we just started a phase III trial collecting similar data from before. We’re using the same database. I decided to stop the insanity and convert the XML in R. The data still needs to produce in Stata data sets, but at least I could likely control the memory consumption and the time limits. At least I could throw it onto our computing cluster if things got out of control (I guess I could have done that with SAS, but I’m not doing that). Now, thank God that the XML package exists. So pretty much just using some xmlParse and xmlToDataFrame commands, I had some data.frames! Now, just make them into Stata data sets right? Not really. Let me first say that the foreign package is awesome. You can read in pretty much any respectable statistical software dataset into R and do some exporting. Also, the SASxport package allows you to export to the XPORT format using write.xport, which is a widely used (and FDA-compliant) format. Now what problems did I have with the foreign package? 1. I believe that Stata data sets can have length-32 variable names. After some correspondence, the maintainers argue that Stata’s documentation only support “up to 32” characters, which they interpret as only 31. The documentation states: varlist contains the names of the Stata variables 1, …, nvar, each up to 32 characters in length A week after my discussion, foreign had noted in their ChangeLog: man/{read,write}.dta: Freeze Stata support. Well, I guess I’ll just change write.dta to do what I want. a. My solution: Copy write.dta and change the 31L to 32L. Or moreover, I could have had the user pass a truncation length. But let’s default some stuff. The only concern is the command do_writeStata which is a hidden (non-exported) function from foreign. So I just slapped a foreign:::do_writeStata on there, and away we go (not the best practice, but the only way I could think – importFrom did not work). 2. Empty strings in R are represented as "". According to the foreign package, Stata documentation states that empty strings is not supported, which is true: Strings in Stata may be from 1 to 244 bytes long. and "" has 0 bytes: nchar("", "bytes")  ## [1] 0  I know from reading in Stata dta files, character variables can have data "", it’s treated as missing. See the Stata code below (I’m going to post about how to knit with Stata in followup) /Applications/Stata/Stata.app/Contents/MacOS/stata "test.do" cat test.log  -------------------------------------------------------------------------------------------------------- name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log log type: text opened on: 11 Jan 2014, 13:39:54 . set obs 1 obs was 0, now 1 . gen x = "" (1 missing value generated) . count if mi(x) 1 . count if x == "" 1 . log close name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log log type: text closed on: 11 Jan 2014, 13:39:54 --------------------------------------------------------------------------------------------------------  This isn’t a major problem, as long as you know about it. a. My solution? Make the "" a " " (a space). Is this optimal? No. If there are true spaces in the data, then these are aliased. But who really cares about them? If you do, then change the code. If not, then great, use my code. If you have a large problem with that, and throw it in the comments and someone will probably read it. Then, in Stata, you can define the function: *** recaststr- making the " " to ""; (to get around str0 cases) capture program drop recaststr program define recaststr foreach var of varlist * { local vtype : type var'; if ( index("vtype'", "str") > 0) replace var' = "" if var' == " "; } end  OK, so if you ever want to use these functions, require(devtools) install_github("processVISION", "muschellij2") library(processVISION)  should start you off: and the functions write32.dta and create_stata_dta should be what you’re looking for. I do some attempt at formatting the columns into numeric and dates in create_stata_dta. If you don’t want that, just use the argument tryConvert=FALSE. Happy converting. Followup Post: How to make a feeble attempt at knitting a Stata .do file and explaining some attempts/packages out there. # R as Food: Lists & Referencing In R, lists can be the most powerful yet most confusing objects, and specifically with respect to references. Essentially, lists are a general type of container that can hold almost any type of R object. For example, let’s say we have a list of foods I ate for the day. Note: whenever I’m talking about “referencing”, I’m referring to R extracting things (usually called elements or objects) from another thing (an R object). mylist = list(breakfast = c("eggs"), lunch = c("salad", "dressing"), dinner = c("chicken"))  Conceptually, I will refer to a piece of tupperware as a list. Now, if we use this figure (source http://3.bp.blogspot.com/-5vJn-RLtceE/UB_iAOhdw3I/AAAAAAAACPU/f9D1c4tXOE4/s1600/20120805_234940_rounded_corners.jpg), then the middle piece of tupperware would be mylist. Now, let’s say the smaller of the two tupperwares are breakfast and dinner and the two-compartment container is lunch. In this case, I threw the salad and dressing together. Now we have: So now, to reference the object, I can say: mylist["breakfast"]  $breakfast
[1] "eggs"


mylist[1]


$breakfast [1] "eggs"  class(mylist["breakfast"])  [1] "list"  This is telling R that I want to grab the element named breakfast in the first line, or the first element of the list (returns the same thing because breakfast is the first element). This returns an object of type list. This would be like opening up the mylist tupperware, and taking out the breakfast tupperware. The food for breakfast is still eggs, but it’s in the tupperware breakfast. I’m hungry, so I just want to take out the eggs. I would do this by: mylist$breakfast


[1] "eggs"


mylist[["breakfast"]]


[1] "eggs"


mylist[[1]]


[1] "eggs"


class(mylist[["breakfast"]])


[1] "character"


This tells R that I want you to return (aka give me) the objects in the element breakfast. This is like saying, I want the food in the breakfast tupperware; give me those eggs. This is the “double bracket” [[ notation for referencing, and has the same behavior of using the dollar sign ($) referencing as with a data.frame if your list has names. As always, you can use positional referencing by using [[1]] saying I want the contents of the first list element. The same applies to the other meals of the day, with the exception that lunch returns a vector of length 2, instead of breakfast and dinner who have length 1 (only one piece of food). ### Lists of Lists Now, let’s say I don’t want my salad and dressing all together in the salad, as it gets soggy by lunch time. So I put my salad in its own tupperware container and the dressing in its own: In R, this would be: mylist2 = list(breakfast = c("eggs"), lunch = list("salad", "dressing"), dinner = c("chicken"))  Now, breakfast and dinner contained in the same way, but lunch is different. Now let’s take out my lunch: mylist2["lunch"]  $lunch
$lunch[[1]] [1] "salad"$lunch[[2]]
[1] "dressing"


mylist2[2]


$lunch$lunch[[1]]

$lunch[[2]] [1] "dressing"  class(mylist2["lunch"])  [1] "list"  Ok, this is returning a list as before. Let’s use the double bracket or ($) referencing:

mylist2$lunch  [[1]] [1] "salad" [[2]] [1] "dressing"  mylist2[["lunch"]]  [[1]] [1] "salad" [[2]] [1] "dressing"  mylist2[[2]]  [[1]] [1] "salad" [[2]] [1] "dressing"  class(mylist2[["lunch"]])  [1] "list"  What gives? I used the “$”! Yes, this takes out the element lunch, but lunch is another list! It’s kind of like those Matryoshka dolls:

Then mylist2 is a list of 2 vectors (breakfast and dinner), and a 2-element list (lunch). Now if we wanted to get the first element of lunch, we could run:

mylist2$lunch[1]  [[1]] [1] "salad"  class(mylist2$lunch[1])


[1] "list"


mylist2$lunch[[1]]  [1] "salad"  class(mylist2$lunch[[1]])


[1] "character"


where we saw that mylist2$lunch returned a list, so we can handle referencing the same way we did with mylist from the beginning of the article. ### Why lists? WHYYY? Now, a lot of new users approach this as: “lists are complicated/dumb/useless/too confusing/whatever” and I like to use this example: dataset <- data.frame(outcome = rnorm(100, mean = 2), x = rep(c(0, 1), each = 50)) mod = lm(outcome ~ x, data = dataset) smod = summary(mod) MSE = mean((dataset$outcome - predict(mod, newdata = dataset))^2)
mod.results = list(model = mod, smod = smod, data = dataset, MSE = MSE)


The first element of the list is a model, the second element is the summary of the model. the third element is the dataset used to fit that model, and the fourth element is the mean squared error (MSE) of that model. Linear models in R (fit using the lm function) has the class lm, but can be thought of as a list of elements:

names(mod)


 [1] "coefficients"  "residuals"     "effects"       "rank"
[5] "fitted.values" "assign"        "qr"            "df.residual"
[9] "xlevels"       "call"          "terms"         "model"


Now let’s say I wanted to get the adjusted R^2 and MSE from my results:

### Conclusion

Overall, lists are powerful, but can be confusing when you start doing referencing. You can do single brackets [, which will return a list, which you would want to do if you want mylist without breakfast:

mylist[c("lunch", "dinner")]


$lunch [1] "salad" "dressing"$dinner
[1] "chicken"


(don’t skip breakfast, it’s the most important meal of the day). Also, you can use a “\$” or double bracket ([[) referencing when you want to get the contents of the elements of a list, which may be a list as well. Complicated lists may not seem useful initially, but can be very convenient when storing results or things of many different types that don’t easily “fit together”.

PS. This is the way I think of σ-fields as well, but that is a whole other topic altogether.