Creating Stata dtas from R, Issues and Resolutions

So I am currently on a clinical trial and we have a very interesting export process. Essentially, the data comes in an XML format, then is converted to data sets using SAS. That’s all fine and good, because SAS is good at converting things (I’ve heard). The problem is, the people who started writing code to process the data and the majority of people maintaining the code use Stata.

Now that’s a problem. OK, well let’s take these SAS data sets and convert them to Stata using Stat-Transfer. Not totally reprehensible, it at least preserves some form versus a CSV to dta (Stata format) nightmare. The only problem with this is that for some reason, the SAS parsing of the XML started to chew up a bit of memory, for about 140 data sets (it’s a lot of forms). Oh, by the way, a bit of memory was about 16 gigs from a 100 meg file. That’s atrocious. I don’t care what it’s doing but an 160 fold increase in just converting some XML and copying the datasets. Not only that, it took over 4 hours. What the hell SAS?

Anyway, we just started a phase III trial collecting similar data from before. We’re using the same database. I decided to stop the insanity and convert the XML in R. The data still needs to produce in Stata data sets, but at least I could likely control the memory consumption and the time limits. At least I could throw it onto our computing cluster if things got out of control (I guess I could have done that with SAS, but I’m not doing that).

Now, thank God that the XML package exists. So pretty much just using some xmlParse and xmlToDataFrame commands, I had some data.frames! Now, just make them into Stata data sets right? Not really.

Let me first say that the foreign package is awesome. You can read in pretty much any respectable statistical software dataset into R and do some exporting. Also, the SASxport package allows you to export to the XPORT format using write.xport, which is a widely used (and FDA-compliant) format.

Now what problems did I have with the foreign package?

  1. I believe that Stata data sets can have length-32 variable names.
    After some correspondence, the maintainers argue that Stata’s
    documentation only support “up to 32” characters, which they
    interpret as only 31. The
    documentation states:

    varlist contains the names of the Stata variables 1, …, nvar, each up
    to 32 characters in length

    A week after my discussion, foreign had noted in their ChangeLog:

    man/{read,write}.dta: Freeze Stata support.

    Well, I guess I’ll just change write.dta to do what I want.

    a. My solution: Copy write.dta and change the 31L to 32L. Or moreover, I could have had the user pass a truncation length. But let’s default some stuff. The only concern is the command do_writeStata which is a hidden (non-exported) function from foreign. So I just slapped a foreign:::do_writeStata on there, and away we go (not the best practice, but the only way I could think – importFrom did not work).

  2. Empty strings in R are represented as "". According to the foreign package, Stata documentation states that empty strings is not supported, which is true:

    Strings in Stata may be from 1 to 244 bytes long.

    and "" has 0 bytes:

    nchar("", "bytes")
    
    ## [1] 0
    

    I know from reading in Stata dta files, character variables can have data "", it’s treated as missing. See the Stata code below (I’m going to post about how to knit with Stata in followup)

    /Applications/Stata/Stata.app/Contents/MacOS/stata "test.do"
    cat test.log
    
    --------------------------------------------------------------------------------------------------------
        name:  <unnamed>
         log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log
    log type:  text
    opened on:  11 Jan 2014, 13:39:54
    
    . set obs 1
    obs was 0, now 1
    
    . gen x = ""
    (1 missing value generated)
    
    . count if mi(x)
      1
    
    . count if x == ""
      1
    
    . log close
        name:  <unnamed>
         log:  /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log
    log type:  text
    closed on:  11 Jan 2014, 13:39:54
    --------------------------------------------------------------------------------------------------------
    

    This isn’t a major problem, as long as you know about it.

    a. My solution? Make the "" a " " (a space). Is this optimal? No. If there are true spaces in the data, then these are aliased. But who really cares about them? If you do, then change the code. If not, then great, use my code. If you have a large problem with that, and throw it in the comments and someone will probably read it.

Then, in Stata, you can define the function:

*** recaststr- making the " "  to ""; (to get around str0 cases)
capture program drop recaststr
program define recaststr
  foreach var of varlist * {
        local vtype : type `var';
        if ( index("`vtype'", "str") > 0) replace `var' = "" if `var' == " ";
    }
end

OK, so if you ever want to use these functions,

require(devtools)
install_github("processVISION", "muschellij2")
library(processVISION)

should start you off: and the functions write32.dta and create_stata_dta should be what you’re looking for. I do some attempt at formatting the columns into numeric and dates in create_stata_dta. If you don’t want that, just use the argument tryConvert=FALSE. Happy converting.

Followup Post: How to make a feeble attempt at knitting a Stata .do file and explaining some attempts/packages out there.

Advertisements

3 thoughts on “Creating Stata dtas from R, Issues and Resolutions

  1. Pingback: Faster XML conversion to Data Frames | A HopStat and Jump Away

  2. Just discovered: When writing a Stata .dta-file, the procedure should check whether the variable names conform to Stata definitions!
    We get invalid variable names into Stata through a series of conversions:
    We read an Excel file into R, which creates variables named “2014”, “2015” etc. Then, the data are written to a Stata .dta-file, which is subsequently read into Stata 14 with the USE command.
    In Stata, all variable names must start with a letter (or underscore). Stata loads the file with no protests, but then we are stuck: The variables named with numbers only are there, but Stata refuses to save a file with such names, refuses to export it to CSV (even with option “do not include variable names on the first line”), and refuses to rename the guilty variables.
    Conclusion: The Write.dta package needs some refining – or one must check the variable names before using it …

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s