So I am currently on a clinical trial and we have a very interesting export process. Essentially, the data comes in an XML format, then is converted to data sets using SAS. That’s all fine and good, because SAS is good at converting things (I’ve heard). The problem is, the people who started writing code to process the data and the majority of people maintaining the code use Stata.
Now that’s a problem. OK, well let’s take these SAS data sets and convert them to Stata using Stat-Transfer. Not totally reprehensible, it at least preserves some form versus a CSV to dta (Stata format) nightmare. The only problem with this is that for some reason, the SAS parsing of the XML started to chew up a bit of memory, for about 140 data sets (it’s a lot of forms). Oh, by the way, a bit of memory was about 16 gigs from a 100 meg file. That’s atrocious. I don’t care what it’s doing but an 160 fold increase in just converting some XML and copying the datasets. Not only that, it took over 4 hours. What the hell SAS?
Anyway, we just started a phase III trial collecting similar data from before. We’re using the same database. I decided to stop the insanity and convert the XML in R
. The data still needs to produce in Stata data sets, but at least I could likely control the memory consumption and the time limits. At least I could throw it onto our computing cluster if things got out of control (I guess I could have done that with SAS, but I’m not doing that).
Now, thank God that the XML
package exists. So pretty much just using some xmlParse
and xmlToDataFrame
commands, I had some data.frame
s! Now, just make them into Stata data sets right? Not really.
Let me first say that the foreign
package is awesome. You can read in pretty much any respectable statistical software dataset into R
and do some exporting. Also, the SASxport
package allows you to export to the XPORT format using write.xport
, which is a widely used (and FDA-compliant) format.
Now what problems did I have with the foreign
package?
- I believe that Stata data sets can have length-32 variable names.
After some correspondence, the maintainers argue that Stata’s
documentation only support “up to 32” characters, which they
interpret as only 31. The
documentation states:varlist contains the names of the Stata variables 1, …, nvar, each up
to 32 characters in lengthA week after my discussion,
foreign
had noted in their ChangeLog:man/{read,write}.dta: Freeze Stata support.
Well, I guess I’ll just change
write.dta
to do what I want.a. My solution: Copy
write.dta
and change the31L
to32L
. Or moreover, I could have had the user pass a truncation length. But let’s default some stuff. The only concern is the commanddo_writeStata
which is a hidden (non-exported) function fromforeign
. So I just slapped aforeign:::do_writeStata
on there, and away we go (not the best practice, but the only way I could think –importFrom
did not work). - Empty strings in
R
are represented as""
. According to theforeign
package, Stata documentation states that empty strings is not supported, which is true:
Strings in Stata may be from 1 to 244 bytes long.
and
""
has 0 bytes:nchar("", "bytes")
## [1] 0
I know from reading in Stata dta files, character variables can have data
""
, it’s treated as missing. See the Stata code below (I’m going to post about how to knit with Stata in followup)/Applications/Stata/Stata.app/Contents/MacOS/stata "test.do" cat test.log
-------------------------------------------------------------------------------------------------------- name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log log type: text opened on: 11 Jan 2014, 13:39:54 . set obs 1 obs was 0, now 1 . gen x = "" (1 missing value generated) . count if mi(x) 1 . count if x == "" 1 . log close name: <unnamed> log: /Users/muschellij2/Dropbox/Public/WordPress_Hopstat/XML_to_Stata/test.log log type: text closed on: 11 Jan 2014, 13:39:54 --------------------------------------------------------------------------------------------------------
This isn’t a major problem, as long as you know about it.
a. My solution? Make the
""
a" "
(a space). Is this optimal? No. If there are true spaces in the data, then these are aliased. But who really cares about them? If you do, then change the code. If not, then great, use my code. If you have a large problem with that, and throw it in the comments and someone will probably read it.
Then, in Stata, you can define the function:
*** recaststr- making the " " to ""; (to get around str0 cases) capture program drop recaststr program define recaststr foreach var of varlist * { local vtype : type `var'; if ( index("`vtype'", "str") > 0) replace `var' = "" if `var' == " "; } end
OK, so if you ever want to use these functions,
require(devtools) install_github("processVISION", "muschellij2") library(processVISION)
should start you off: and the functions write32.dta
and create_stata_dta
should be what you’re looking for. I do some attempt at formatting the columns into numeric and dates in create_stata_dta
. If you don’t want that, just use the argument tryConvert=FALSE
. Happy converting.
Followup Post: How to make a feeble attempt at knitting a Stata .do file and explaining some attempts/packages out there.
Pingback: Faster XML conversion to Data Frames | A HopStat and Jump Away
Just discovered: When writing a Stata .dta-file, the procedure should check whether the variable names conform to Stata definitions!
We get invalid variable names into Stata through a series of conversions:
We read an Excel file into R, which creates variables named “2014”, “2015” etc. Then, the data are written to a Stata .dta-file, which is subsequently read into Stata 14 with the USE command.
In Stata, all variable names must start with a letter (or underscore). Stata loads the file with no protests, but then we are stuck: The variables named with numbers only are there, but Stata refuses to save a file with such names, refuses to export it to CSV (even with option “do not include variable names on the first line”), and refuses to rename the guilty variables.
Conclusion: The Write.dta package needs some refining – or one must check the variable names before using it …
1) Why not simply export the data in CSV from R? Or moreover from Excel? I guess you’d be doing some data manipulations.
2) Do you need to go to Stata? What can’t you do in R?
I believe write.dta in the foreign package is not being maintained. I’d suggest the haven package f/Hadley Wickham (https://cran.r-project.org/web/packages/haven/index.html) for read/write Stata files.