GivenTheData: R and the web (for beginners), Part II: XML in R

Friday, June 22, 2012

R and the web (for beginners), Part II: XML in R

This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage's API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.

Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you...

How to parse/read this XML-document into R?

# install and load the necessary package

install.packages("XML")

library(XML)

# Save the URL of the xml file in a variable

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

# Use the xmlTreePares-function to parse xml file directly from the web

xmlfile <- xmlTreeParse(xml.url)

# the xml file is now saved as an object you can easily work with in R:

class(xmlfile)

# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

# have a look at the XML-code of the first subnodes:

print(xmltop)[1:2]

This should look more or less like:

$PLANT
<PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
</PLANT>

$PLANT
<PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
</PLANT>

attr(,"class")
[1] "XMLNodeList"

One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?

# To extract the XML-values from the document, use xmlSApply:

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))

# Finally, get the data in a data-frame and have a look at the first rows and columns

plantcat_df <- data.frame(t(plantcat),row.names=NULL)

plantcat_df[1:5,1:4]

The first rows and columns of that data frame should look like this:

               COMMON              BOTANICAL ZONE        LIGHT
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady
2           Columbine   Aquilegia canadensis    3 Mostly Shady
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny
4             Cowslip       Caltha palustris    4 Mostly Shady
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady

Which is exactly what we need to analyze this data in R.