This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to
interchange data over the Internet. If you want to access some online data over
a webpage's API you are likely to get it in XML format. So here is a very simple example
of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you...
How to parse/read this XML-document into R?
# install and load the necessary package
install.packages("XML")
library(XML)
# Save the URL of the xml file in a variable
xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
# Use the xmlTreePares-function to parse xml file directly from the web
xmlfile <-
xmlTreeParse(xml.url)
# the xml file is now saved as an object you can easily work with in R:
class(xmlfile)
# Use the xmlRoot-function to access the top node
xmltop = xmlRoot(xmlfile)
# have a look at the XML-code of the first subnodes:
print(xmltop)[1:2]
This should look more or less like:
$PLANT <PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinaria canadensis</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY> </PLANT> $PLANT <PLANT> <COMMON>Columbine</COMMON> <BOTANICAL>Aquilegia canadensis</BOTANICAL> <ZONE>3</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$9.37</PRICE> <AVAILABILITY>030699</AVAILABILITY> </PLANT> attr(,"class") [1] "XMLNodeList"
One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?
# To extract the XML-values from the document, use xmlSApply:
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
# Finally, get the data
in a data-frame and have a look at the first rows and columns
plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]
The first rows and columns of that data frame should look like this:
COMMON BOTANICAL ZONE LIGHT
1 Bloodroot Sanguinaria canadensis 4 Mostly Shady
2 Columbine Aquilegia canadensis 3 Mostly Shady
3 Marsh Marigold Caltha palustris 4 Mostly Sunny
4 Cowslip Caltha palustris 4 Mostly Shady
5 Dutchman's-Breeches Dicentra cucullaria 3 Mostly Shady
Which is exactly what we need to analyze this data in R.
Hi
ReplyDeleteHow to pass the parameters for the year='2012" and month="August" to this url?.
It gives me no tables.why?.
thanks
veepsirtt
options(RCurlOptions = list(useragent = "R"))
library(RCurl)
url <- "http://www.bseindia.com/histdata/categorywise_turnover.asp"
wp = getURLContent(url)
library(RHTMLForms)
library(XML)
doc = htmlParse(wp, asText = TRUE)
form = getHTMLFormDescription(doc)[[1]]
fun = createFunction(form)
o = fun(mmm = "9", yyy = "2012",url="http://www.bseindia.com/histdata/categorywise_turnover.asp")
table = readHTMLTable(htmlParse(o, asText = TRUE),
header = TRUE,
stringsAsFactors = FALSE)
table
Hi veepsirtt,
ReplyDeleteI'm not very familiar with the RHTMLForms-package, thus I might be the wrong guy to answer this question. Nevertheless, I guess the problem occurs already in your application of createFunction(), with your code I get from that line:
Error in if (action != "") formDescription$url = toString.URI(mergeURI(URI(action), :
missing value where TRUE/FALSE needed
something seems to be wrong with the formDescription-argument you are using in createFunction().
I'd recommend you to carefully check the documentation of this function and in the worst case to contact the Author of the function if that problem doesn't pop up in any forum or mailing list.
Hi, any thoughts on how to extract data from an embedded spreadsheet, as is in the following example: http://pakistanbodycount.org/drone_attack
ReplyDeleteThanks in advance!
Hi Andrew,
ReplyDeleteA good general starting point is to use Firebug (a Firefox extension) to inspect the website with the data you are interested in.
What you refer to in your example as "embedded spreadsheet" seems to be in the end a HTML-table (for which the same techniques as described in my post on web scraping should work: http://giventhedata.blogspot.com/2012/08/r-and-web-for-beginners-part-iii.html)
Mind though that scraping data from a web site, such as in your example, is often a lot more tricky than querying/extracting data from a XML-document.
best regards
Thanks! Seem to be getting an error at the second step (mps.doc <- htmlParse(mps)) but am new to this and will keep playing. Appreciate the feedback!
ReplyDeleteD
Hi
ReplyDeleteYou can do this and get same result :D
plantcat_df <-xmlToDataFrame(xml.url)
Hi Claudio
Deleteyou've correctly pointed out that the XML package also comes with a convenient function (xmlToDataFrame) to "extract data from a simple XML document". There are mainly two reasens why I didn't want to point to that function in this post:
1) if you are a novice in xml/R you don't learn anything by just using xmlToDataFrame in the above example. The explicit aim of the post is to give some insights into how one can work with XML documents in R.
2) as the documentation of xmlToDataFrame mentions, this function is made for "simple" XML documents. You will notice what this means as soon as your trying to use xmlToDataFrame in a more complex xml structure as the very simple example above.
a third, rather minor point is that even if xmlToDataFrame works in your setting it is likely to be less efficient than a self-made function written with the functions pointed out in the example.
anyway, thanks for pointing this out! mentioning the convenient function as concluding remarks in my post would not have been a bad idea.
Best,