Friday, June 22, 2012

R and the web (for beginners), Part II: XML in R


This second post of my little series on R and the web deals with how to access and process XML-data with R. XML is a markup language that is commonly used to interchange data over the Internet. If you want to access some online data over a webpage's API you are likely to get it in XML format. So here is a very simple example of how to deal with XML in R.
Duncan Temple Lang wrote a very helpful R-package which makes it quite easy to parse, process and generate XML-data with R. I use that package in this example. The XML document (taken from w3schools.com) used in this example describes a fictive plant catalog. Not that thrilling, I know, but the goal of this post is not to analyze the given data but to show how to parse it and transform it to a data frame. The analysis is up to you...

How to parse/read this XML-document into R?
 
# install and load the necessary package

install.packages("XML")
library(XML)


# Save the URL of the xml file in a variable

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

# Use the xmlTreePares-function to parse xml file directly from the web
 
xmlfile <- xmlTreeParse(xml.url)


# the xml file is now saved as an object you can easily work with in R:

class(xmlfile)



# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

# have a look at the XML-code of the first subnodes:

print(xmltop)[1:2]

This should look more or less like:


$PLANT
<PLANT>
 <COMMON>Bloodroot</COMMON>
 <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
 <ZONE>4</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$2.44</PRICE>
 <AVAILABILITY>031599</AVAILABILITY>
</PLANT>

$PLANT
<PLANT>
 <COMMON>Columbine</COMMON>
 <BOTANICAL>Aquilegia canadensis</BOTANICAL>
 <ZONE>3</ZONE>
 <LIGHT>Mostly Shady</LIGHT>
 <PRICE>$9.37</PRICE>
 <AVAILABILITY>030699</AVAILABILITY>
</PLANT>

attr(,"class")
[1] "XMLNodeList"

One can already assume how this data should look like in a matrix or data frame. The goal is to extract the XML-values from each XML-tag <> for all $PLANT nodes and save them in a data frame with a row for each plant ($PLANT-node) and a column for each tag (variable) describing it. How can you do that?


# To extract the XML-values from the document, use xmlSApply:

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))


# Finally, get the data in a data-frame and have a look at the first rows and columns

plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]

The first rows and columns of that data frame should look like this:
 
               COMMON              BOTANICAL ZONE        LIGHT
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady
2           Columbine   Aquilegia canadensis    3 Mostly Shady
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny
4             Cowslip       Caltha palustris    4 Mostly Shady
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady
Which is exactly what we need to analyze this data in R.



41 comments:

  1. Hi
    How to pass the parameters for the year='2012" and month="August" to this url?.
    It gives me no tables.why?.
    thanks
    veepsirtt

    options(RCurlOptions = list(useragent = "R"))
    library(RCurl)
    url <- "http://www.bseindia.com/histdata/categorywise_turnover.asp"
    wp = getURLContent(url)

    library(RHTMLForms)
    library(XML)
    doc = htmlParse(wp, asText = TRUE)
    form = getHTMLFormDescription(doc)[[1]]
    fun = createFunction(form)
    o = fun(mmm = "9", yyy = "2012",url="http://www.bseindia.com/histdata/categorywise_turnover.asp")

    table = readHTMLTable(htmlParse(o, asText = TRUE),
    header = TRUE,
    stringsAsFactors = FALSE)
    table

    ReplyDelete
  2. Hi veepsirtt,

    I'm not very familiar with the RHTMLForms-package, thus I might be the wrong guy to answer this question. Nevertheless, I guess the problem occurs already in your application of createFunction(), with your code I get from that line:

    Error in if (action != "") formDescription$url = toString.URI(mergeURI(URI(action), :
    missing value where TRUE/FALSE needed

    something seems to be wrong with the formDescription-argument you are using in createFunction().

    I'd recommend you to carefully check the documentation of this function and in the worst case to contact the Author of the function if that problem doesn't pop up in any forum or mailing list.

    ReplyDelete
  3. Hi, any thoughts on how to extract data from an embedded spreadsheet, as is in the following example: http://pakistanbodycount.org/drone_attack

    Thanks in advance!

    ReplyDelete
  4. Hi Andrew,

    A good general starting point is to use Firebug (a Firefox extension) to inspect the website with the data you are interested in.

    What you refer to in your example as "embedded spreadsheet" seems to be in the end a HTML-table (for which the same techniques as described in my post on web scraping should work: http://giventhedata.blogspot.com/2012/08/r-and-web-for-beginners-part-iii.html)

    Mind though that scraping data from a web site, such as in your example, is often a lot more tricky than querying/extracting data from a XML-document.

    best regards

    ReplyDelete
  5. Thanks! Seem to be getting an error at the second step (mps.doc <- htmlParse(mps)) but am new to this and will keep playing. Appreciate the feedback!

    D

    ReplyDelete
  6. Hi

    You can do this and get same result :D

    plantcat_df <-xmlToDataFrame(xml.url)

    ReplyDelete
    Replies
    1. Hi Claudio

      you've correctly pointed out that the XML package also comes with a convenient function (xmlToDataFrame) to "extract data from a simple XML document". There are mainly two reasens why I didn't want to point to that function in this post:

      1) if you are a novice in xml/R you don't learn anything by just using xmlToDataFrame in the above example. The explicit aim of the post is to give some insights into how one can work with XML documents in R.

      2) as the documentation of xmlToDataFrame mentions, this function is made for "simple" XML documents. You will notice what this means as soon as your trying to use xmlToDataFrame in a more complex xml structure as the very simple example above.

      a third, rather minor point is that even if xmlToDataFrame works in your setting it is likely to be less efficient than a self-made function written with the functions pointed out in the example.

      anyway, thanks for pointing this out! mentioning the convenient function as concluding remarks in my post would not have been a bad idea.

      Best,

      Delete
  7. Hi there. Thoughts on good ways to access a very long and complicated XML documents? For example this: http://pastebin.com/tFVwyJgt

    ReplyDelete
  8. HTML tutorial for beginners with examples

    Free online HTML tutorial for beginners with examples - HTML tutorial will help you in creating website, after study the tutorial you will just one step ahead of creating your own website. HTML is easy to understand and you will enjoy it to learn. HTML tutorial contains hundreds of examples to better understand.

    http://www.willvick.com/
    http://www.willvick.com/HTML-tutorial-for-beginners-with-examples/HTML-tutorial-for-beginners-with-examples.aspx

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Nice blog Very useful information is providing by ur blog. Great beginning html tutorials Very clear and helpful for beginners.

    ReplyDelete
  11. Great Tutorial. I really learned some new things here. So thanks a lot

    web design training in chennai

    ReplyDelete
  12. I would like to say that this blog really convinced me to do it! Thanks, very good post.
    web hosting reviews

    ReplyDelete
  13. The XML schema contains whole information about the relation structure .It contains information regarding table, constraints and relation . See more at: xml file

    ReplyDelete
  14. Thank you this was very useful!

    ReplyDelete
  15. Hi, I tried to replicate this code for a different XML. However, when I use xmlSApply for creating a matrix object, I receive an list object. I know that this is because some observations contain more values than others, but I don't know how to overcome the problem. Can you please help me? Thanks

    ReplyDelete
  16. Great post!I am actually getting ready to across this information,i am very happy to this commands.Also great blog here with all of the valuable information you have.Well done,its a great knowledge.
    SQL Server Training in Chennai

    ReplyDelete
  17. Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
    mc donalds gutscheine | startlr | salud limpia

    ReplyDelete
  18. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  19. Great blog created by you. I read your blog, its best and useful information. You have done a great work. Super blogging and keep it up.
    php jobs in hyderabad.

    ReplyDelete
  20. I read that Post and got it fine and informative.
    be your own boss

    ReplyDelete
  21. It is a very nice article including a lot of viral content. I am going to share it on social media. Get the online crackers in chennai.

    ReplyDelete
  22. Thank you for the writing a good article and it helps me a lot. Buy the Cold Pressed Oil in India.

    ReplyDelete
  23. • Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updatingAzure Online course

    ReplyDelete
  24. This comment has been removed by the author.

    ReplyDelete
  25. Error: 1: Unknown IO error2: failed to load external entity "http://www.w3schools.com/xml/plant_catalog.xml"
    Why it is showing this error?

    ReplyDelete
  26. I read this blog i didn't have any knowledge about this but now i got some knowledge so keep on sharing such kind of an interesting blogs.
    aws scenario based interview questions

    ReplyDelete
  27. Existing without the answers to the difficulties you’ve sorted out through this guide is a critical case, as well as the kind which could have badly affected my entire career if I had not discovered your website.
    Digital Marketing online training

    full stack developer training in pune

    full stack developer training in annanagar

    full stack developer training in tambaram

    full stack developer training in velachery










    ReplyDelete
    Replies
    1. Woah this blog is wonderful i like studying your posts. Keep up the great work! You understand, lots of persons are hunting around for this info, you could help them greatly.
      python training in tambaram
      python training in annanagar
      python training in velachery

      Delete
  28. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
    Blueprism training in btm

    Blueprism online training

    AWS Training in chennai

    ReplyDelete
  29. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
    Data Science training in Chennai
    Data science training in bangalore
    Data science online training
    Data science training in pune

    ReplyDelete
  30. I am really very happy to find this particular site. I just wanted to say thank you for this huge read!! I absolutely enjoying every petite bit of it and I have you bookmarked to test out new substance you post.
    java training in chennai | java training in bangalore

    java online training | java training in pune

    ReplyDelete
  31. It has been just unfathomably liberal with you to give straightforwardly what precisely numerous people would've promoted for an eBook to wind up making some money for their end, basically given that you could have attempted it in the occasion you needed.fire and safety course in chennai

    ReplyDelete
  32. Thanks for such a great article here. I was searching for something like this for quite a long time and at last I’ve found it on your blog. It was definitely interesting for me to read about their market situation nowadays. Well written article android quiz questions and answers | android code structure best practices

    ReplyDelete
  33. When I initially commented, I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get several emails with the same comment. Is there any way you can remove people from that service? Thanks.

    AWS Interview Questions And Answers

    AWS Training in Chennai | Best AWS Training in Chennai


    AWS Training in Pune | Best Amazon Web Services Training in Pune

    AWS Tutorial |Learn Amazon Web Services Tutorials |AWS Tutorial For Beginners

    ReplyDelete
  34. Amazing write-up! , i Request you to write more blogs like this Data Science Online course

    ReplyDelete
  35. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    nebosh course in chennai

    ReplyDelete
  36. Thanks for your efforts in sharing this information in detail. This was very helpful to me. Kindly keep
    continuing the great work.

    Article submission sites
    Education

    ReplyDelete