Thursday, August 23, 2012

R and the web (for beginners), Part III: Scraping MPs' expenses in detail from the web

In this last post of my little series (see my latest post) on R and the web I explain how to extract data of a website (web scraping/screen scraping) with R. If the data you want to analyze are a part of a web page, for example a HTML-table (or hundreds of them) it might be very time-consuming (and boring!) to manually copy/paste all of its content or even typewrite it to a spreadsheet table or data frame. Instead, you can let R do the job for you!

This post is really aimed at beginners. Thus, to keep things simple it only deals with scraping one data table from one web page: a table published by BBC NEWS containing the full range of British Members of Parliament' expenses in 2007-2008. Quite an interesting data set if you are into political scandals...


Web scraping with R

There are several R packages that might be helpful for web scraping, such as XML, RCurl, and scrapeR. In this example only the XML package is used. As a fist step, you parse the whole HTML-file and extract all HTML-tables in it:

library(XML)

# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm" 

# parse the document for R representation:
mps.doc <- htmlParse(mps)

# get all the tables in mps.doc as data frames
mps.tabs <- readHTMLTable(mps.doc) 

mps.tabs is a list containing in each element a HTML-table from the parsed website (mps.doc) as data.frame. The website contains several HTML-tables (some are rather used to structure the website and not to present data). The list mps.tabs actually has seven entries, hence there were seven HTML-tables in the parsed document:

length(mps.tabs)

To proceed you need to check which of these data frames (list entries) contains the table you want (the MPs' expenses). You can do that "manually" by checking how the data frame starts and ends and compare it with the original table of the website:

head(mps.tabs[[1]])  #and
tail(mps.tabs[[1]])  #for 1 to 7

With only seven entries this is quite fast. But alternatively you could also write a little loop to do the job for you. The loop checks each data frame for certain conditions. In this case: the string of the first row and first column and the string in the last row and column. According to the original table from the website that should be:
first <- "Abbott, Ms Diane"
last <- "157,841"

# ... and the loop:

for (i in 1:length(mps.tabs)) {
 
  lastrow <- nrow(mps.tabs[[i]]) # get number of rows
  lastcol <- ncol(mps.tabs[[i]])
 
  if (as.character(mps.tabs[[i]][1,1])==first & as.character(mps.tabs[[i]][lastrow,lastcol])==last) {
   
    tabi <- i
     
    }
  }

Check if that is realy what you want and extract the relevant table as data frame.

head(mps.tabs[[tabi]])
tail(mps.tabs[[tabi]])
mps <- mps.tabs[[tabi]] 

Before you can properly analyze this data set we have to remove the commas in the columns with expenses and format them as numeric:

money <- sapply(mps[,-1:-3], FUN= function(x) as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))

mps2 <- cbind(mps[,1:3],money)

Now you are ready to go... For example, you could compare how the total expenses are distributed for each of the five biggest parties:

# which are the five biggest parties by # of mps?
nbig5 <- names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])

#subset of mps only with the five biggest parties:
big5 <- subset(mps2, mps$Party%in%nbig5)

# load the lattice package for a nice plot

library(lattice)

bwplot(Total ~  Party, data=big5, ylab="Total expenses per MP (in £)")


Here is the resulting plot: 



 And the relevant R code in one piece:

library(XML)

# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm" 

# parse the document for R representation:
mps.doc <- htmlParse(mps)


# get all the tables in mps.doc as data frames
mps.tabs <- readHTMLTable(mps.doc)
# loop to find relevant table:

first <- "Abbott, Ms Diane"
last <- "157,841"

for (i in 1:length(mps.tabs)) {
 
  lastrow <- nrow(mps.tabs[[i]]) # get number of rows
  lastcol <- ncol(mps.tabs[[i]])
 
  if (as.character(mps.tabs[[i]][1,1])==first & as.character(mps.tabs[[i]][lastrow,lastcol])==last) {
   
    tabi <- i
     
    }
  }


# extract the relevant table and format it:

mps <- mps.tabs[[tabi]]  

money <- sapply(mps[,-1:-3], FUN= function(x) as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))

mps2 <- cbind(mps[,1:3],money)


# which are the five biggest parties by # of mps?
nbig5 <- names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])

#subset of mps only with the five biggest parties:
big5 <- subset(mps2, mps$Party%in%nbig5)

# load the lattice package for a nice plot

library(lattice)

bwplot(Total ~  Party, data=big5, ylab="Total expenses per MP (in £)")

33 comments:

  1. Wonderful! Thanks so much!
    Sincerely,
    Erin

    ReplyDelete
  2. Hi Erin,
    Thanks a lot for the positive feedback!
    I'm glad you like it.
    best regards,
    gtd

    ReplyDelete
  3. Am in the process of creating an engine on the cloud that allows scraping of data from online listings that spans multiple pages with links to embedded pages.

    Go a demo deployed here
    http://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm

    where I scraped price and image url from the main listing page, while getting description, seller_name, seller_profile_url from the corresponding embedded page.

    {
    origin_url: 'http://www.ebay.com/sch/?_nkw=GM%20Part&_pgn=1',
    columns: [
    {
    col_name: 'item_name',
    dom_query: 'h4 a'
    }, {
    col_name: 'item_detail_url',
    dom_query: 'h4 a',
    required_attribute: 'href',
    options : {
    columns: [{
    col_name: 'description',
    dom_query: '#desc_div'
    },{
    col_name: 'seller_name',
    dom_query: '.mbg a[[0]]'
    },{
    col_name: 'seller_profile_url',
    dom_query: '.mbg a[[0]]',
    required_attribute: 'href'
    }]
    }
    }, {
    col_name: 'item_image',
    dom_query: '.img img',
    required_attribute: 'src'
    }
    ],
    next_page: {
    dom_query: '.next'
    }
    };


    Would like to get your feedback to know if having it integrated into R would be useful?

    ReplyDelete
    Replies
    1. Hi,

      I am not exactly sure what you mean by "integrate into R". I think, the question you should ask yourself is who will use your application and for what (in order to answer your question above). However, here some of my thoughts about why writing a scraper in R (or interfacing it with R...):
      In terms of web scraping I use R to directly integrate the data gathering process to the statistical analysis (on the one hand for convenience on the other hand for reproducibility). Hence, (in a broader sense) I use R to write scrapers for scientific purposes.
      However, if your application is meant to retrieve data and directly reprocess it in a web environment, R might not be the best choice. In that case, I think, Perl would make more sense.

      Delete
    2. This comment has been removed by the author.

      Delete
  4. Hi,

    Great work buddy!
    But this works only if the site has tables in it, right?. What if I want to collect every text available on the website and then analyze it. How can I do that? Do you have codes for that. Reply awaited . Thanx in advance

    ReplyDelete
    Replies
    1. thanks! yes, this post is specifically on how to scrap tables. probably the simplest way to "collect every text available on a website" is to

      first: read in the whole html document as text/string

      second: remove all html tags (http://stackoverflow.com/questions/3765754/remove-html-tags-from-string-r-programming)

      third: use a text mining tool to further process/analyze the remaining text (such as the tm package: http://cran.r-project.org/web/packages/tm/index.html)

      note though, that this is rather a brute force approach. for more sophisticated analyses you might want to only extract certain text elements of a website. this might be a good starting point: http://www.stat.berkeley.edu/classes/s133/Readexample.html

      Delete
  5. Nice post. If I run your loop though I get an error. Any idea how to fix this? Thank you.
    Error in if (as.character(mps.tabs[[i]][1, 1]) == first & as.character(mps.tabs[[i]][lastrow, :
    argument is of length zero

    ReplyDelete
    Replies
    1. I encountered the same error in my environment:
      R version 3.1.1 (2014-07-10) -- "Sock it to Me"
      Copyright (C) 2014 The R Foundation for Statistical Computing
      Platform: x86_64-apple-darwin13.1.0 (64-bit)

      I modified the if condition as follows and it works for me:

      if (isTRUE(as.character(mps.tabs[[i]][lastrow,lastcol])==last.entry) & isTRUE(as.character(mps.tabs[[i]][1,1])==first.entry)) {
      tabi <- i
      }

      Delete
  6. Nice Post. But when I am running the code initially

    for (i in 1:length(mps.tabs)) {

    lastrow <- nrow(mps.tabs[[i]]) # get number of rows
    lastcol <- ncol(mps.tabs[[i]])

    if (isTRUE(as.character(mps.tabs[[i]][1,1])==first) & isTRUE(as.numeric(mps.tabs[[i]][lastrow,lastcol])==last)) {

    tabi <- i

    }
    }

    use to throw an that 'argument is of length zero'. After which I changed it to

    if (isTRUE(as.character(mps.tabs[[i]][lastrow,lastcol])==last.entry) & isTRUE(as.character(mps.tabs[[i]][1,1])==first.entry)) {
    tabi <- i
    }

    but still it says 'Error: object 'tabi' not found' any idea how to fix it.

    Thank you

    ReplyDelete
  7. i am also getting the same error as yasho joshi is getting ...Pls help

    ReplyDelete
  8. Hiii....I am really new to this field but the blog helps me to learn about scraping..Easily understandable to all...
    Thanks for updating these types of information...


    R Programming Training in Chennai | Software Testing Training in Chennai

    ReplyDelete
  9. Nice article...While running the code it having an error"object not found" ..How to rectify that error....

    ReplyDelete
  10. Web Scraping Services or website scraping service is like a boon to grow business and reach your business to new heights and success. Website scraping services is nothing but a process of extracting data from website for your business need.

    ReplyDelete
  11. Nice blog and Information. Like to share some more web scrapping or crawling tools, One of my friend got service from Mobito and like to share with you guys for the details to know more Mobito - Web crawler tools

    ReplyDelete
  12. Appreciating the persistence you put into your blog and detailed information you provide.
    Blue Prism Training in Bangalore

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Web Scraping is the process in which we extract data from different websites.

    ReplyDelete
  15. This is really very informative post. Thanks for sharing such a useful knowledge.
    web scraping services

    ReplyDelete
  16. Hello! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done an outstanding job.
    Best AWS Training in Chennai | Amazon Web Services Training in Chennai

    AWS Training in Bangalore | Amazon Web Services Training in Bangalore

    Amazon Web Services Training in Pune | Best AWS Training in Pune

    ReplyDelete

  17. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.

    Digital Marketing Training in Mumbai

    Six Sigma Training in Dubai

    Six Sigma Abu Dhabi

    ReplyDelete

  18. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.

    Digital Marketing Training in Mumbai

    Six Sigma Training in Dubai

    Six Sigma Abu Dhabi

    ReplyDelete
  19. Hi, possibly i’m being a little off topic here, but I was browsing your site and it looks stimulating. I’m writing a blog and trying to make it look neat, but everytime I touch it I mess something up. Did you design the blog yourself?

    Digital Marketing Course in Chennai
    Digital Marketing Training in Chennai
    Online Digital Marketing Courses
    SEO Training in Chennai
    Digital Marketing Course
    Digital Marketing Training
    Digital Marketing Courses

    ReplyDelete
  20. I was recommended this web site by means of my cousin. I am now not certain whether this post is written through him as nobody else recognise such precise about my difficulty. You're amazing! Thank you!



    angularjs Training in online

    angularjs Training in bangalore

    angularjs Training in bangalore

    angularjs Training in btm

    ReplyDelete
  21. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Good discussion. Thank you.
    Anexas
    Six Sigma Training in Abu Dhabi
    Six Sigma Training in Dammam
    Six Sigma Training in Riyadh

    ReplyDelete
  22. Best R Programming Training in Bangalore offered by myTectra. India's No.1 R Programming Training Institute. Classroom, Online and Corporate training in R Programming
    r programming training

    ReplyDelete
  23. The post is written in very a good manner and it entails many useful information for me. I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement the concept.
    Python training in marathahalli | Python training institute in pune

    ReplyDelete
  24. Woah this blog is wonderful i like studying your posts. Keep up the great work! You understand, lots of persons are hunting around for this info, you could help them greatly.

    Java training in Chennai | Java training in Bangalore

    Java interview questions and answers | Core Java interview questions and answers

    ReplyDelete
  25. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post I would like to read this
    Data Science Training in Indira nagar | Data Science Training in btmlayout

    Python Training in Kalyan nagar | Data Science training in Indira nagar

    Data Science Training in Marathahalli | Data Science Training in BTM Layout

    ReplyDelete
  26. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.

    rpa interview questions and answers
    automation anywhere interview questions and answers
    blueprism interview questions and answers
    uipath interview questions and answers
    rpa training in chennai

    ReplyDelete
  27. Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Website. Please feel free to visit our site. Thank you for sharing.Well written article Thank You Sharing with Us project management courses in chennai | pmp training class in chennai | pmp training fee | project management training certification | project management training in chennai | project management certification online |

    ReplyDelete