In this last post of my little series (see my latest post) on R and the web I explain how to extract data of a website (web scraping/screen
scraping) with R. If the data you want to analyze are a part of a
web page, for example a HTML-table (or hundreds of them) it might be very
time-consuming (and boring!) to manually copy/paste all of its content
or even typewrite it to a spreadsheet table or data frame. Instead, you
can let R do the job for you!
This post is really aimed at beginners. Thus, to keep things simple it only deals with scraping one data table from one web page: a table published by BBC NEWS containing the full range of British Members of Parliament' expenses in 2007-2008. Quite an interesting data set if you are into political scandals...
Web scraping with R
library(XML)
# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
# parse the document for R representation:
mps.doc <- htmlParse(mps)
mps.doc <- htmlParse(mps)
# get all the tables in mps.doc as data frames
mps.tabs <- readHTMLTable(mps.doc)
mps.tabs is a list containing in each element a HTML-table from the
parsed website (mps.doc) as data.frame. The website contains several HTML-tables (some are rather used to
structure the website and not to present data). The list mps.tabs actually has seven entries, hence there were seven
HTML-tables in the parsed document:
length(mps.tabs)
To proceed you need to check which of these data frames (list entries)
contains the table you want (the MPs' expenses). You can do that "manually" by checking how the data frame
starts and ends and compare it with the original table of the website:
head(mps.tabs[[1]]) #and
tail(mps.tabs[[1]]) #for 1 to 7
With only seven entries this is quite fast. But alternatively you could
also write a little loop to do the job for you. The loop checks each data frame for certain conditions. In this case: the string of the first row and first column and the string in the last row and column. According to the original table from the website that should be:
first <- "Abbott, Ms Diane"
last <- "157,841"
# ... and the loop:
for (i in 1:length(mps.tabs)) {
lastrow <-
nrow(mps.tabs[[i]]) # get number of rows
lastcol <-
ncol(mps.tabs[[i]])
if
(as.character(mps.tabs[[i]][1,1])==first &
as.character(mps.tabs[[i]][lastrow,lastcol])==last) {
tabi <- i
}
}
Check if that is realy what you want and extract the relevant table as data frame.
head(mps.tabs[[tabi]])
tail(mps.tabs[[tabi]])
mps <- mps.tabs[[tabi]]
Before you can properly analyze this data set we have to remove the commas
in the columns with expenses and format them as numeric:
money <- sapply(mps[,-1:-3], FUN= function(x)
as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))
mps2 <- cbind(mps[,1:3],money)
Now you are ready to go... For example, you could compare how the total expenses are distributed for each of the five biggest parties:
# which are the
five biggest parties by # of mps?
nbig5 <-
names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])
#subset of mps only with the five biggest parties:
big5 <- subset(mps2, mps$Party%in%nbig5)
# load the lattice package for a nice plot
library(lattice)
And the relevant R code in one piece:
library(XML)
# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
# parse the document for R representation:
mps.doc <- htmlParse(mps)
mps.doc <- htmlParse(mps)
# get all the tables in mps.doc as
data frames
mps.tabs <- readHTMLTable(mps.doc)
# loop to find relevant table:
first <- "Abbott, Ms Diane"
last <- "157,841"
for (i in 1:length(mps.tabs)) {
lastrow <-
nrow(mps.tabs[[i]]) # get number of rows
lastcol <-
ncol(mps.tabs[[i]])
if
(as.character(mps.tabs[[i]][1,1])==first &
as.character(mps.tabs[[i]][lastrow,lastcol])==last) {
tabi <- i
}
}
# extract the relevant table and format it:
# extract the relevant table and format it:
mps <- mps.tabs[[tabi]]
money <- sapply(mps[,-1:-3], FUN= function(x)
as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))
mps2 <- cbind(mps[,1:3],money)
#subset of mps only with the five biggest parties:
library(lattice)
bwplot(Total ~ Party, data=big5, ylab="Total expenses per MP (in £)")
# which are the
five biggest parties by # of mps?
nbig5 <-
names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])
#subset of mps only with the five biggest parties:
big5 <- subset(mps2, mps$Party%in%nbig5)
# load the lattice package for a nice plot
library(lattice)
More web scraping examples on r-bloggers.com
If you are interested in more web scraping with R, check out the following links to posts with more advanced/specific examples presented on r-bloggers.com:
http://www.r-bloggers.com/web-scraping-in-r/
http://www.r-bloggers.com/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors%E2%80%99-r-blog-posts-e-g-number-of-likes-comments-shares-etc/
http://www.r-bloggers.com/web-scraping-yahoo-search-page-via-xpath/
http://www.r-bloggers.com/how-to-buy-a-used-car-with-r-part-1/

Wonderful! Thanks so much!
ReplyDeleteSincerely,
Erin
Hi Erin,
ReplyDeleteThanks a lot for the positive feedback!
I'm glad you like it.
best regards,
gtd
Am in the process of creating an engine on the cloud that allows scraping of data from online listings that spans multiple pages with links to embedded pages.
ReplyDeleteGo a demo deployed here
http://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm
where I scraped price and image url from the main listing page, while getting description, seller_name, seller_profile_url from the corresponding embedded page.
{
origin_url: 'http://www.ebay.com/sch/?_nkw=GM%20Part&_pgn=1',
columns: [
{
col_name: 'item_name',
dom_query: 'h4 a'
}, {
col_name: 'item_detail_url',
dom_query: 'h4 a',
required_attribute: 'href',
options : {
columns: [{
col_name: 'description',
dom_query: '#desc_div'
},{
col_name: 'seller_name',
dom_query: '.mbg a[[0]]'
},{
col_name: 'seller_profile_url',
dom_query: '.mbg a[[0]]',
required_attribute: 'href'
}]
}
}, {
col_name: 'item_image',
dom_query: '.img img',
required_attribute: 'src'
}
],
next_page: {
dom_query: '.next'
}
};
Would like to get your feedback to know if having it integrated into R would be useful?
Hi,
DeleteI am not exactly sure what you mean by "integrate into R". I think, the question you should ask yourself is who will use your application and for what (in order to answer your question above). However, here some of my thoughts about why writing a scraper in R (or interfacing it with R...):
In terms of web scraping I use R to directly integrate the data gathering process to the statistical analysis (on the one hand for convenience on the other hand for reproducibility). Hence, (in a broader sense) I use R to write scrapers for scientific purposes.
However, if your application is meant to retrieve data and directly reprocess it in a web environment, R might not be the best choice. In that case, I think, Perl would make more sense.
This comment has been removed by the author.
DeleteHi,
ReplyDeleteGreat work buddy!
But this works only if the site has tables in it, right?. What if I want to collect every text available on the website and then analyze it. How can I do that? Do you have codes for that. Reply awaited . Thanx in advance
thanks! yes, this post is specifically on how to scrap tables. probably the simplest way to "collect every text available on a website" is to
Deletefirst: read in the whole html document as text/string
second: remove all html tags (http://stackoverflow.com/questions/3765754/remove-html-tags-from-string-r-programming)
third: use a text mining tool to further process/analyze the remaining text (such as the tm package: http://cran.r-project.org/web/packages/tm/index.html)
note though, that this is rather a brute force approach. for more sophisticated analyses you might want to only extract certain text elements of a website. this might be a good starting point: http://www.stat.berkeley.edu/classes/s133/Readexample.html