In this last post of my little series (see my latest post) on R and the web I explain how to extract data of a website (web scraping/screen
scraping) with R. If the data you want to analyze are a part of a
web page, for example a HTML-table (or hundreds of them) it might be very
time-consuming (and boring!) to manually copy/paste all of its content
or even typewrite it to a spreadsheet table or data frame. Instead, you
can let R do the job for you!
This post is really aimed at beginners. Thus, to keep things simple it only deals with scraping one data table from one web page: a table published by BBC NEWS containing the full range of British Members of Parliament' expenses in 2007-2008. Quite an interesting data set if you are into political scandals...
Web scraping with R
library(XML)
# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
# parse the document for R representation:
mps.doc <- htmlParse(mps)
mps.doc <- htmlParse(mps)
# get all the tables in mps.doc as data frames
mps.tabs <- readHTMLTable(mps.doc)
mps.tabs is a list containing in each element a HTML-table from the
parsed website (mps.doc) as data.frame. The website contains several HTML-tables (some are rather used to
structure the website and not to present data). The list mps.tabs actually has seven entries, hence there were seven
HTML-tables in the parsed document:
length(mps.tabs)
To proceed you need to check which of these data frames (list entries)
contains the table you want (the MPs' expenses). You can do that "manually" by checking how the data frame
starts and ends and compare it with the original table of the website:
head(mps.tabs[[1]]) #and
tail(mps.tabs[[1]]) #for 1 to 7
With only seven entries this is quite fast. But alternatively you could
also write a little loop to do the job for you. The loop checks each data frame for certain conditions. In this case: the string of the first row and first column and the string in the last row and column. According to the original table from the website that should be:
first <- "Abbott, Ms Diane"
last <- "157,841"
# ... and the loop:
for (i in 1:length(mps.tabs)) {
lastrow <-
nrow(mps.tabs[[i]]) # get number of rows
lastcol <-
ncol(mps.tabs[[i]])
if
(as.character(mps.tabs[[i]][1,1])==first &
as.character(mps.tabs[[i]][lastrow,lastcol])==last) {
tabi <- i
}
}
Check if that is realy what you want and extract the relevant table as data frame.
head(mps.tabs[[tabi]])
tail(mps.tabs[[tabi]])
mps <- mps.tabs[[tabi]]
Before you can properly analyze this data set we have to remove the commas
in the columns with expenses and format them as numeric:
money <- sapply(mps[,-1:-3], FUN= function(x)
as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))
mps2 <- cbind(mps[,1:3],money)
Now you are ready to go... For example, you could compare how the total expenses are distributed for each of the five biggest parties:
# which are the
five biggest parties by # of mps?
nbig5 <-
names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])
#subset of mps only with the five biggest parties:
big5 <- subset(mps2, mps$Party%in%nbig5)
# load the lattice package for a nice plot
library(lattice)
And the relevant R code in one piece:
library(XML)
# URL of interest:
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
mps <- "http://news.bbc.co.uk/2/hi/uk_politics/8044207.stm"
# parse the document for R representation:
mps.doc <- htmlParse(mps)
mps.doc <- htmlParse(mps)
# get all the tables in mps.doc as
data frames
mps.tabs <- readHTMLTable(mps.doc)
# loop to find relevant table:
first <- "Abbott, Ms Diane"
last <- "157,841"
for (i in 1:length(mps.tabs)) {
lastrow <-
nrow(mps.tabs[[i]]) # get number of rows
lastcol <-
ncol(mps.tabs[[i]])
if
(as.character(mps.tabs[[i]][1,1])==first &
as.character(mps.tabs[[i]][lastrow,lastcol])==last) {
tabi <- i
}
}
# extract the relevant table and format it:
# extract the relevant table and format it:
mps <- mps.tabs[[tabi]]
money <- sapply(mps[,-1:-3], FUN= function(x)
as.numeric(gsub(",", "", as.character(x), fixed = TRUE) ))
mps2 <- cbind(mps[,1:3],money)
#subset of mps only with the five biggest parties:
library(lattice)
bwplot(Total ~ Party, data=big5, ylab="Total expenses per MP (in £)")
# which are the
five biggest parties by # of mps?
nbig5 <-
names(summary(mps2$Party)[order(summary(mps2$Party)*-1)][1:5])
#subset of mps only with the five biggest parties:
big5 <- subset(mps2, mps$Party%in%nbig5)
# load the lattice package for a nice plot
library(lattice)
More web scraping examples on r-bloggers.com
If you are interested in more web scraping with R, check out the following links to posts with more advanced/specific examples presented on r-bloggers.com:
http://www.r-bloggers.com/web-scraping-in-r/
http://www.r-bloggers.com/r-web-scraping-r-bloggers-facebook-page-to-gain-further-information-about-an-authors%E2%80%99-r-blog-posts-e-g-number-of-likes-comments-shares-etc/
http://www.r-bloggers.com/web-scraping-yahoo-search-page-via-xpath/
http://www.r-bloggers.com/how-to-buy-a-used-car-with-r-part-1/
Wonderful! Thanks so much!
ReplyDeleteSincerely,
Erin
Hi Erin,
ReplyDeleteThanks a lot for the positive feedback!
I'm glad you like it.
best regards,
gtd
Am in the process of creating an engine on the cloud that allows scraping of data from online listings that spans multiple pages with links to embedded pages.
ReplyDeleteGo a demo deployed here
http://ec2-204-236-207-28.compute-1.amazonaws.com/scrap-gm
where I scraped price and image url from the main listing page, while getting description, seller_name, seller_profile_url from the corresponding embedded page.
{
origin_url: 'http://www.ebay.com/sch/?_nkw=GM%20Part&_pgn=1',
columns: [
{
col_name: 'item_name',
dom_query: 'h4 a'
}, {
col_name: 'item_detail_url',
dom_query: 'h4 a',
required_attribute: 'href',
options : {
columns: [{
col_name: 'description',
dom_query: '#desc_div'
},{
col_name: 'seller_name',
dom_query: '.mbg a[[0]]'
},{
col_name: 'seller_profile_url',
dom_query: '.mbg a[[0]]',
required_attribute: 'href'
}]
}
}, {
col_name: 'item_image',
dom_query: '.img img',
required_attribute: 'src'
}
],
next_page: {
dom_query: '.next'
}
};
Would like to get your feedback to know if having it integrated into R would be useful?
Hi,
DeleteI am not exactly sure what you mean by "integrate into R". I think, the question you should ask yourself is who will use your application and for what (in order to answer your question above). However, here some of my thoughts about why writing a scraper in R (or interfacing it with R...):
In terms of web scraping I use R to directly integrate the data gathering process to the statistical analysis (on the one hand for convenience on the other hand for reproducibility). Hence, (in a broader sense) I use R to write scrapers for scientific purposes.
However, if your application is meant to retrieve data and directly reprocess it in a web environment, R might not be the best choice. In that case, I think, Perl would make more sense.
This comment has been removed by the author.
DeleteHi,
ReplyDeleteGreat work buddy!
But this works only if the site has tables in it, right?. What if I want to collect every text available on the website and then analyze it. How can I do that? Do you have codes for that. Reply awaited . Thanx in advance
thanks! yes, this post is specifically on how to scrap tables. probably the simplest way to "collect every text available on a website" is to
Deletefirst: read in the whole html document as text/string
second: remove all html tags (http://stackoverflow.com/questions/3765754/remove-html-tags-from-string-r-programming)
third: use a text mining tool to further process/analyze the remaining text (such as the tm package: http://cran.r-project.org/web/packages/tm/index.html)
note though, that this is rather a brute force approach. for more sophisticated analyses you might want to only extract certain text elements of a website. this might be a good starting point: http://www.stat.berkeley.edu/classes/s133/Readexample.html
Nice post. If I run your loop though I get an error. Any idea how to fix this? Thank you.
ReplyDeleteError in if (as.character(mps.tabs[[i]][1, 1]) == first & as.character(mps.tabs[[i]][lastrow, :
argument is of length zero
I encountered the same error in my environment:
DeleteR version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.1.0 (64-bit)
I modified the if condition as follows and it works for me:
if (isTRUE(as.character(mps.tabs[[i]][lastrow,lastcol])==last.entry) & isTRUE(as.character(mps.tabs[[i]][1,1])==first.entry)) {
tabi <- i
}
Nice Post. But when I am running the code initially
ReplyDeletefor (i in 1:length(mps.tabs)) {
lastrow <- nrow(mps.tabs[[i]]) # get number of rows
lastcol <- ncol(mps.tabs[[i]])
if (isTRUE(as.character(mps.tabs[[i]][1,1])==first) & isTRUE(as.numeric(mps.tabs[[i]][lastrow,lastcol])==last)) {
tabi <- i
}
}
use to throw an that 'argument is of length zero'. After which I changed it to
if (isTRUE(as.character(mps.tabs[[i]][lastrow,lastcol])==last.entry) & isTRUE(as.character(mps.tabs[[i]][1,1])==first.entry)) {
tabi <- i
}
but still it says 'Error: object 'tabi' not found' any idea how to fix it.
Thank you
i am also getting the same error as yasho joshi is getting ...Pls help
ReplyDeleteHiii....I am really new to this field but the blog helps me to learn about scraping..Easily understandable to all...
ReplyDeleteThanks for updating these types of information...
R Programming Training in Chennai | Software Testing Training in Chennai
Nice article...While running the code it having an error"object not found" ..How to rectify that error....
ReplyDeleteNice blog and Information. Like to share some more web scrapping or crawling tools, One of my friend got service from Mobito and like to share with you guys for the details to know more Mobito - Web crawler tools
ReplyDeleteAppreciating the persistence you put into your blog and detailed information you provide.
ReplyDeleteBlue Prism Training in Bangalore
This comment has been removed by the author.
ReplyDeleteWeb Scraping is the process in which we extract data from different websites.
ReplyDeleteHi, possibly i’m being a little off topic here, but I was browsing your site and it looks stimulating. I’m writing a blog and trying to make it look neat, but everytime I touch it I mess something up. Did you design the blog yourself?
ReplyDeleteDigital Marketing Course in Chennai
Digital Marketing Training in Chennai
Online Digital Marketing Courses
SEO Training in Chennai
Digital Marketing Course
Digital Marketing Training
Digital Marketing Courses
I was recommended this web site by means of my cousin. I am now not certain whether this post is written through him as nobody else recognise such precise about my difficulty. You're amazing! Thank you!
ReplyDeleteangularjs Training in online
angularjs Training in bangalore
angularjs Training in bangalore
angularjs Training in btm
Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
ReplyDeleteGood discussion. Thank you.
Anexas
Six Sigma Training in Abu Dhabi
Six Sigma Training in Dammam
Six Sigma Training in Riyadh
The post is written in very a good manner and it entails many useful information for me. I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement the concept.
ReplyDeletePython training in marathahalli | Python training institute in pune
Woah this blog is wonderful i like studying your posts. Keep up the great work! You understand, lots of persons are hunting around for this info, you could help them greatly.
ReplyDeleteJava training in Chennai | Java training in Bangalore
Java interview questions and answers | Core Java interview questions and answers
I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post I would like to read this
ReplyDeleteData Science Training in Indira nagar | Data Science Training in btmlayout
Python Training in Kalyan nagar | Data Science training in Indira nagar
Data Science Training in Marathahalli | Data Science Training in BTM Layout
i am for the first time here. I found this board and I in finding It truly helpful & it helped me out a lot. I hope to present something back and help others such as you helped me. We Stock both New and Refurbished
ReplyDeleteI always enjoy reading quality articles by an individual who is obviously knowledgeable on their chosen subject. Ill be watching this post with much interest. Keep up the great work, I will be back
ReplyDeletedevops online training
aws online training
data science with python online training
data science online training
rpa online training
ReplyDeleteIt's really a nice experience to read your post. Thank you for sharing this useful information.
check out : big data hadoop training cost in chennai | hadoop training in Chennai | best bigdata hadoop training in chennai | best hadoop certification in Chennai
its wonderful message
ReplyDeletebest angularjs training in chennai
angular js training in sholinganallur
angularjs training in chennai
azure training in chennai
best java training in chennai
selenium training in chennai
Thanks for the blog loaded with so many information. Stopping by your blog helped me to get what I was looking for.
ReplyDeleteWebdesign
thanks for sharing this informations
ReplyDeleteazure training in chennai
azure training in sholinganallur
best devops training in chennai
best hadoop training in chennai
best hadoop training in omr
best java training in chennai
thanks for sharing this information
ReplyDeleteAndroid Training in Bangalore
informatica Training in Bangalore
Blue Prism Training in BTM
Blue Prism Training in Bangalore
MERN StackTraining in Bangalore
MEAN Stack Training in Bangalore
RPA Training in Bangalore
thanks for sharing this information
DeleteAndroid Training in Bangalore
informatica Training in Bangalore
Blue Prism Training in BTM
Blue Prism Training in Bangalore
MERN StackTraining in Bangalore
MEAN Stack Training in Bangalore
RPA Training in Bangalore
web development there are various platforms like azure. learn azure through azure training in hyderabad
ReplyDeleteI learned World's Trending Technology from certified experts for free of cost. I got a job in decent Top MNC Company with handsome 14 LPA salary, I have learned the World's Trending Technology from python training in btm layout
ReplyDeleteexperts who know advanced concepts which can help to solve any type of Real-time issues in the field of Python. Really worth trying Freelance SEO expert in Bangalore
For AWS training in Bangalore, Visit:
ReplyDeletePython training in Bangalore
Visit for AWS training in Bangalore:- AWS training in Bangalore
ReplyDeleteThanks for sharing,very useful blog.I appreciate your work to provide clear and understandable content.Keep updating us more.
ReplyDeleteMachine learning training institute in bangalore
Great Article
ReplyDeleteData Mining Projects
Python Training in Chennai
Project Centers in Chennai
Python Training in Chennai
thanks for this informative article it is very useful
ReplyDeleteaws Training in Bangalore
python Training in Bangalore
hadoop Training in Bangalore
angular js Training in Bangalore
bigdata analytics Training in Bangalore
python Training in Bangalore
aws Training in Bangalore
I am really happy with your blog because your article is very unique and powerful for new reader.
ReplyDeleteaws Training in Bangalore
python Training in Bangalore
hadoop Training in Bangalore
angular js Training in Bangalore
bigdata analytics Training in Bangalore
python Training in Bangalore
aws Training in Bangalore
Thanks for this blog are more informative contents step by step. I here attached my site would you see this blog.
ReplyDelete7 tips to start a career in digital marketing
“Digital marketing is the marketing of product or service using digital technologies, mainly on the Internet, but also including mobile phones, display advertising, and any other digital medium”. This is the definition that you would get when you search for the term “Digital marketing” in google. Let’s give out a simpler explanation by saying, “the form of marketing, using the internet and technologies like phones, computer etc”.
we have offered to the advanced syllabus course digital marketing for available
more details click the link now.
https://www.webdschool.com/digital-marketing-course-in-chennai.html
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog,leave some more.
ReplyDeleteAi & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai
Really very nice blog information for this one and more technical skills are improve,i like that kind of post...
ReplyDeleteMicrosoft Windows Azure Training | Online Course | Certification in chennai | Microsoft Windows Azure Training | Online Course | Certification in bangalore | Microsoft Windows Azure Training | Online Course | Certification in hyderabad | Microsoft Windows Azure Training | Online Course | Certification in pune
Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
ReplyDeleteAWS training in chennai | AWS training in annanagar | AWS training in omr | AWS training in porur | AWS training in tambaram | AWS training in velachery
Thanks for this blog it is so informative blog and useful for us.
ReplyDeletehttps://www.acte.in/reviews-complaints-testimonials
https://www.acte.in/velachery-reviews
https://www.acte.in/tambaram-reviews
https://www.acte.in/anna-nagar-reviews
https://www.acte.in/porur-reviews
https://www.acte.in/omr-reviews
https://www.acte.in/blog/acte-student-reviews
I simply want to mention I am just all new to blogging and site-building and truly loved you’re web page. Almost certainly I’m planning to bookmark your site . You really have outstanding stories. Many thanks for revealing your webpage.…
ReplyDeleteAzure Training in Chennai
Azure Training in Bangalore
Azure Training in Hyderabad
Azure Training in Pune
Azure Training | microsoft azure certification | Azure Online Training Course
Azure Online Training
Really awesome blog!!! I finally found great post here.I really enjoyed reading this article. Nice article on data science . Thanks for sharing your innovative ideas to our vision. your writing style is simply awesome with useful information. Very informative, Excellent work! I will get back here.
ReplyDeletepython training in bangalore
python training in hyderabad
python online training
python training
python flask training
python flask online training
python training in coimbatore
python training in chennai
python course in chennai
python online training in chennai
This is my first time i visit here and I found so many interesting stuff in your blog especially it's discussion, thank you.
ReplyDeleteSalesforce Training in Chennai
Salesforce Online Training in Chennai
Salesforce Training in Bangalore
Salesforce Training in Hyderabad
Salesforce training in ameerpet
Salesforce Training in Pune
Salesforce Online Training
Salesforce Training
Great post, I really interesting the way you highlighted some important points.I never seen these type of article in my life ..its really wonderful Thanks very much, I appreciate your post.
ReplyDeleteJava Training in Chennai
Java Training in Bangalore
Java Training in Hyderabad
Java Training
Java Training in Coimbatore
It was so nice content.I was really satisfied by seeing this content.
ReplyDeletesap wm training in bangalore
Wow, amazing post! Really engaging, thank you.
ReplyDeletesap hybris training in bangalore
Nice Blog information.
ReplyDeleteGIEC Global is the Best Education Consultants in Melbourne, Australia and education consultant in Melbourne, Sydney, Brisbane, Perth, Adelaide,Australia.Education Consultants in Melbourne, Best Education Agent in Melbourne, Sydney, Adelaide, Perth, and Brisbane is GIEC Global. We are Melbourne Migration and Education Consultants, Education Migration Agent Melbourne, Melbourne Study Abroad, and Performance Education Melbourne
Infycle Technologies, the No.1 software training institute in Chennai offers the leading Python course in Chennai, for tech professionals and students at the best offers. In addition to the Python course, other in-demand courses such as Data Science, Selenium, Oracle, Java, Power BI, Digital Marketing also will be trained with 100% practical classes. After the completion of training, the trainees will be sent for placement interviews in the top MNC's. Call 7504633633 to get more info and a free demo.
ReplyDeletetableau training
ReplyDeleteGrab the best AWS Training in Chennaifrom Infycle Technologies, the best software training institute, and Placement centre in Chennai. We also provide technical courses like Power BI, Cyber Security, Graphic Design and Animation, Block Security, Java, Oracle, Python etc. For free demo class and enquiry call 7504633633.
ReplyDeleteThat is in try of truth conceivable to tune in. much obliged to you for the supplant and invigorating karma. Microsoft Office 2007 Crack Free Download
ReplyDeleteDigiDNA iMazing is acquire an outside pressure as of organizer chief notwithstanding through it by the File App plan. Imazing Crack
ReplyDelete
ReplyDeleteI hope you're having a good day. At least once a week I come check out your blog to see what you've been up to.
https://easyserialkeys.com/plagiarism-checker-x-crack/
Thanks for the post.it was really helpful.
ReplyDeleteFull-stack classes in Nagpur
Great post for beginners! Clear explanation of web scraping with R, and the practical example on MPs' expenses was especially helpful. Thanks for sharing!
ReplyDeleteonline internship | internship in chennai | online internship for students with certificate | bca internship | internship for bca students | sql internship | online internship for btech students | internship for 1st year engineering students