Tuesday, August 9, 2016

R, the Programmable Web, and Transparency in Social Science Research

Alois Stutzer and I recently contributed a guest post to the BITSS-Blog (of the Berkeley Initiative for Transparency in the Social Sciences). As a big part of it focuses on R-related topics, I figured it might also be of interest for readers of this blog. Here the gist:

"The replicability of social science research is becoming more demanding in the age of big data. First, researchers aiming to replicate a study based on massive data face substantial computational costs. Second, and probably more challenging, they are often confronted with “highly unique” data sets derived and compiled from sources with different and unusual formats (as they are originally generated and recorded for purposes other than data analysis or research). This holds in particular for Internet data from social media, new e-businesses, and digital government. More and more social scientists attempt to exploit these new data sources following ad hoc procedures in the compilation of their data sets. 

The entire post can be found here.

In this context, I also want to explicitly point to all the very relevant contributions listed in the CRAN Task View on Web Technologies and Services,  the CRAN Open Data Task View, as well as the contributions by rOpenSci.

Tuesday, July 19, 2016

Easy access to data on US politics: New version of pvsR now on BitBucket

I am happy to announce a new release (version 0.4) of the R-package pvsR  on Bitbucket. pvsR facilitates data retrieval from Project Vote Smart's rich online data base on US politics via the Project Vote Smart application programming interface (PVS API). The functions in this package cover most PVS API classes and methods and return the requested data in data-frames (and classes "tbl_df", "tbl"). See here for extended examples and background. The new version includes the following improvements:

  • Replaced internal function dfList() with faster implementation in dplyr::bind_rows, removed dfList() from package.
  • In order to improve the comfort in interactive sessions, all high-level functions for querying data from the PVS API return now objects of class "tbl_df"/"tbl"/"data.frame".
  • Improved code-readability/formatting.

How to install/use

# install/load package
# define api-key variable (use personal key, see http://votesmart.org/share/api)
pvs.key <- "<YOUR-KEY-HERE>"
# get biographical data on Hilary Clinton and Donald Trump
Candidates.getByLastname(list("Clinton", "Trump"))
Created by Pretty R at inside-R.org

Suggestions and issue reports very welcome


Please feel free to  make suggestions, and report issues (preferably via the issue-tracker in the Bitbucket-repository).

Monday, March 14, 2016

RWebData V. 0.1 on Bitbucket: A High-Level Interface to the Programmable Web

I am happy to announce the first release of the R-package RWebData  on Bitbucket. The main aim of the package is to provide high-level functions that facilitate the access and systematic collection of data from REST APIs for the purpose of statistical analysis. RWebData is thus made for users that predominantly use R as a statistical software but do not have experience with web APIs and/or web data formats. In a broader sense (and in the long run) the package should serve as a high level interface to the programmable web for research in the social sciences (i.e., accessing the programmable web as a data source). The package thus takes up some of the broader ideas discussed in our paper on the pvsR-package. A short paper with a broader motivation for the package, some discussion of the package's architecture, as well as a practical introduction with several examples can be found here.

RWebData builds on many important packages that facilitate client-server interaction via R/HTTP as well as different parsers for web-data formats (including: RCurl, jsonlite, XML, XML2R, httr, mime, yaml, RJSONIO). At its core, the package provides a generic approach to map nested web data to a flat data representation in the form of one or several (non-nested) data-frames. 

A simple example

This example is taken from the working paper on arXiv. It illustrates the very basic usage of the package: Say you want to statistically analyze/visualize data provided from a web API, all you have is an URL pointing to the data of interest, you do not know/care what JSON, XML and the like are, you simply want the data in a format that is suitable for statistical analysis in R. 
Here, we want to fetch data from the World Bank Indicators API which provides time series data on financial indicators of different countries (as XML in a compressed text file). In the example, we query data from that API in order to investigate how the United States' public dept was affected by the financial crisis in 2008.

# install the package directly from bitbucket
# fetch the data and map it to a table-like representation (a data-frame)
u <- "http://api.worldbank.org/countries/USA/indicators/DP.DOD.DECN.CR.GG.CD?&date=2005Q1:2013Q4"
usdept <- getTabularData(u)
# analyze/visualize the data
plot(as.ts(zoo(usdept$value,  as.yearqtr(usdept$date))), 
     ylab="U.S. public dept (in USD)")
Created by Pretty R at inside-R.org

More examples will follow...

Comments etc. very welcome

Please feel free to comment, make suggestions, and report issues (preferably via the issue-tracker in the Bitbucket-repository). As mentioned above, this is the first release. While I have already used the package to collect data for several of my own research projects, there are certainly still a lot of issues to be resolved...