Thursday, March 14, 2013

Data Science in Business/Computational Social Science in Academia?

Nomen Est Omen?

Lately, the terms "data science" and "data scientist" turn up at an increasing pace in the R-blog-sphere. Since its first occurrence (to my knowledge,  "data scientist" has been coined by DJ Patil and Jeff Hammerbacher in 2008), the term "data scientist" has become established and accepted not only in the data-blog-sphere but also in the corporate/business world as well as in academia. It's frequent occurrence as job title as well as some controversial discussions on the situation of the respective labor market are evidence for our understanding of data scientist as an occupational title. At the same time data science is being established as a course taught at universities (with the first drafts of specific textbooks to learn data science; see, e.g., Jeffrey Stanton's free book on data science).   Interestingly, by the existence of job descriptions for data scientists and the respective skill sets, our understanding of data science is increasingly defined through what the corporate labor market demands - hence, through business. As I see it, this development is also taken up by the scholars teaching data science at universities. A data science course is quite specifically a preparation for a future job as data scientist. In that sense, data science is not a science itself but the application of various sciences (computer science, statistics, etc.). This notion, I think, is also present when reading the JDS.

Empirical Computational Social Science

 The corporate labor market asks for data scientists and universities are offering new courses in order to fill the gaps. But, is there also room for data science skills in a purely academic research environment?

I think, there is very much room for it. At around the same time the term "data scientist" came up, the Science Magazine published Lazer et al.'s maniphesto on data-driven computational social science (or, the term I prefer, empirical computational social science). Historically, the term computational social science is rather referring to the application of numerical methods and simulation (i.e., agent based modelling) to complex issues of social science research. What Lazer et al. rather understand as computational social science, however, is social science research that draws on the enormous potential of vast amounts of digital data on social interactions (made available through the Internet, mobile applications etc.). Handling this data in order to conduct empirical social science research clearly needs data science skills. To come full circle, I have revisited Drew Conway's post and Venn-diagram on data science and drafted another Venn-diagram to illustrate how data-driven computational social science could be interpreted in the framework discussed above.

Whether or not you generally share my point of view concerning data science and computational social science, I am pretty sure you will agree on one thing: R will play an important role in the further development of these fields.

Thursday, February 7, 2013

My R-Package Development Cheat Sheet

In case you have no experience in writing an R-package yourself  but would like to start developing one right away, this post might be helpful.

I'm about to finish my first own (serious) R-package these days (more on the package itself later). While writing my package, I collected a handful of commands and notes etc. that proofed to be helpful, and saved them in a R-script. I usually had that script open in one window when writing/testing some parts of my package in RStudio. I figured that it might help someone in a similar situation. Note, though, that this little code collection has no ambitions whatsoever to be anything like a complete guide to develop your own R-package. Having that said, here it is (you can also download it from my github repo):


#######################################################
# This was contributed by giventhedata.blogspot.com   #
#######################################################
 
 
# I. Very useful tools when writing a R-package:
#------------------------------------------------
install.packages("devtools", "roxygen2")
library(devtools)
library(roxygen2)
 
 
# II. getting started
#--------------------
 
# assuming your package is to be called 'MyRpackage' and
# all the scripts that contain functions that should be part
# of your package are in your current working directory and
# and there are no functions loaded in the workspace of your 
# current R-session...
 
# source all scripts:
myscripts <- c("script1.R", "script2.R", "script3.R") #...
 
for (i in myscripts) source(i)
 
 
# get all the names of the functions in the workspace
fs <- c(lsf.str()) 
 
# create package skeleton:
package.skeleton("MyRpackage", fs)
 
 
# III. While working on your package...
#--------------------------------------
 
# renew documentation
MyRpackage_package <- as.package("MyRpackage")
document(MyRpackage_package)
 
# build and check
system("R CMD build MyRpackage")
system("R CMD check MyRpackage")
system("R CMD Rd2pdf MyRpackage") # update/check manual.pdf (while working on the documentation)
 
# install your package from the local directory after successfully building it
install.packages(paste(getwd(),"/MyRpackage_0.1.tar.gz",sep=""), repos=NULL, type="source")
 
# load it for tests
library(MyRpackage)
 
# unload old version of package after changes (in order to install new built)
detach(package:MyRpackage, unload=TRUE)
 
 
#Note:
# For internal functions: delete Rd-file, leave out export-command in Namespace
Syntax highlighting created by Pretty R at inside-R.org