Making R Work with Big Data
Oct 23, 2013/
What is R?
R is an open-source statistical language primarily used for data analytics. It provides a wide collection of statistical and graphical techniques and is highly extensible. In addition to an already strong presence in the data scientist community, contributors to R are growing at a fast pace. The Comprehensive R Archive Network has about 4000 packages (and growing) today, and many companies already have a dedicated team of data scientists specializing in R for data analytics. R is increasingly preferred as a replacement for other analytical solutions like SAS. Read more about R here.
Standalone R is not naturally suited for Big Data Analytics
Although R is powerful and flexible, with a rich set of statistical algorithms and graphical capabilities, it is single threaded and in-memory, making it almost impossible to scale to large data sizes. Due to this limitation, data scientists usually rely on sampling the big data-sets sitting on a platform, and then performing analytics on the reduced data. This results in loss of valuable insights that could have been gained had the whole data-set been analyzed. Also, data scientists comfortable with R need to understand query languages (e.g. SQL) depending on the data source, write specific scripts to interact with them and pull the data into R.
An effort to reduce the gap between R and big data
To address the above limitations in R, we worked towards building an R connector package during our most recent Hack Week at RichRelevance. Using this package, one can potentially perform analytics entirely in R, agnostic to the data store in the back-end (HIVE, HDFS, HBase, PostGRES, etc.). In addition, it can provide much required scalability to R by pushing the compute to the platform (e.g. Hadoop).
Open-Source R on Hadoop
There are quite a few packages which address the scalability issue of R to different extents. Popular ones include:
- Rhive: This open-source R package allows R users to run HIVE queries and providing custom HIVE UDFs to embed R code into the HIVE queries. However, there is no inherent ‘transparency’ involved, and R users need to know HQL to explicitly write the queries and pass them into the Rhive functions. Also, one would need to install Rserver and RJava packages to make it work, which is not very easy to do. The performance impact of these additional packages is unclear.
- RHadoop/RMR2: This is one of the most commonly used R packages for Hadoop. It has a simple interface for running MapReduce jobs using Hadoop streaming. It also has a set of interfaces to interact with HDFS files and Hbase tables. However, there is no transparency layer to interact with HDFS/HIVE objects and perform data preparation and ad-hoc analytics in R on Hadoop data.
- RHIPE: This package lets you run mappers and reducers in R, and exposes a wrapper function to run MR jobs. There is a custom jar file which invokes the R processes in each cluster and uses protocol buffers for data serialization/de-serialization. In addition to not being user-friendly, its dependency on protobuf makes it non-trivial to install. Last but not the least, there is no transparency layer to perform ad-hoc analytics and data preparation.
Advantages of having an R to HIVE connector
- R proxy object extensions: The connector would provide the much-needed transparency between R and the data source (HIVE, Postgress etc.) and computation platform (Hadoop).
- Pluggable Query generation: The connector could separate physical query generation pertaining to a data source from the analytical logic written in R thus making the data source pluggable.
- R as a scalable analytical tool: Data analysts are likely to perform the following steps once they have the raw data. All of these steps can be performed using the proposed R connector:
- Data cleanup: Filter out NAs, outliers, etc.
- Ad-hoc analytics: fivenums, stdev, summary, etc.
- Data preparation: Joining tables, project out columns, filter values, etc.
- Distributed analytics using Hadoop: Submit MR jobs on the Hadoop cluster from R and apply rich R analytical functions on the data-sets in a distributed fashion
- Result summarization and publishing: Accumulated results available readily in the client for inferences, reporting, etc.
How this works:
R has a concept of Object Oriented programming based on S3 and S4 class systems (short introduction here). We have used the S4 class system (which is closer to the OO paradigm to extend native R types like data.frame, numeric, vector, etc.) to create derived types (e.g. rr.frame, rr.vector, etc.). These types store metadata information (e.g., the HIVE query corresponding to the transformed R object) of the data source it represents. We have used HIVE as the data source for this exercise.
The following R code highlights the above design. We create a proxy R object called "purchases" which represents a HIVE table containing the purchase history of customers. We show some basic SQL-like operations in R such as projection and filtering in this R object. These operations translate to HIVE queries without actually being run (delayed execution). The final step pulls the data from the query prepared on the purchases table and creates an in-memory R frame object.
> # purchases is an rr.frame object which represents the HIVE table “purchases” containing purchase history of a customer
> class(purchases) "rr.frame"
> # filter all purchases on data 1st Jan 2013 and project out three columns: amount and location
> # (SQL equivalent of selecting columns and filtering with a where clause)
> purchases_filt <- purchases[purchases$eventdate == "'2013-01-01'",c("amount", "location"]
> # find the max order amount grouped by location (SQL equivalent of group by)
> aggquery <- aggregate(purchases_filt$amount, purchases_filt[,c("location")], max)
> # pull the data set into Rmemory by running the HIVE query corr. to the aggquery object (group by and filtering above)
> res <- rr.get(aggquery)
We’d love to hear your feedback and experience on making R work with big data. From our end, we’ll be following up with more blogs focused on the design, architecture and an end-to-end use case from data preparation to distributed analytics.
About Sukhendu Chakraborty
Sukhendu Chakraborty is a Senior Software Engineer in RichRelevance working on DataMesh, RicheRelevance's big data platform. Prior to RichRelevance, Sukhendu was a Principal Member of Technical Staff at Oracle where he worked for over 6 years. At Oracle, he was a member of the core database team working on storage optimization and retrieval of XML documents in Oracle's XML Database. He was also one of the initial members of Oracle's Advanced Analytics team, working on building scalable analytics solutions on Oracle's Big Data Appliance. Sukhendu has a Master's degree in Computer Science from Duke University and a Bachelor's degree in Computer Engineering from Indian Institute of Technology, Roorkee, India.