Gaurav Tungatkar

Smart API Publishing using Circuit Breakers

Smart API Publishing using Circuit Breakers

This is the first in a series of blog posts about how we have architected a scalable and distributed API platform for retailers.

Introduction

DataMesh is our self-service flexible platform that enables rapid application development, be it the creation of new recommendation algorithms, social media integrations or enablement of a single view of a customer.

How API publishing works

API publishing takes high-value operational data defined by customers from backend data processing systems and makes it available in the front-end data centers to meet tight 65ms SLAs with geo load balancing.

Our push button API dashboard give access to rich event data within Hive tables, data querying capabilities and end-to-end API building capabilities. It provides a simple interface to users while orchestrating (behind the scenes) a complex data processing pipeline involving Hadoop for data processing and transformations, Hive and Impala for querying, Oozie for workflow management and custom RichRelevance web services that process and serve the API under 65ms SLAs.

From a high level, API publishing workflow involves:

  1. Creating a Hive table for an API through data manipulation and transformation of the event data
  2. Defining an API by selecting the keys and value columns from the table
  3. Aggregating and making the data available in edge caches in the right format

Challenges

As we started building this platform, we quickly realized that we wanted to support a wide variety of APIs on top of this infrastructure. Some examples that customers wanted to build include:

  1. An API that returns top 10 products by social media referrals
  2. An API that gives the recent activity (clicks, purchases) of a particular user. This involves per user history for millions of users that the API must serve.

For APIs closer to example 2, we leverage Hadoop and Hive for data processing and manage a workflow using Oozie- designed to handle large volumes of data. But Hadoop and its ecosystem requires significant ramp-up and involves overhead to set up jobs and schedule workflows. This overhead quickly became a problem when we tried to use the same flow for APIs that were closer to example 1. We wanted to handle both scenarios, but keep the API management dashboard simple and unified for all types of APIs. The customer should not need to make a decision of whether her data is small enough or whether she needs Hadoop for the processing.

How we tackled challenges

We introduced a fast track path for the data processing, delegating some steps to periodic jobs and doing the rest in memory.  The data processor server decides at runtime whether to take this fast track path or fall back to the Oozie workflow. It uses heuristics based on record structure and record counts, as well as hard limits on a set of parameters like size and number of part files to make that decision.

Fast track path

The fast track manager reads raw data from the Hive table by following a short circuit path. It connects directly to the Hive metastore using the Hive metastore client. It pulls important metadata for a Hive table – location of data files and schema description from the metastore. It then directly reads the data files in HDFS backing that table. It parses the data from the files, based on the metadata that was obtained from the metastore. It then executes a series of data transformations and aggregation is done on the fly in memory.

 

After data processing is complete, the fast track processor pushes out the data to the caches at the edge nodes and the API is ready to be served. It essentially executes a workflow in memory and ensures error handling and chaining at each step.

Don’t blow up the in-memory engine

Doing everything in memory runs the risk of running out of memory if we accidently process a huge Hive table. At each step, there are aggressive circuit breaker safeguards that ensure that we don’t bite off more than we can chew. Circuit breakers look at:

  1. Number of records in the table
  2. Schema of each record
  3. Number and size of part file backing that Hive table
  4. Size/number of records in the in-memory data structures as we do the data aggregation

If one of these trips, we fall back to the Oozie workflow based off Hadoop, which can easily handle a large volume of data.

Conclusion

This in-memory fast track path has given us significant flexibility when creating APIs. It helps us keep things responsive and interactive when creating APIs based of distilled operational data artifacts like in example 1, yet handle large volumes of data in the same user experience. We went from smaller APIs taking more than 5 minutes to become ready end-to-end, in less than 10 seconds. This had an enormous impact on the usability of the product while keeping things scalable.

About :

Gaurav Tungatkar is a Software Engineer in RichRelevance working on DataMesh, RicheRelevance's big data platform. Prior to RichRelevance, he was part of the Cloud Services group at Samsung Mobile Research Lab. He has a Master's degree in Computer Science from North Carolina State University. Gaurav loves being innovative and making things from scratch, be it software or food!

3 Comments

  • Mahesh Sharma says:

    If the API is selected to be processed via fast track processing, what was the reasoning behind not pre-computing the results before storing the data in the Hive tables? (if required, after reading the concerned historical data)?

    Further more, with Hive 0.13+ which provides row level transactions, would it complicate managing locks outside of Hive? (Assuming you get a partition level lock in the fast track processing)

    • Gaurav Tungatkar says:

      Hi Mahesh,
      Thanks for the comment.
      These are lookup APIs, the computation is mainly to aggregate and convert the records to a specific format to cache at the edge nodes. This has to be done after the data table has been defined.

      AFA locking and transactions go, yes generally that can be an issue but not a major one in our current workflow. The table update and the fast track publish is chained such that publish triggers only after the hive table is updated. Additionally, the data files are moved to a temp dir before processing them.

  • Mahesh Sharma says:

    I see, Thanks much for clarifying.

  • Leave a Comment

    Your email address will not be published. Required fields are marked *

    *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>