Smart API Publishing using Circuit Breakers
May 09, 2014/
Smart API Publishing using Circuit Breakers
This is the first in a series of blog posts about how we have architected a scalable and distributed API platform for retailers.
DataMesh is our self-service flexible platform that enables rapid application development, be it the creation of new recommendation algorithms, social media integrations or enablement of a single view of a customer.
How API publishing works
API publishing takes high-value operational data defined by customers from backend data processing systems and makes it available in the front-end data centers to meet tight 65ms SLAs with geo load balancing.
Our push button API dashboard give access to rich event data within Hive tables, data querying capabilities and end-to-end API building capabilities. It provides a simple interface to users while orchestrating (behind the scenes) a complex data processing pipeline involving Hadoop for data processing and transformations, Hive and Impala for querying, Oozie for workflow management and custom RichRelevance web services that process and serve the API under 65ms SLAs.
From a high level, API publishing workflow involves:
- Creating a Hive table for an API through data manipulation and transformation of the event data
- Defining an API by selecting the keys and value columns from the table
- Aggregating and making the data available in edge caches in the right format
As we started building this platform, we quickly realized that we wanted to support a wide variety of APIs on top of this infrastructure. Some examples that customers wanted to build include:
- An API that returns top 10 products by social media referrals
- An API that gives the recent activity (clicks, purchases) of a particular user. This involves per user history for millions of users that the API must serve.
For APIs closer to example 2, we leverage Hadoop and Hive for data processing and manage a workflow using Oozie- designed to handle large volumes of data. But Hadoop and its ecosystem requires significant ramp-up and involves overhead to set up jobs and schedule workflows. This overhead quickly became a problem when we tried to use the same flow for APIs that were closer to example 1. We wanted to handle both scenarios, but keep the API management dashboard simple and unified for all types of APIs. The customer should not need to make a decision of whether her data is small enough or whether she needs Hadoop for the processing.
How we tackled challenges
We introduced a fast track path for the data processing, delegating some steps to periodic jobs and doing the rest in memory. The data processor server decides at runtime whether to take this fast track path or fall back to the Oozie workflow. It uses heuristics based on record structure and record counts, as well as hard limits on a set of parameters like size and number of part files to make that decision.
Fast track path
The fast track manager reads raw data from the Hive table by following a short circuit path. It connects directly to the Hive metastore using the Hive metastore client. It pulls important metadata for a Hive table – location of data files and schema description from the metastore. It then directly reads the data files in HDFS backing that table. It parses the data from the files, based on the metadata that was obtained from the metastore. It then executes a series of data transformations and aggregation is done on the fly in memory.
After data processing is complete, the fast track processor pushes out the data to the caches at the edge nodes and the API is ready to be served. It essentially executes a workflow in memory and ensures error handling and chaining at each step.
Don’t blow up the in-memory engine
Doing everything in memory runs the risk of running out of memory if we accidently process a huge Hive table. At each step, there are aggressive circuit breaker safeguards that ensure that we don’t bite off more than we can chew. Circuit breakers look at:
- Number of records in the table
- Schema of each record
- Number and size of part file backing that Hive table
- Size/number of records in the in-memory data structures as we do the data aggregation
If one of these trips, we fall back to the Oozie workflow based off Hadoop, which can easily handle a large volume of data.
This in-memory fast track path has given us significant flexibility when creating APIs. It helps us keep things responsive and interactive when creating APIs based of distilled operational data artifacts like in example 1, yet handle large volumes of data in the same user experience. We went from smaller APIs taking more than 5 minutes to become ready end-to-end, in less than 10 seconds. This had an enormous impact on the usability of the product while keeping things scalable.