How We Selected Apache Kafka on our Path to Real-time Data Ingestion
Oct 29, 2013/
About a year ago, RichRelevance was looking for a way to enhance our existing approach for collecting clickstream data and to move it reliably in a scalable way in real time from our front-end data centers to our back-end cloud-based platform. As the global leader in omni-channel personalization, we serve more than 160 global clients in over 40 countries, and sit on petabytes of customer data. What made this effort extra challenging was the fact that our infrastructure is globally distributed.
We used several key requirements to guide our search, including:
- An overall design principle that supported streaming over a batch approach
- Low operational complexity while being reliable and scalable
- A pub-sub model that incorporates push and pull mechanism
- A system that fit very well with our existing use cases (involving pacing of ad campaigns, omni-channel retail use cases, etc.)
- The ability to pull data at different rates to effectively take the data and use it for analytics, use it operationally to create models, or use it for BI purposes, rather than ingesting data and getting into ETL processing with the platform; (Being able to pull this streaming data for whatever use case, and integrate it with other streaming computation frameworks was critical.)
After we evaluated several technologies including Flume and Active MQ, we considered Kafka 0.7—before it was even an Apache project. We saw it evolve, then adopted and incorporated it in shadow mode, while still keeping our production system in place. After evaluating this version—and having good success on the forum with a lot of help from community contributors—we gained confidence in Kafka. The publisher-subscriber model allowed us to push data to the Kafka brokers and pull data from it using various consumers. Being able to pull data at different rates meant that we could decouple the rate at which messages are produced and transferred to the Kafka server from the rate at which the messages are consumed.
Another important criterion was fault-tolerance. The 0.7 version of Kafka did not have broker level replication baked into it, so we developed a home-grown approach. Afterwards, we operationalized the 0.8 version of Kafka in its early stages and deployed it to production, which gave us the needed fault-tolerance. The win here is that we operationalized Kafka 0.8 and mitigated all the risks by branching it off and making the necessary changes to get production value from it before companies jumped on it.
There were a few key factors that led to our successful adoption of Kafka. First and foremost, we’re not afraid to use and contribute to open source technology here at RichRelevance, and have a long-standing tradition of doing so. We’re at the bleeding edge of many OSTs, and contribute our fair share toward them. The fact that we witnessed many startups utilizing Kafka in this initial 3-month period increased our confidence in the technology.
Equally important, the LinkedIn Data Infrastructure Team maintained its exemplary tradition in open sourcing previously in-house software technologies; Not only is the technology high quality, the manner in which it was open sourced for community usage encouraged a high level of responsiveness within the user base.
We were so impressed with the community, and our success as early adopters, that we shared our story with the Kafka User Group at LinkedIn (view video starting at 28 minutes), explaining how we operationalized version 0.8 for real-time data ingestion.
In comparison to most messaging systems, Kafka has better throughput, built-in partitioning, replication, and fault-tolerance—which makes it an operationally scalable solution for large scale message processing applications. From an internal operations perspective, using and managing the framework to monitor the health of our internal processes has made our life easier in many ways. For example, we can perform time-series analysis on our clickstream data to compute trends on traffic—such as analyzing the number of page views on Black Friday. We also built a real-time system/operations monitoring tool around Kafka and Hbase, since we had out grown the open source options.
Today, we leverage Kafka to get all of our clickstream data to our back-end infrastructure. We are always innovating our technology in real-time—and are fortunate to be able to do so, because it’s not just software, but also our massively scalable infrastructure, operations, and software joined together that helps us achieve true returns.
Our next steps include continuing to invest, partner, and contribute to Kafka. Our platform team is investing into streaming computation frameworks around Kafka, and we are also adopting it internally for different products. We look forward to sharing our ongoing progress with the Kafka community!