zeotap makes large scale, deterministic data assets easily available within the digital advertising ecosystem and other industries. Our data engineering team transforms unstructured raw data that we receive from a multitude of data partners into structured readily queryable data which can then be monetized through different distribution channels. In this article, I will begin with a high-level introduction to our data pipeline and elaborate on how we use Apache Oozie as a distributed controller for our data platform with some customisations to make it work for our use case.
Our whole data platform runs on Amazon Web Services (AWS) making heavy use of a diverse set of tools and services provided by AWS. We receive data from various data partners every day in designated S3 buckets in pre-decided formats. We then process it on AWS Elastic MapReduce (EMR) using Apache Spark through various stages in the pipeline. The processed data is uploaded to Amazon Redshift Clusters (an OLAP MPP Database) and made available for efficient querying by different teams as well as by various internal APIs that we have built on top of it. All of this is scheduled and monitored through Apache Oozie that comes along with Amazon’s EMR service.
Apache Oozie is a workflow scheduler for the HDFS architecture ecosystem that can schedule various MapReduce jobs across an EMR cluster with the help of yarn resource manager. Each Oozie workflow consists of a directed acyclic graph of control flow nodes called actions. A workflow can be executed once using Oozie controller or scheduled periodically using an Oozie Coordinator. An Oozie workflow is described in an XML file. Those are usually not a favourite to work with, but in Oozie they are supported with some really powerful abstractions such as workflow parametrization, fork-join action nodes for parallel execution, a rich library of EL (Expression Language) expressions and email notifications so we can quickly get over it.
DATA PIPELINE IN OOZIE
In order to set up our data pipeline in Oozie, we composed each stage of the pipeline as a sub-workflow consisting of one or more Spark actions. We also made use of HDFS actions to create the necessary folder structure and shell actions to ensure that the expected data is in fact available. We wrapped all the sub-workflows in yet another parent workflow to run them in sequence and perform the bookkeeping operations as needed. The decoupling of a parent workflow with sub-workflows for each stage enabled us to run one or more stages independently for testing and debugging purposes.
We parametrized the parent workflow to process data for every data partner using Oozie variables. We set up Oozie coordinators to run the corresponding parent workflow every day at specific times of the day. Here, we made use of EL expressions to compute paths to a folder in an S3 bucket containing the latest data for a given data partner. Finally, we added another parent workflow consisting of a Spark action for aggregating data across all data partners followed by a Shell action to upload aggregated data to Redshift.