How Zeotap’s Data Science Team Continues to Pave the Way for Industry Innovation

Behind the success of Zeotap’s Customer Intelligence Platform is a team of Data Scientists who are paving the way for innovation in our industry. Not only are they incredibly skilled at their craft, but they continuously work towards finding new solutions to the industry’s most critical data challenges. Along the way, the team has put together a number of comprehensive research papers that deep dive into these challenges and solutions – which have since been accepted by some of the industry’s most prestigious journals and data science organisations.

To highlight these successes, we’ve put together an overview of our Data Science team’s most recent papers accepted for publication (including their most recently published paper):

1. Multigraph Approach Towards a Scalable, Robust Look-alike Audience Extension System
Written by: Ernest Kirubakaran Selvaraj, Tushar Agarwal, Nilamadhaba Mohapatra and Swapnasarit Sahu
Accepted by: AdKKD (Adtech branch of The Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining – SIGKDD)

Abstract: In online advertising, finding the right audience is critical for a campaign’s success. One common way of finding the right audience is to look for users with traits similar to others who have historically responded positively to the campaign. This pool of users is known as the ‘seed set’, and the goal here is to reach a bigger audience with traits very similar to that of the seed set. This technique, popularly known as look-alike audience extension, becomes increasingly challenging with the scale and high sparsity of data commonly encountered in the advertising domain.

In this paper, we present a novel multigraph-based audience extension and scoring system which works well with high-dimensional sparse data and can be scaled easily to millions of users. Our experimental results on large real-world data demonstrate significant improvement in the performance of our approach over the existing architectures.

2. Estimating the instantaneous survival rate of digital advertising and marketing IDs: LIFESPAN by Cox-Proportional
Written by: Nilamadhaba Mohapatra, Humeil Makhija and Swapnasarit Sahu
Accepted by: AdKKD

Abstract: Finding the active and inactive device IDs (ID) in the digital advertising and marketing domain is one of the most crucial tasks in terms of cost and quality. Keeping the IDs for a longer period of time will increase the load for the downstream pipelines that incur more storage and computation cost. These IDs are treated as currency in the digital domain. So, by putting a smaller TTL, a premature loss of IDs can lead to multiple losses of information. On the contrary, if a higher TTL is proposed, it can lead to the original problem of cost and computation. Checking if an individual ID is active or not, in real-time, is almost impossible.

While most of the non-feedback systems run on TTL-based methods to purge the IDs and clean the database, in this paper we propose a granular machine learning-based approach which learns from implicit feedback.

3. Ensemble model for chunking
Written by: Nilamadhaba Mohapatra, Namrata Sarraf and Swapnasarit Sahu
Accepted at: The 2nd International Conference on Natural Language Computing and AI (part of the 11th International Conference on Computer Science, Engineering and Applications)

Abstract: ‘Chunking’ – the process of splitting the words of a sentence into tokens and then grouping these tokens in a meaningful way – is a task that has gradually moved from POS tag-based statistical models to neural nets using language models such as LSTM, Bidirectional LSTMs, and attention models. Deep neural net models are deployed indirectly to classify tokens as different tags defined under Named Recognition Tasks. Later, these tags are used in conjunction with pointer frameworks for the final chunking task.

In this paper, we propose an Ensemble Model using a fine-tuned Transformer model and a recurrent neural network model together to predict tags and chunk substructures of a sentence. To achieve this, we analysed the shortcomings of the transformer models in predicting different tags and then trained the BILSTM+CNN accordingly to compensate.

4. Domain-based chunking
Written by: Nilamadhaba Mohapatra, Namrata Sarraf, Swapnasarit Sahu
Accepted by: The International Journal on Natural Language Computing (IJNLC)

Abstract: Manually annotating and producing a large-high-quality training set is a tedious and costly affair in terms of time and resources.

In this paper, we discuss annotated data generation for specific and diverse domains – proposing a novel grammar-based text generation mechanism (which takes care of annotation). The annotated queries were used to train a machine learning model using an ensemble transformer-based deep neural network model.