Probabilistic vs Deterministic Data: What’s the Difference?
In today’s digital-first world, marketers need ways to interact with customers across multiple customer journey touchpoints. But customer journeys are now more complex than ever: the majority of shoppers follow a zig-zagging path through a wealth of touchpoints, both online and offline, and aren’t always logged into every device they use. This makes it harder to identify customers and build customer profiles that deliver the personalised experiences they’ve grown to expect.
There are two primary identity resolution models used to bridge this identity gap: probabilistic data modelling and deterministic data matching. Each one serves a different purpose, so it’s important to understand how they’re used and the information they offer.
In this blog post, we compare probabilistic vs deterministic data to help you choose a model that fits your business needs.
What is probabilistic data?
Probabilistic data is data based on behavioural events like page views, time spent on page, or click-throughs. This data is analysed and grouped by the likelihood that a user belongs to a certain demographic, socio-economic status or class.
To generate probabilistic data, algorithms will identify pre-defined behavioural patterns such as interests or browsing behaviours to determine the probability of the user’s age, gender or socio-economic status. Behavioural patterns could be as general as grouping users according to the types of media they’re most likely to consume, or they could be more precise and group audiences by the type of device they’re most likely to use to access a touchpoint.
How is probabilistic data used?
Probabilistic data can be used to add more value to deterministic datasets and to scale deterministic data models. If something is unknown in a deterministic dataset, enriching the data with probabilistic data can offer more accurate insights.
What is deterministic data?
Deterministic data is linked to something which identifies a user, like an email address or a cookie ID, and has a likelihood of being 100% true. Deterministic data provides a solid foundation for marketing operations because it is based on fact. For example, if a user signs up in one year and gives their current age, it is a fact that the following year they will be a year older.
In addition to demographic information, deterministic data can also take the form of a user’s interests or commonly visited geographical locations. Having factual data of this kind is critical to helping marketers refine the accuracy of their personalised and targeted marketing efforts.
How is deterministic data used?
Deterministic data can be used to provide accuracy and clarity in targeted marketing campaigns and to enhance probabilistic segments.
One effective use case for deterministic data is in the creation of granular segmentation to target users with relevant campaigns. For example, grouping users who you know for a fact share an interest in cycling.
Deterministic data can also be used to supplement and enhance the accuracy of marketing prediction. Prediction is used to make educated guesses about users if the information is not apparent in deterministic data. Marketers may attempt to guess ages, genders or interests to then create probabilistic segments from their prediction which can be fed into CDPs.
However, predictions can be wholly inaccurate, which can then lead machine-learning algorithms to produce unsatisfactory results. To this extent, supplementing unknown information with deterministic data gives the algorithm a higher percentage of accuracy.
The Difference Between Probabilistic and Deterministic Matching
- Looks for an exact match between two pieces of data
- Creates device relationships by using personally identifiable information (PII) to join devices, like email addresses, names and phone numbers. Links can only be made if they directly tie the PII to a consumer to prioritise accuracy and prevent false positives.
- Heavily relies on data being at a 100% quality level, achieved through cleansing and standardising
- Best suited for source systems which consistently collect unique identifiers like PII (i.e. drivers license numbers or passport numbers)
- Uses a statistical approach to assess the probability that two records represent the same individual
- Works best when given up-front access to the data and uses wider sets of data elements to create matches
- Uses weights which calculate the matching scores and thresholds to determine whether there is a match, non-match or possible match. Also takes into account the frequency of the occurrence of a particular data value against all the values in that data. For example, the first name Jack matching with another Jack would result in a low score or weight because Jack is considered a common name.
- Essential for DPOs or Data Stewards to manually review matching results to ensure accuracy of the results
Probabilistic vs. Deterministic: Which One is Best?
Neither probabilistic data or deterministic data can be perceived as being “the best” as data matching for both can exemplify different campaign types and goals. Additionally, neither is immune to some data-specific factors which impact matching accuracy.
1. Knowledge of the data source and its systems
Before deciding on which data model to use, it’s important to assess the source systems of your data. Ask why data is being collected from the source system, what value the data from the source system will have and how the data will be useful to the overall enterprise.
Further, the data elements of each source system will need to be evaluated. How the data is captured, added, updated and deleted should be assessed, as well as the level of data validation and cleansing performed by the source system when capturing the data.
2. The quality of the data
Data Profiling is a crucial step in compiling data for an organisation. This process should be done early to determine the quality of your information before using it for anything important, like matching and searching applications/cases to find similarities. Additionally, profile the parts of your data set that are going into those matches – this way you’ll know which aspects will need more attention if something goes wrong with them later down the road during use or transfer processes.
Profiling will not only show how accurate certain fields are in terms of their content, but what anonymous values (or equivalence) may match up easily across different sources when looking at duplicate records as well.
3. The completeness of the data
Ideally, each data element used for matching should always contain a value. However, this is not always the case. This goes back to how Source Systems enforce certain rules when capturing your information: some make an entry mandatory and others do not require that field be filled out at all times or even on every record matched with another system’s records.
When matching data elements that may not be populated 100% of the time, it’s important to consider how they’ll affect your searching and matching rules. For example, if a particular piece of information is found in one record but not another, when would you call this an exact match? When two pieces of data don’t line up with each other exactly as expected, can we still say these are matched because some similar characteristics exist between them?
4. The age of the data
Bringing outdated and irrelevant information into the hub will degrade performance in addition to wasting processing power on cleansing for new records that won’t be matched with any other systems.
In most cases, old data sets are also incomplete because they were collected without many validation rules enforced during collection, which will require more quality control down the line.
When to choose a probabilistic model
If your goal is to target specific audiences who might be interested in buying certain types or products, using probabilistic data can simply help you reach a larger audience vs. pinpointing precisely which consumers qualify as prospects.
When to choose a deterministic model
A deterministic model is appropriate when the probability of an outcome can be determined with certainty. For example, a software platform selling its technology products may use this type of model to set prices or forecast demand for new products. In general, this type of modelling is used in situations where it is important to make decisions based on objective facts rather than subjective opinions about what might happen in the future.
CIPs like Zeotap’s use AI and machine learning to collect, manage and analyse both deterministic and probabilistic data from multiple disparate sources at breakneck speeds. Other CDPs also use deterministic data (unique identifiers such as email addresses and phone numbers), but will need access to both deterministic and probabilistic data to perform customer identity resolution and build a complete customer profile.
Understanding the Data Sources That Power a CDP
CDPs collect a wealth of data from different sources and platforms including event-based data and transactional data. Understanding the data sources that power a ‘winning’ CDP is critical to choosing the right platform. Download The Enterprise Marketer’s Guide to Customer Data Platforms for an in-depth of the top characteristics and considerations behind an investment-worthy CDP.