Identity-Providing Datasets
Legacy Identity
If you are a legacy identity customer, see Contact Identity, Contact Data, Attributes & Aggregates, and Contact Segment Datasets.
Identity-Providing Datasets (Next-Gen Identity)
If your organization is on Next-Gen Identity, an Identity-Providing Dataset (IPD) is required to bring identifiers into Simon. These datasets can be used in a couple of ways, depending on how many identifiers you want to bring in and how many data sources from which the identifiers are coming.
Each data source must have its own Identity-Providing Dataset. Here’s an example that illustrates the relationship between them:
- Your data warehouse contains customer_id and email.
- A separate Shopify table contains email and phone_number.
- Each data source will have an IPD to bring in the respective identifiers, and the common denominator in this example is email.
- When events enter the identity service, email will be used essentially as a join key to associate the customer_id from the data warehouse to the phone_number in the Shopify table, and the identifiers from these separate data sources will be stitched together to create a single view of the contact.
If there is only one identifier being brought into Simon from a single data source, the Identity-Providing Dataset is used to simply bring it into the platform, and the dataset query would look like this:
SELECT
lower(id) as id,
last_updated('EPOCH_SECOND', current_timestamp()) - 60 as association_timestamp
FROM prod.demo_contact;
If there is more than one identifier in the Identity-Providing Dataset, this creates an association between the identifiers. In the example below, email and user_id are present in the Identity-Providing Dataset. This means that if Simon ingests an event with one or both of these identifiers, an association is created that ties those two identifiers to a single contact profile.
SELECT
lower(email) as email,
lower(id) as user_id,
last_updated('EPOCH_SECOND', current_timestamp()) - 60 as association_timestamp
FROM prod.demo_contact;
Note
All identifier values must be lowercase. You can do this in the SQL query like this:
lower(email) as email
Creating an Identity-Providing Dataset
- Click on Datasets, then Datasets again in the left-hand navigation.
- Click the Create Dataset button.
- Select Identity-Providing as the dataset purpose, then click Next.
- Name your dataset. This should clearly explain the identifiers you’re bringing in via this dataset, such as “Email & User_ID Identity” or “Phone Number Identity”.
- Select Database Query as your source.
- Choose your identifier. If the dataset you’re creating will contain your stable identifier, be sure to select that identifier here.
- Choose the database from which you want to bring in your identifiers, then click Start.
- Follow the SQL prompt to write your identity query.
Association Timestamp Field
It’s highly recommended that each Identity-Providing Dataset include an updated timestamp field. This is important for a couple reasons:
- Efficiency. The updated timestamp allows Simon to extract data incrementally rather than extracting all of the data every time the pipe runs. If there is no updated timestamp, a significant amount of time could be added to the pipe run.
- Audit log. The updated timestamp also enables Simon to keep track of identifier associations. Without this field, it will be impossible to know when a new association is made, which may lead to confusion when debugging any potential issues that arise.
Notes
- It’s not possible to delete an updated timestamp field from an IPD after the dataset has already been committed.
- Be sure to alias your timestamp field as ‘association_timestamp’ AND format it in EPOCH SECONDS. See the screenshot below for an example.
Timestamp must be accurate
If your timestamp is not accurate, some contacts may be skipped during the pipe run since we only ingest new records since the previous run. For example, if the pipe runs on August 23rd and a contact's association timestamp is 2am on August 22nd, this contact will likely be skipped during extraction. This is because when extracting data, we look for new records based on the timestamp provided and in this case, it is dated before the time that the dataset is being incrementally extracted.
Editing an IPD
There are a couple of reasons why you’d want to edit an Identity-Providing Dataset.
- You need to bring an additional identifier from the data source into Simon.
- You need to remove an identifier from the dataset.
In order to add a new identifier to an existing Identity-Providing Dataset, the identifier must exist in the data source from which the dataset is pulling. Simply add a new row to the IPD’s SQL query for the new identifier.
There are a couple things to keep in mind when attempting to add a new identifier to an existing IPD:
- Because saving a new query version triggers a backfill with the new data, it may be more difficult to debug where an identifier came from in the future, and;
- A backfill could cause a delay in the next pipe run depending on the number of contacts in your account and the size of the dataset.
NOTE:
If you’re adding an identifier that does not currently exist elsewhere in your account, speak to your account manager to ensure it’s available for use in the IPD you’re editing.
Removing an identifier
Because Identity-Providing Datasets are the bedrock of your customer table, it’s extremely risky to remove an identifier from an existing IPD. Removing an identifier will cause stale data to be present in your account. Please speak to your account manager if you need to remove an identifier from an existing Identity-Providing Dataset.
Archiving an IPD
Speak to your account manager for information on archiving an Identity-Providing Dataset.
Updated 9 months ago