Simon Data supports a number of dataset types, including:
## Contact Identity (Legacy Identity)
Every dataset you bring into Simon must contain a customer identifier: a field that uniquely identifies the customer. Usually this identifier is an email address, or an identifier than can be mapped to an email address via an Identity dataset. By uniquely identifying customers, you ensure every segment and campaign has a set of distinct customers.
If your data uses multiple identifiers, you must have an Identity dataset that associates all your identifiers with one another. By explicitly connecting all your identifiers in a single, authoritative source, you are able to reference any of them in another dataset, and Simon will join customer data together accordingly. The dataset can tap into an existing mapping of identities from your database or a 3rd party provider, or can contain logic that deterministically performs an identity association based on one or many criteria.
For example, say you want to associate email address, user ID, and phone number together from an existing mapping in your database. The following table demonstrates how this would look:
If you want to associate email address, user ID, and client ID together based on web and in-app event data, the following sql query and table demonstrates how this would look:
Contact Identity Datasets
Contact Identity datasets can't bring in fields beyond identifiers.
## Identity-Providing Datasets (Next-Gen Identity)
If your organization is on Next-Gen Identity, an Identity-Providing Dataset is required to bring identifiers into Simon. These datasets can be used in a couple of ways, depending on how many identifiers you want to bring in and how many data sources from which the identifiers are coming.
Each data source must have its own Identity-Providing Dataset. Here’s an example that illustrates the relationship between them:
Your data warehouse contains _customer_id_ and _email_.
A separate Shopify table contains _email_ and _phone_number_.
Each data source will have an IPD to bring in the respective identifiers, and the common denominator in this example is _email_.
When events enter the identity service, _email_ will be used essentially as a join key to associate the _customer_id_ from the data warehouse to the _phone_number_ in the Shopify table, and the identifiers from these separate data sources will be stitched together to create a single view of the contact.
If there is only one identifier being brought into Simon from a single data source, the Identity-Providing Dataset is used to simply bring it into the platform, and the dataset query would look like this:
If there is more than one identifier in the Identity-Providing Dataset, this creates an association between the identifiers. In the example below, _email_ and _user_id_ are present in the Identity-Providing Dataset. This means that if Simon ingests an event with one or both of these identifiers, an association is created that ties those two identifiers to a single contact profile.
All identifier values must be lowercase. You can do this in the SQL query like this: _lower(email) as email_
### Creating an Identity-Providing Dataset
Click on **Datasets**, then **Datasets** again in the left-hand navigation.
Click the **Create Dataset** button.
Select **Identity-Providing** as the dataset purpose, then click Next.
Name your dataset. This should clearly explain the identifiers you’re bringing in via this dataset, such as “Email & User_ID Identity” or “Phone Number Identity”.
Select **Database Query** as your source.
Choose your identifier. If the dataset you’re creating will contain your stable identifier, be sure to select that identifier here.
Choose the database from which you want to bring in your identifiers, then click **Start**.
Follow the SQL prompt to write your identity query.
### Association Timestamp Field
It’s highly recommended that each Identity-Providing Dataset include an updated timestamp field. This is important for a couple reasons:
**Efficiency**. The updated timestamp allows Simon to extract data incrementally rather than extracting all of the data every time the pipe runs. If there is no updated timestamp, a significant amount of time could be added to the pipe run.
**Audit log**. The updated timestamp also enables Simon to keep track of identifier associations. Without this field, it will be impossible to know when a new association is made, which may lead to confusion when debugging any potential issues that arise.
It’s not possible to delete an updated timestamp field from an IPD after the dataset has already been committed.
Be sure to alias your timestamp field as ‘association_timestamp’ AND format it in EPOCH SECONDS. See the screenshot below for an example.
Timestamp must be accurate
If your timestamp is not accurate, some contacts may be skipped during the pipe run since we only ingest new records since the previous run. For example, if the pipe runs on August 23rd and a contact's association timestamp is 2am on August 22nd, this contact will likely be skipped during extraction. This is because when extracting data, we look for new records based on the timestamp provided and in this case, it is dated _before_ the time that the dataset is being incrementally extracted.
## Contact Data
The most common dataset type is user data; that is, data fields that map to a customer and can be used in segmentation and personalization. Simon supports two ingestion sources for user data: SQL query or CSV upload. In addition, user datasets will fall into one of the two following categories.
## Attributes & Aggregates (One-to-One)
In a one-to-one dataset every entry is a unique customer, and therefore each customer is associated with exactly one data point per field. As SQL, these datasets often include a 'GROUP BY' clause on the identifier along with the aggregated user data.
One-to-one datasets pull in contact information such as first/last purchase date, total lifetime value, total orders, etc. The majority of datasets in Simon are one-to-one.
## Contact Segment
Sometimes it is useful to directly transform a list of customers into a segment within Simon. This is possible via the user segment dataset type. Again, Simon supports two ingestion sources for this dataset type, SQL query or CSV upload. A requirement of this dataset type is that there be only a single column of customers with no additional fields beyond email or user identifier.
Once a segment dataset has been added, it will be available in the segment builder under the Customer segment filter using the same name as the dataset itself ("NYC Event Attendees" in the example below).
This type of dataset is also useful in a situation where a segment has complex criteria that are better mapped in SQL than using Simon's segment builder.
For information on _Upload List_ datasets, see [Lists](🔗).