Dataset validation

Before allowing you to create or edit a dataset, Simon runs certain validation checks on the dataset’s configuration. These validations are designed to ensure the dataset can be successfully ingested by Simon for use in your account. To see if a dataset is valid, click Validate.

If successful, a small sample result displays.
If unsuccessful, a description of the error(s) and remediation steps display.

Error Type	Description
Identifier	Your dataset must contain a customer identifier (e.g. ‘email’).
Unique field names	Each dataset is associated with a unique set of fields, and no two datasets can be associated with a field with the same name. If your dataset returns a field that already exists in your data, the dataset will be invalid. An exception to this unique constraint is the identifier, which must be present in every dataset.
No field deletion	Fields cannot be removed from a dataset. If for some reason an existing field is causing issues and needs to be removed, please contact your Client Solutions Manager.
Non-zero rows returned	The dataset must contain data.
Valid syntax (queries only)	Your query must have valid syntax and successfully run against your database.

Created and Updated Timestamp Validations for Contact Event and Object Datasets

The Contact Event or Object Dataset Settings tab has Extract Schedule configurations that require a Created Timestamp and an Updated Timestamp.

Fields selected for these configurations require additional validations so you need to click the Validate button again. You'll see an error message if the following conditions are not met:

Timestamps can't be in the future.
Timestamps can't contain milliseconds.
Timestamps can't be null.
Updated Timestamp must be greater or equal to Created Timestamp.

Extract Validation Rules

The dataset details page has a tab called Settings, which contains additional dataset-level and field-level validation rules that you can configure to operate on your datasets.

📘
Note

Unlike the query-level validations above, these rules are run against the entire dataset during extraction, not just a sample. While validations are designed to confirm a dataset’s configuration, dataset rules are able to detect anomalies in the data itself.

Unlike the mandatory configuration validations listed above, these dataset rules can be edited and removed as appropriate.

A. Dataset Rules

Dataset rules are applied by default to every new dataset. These validations currently exist (and more are coming soon!):

Rule	Description
Dataset should not be empty	The dataset is always expected to contain some data. A failure occurs if there are no rows during the dataset extract. If you are creating a dataset that is expected to only occasionally contain data, toggle this rule to off.
Row count should not decrease below threshold on refresh	In general, Simon expects that datasets will contain consistent or increasing amounts of data, which means that row count should not decrease dramatically. After a dataset is extracted, Simon will compare the total number of rows in the dataset to the previous extracts. If there is a significant decrease in row count, the extract job will fail. This rule is created by default for all new datasets with a threshold of 75% (in other words, the row count should never decrease by more than 25%). The threshold may be manually adjusted from the rules tab. If you change the threshold, remember to click Save at the top of the dataset details page or you will lose your changes. If you expect your datasets to have large fluctuations in data quantity, toggle this rule to off.
Skippable (only applicable to Contact Data	If the dataset fails during extract for any reason, continue with remaining Contact Data extraction and use data from last successful extract.

Rule

Description

Dataset should not be empty

The dataset is always expected to contain some data. A failure occurs if there are no rows during the dataset extract.

If you are creating a dataset that is expected to only occasionally contain data, toggle this rule to off.

Row count should not decrease below threshold on refresh

In general, Simon expects that datasets will contain consistent or increasing amounts of data, which means that row count should not decrease dramatically. After a dataset is extracted, Simon will compare the total number of rows in the dataset to the previous extracts. If there is a significant decrease in row count, the extract job will fail.

This rule is created by default for all new datasets with a threshold of 75% (in other words, the row count should never decrease by more than 25%). The threshold may be manually adjusted from the rules tab.

If you change the threshold, remember to click Save at the top of the dataset details page or you will lose your changes.

If you expect your datasets to have large fluctuations in data quantity, toggle this rule to off.

Skippable (only applicable to Contact Data

If the dataset fails during extract for any reason, continue with remaining Contact Data extraction and use data from last successful extract.

B. Field Rules

These rules can be applied at the individual field level. If there are currently no field-level rules only the Add Field Rules button displays.

To configure new field-level rules:

Click Add Field Rules
From the drop-down, choose a dataset field to add rules to.
Toggle validation options on or off:

Rule	Description
The number of null values should not increase on refresh	Alerts you when the number of null values within a column increases unexpectedly. This validation is meant to detect upstream data issues that need to be resolved before retrying the extract.
Field should always have data from today	Enable this validation for any datasets where you want to ensure there has been an update before Simon extracts. This will ensure that each time your data is extracted, the row specified contains fresh data. If the latest timestamp in a dataset does not have today’s date, this rule will prevent the dataset from extracting successfully.
Duplicate values should not exist	Simon expects that certain types of non-event contact datasets will only contain one record per user. This rule evaluates whether there are multiple rows mapped to a single identifier and alert if that validation fails.

Rule

Description

The number of null values should not increase on refresh

Alerts you when the number of null values within a column increases unexpectedly.

This validation is meant to detect upstream data issues that need to be resolved before retrying the extract.

Field should always have data from today

Enable this validation for any datasets where you want to ensure there has been an update before Simon extracts. This will ensure that each time your data is extracted, the row specified contains fresh data.

If the latest timestamp in a dataset does not have today’s date, this rule will prevent the dataset from extracting successfully.

Duplicate values should not exist

Simon expects that certain types of non-event contact datasets will only contain one record per user. This rule evaluates whether there are multiple rows mapped to a single identifier and alert if that validation fails.

To “delete” field-level rules, just toggle them off.

C. Skippable

You can explicitly flag specific datasets as skippable to keep them from holding up your pipe.

From the dataset view, click the Rules tab.
Toggle the skippable option to *On.

📘
Skippable notes

If a dataset is not flagged as skippable and also isn't used within your account and there’s been at least one successful pipe run (data refresh), the dataset is implicitly skippable to keep your campaigns going. You don't have to toggle it to skippable yourself, but if you do it's OK. No harm done.

You can configure a notification to let you know if a dataset extract fails, but has also been flagged as skippable.

Additional rules

If there are other rules that you would like to apply to your datasets, please reach out to your Account Manager and let us know!