Datasets Best Practices
Customer Dataset Query Writing
- Avoid full outer joins in your queries. Simon’s data pipe assembles a fully materialized customer 360 on your behalf by joining your datasets together. There may be some rare instances where a full outer join is required to compute the metric you’re looking for, but they’re generally to be avoided.
- If you find yourself wanting to use a full outer join, it may be a sign that you should split your dataset up into multiple datasets. Each dataset should be logically consistent, pulling different aggregates as fields from the same set of tables. When you’re reaching for new sets of tables, consider building a new dataset rather than updating an existing one.
- Be thoughtful about the Customer fields you bring into the platform. You don’t need to add fields to a dataset one-by-one as you need them, and can proactively add fields to power use cases that are on deck for your team to tackle. But you also shouldn’t bring in hundreds of fields you might need at some point in the future. Reducing the number of unnecessary fields in your account can improve your user experience by keeping your account clean and speeding up your data refresh.
- Ensure each Customer purpose query isn’t unintentionally returning more rows than the number of total contacts in your account. This is often indicative of duplicates in your query. Our data pipe will deduplicate these rows for you, but it will do so randomly (albeit deterministically), and processing that extra data can dramatically increase the runtime of your data pipe.
- When adding new Customer fields to your account, take a peek to see if any of your existing live datasets leverage the set of source tables containing the new fields you intend to add. Adding these fields to existing datasets where possible can ensure you’re not running the same joins and aggregations over and over again across different datasets.
Contact Event and Object Query Writing and Configuration
- Avoid the use of full overwrite Contact Event and Object datasets at all costs.
- Ensure all incrementally updating Contact Event and Object datasets leverage a true updated timestamp (i.e. a timestamp that updates any time any value in its row updates).
- Don’t mark datasets for use in event content unless you actually need the content from those datasets to personalize a campaign.
- Even if you do need event content, leverage linked lookups wherever possible instead of event content, even if that means duplicating entire datasets.
- Ensure all Contact Event and Object datasets don’t update more frequently than the data they sit on top of – i.e. if a table updates daily, don’t have the dataset that extracts from it update hourly.
Data sources
Data sources should be preferred in a hierarchy as follows:
- Direct cloud data warehouse connection (database ingestion)
- Direct connections provide the lowest latency and most flexibility
- Use data shares where necessary to address security or access concerns (Snowflake only)
- Minimum required webhook and event integration for data enhancement (webhooks, Signal, and the EIAPI)
- The emphasis here should be on marketable event triggers as well as marketing-related behavioral event streams
- Minimum required 2p API integration (ingestion integrations)
- Ideally, you should manage an ETL partner on your own (e.g. Fivetran, Stitch, etc.) upstream of Simon because you can then directly take consistent advantage of this data across your organization, beyond marketing use cases
- File feeds (recurring file feeds, SFTP, etc.)
- These are ideal for supporting datasets, but should neither be used exclusively nor for core contact datasets
Doing as much data centralization, modeling, and QA upstream of Simon as possible in your own cloud data warehouse permits you to take consistent advantage of this data across your organization, beyond marketing use cases. However, Simon can facilitate rapid integration to data until that time when you can “pull back” that processing into your own infrastructure.
Dataset development
Simon supports a standard workflow for creating new datasets:
Develop
- Configure and define your dataset, typically by authoring a SQL query
- Validate your dataset using a sample extracted by the dataset management tool
- Configure the type and use of fields in the query’s result set
- Use the development-time QA process below
- Save and continue to develop
Live
- Commit the dataset to make it live
Address any Action Panel items
- Action Panel items will correspond to failures in dataset or fields rules defined above
- Either correct the dataset’s definition or the rules
- Use the development-time QA process below
- If new issues arise, be sure to cover them with additional rules to prevent causing downstream issues
Pause (at request to account manager)
- The dataset will remain visible in the account, but will not be updated by Simon’s data refresh, and therefore will go stale
Archive
- Use the dataset explorer to verify that the dataset is not in use by any downstream segments, templates, flows, or journeys
- If it is still in use, delete the downstream usages before proceeding
- Archive the dataset to remove it from the account
Quality assurance and validation
Example validations to perform before committing a dataset:
- Does the dataset cover as many contacts as I expect it to, or am I unintentionally missing data?
- Does a dataset intended to be user-level (i.e. 1:1) return more rows than I have customers? Do I need to be aggregating / grouping rather than just selecting?
- Does a field have as many non-empty values as I expect it to, or am I unintentionally pulling sparse data
Example validations to coordinate with your data team and assert via Simon rules:
- When does my data refresh in my cloud data warehouse? Will it be available when Simon refreshes or is it unavailable during specific times of day?
- How does my data update? Is data constantly updated and/or appended, or is the entire table emptied and refilled on some cadence?
- How are “updated” and “created” timestamps calculated on tables incrementally extracted by Simon? If any cell in a row is updated, will that bump the “updated” timestamp? Simon will miss updates that don’t bump the “updated” timestamp.
Development-time QA process
Utilize the “READ-ONLY” mode of the dataset tool to replace your query with summary aggregates
This allows you to run summary statistics over your dataset at the dataset and field levels and return arbitrary result sets to validate your dataset. Use a query such as the following:
select count(*), count(field_name), count(distinct field_name) from schema.table_name ;
- Remember, you may use CTEs, joins, sub queries, and any existing QA facilities or tables you may already have in your cloud data warehouse
- You can use this exploration to define dataset rules (applied by default, but configurable by you) and field rules (configured by you) for automated validation once the dataset is set live
- Dataset rules
- Skippable (off by default)
- Non-empty (on by default)
- Row count should not decrease beyond a threshold (75% by default)
- Field rules
- The number of null values should not increase on data
- The timestamp field should always contain data from today or later
- Duplicate values should not exist
- Column should not contain invalid data formatting
- Dataset rules
Once validated, switch back to “BUILDING” and save the dataset
Updated 20 days ago