Summary
This document outlines our specifications for ingesting recurring file feeds from an S3 or SFTP. All specifications must be adhered to in order for Simon to ingest. If you find that your feed does not meet these specifications, please reach out to your Client Solutions Manager.
File Attributes
File Formats
Character-delimited (CSV, TSV, etc.)
- If the files are character-delimited, the separator character must not appear in the values of the file itself you should be certain that none of the values you’re putting into the file have your delimiter in them
- For example, the following is invalid, and should use a different delimiter
first_name,last_name,location
jane,doe,Dallas, Texas
- If the character must appear in the values of the file itself, the values in your file must be enclosed in double quotes. We use double quotes as text qualifiers; unpaired quotation marks cause records to drop.
- For example, the following is valid:
first_name,last_name,location
“jane”,”doe”,”Dallas, Texas”
- We require the delimiters to be one of the following: tabs
\t
OR commas,
JSON
Files must be newline-delimited JSON; we cannot ingest them if they're not. This basically means that we support a file format in which there is a single JSON object per line of a file, not a single massive JSON array with multiple objects in a single file.
Parquet
If the files are in parquet format, you must provide a schema dictating the underlying structure of the data. We also only accept primitive data types in parquet files.
Compression
File Format | Accepted Compression |
---|---|
Character Delimited (.CSV, .TSV) | Preferred: GZIP Accepted: ZIP, None We do not accept zipped directories, each file must be compressed individually |
JSON | Preferred: GZIP Accepted: ZIP, None We do not accept zipped directories, each file must be compressed individually |
Parquet | Preferred: Snappy Accepted: None |
- Compressed files must have the suffix corresponding to the compression format (e.g. .zip for ZIP, or .gz for Gzip, etc.)
Encryption
We can ingest PGP / GPG encrypted files transferred via AWS to our S3 environment in two ways:
- (preferred) We generate a public key/private key pair, and give you the public key to encrypt your files before sending to us
- (less secure) You can also generate your own public key/private key pair and send the private key to us
Encoding
All files must be encoded in UTF-8.
Folder structure
Datasets
- All files being extracted for a given logical dataset (e.g. users) must be in the same folder, and no folder should contain more than one logical dataset.
Schema changes
Schema changes are not accepted at this time. We do NOT support the deletion OR renaming of fields. Please persist the column with null or empty values.
Incremental Extracts (preferred)
We load any records that have been created/updated in the period since the last run. Save with the date updated as the dt key or year/month/day keys.
Specify the date of the files in the folder of the dataset in one of three ways:
- Date partitioned
- i.e.
{root_path}/year={year}/month={month}/day={day}/{files}
- e.g.
s3://ORG.simondata.com/purchases/year=2020/month=03/day=17/purchases.json
- i.e.
- Date foldered
- i.e.
{root_path}/{year}/{month}/{day}/{files}
- e.g.
s3://ORG.simondata.com/purchases/2020/03/17/purchases.json
- i.e.
- Date encoded
- i.e.
{root_path}/{year}{month}{day}/{files} or {root_path}/{year}-{month}-{day}/{files}
- e.g.
s3://ORG.simondata.com/purchases/20200317/purchases.json
ors3://ORG.simondata.com/purchases/2020-03-17/purchases.json
- i.e.
- If writing more than one file per day, add a unique key to the end of the file name to prevent overwrites.
- Incremental datasets may have multiple files per dataset per day, provided they are all in the same folder
- All event data must be shared in an incremental manner
Timestamps are strongly recommended for incremental file feeds, but not required. We strongly suggest epoch format.
Overwrite Extracts
The entire table/schema that has been agreed upon will be dropped into S3 on each load. The previous data in S3 will be fully dropped and replaced with this new full load
- Types of data
- Overwrite extracts can't be used to extract event data; event data must be shared in an incremental manner (see above)
- Some examples where overwrite extracts are especially useful:
- Product catalogs
- Offers
- Wherever possible, share all other types of data in an incremental manner (transferring adds, updates and deletes rather than full overwrites). We will deduplicate records on our side.
- Overwrite datasets may only point to one file in a given folder.
- Overwrite datasets in your S3 environment should genuinely overwrite the previous day’s files (i.e. all files in the given folder should be extracted each day, without extracting previous day’s files).
If you dropped malformed/incorrect data and need to overwrite
Please overwrite the bad file with your new file while maintaining the same file name
For example:
s3://ORG.simondata.com/table_name/table_name.tsv.gz
Data format
-
Timestamp data is in ISO format: 2017-08-11T01:45:31+00:00.
-
If fields have a tab or newline (or the specified delimiter of your file) within them, escape the character with
\
-
Delimited files (e.g. TSV or CSV format) must have headers
-
Headers in TSV / CSV formats or field names JSONL / Columnar format requirements:
- No spaces
- All lowercase
- No special characters
- Unique in a single feed (no repetitive column names)
-
File names must include the proper extension (file_name.csv must be a .csv extension)
Global Limit
We limit each to loading a total of 1GB of data into Simon per day across all of your datasets, whether batch or incremental, via S3. There is also a limit of 1 file per 15 minutes. If you need more than that, please speak with your Client Solutions Manager.
This product is intended to ingest periodic file uploads as a method of batch integration. It is not intended to support event streams.