Recurring File Feeds - Specifications

Summary

This document outlines our specifications for ingesting recurring file feeds from an S3 or SFTP. All specifications must be adhered to in order for Simon to ingest. If you find that your feed does not meet these specifications, please reach out to your Client Solutions Manager.

File Attributes

File Formats

Character-delimited (CSV, TSV, etc.)

  • If the files are character-delimited, the separator character must not appear in the values of the file itself you should be certain that none of the values you’re putting into the file have your delimiter in them
    • For example, the following is invalid, and should use a different delimiter
first_name,last_name,location
jane,doe,Dallas, Texas
  • If the character must appear in the values of the file itself, the values in your file must be enclosed in double quotes
    • For example, the following is valid:
first_name,last_name,location
“jane”,”doe”,”Dallas, Texas”
  • We require the delimiters to be one of the following: tabs \t OR commas ,

JSON

Files must be newline-delimited JSON; we cannot ingest them if they're not. This basically means that we support a file format in which there is a single JSON object per line of a file, not a single massive JSON array with multiple objects in a single file.

Parquet

If the files are in parquet format, you must provide a schema dictating the underlying structure of the data. We also only accept primitive data types in parquet files.

Compression

File Format

Accepted Compression

Character Delimited (.CSV, .TSV)

Preferred: GZIP
Accepted: ZIP, None
We do not accept zipped directories, each file must be compressed individually

JSON

Preferred: GZIP
Accepted: ZIP, None
We do not accept zipped directories, each file must be compressed individually

Parquet

Preferred: Snappy
Accepted: None

  • Compressed files must have the suffix corresponding to the compression format (e.g. .zip for ZIP, or .gz for Gzip, etc.)

Encryption

We can ingest PGP / GPG encrypted files transferred via AWS to our S3 environment in two ways:

  • (preferred) We generate a public key/private key pair, and give you the public key to encrypt your files before sending to us
  • (less secure) You can also generate your own public key/private key pair and send the private key to us

Encoding

All files must be encoded in UTF-8.

Folder structure

Datasets

  • All files being extracted for a given logical dataset (e.g. users) must be in the same folder, and no folder should contain more than one logical dataset.

🚧

Schema changes

Schema changes are not accepted at this time. We do not support the deletion of fields. Please persist the column with null or empty values.

Incremental Extracts (preferred)

We load any records that have been created/updated in the period since the last run. Save with the date updated as the dt key or year/month/day keys.

Specify the date of the files in the folder of the dataset in one of three ways:

  • Date partitioned
    • i.e. {root_path}/year={year}/month={month}/day={day}/{files}
    • e.g. s3://ORG.simondata.com/purchases/year=2020/month=03/day=17/purchases.json
  • Date foldered
    • i.e. {root_path}/{year}/{month}/{day}/{files}
    • e.g. s3://ORG.simondata.com/purchases/2020/03/17/purchases.json
  • Date encoded
    • i.e. {root_path}/{year}{month}{day}/{files} or {root_path}/{year}-{month}-{day}/{files}
    • e.g. s3://ORG.simondata.com/purchases/20200317/purchases.json or s3://ORG.simondata.com/purchases/2020-03-17/purchases.json
  • If writing more than one file per day, add a unique key to the end of the file name to prevent overwrites.
  • Incremental datasets may have multiple files per dataset per day, provided they are all in the same folder
  • All event data must be shared in an incremental manner

Overwrite Extracts

The entire table/schema that has been agreed upon will be dropped into S3 on each load. The previous data in S3 will be fully dropped and replaced with this new full load

  • Types of data
    • Overwrite extracts can't be used to extract event data; event data must be shared in an incremental manner (see above)
    • Some examples where overwrite extracts are especially useful:
      • Product catalogs
      • Offers
    • Wherever possible, share all other types of data in an incremental manner (transferring adds, updates and deletes rather than full overwrites). We will deduplicate records on our side.
    • Overwrite datasets may only point to one file in a given folder.
    • Overwrite datasets in your S3 environment should genuinely overwrite the previous day’s files (i.e. all files in the given folder should be extracted each day, without extracting previous day’s files).

For example:

  • s3://ORG.simondata.com/table_name/table_name.tsv.gz

Data format

  • Timestamp data is in ISO format: 2017-08-11T01:45:31+00:00.
  • If fields have a tab or newline (or the specified delimiter of your file) within them, escape the character with \
  • Delimited files (e.g. TSV or CSV format) must have headers
  • Headers in TSV / CSV formats or field names JSONL / Columnar format requirements:
    • No spaces
    • All lowercase
    • No special characters
  • File names must include the proper extension (file_name.csv must be a .csv extension)

Global Limit

We limit each to loading a total of 1GB of data into Simon per day across all of your datasets, whether batch or incremental, via S3. If you need more than that, please speak with your Client Solutions Manager.


Did this page help you?