The Import from S3 (script template ID 296147) connects to a specified AWS S3 bucket and imports data onto Civis Platform in one of the following ways:
-
As a Civis table:
All specified S3 files (determined by the parameters “S3 Path”, “Path Pattern”, and “S3 Last Modified Days / Start Date / End Date”) will be treated as CSV-like files and imported to the specified Civis table. For this use case, all of the parameters “Database”, “Database credential”, and “Schema and Tablename” must be provided.
The CSV data can be either plain text or compressed files. For file compression, only gzip (.gz) and bzip2 (.bz2) are auto-detected and supported.
If the CSV data needs to be processed in any way before it can be imported to a Civis table, it can first be imported as Civis files – please see below.
For importing data to a Civis table, users of the older script template ID 181199, whose documentation is kept in the second half of this page below, are encouraged to migrate existing custom scripts based on template 181199 to this newer template 296147. The older template 181199 will remain as-is and won’t be removed.
-
As Civis files:
All specified S3 files (determined by the parameters “S3 Path”, “Path Pattern”, and “S3 Last Modified Days / Start Date / End Date”) will be directly copied onto Civis Platform as Civis files, accessible as run outputs in a custom script run. For this use case, all of the parameters “Database”, “Database credential”, and “Schema and Tablename” must be left empty.
This use case is suitable for handling CSV data that needs custom processing before importing it to a Civis table, or for bringing in any data files onto Civis Platform in general.
To access the imported Civis files as run outputs from a custom script run, they are available from the logs pane for a given custom script run on Civis Platform. They can also be accessed programmatically via the Civis API, e.g., client.jobs.list_runs_outputs from the Civis Python client.
For importing data as Civis files, users of the older script template ID 9245 are encouraged to migrate existing custom scripts based on template 9245 to this newer template 296147. The older template 9245 will remain as-is and won’t be removed.
Parameters
- AWS Keypair Credential (required): The AWS keypair credential to access the data on S3.
- S3 Path (required): The S3 path (file or folder) containing the data to import, e.g., "my-bucket/my-prefix/" or "my-bucket/my-file.csv".
- Path Pattern (optional): A pattern to match S3 file paths. Only files with paths matching this pattern will be imported. Regular expression syntax is supported for pattern matching. For example, to import only CSV files, you might use ".*\.csv$". If this parameter is left blank, all files under the specified S3 path will be imported.
- S3 Last Modified Days (optional): If specified, only import files that were last modified within the given number of days. For example, if you enter "7", only files modified in the last 7 days will be imported. This parameter can optionally be used in conjunction with either "S3 Last Modified Start Date" or "S3 Last Modified End Date" (but not both).
- S3 Last Modified Start Date (optional): If specified, only import files that were last modified on or after the given date. The date should be in YYYY-MM-DD format. This parameter can optionally be used in conjunction with "S3 Last Modified Days".
- S3 Last Modified End Date (optional): If specified, only import files that were last modified on or before the given date. The date should be in YYYY-MM-DD format. This parameter can optionally be used in conjunction with "S3 Last Modified Days".
- Days After Which Civis Files Expire (optional): The number of days (as an integer) after which the imported Civis files expire. If not specified, the files will expire in 30 days. To have the files never expire, set this parameter to 0. You are encouraged to keep Civis files only as long as needed for security and storage cost reasons.
- Has Header (optional): If importing data to a Civis table, whether the data files have a header row. If this parameter is not selected because your S3 data doesn't have a header row, the destination Civis table is assumed to exist (if not, create it via a query such as CREATE TABLE before running this S3 import).
- Column Delimiter (optional): Column delimiters, one of comma, tab, or pipe.
- Database (optional): The Civis Platform database to which to write records. Provide all of the parameters “Database”, “Database credential” as well as “Schema and Tablename” if importing data to a Civis database table, or specify none of them when importing data as Civis files instead.
- Database Credential (optional): A valid database credential for the Civis Platform database. To use the credential associated with the logged-in user, choose from the dropdown the credential that has this user’s Civis Platform username. Provide all of the parameters “Database”, “Database credential” as well as “Schema and Tablename” if importing data to a Civis database table, or specify none of them when importing data as Civis files instead.
- Schema and Tablename (optional): The schema and table name you would like to import the data into, e.g., schema.table. Provide all of the parameters “Database”, “Database credential” as well as “Schema and Tablename” if importing data to a Civis database table, or specify none of them when importing data as Civis files instead.
- Column Data Type Overrides (optional): A mapping from column names to data types to use instead of the data types from the data source. Each line should be in the form of "column_name: DATA_TYPE" (without quotes), e.g., "foo_bar: VARCHAR". Only provide an entry for columns for which you want to override the data type; do not provide entries for columns that should use the default data type from the data source. This parameter is useful for data type issues that would otherwise cause the import to fail, e.g., use "foobar: VARCHAR" if the data source says "foobar" should be a numeric type but it actually contains non-numeric values.
-
Existing Table Exists (optional): The action to take if a table with the requested name already exists. Options are:
- Append: Records will be appended to the existing table
- Fail: The sync job fails if the table exists
- Drop: Removes the entire table including column information.
- Truncate: Records on the table will be wiped out while maintaining the current columns.
Note: This is the end of the documentation for the script template 296147. The following is the documentation for the older template 181199 kept for reference.
How it works
For a given S3 bucket and set of matching strings e.g. "cust_data", "int_data", and "ret_data", each file in the bucket that matches the string "cust_data" will be put into one table, "int_data" matches into another, and "ret_data" into a third. Assumptions:
All the files are plain text CSV files. Compressed files cannot be processed with this import script.
All the files that match a string have the same column schema.
Any future files of that type will also match that column schema. Any changes will cause those files to fail when inserting into existing tables.
If the given S3 bucket is empty, the script will fail.
Getting started
Fill out the following parameters (required parameters are marked with a red asterisk in the template).
Parameters
S3 bucket (required): S3 bucket to get objects from
S3 Creds (required): Valid S3 credential with a list and read permissions
Match strings (optional): A list of comma separated strings to match objects against e.g. key_name,other_key_name
Path (optional): This is the path or directory to check for objects
Delimiter (optional): Delimiter to use for importing data for file (e.g, for a csv, use a comma). All matched files are expected to have the same delimiter value.
Database (required): Database to write to
- Database Credential (required): A valid database credential for the database.
Schema Name (required): Schema name to put table(s) into.
Full refresh (optional): Perform a full refresh i.e. gets all matching files and drop any existing tables before importing them.
Refresh Date (optional): Date to use as start date. This will get all data after the specified date by comparing it with the last modified date of the object. This will be ignored if the Full Refresh parameter is selected. If the Full Refresh is not selected, and the refresh date is not specified also, the last successful run date of the job will be used. Allowed formats are: YYYY-MM-DD, YYYY/MM/DD, MM-DD-YYYY, MM/DD/YYYY.
End Refresh Date (optional): Date to use as end date. This will get all data before the specified date by comparing it with the last modified date of the object (used with Refresh date to make a range). This will be ignored if Full Refresh is selected. Defaults to None if not specified and Full refresh is not selected also which will pull data up to the most recent. Allowed formats are: YYYY-MM-DD, YYYY/MM/DD, MM-DD-YYYY, MM/DD/YYYY.
Skip bad rows (optional): Allows bad rows to be skipped during import. When bad data is encountered (i.e. misaligned column splits), the file is skipped.
Ignore list (optional): List of S3 objects to ignore (must be the full object name), one object per line
Log level (optional): Log level for output logs (can be DEBUG (most output), INFO (some output), WARNING (least output))
Comments
0 comments
Please sign in to leave a comment.