Introduction
In addition to the File Imports in the Civis Platform UI, a more streamlined CSV import is available via the API. This API endpoint gives a number of benefits, including:
- Faster Imports. The API endpoint circumvents the cleaning and inference steps performed by the UI-based import, which allows for faster processing.
- Upsert Mode. The API endpoint allows users to upsert (Update or Insert) data into a table, in addition to current options for working with existing table data. The Upsert Mode will modify the existing table-- leveraging the primary key and last modified fields to add new records and update modified ones. For more information on this, see “Upsert Mode” below.
- Importing a Subset of Table Columns. Where the UI import requires that a file match all columns of the destination table, this import allows users to specify which columns their file contains, and imports only those columns.
- Multi-File Import. The API endpoint allows users to import multiple files to the same table at once. These files must have the same data schema and file format.
- Import from S3. The API endpoint allows users to connect to their own S3 Buckets and import files or directories without the need to copy to Civis’ S3 bucket first.
However, with these benefits come some additional constraints. These are:
- Files Must Be Pre-Cleaned. Because the import does not perform cleaning, it is the responsibility of the user to ensure that their file is valid for import. This includes taking steps to verify quote escaping, checking for truncated lines, or ensuring that data is consistently delimited by a common separator like a comma or tab. Please see “Data Cleaning” below. Data files that do not meet these specifications will fail to import properly.
- Users Must Provide Table Columns. Since the import does not perform type inference on the file, users must provide column type information. This can be done in one of two ways:
- The `table_columns` parameter can be passed in the API call to create the import.
- The user can create the table into which they will be importing files ahead of time; the import will use the column types from that table.
Cleaning Data
The CSV Imports API expects data that has been cleaned and prepared for import. For data files to be accepted for import, the following should be true:
- Data files must be UTF-8 encoded, and compressed with gzip, or not at all.
- Data files must not feature a byte-order mark (BOM).
- Data files must have either UNIX-style line endings ("\n"), or else Windows-style line endings ("\r\n").
- Columns must be delimited by a comma (","), tab("\t"), or pipe ("|") character, and the "columnDelimiter" parameter must be set to ''comma”, "tab", or "pipe" as appropriate.
- All rows must feature the same number of columns, and must share the same column delimiter.
- Data file lines must not have a trailing delimiter.
- Fields that use the column delimiter as data must be quoted.
- If fields are quoted, interior quotes should be doubled in order to escape them, or else escaped by a single backslash, and the `escaped` parameter should be set to "true".
It is imperative that users take steps to ensure data is clean before import. Data files that do not meet these specifications will fail to import properly.
Using the API Endpoint
CSV Import can be created by sending a `POST` request to `/imports/files/csv`. For additional information on available parameters, please refer to the Civis Platform API Documentation. Helper functions are available in the Civis Python and R clients.
It is important to note that this endpoint only creates the import object in Platform. It does not move any data. To run the newly created import, users will need to send a `POST` request to the `/jobs/{id}/runs` endpoint, providing the `id` of the newly created import. For more information on this endpoint, see the Civis Platform API Documentation, as well as the Python client and R client documentation. Python client users can also import the `civis.utils` module and use the `civis.utils.run_job` helper function.
Upsert Mode
The CSV Imports API supports `upsert` as an option for the `existing_table_rows` parameter. An upsert either appends new rows of data to the Platform table or updates existing rows if the data has changed in the source table. This determination is based on the primary_keys and the last_modified_keys that are provided by the user. (These are required fields if using `upsert` mode). If the value of primary_keys is not already found in the Platform table, a new row is created. If the value of the primary_keys is already found in the Platform table, the last_modified_keys are used to determine if that row’s data from the source database has been updated since the last import. If the row has been updated since the last import, the job will replace the entire row in the Platform table with the updated row from the source database.
This mode requires the following parameters:
- `primary_keys` - A list of column(s) which together uniquely identify a row in the data set. These columns must be unique per row, and must not contain NULL values. N.B. The import will not validate the uniqueness of values in columns designated as primary keys. Users are responsible for validating these properties, and upserts may behave in unexpected ways if the designated primary key columns do not form a true primary key.
- `last_modified_keys` - a column representing when data in a given row was last changed. This allows the import to determine whether a row in the dataset needs to be updated.
Together, these parameters allow Civis Platform to identify which rows in the new data are truly new, and which need to be updated. These values must be set for the upsert to be able to proceed. If the destination table does not exist, or does not have these attributes set on it in the Civis API, these parameters are required for the import.
Data arrives at its final destination table via an upsert in a different path than other imports. First, new data is loaded into a staging table. The staging table for a given final table will be created in the same schema as the final table. It will share the column structure of the final table, and will have the name of the final table with `_stg` appended as its name. So for example, an upsert into a table in a schema `scratch` called `upsert_funk` would have a staging table called `upsert_funk_stg`, and would have the same columns in the same order. Using a staging table allows Civis to validate that the data can be loaded and matches the schema of the final table. It also pools multiple smaller imports, if they happen in quick succession, into a single larger move, which is more performant for analytics databases like Redshift.
Once data has been loading to a staging table, Platform will enqueue a background process to move data between the staging and final tables. In the time between the initial copy into the staging table, and the final copy to the final table, a “near-real-time” view containing the data from both the final and staging table is available in the table’s schema, as its name with `_nrt` appended. In the example mentioned above, the near-real-time view unifying `upsert_funk` and `upsert_funk_stg` would be called `upsert_funk_nrt`.
Data will be moved from staging table to final table 15 minutes after the last upsert.
Storage Hosts
You can use the CSV Imports API to import files from your Amazon S3 bucket.
To create an Import from S3:
- Register your S3 bucket as a Storage Host
- Create a credential for the bucket in Credentials
- Use the CSV import api to copy a collection of files directly into Civis Platform.
Notes
- The import can accept a list of both files and directories; if a directory is provided, all files in that directory will be imported. As noted above, all files within the same import should share a data schema and file format.
- The S3 credential will need read access to the files and locations that are being imported. If multiple files are being imported to a Redshift database, this credential will also need write access to the location of at least one of the files being imported. Please see this Helpdesk page for more information.
Adding an S3 Credential
To add your S3 credential to Platform go to New Credential and fill out the fields and click "Save"
- Name: what you would like this credential to be named
- Credential Type: Amazon Web Services S3
- Username: AWS Access Key
- Password: AWS Secret Key
- Belongs To: select your user
The Execution Parameter
The `execution` parameter allows users to tailor the execution strategy of the file import to their use case. This parameter has two valid values: `delayed` and `immediate`.
- Immediate executions refresh column statistics as part of the run. These statistics will be immediately available in the data catalog once the import has completed.
- Delayed executions flag the table for a deferred statistics update; column statistics may not be available until the next day. In addition, for inserts in upsert mode, delayed executions move data from staging table to final table after a brief delay, in order to accommodate multiple concurrent imports to the same destination table.
The `execution` parameter defaults to `delayed.
Examples
- This sample Python code uses the Civis Python client to upload a file to Civis and import it using the CSV Imports API.
- This sample Python code runs a CSV import from a Civis S3 bucket.
Comments
0 comments
Please sign in to leave a comment.