Database imports often need custom configuration. Your import might be scheduled to run weekly, and you want to choose whether the entire table is refreshed each run or if only new rows should be added. Maybe you have multiple identifier columns and you want to specify which one should be the primary key; otherwise, the import will try auto-detecting the primary key. In order to configure your import, you will need to modify the advanced settings for the tables you are syncing. You can manage additional settings for any individual table by clicking on the down caret to the right of the destination schema and table field. Note that you only need to manually edit the settings relevant to your job, untouched settings will remain with the default option.
Verify Table Row Count
After sync completes, check to confirm that the number of rows imported to the destination matches the number of rows existing in the data source. If Verify Table Row Count is enabled, your job will fail when row counts do not match. To accurately verify table row count, we compare the rows imported based on the max value of the Primary Key column if it is provided or detected from the source database. This helps us account for when new rows are added to the source table after the sync began. These rows will be imported the next time the sync runs.
Full Refresh On Each Run
Delete any existing data in destination table before syncing. Turn on if your table can't be incrementally synced (does not have auto-incrementing primary key, a modified_at type, or deletes records) and needs to be fully refreshed on each run.
Truncate Long Text
Truncate data to fit the destination table.
Invalid Character Replacement
Insert any ASCII character to replace any invalid characters. Only applies when the destination database is Redshift.
Primary Key
Set a primary key, which is a unique identifier given to a row in the table. This value will override the auto-detected identity column(s). If more than one identity column is specified, an identity column must be specified for each table.
We recommend the primary key be a sequential data type such as an integer, double, timestamp, date, or float. This is because string data types such as varchar and text might not be able to sync all data from the source to the destination. If using a primary key that is of a string data type, we recommend having a Last Modified Column, see below, to ensure all data is synced to the destination table.
For example, let's say the first time I sync my table, the max value on my primary key is "winston". Then a new row is added that has the primary key value "salem". On my incremental sync, if I do not have a Last Modified column set in the job, this new row would not be synced on the following runs since "salem" is interpreted as being lower than the previous max value "winston".
Last Modified Column
Used for incremental syncs. A column is selected for the import to only pull rows where last_modified_column in the source >= last_modified_column in the destination. The default value is "Auto-Detect". When this is the case, the column name must contain "modif" or "update" as part of the name. The import will use the first column it finds containing those parts. You can also specify a custom column name or leave blank if the table does not have a last modified column. Note that Full Refresh on Each Run will need to be turned off for this to be used.
Row Chunk Size
Breaks the SQL request to pull data from the outside database into chunks. It is useful for very large imports, where the value will override calculated chunk size. A value of zero or leaving it blank disables chunking.
Comments
0 comments
Please sign in to leave a comment.