dbt is a popular tool for building data transformation pipelines in SQL. Civis Platform enables you to run dbt pipelines via dbt Scripts, either by themselves or as part of larger Workflows.
dbt Scripts
If you want to run dbt by itself, you can create a dbt Script.
Step 1: Run Arguments
To run dbt, first connect your script to your Git repository and select which dbt command you’d like to run. If you’d like to pass in any additional arguments, like node selection, you can add these to the Command Line Arguments field. There is usually no need to change the dbt version in use.
Note that Platform does not yet support all dbt commands - please let us know by reaching out to Support@CivisAnalytics.com if there’s command you’d like to use which we don’t support yet!
Step 2: Profiles
Choose whether you’d like to use a profiles.yml file from your git repository or if you’d like Platform to autogenerate one for you.
Step 3: Database Variables
Select which database and credential to use and the schema where dbt will create the output tables. This must be a schema you have create permission on.
Step 4: dbt Variables
These variables tell Platform where to look for your dbt configuration.
If your dbt_project.yml file lives in a sub-directory of your repo, specify that directory path in the “dbt Project Directory” field.
When not auto-generating a profile, you may also set the Profiles Directory and Target for dbt to use. You do not need to set a profiles directory if your profiles.yml lives in the root of your repository or in the ~/.dbt folder. If target is unspecified, it will default to the default target defined in your profiles.yml.
Step 5: Parameters (Optional)
Like in any Platform Script, you may add additional parameters for use in your dbt script using the “Set Parameters” side pane.
You may reference these parameters as environement variables in your dbt project:
"{{ env_var('MY_PARAM') }}"
dbt & Workflows
In this section, we’ll show an example Workflow based on dbt’s jaffle_shop example. If you want to try this out and see dbt running in Platform, you can 1) create a new Workflow object in Platform, 2) copy-paste the example Workflow definition below into the YAML tab, and 3) start an execution after specifying IDs for your database and database credentials.
In the example Workflow below, we have just a single task that runs dbt, for simplicity. A more realistic Workflow might have tasks for imports or exports, tasks for data processing in Python, tasks to publish to a Report, etc.
The script uses the generate_profiles setting to automatically create a profiles.yml file that tells dbt how to connect to your database and what schema the output tables should be created in. For production pipelines, you may want to use a profiles.yml file stored in your GitHub repository instead.
The script runs the dbt build command to seed the database, build tables, and run tests.
version: '2.0'
Workflow:
input:
# Find the ID of your database here: https://api.civisanalytics.com/databases
- database_id:
# Find your default database credential ID here: https://platform.civisanalytics.com/spa/#/credentials
- database_cred_id:
# Desired output schema. Must have create permission on it.
- output_schema: scratch
tasks:
run_dbt:
action: civis.scripts.dbt
input:
name: Run dbt
repo_http_uri: github.com/civisanalytics/platform-code-examples
repo_ref: main
target_database:
remote_host_id: <% int($.database_id) %>
credential_id: <% int($.database_cred_id) %>
dbt_project:
schema: <% $.output_schema %>
project_dir: dbt
generate_profiles: true
dbt_command: "build"
dbt Docs
A useful feature of dbt is that it can produce interactive documentation to help you understand your data pipeline. Platform automatically generates this documentation for you and uploads it to a report for easy viewing and sharing.
To view this documentation, click the “dbt Docs Report” button next to the run History. This will open the dbt docs report in a new tab.
Constructing a Profile
If you’d like to use a custom profiles.yml file, you can still take advantage of Platform’s database and credential support.
Example profiles.yml
Here is an example profiles.yml file showing how to connect to each different database type. You are unlikely to need both targets; choose whichever is applicable to you.
my-project-name:
target: civis_redshift # sets the default target
outputs:
# Works for redshift or postgres
civis_redshift:
type: "{{ env_var('DATABASE_TYPE') }}"
host: "{{ env_var('DATABASE_HOST') }}"
port: "{{ env_var('DATABASE_PORT') | as_number }}"
dbname: "{{ env_var('DATABASE_NAME') }}"
user: "{{ env_var('DATABASE_USERNAME') }}"
password: "{{ env_var('DBT_ENV_SECRET_DATABASE_PASSWORD') }}"
schema: "{{ env_var('DBT_SCHEMA') }}"
threads: 4
civis_bigquery_service_token:
type: "{{ env_var('DATABASE_TYPE') }}"
method: "{{ env_var('GCP_AUTH_METHOD') }}"
project: "{{ env_var('GCP_PROJECT_ID') }}"
schema: "{{ env_var('DBT_SCHEMA') }}"
keyfile: "{{ env_var('GCP_SERVICE_ACCOUNT_KEYFILE_JSON') }}"
threads: 4
dbt recommends that you save your profiles.yml
configuration file either to the same directory as your dbt_project.yml
file or to a ~/.dbt
folder. To read more about dbt profiles, see here.
Note: dbt expects the port
parameter to be an integer, so the environment variable needs to be piped through the as_number
jinja function. For more information about jinja functions in dbt, see dbt jinja functions.
Custom Parameters
You can also define arbitrary parameters for any values which Platform isn’t providing and reference them in a similar way. For example, if you want to parameterize the number of threads in use, you could define a THREADS
custom parameter and reference it in your profiles file. Use the "Set Parameters" pane to set custom parameters.
Debugging
Memory issues
dbt scripts default to 1GB of memory to run in. If your logs include the message Killed
, this may be because the dbt process attempted to use too much memory. In a dbt run, this is often caused by a macro, such as run_query
that is attempting to process too many rows. In general, it is more performant in both time and memory to refactor the affected model to process fewer rows locally, potentially by using a LIMIT
or GROUP
clause. If optimizing your models isn't an immediate option, you can also adjust the script memory setting under the Settings dropdown or via the API.
Comments
0 comments
Please sign in to leave a comment.