Back to tutorials

dbt sources & dbt source freshness Commands: Usage & Examples

Introduction

Data transformation is the backbone of modern analytics, and dbt has emerged as a pivotal player in this space. The dbt source command, a cornerstone of dbt, allows users to manage and reference raw data tables seamlessly. This tutorial delves into the intricacies of this command, ensuring you can harness its full potential.

dbt sources Overview

dbt sources represent raw data tables in your data warehouse, serving as the foundation for dbt transformations. Instead of directly referencing these tables, dbt sources provide an abstraction, enhancing maintainability and clarity. By declaring a table as a source, you're marking it as a trusted base for transformations, enabling functionalities like data freshness checks. In business contexts, sources could represent raw sales data, customer interactions, or inventory details.

Understanding dbt sources

dbt sources are designed to manage source data within dbt. Its primary subcommand, dbt source freshness, allows you to query the "freshness" of your source tables, ensuring data is up-to-date and reliable.

Working with Source Data

In analytics, source data is the raw, unprocessed information directly from data stores. For instance, if you're an e-commerce business, your source data might be raw sales transactions, a raw list of customers, and raw historical inventory data. With dbt sources, you can create references to these tables, streamlining the transformation process.

Using the dbt source freshness Command

Add loaded_at_field to your sources yaml file

Before you can use dbt source freshness you need to define your sources and a loaded_at_field in your sources yaml file(s). In this example we two tables from our Stripe data: subbscriptions and customers that both have a field called _synced_at that tells us when each row was pulled from Stripe. We’re also including a freshness warning if data is more than 2 days old and an error if it’s more than 7 days old.

sources:
  - name: stripe
    database: analytics_db
    freshness:
      warn_after: {count: 2, period: day}
      error_after: {count: 7, period: day}
    loaded_at_field: _synced_at

    tables:
      - name: subscriptions
      - name: customers

Use the dbt source freshness

Imagine you want to ensure that your sales transactions data is updated at least once a day. Using dbt source freshness, you can run:

dbt source freshness

If any source table hasn't been updated as per the specified freshness configuration, dbt will raise a warning or error. In our example above, if the data is 3 days old, we’ll get a warning, since the warn_after threshold was set to 2 days.

Configuring Source Freshness Output

By default, the freshness information is saved to target/sources.json. To customize this:

dbt source freshness --output target/my_custom_path.json

Advanced Usage of dbt Source

Specifying Sources to Snapshot

If you only want to check the freshness of specific tables, use selectors like --select, --exclude , etc. This command will check source freshness for only the Stripe Customers source.

dbt source freshness --select source:stripe.customers

Understanding Data Lineage

For a business, understanding where data comes from is crucial. Using the {{ source() }} function, you can define the lineage. For instance, to reference our Stripe customers source in a model:

select 
  customer_id, 
  customer_name, 
  created_at

from {{ source('stripe', 'customers') }}

Most of the time you’ll select fields you need out of each source in a staging model, so each source has its own staging model file. That way you can choose which fields you need from the table and handle renaming and recasting data in one place.

Testing Your Sources

To ensure data quality, you can write tests for your sources. For instance, to ensure that each Stripe customer has a unique ID:

version: 2

sources:
  - name: stripe
    tables:
      - name: customers
        tests:
          - unique:
              column_name: customer_id

Benefits of Using dbt Sources

For businesses, dbt sources offer:

  • Data Consistency: Ensure that your analytics are based on reliable, consistent data.
  • Efficiency: Avoid rewriting transformation logic by referencing standardized data.
  • Freshness Monitoring: Ensure that your business decisions are based on the latest data.

Common Pitfalls and Best Practices

  • Stay Updated: Source data can change. Regularly run dbt source freshness to avoid disruptions.
  • Regular Checks: Schedule the freshness command to run at regular intervals, ensuring data quality.

Conclusion

dbt sources are a powerful tool in the dbt arsenal, especially for businesses that rely on timely and accurate data. By mastering the dbt source freshness command and setting up sources correctly, you ensure a robust foundation for your analytics.

database icon
Unified workspace for your dbt workflow
Forget about the painful parts of dbt development, focus on what matters the most - data analysis