Skip to main content

Analytics craft

The art of being an analytics practitioner.

View All Tags

Test smarter not harder: add the right tests to your dbt project

· 11 min read
Faith McKenna
Jerrie Kumalah Kenney

The Analytics Development Lifecycle (ADLC) is a workflow for improving data maturity and velocity. Testing is a key phase here. Many dbt developers tend to focus on primary keys and source freshness. We think there is a more holistic and in-depth path to tread. Testing is a key piece of the ADLC, and it should drive data quality.

In this blog, we’ll walk through a plan to define data quality. This will look like:

  • identifying data hygiene issues
  • identifying business-focused anomaly issues
  • identifying stats-focused anomaly issues

Once we have defined data quality*,* we’ll move on to prioritize those concerns. We will:

  • think through each concern in terms of the breadth of impact
  • decide if each concern should be at error or warning severity

Who are we?

Let’s start with introductions - we’re Faith and Jerrie, and we work on dbt Labs’s training and services teams, respectively. By working closely with countless companies using dbt, we’ve gained unique perspectives of the landscape.

The training team collates problems organizations think about today and gauge how our solutions fit. These are shorter engagements, which means we see the data world shift and change in real time. Resident Architects spend much more time with teams to craft much more in-depth solutions, figure out where those solutions are helping, and where problems still need to be addressed. Trainers help identify patterns in the problems data teams face, and Resident Architects dive deep on solutions.

Today, we’ll guide you through a particularly thorny problem: testing.

Why testing?

Mariah Rogers broke early ground on data quality and testing in her Coalesce 2022 talk. We’ve seen similar talks again at Coalesce 2024, like this one from the data team at Aiven and this one from the co-founder at Omni Analytics. These talks share a common theme: testing your dbt project too much can get out of control quickly, leading to alert fatigue.

In our customer engagements, we see wildly different approaches to testing data. We’ve definitely seen what Mariah, the Aiven team, and the Omni team have described, which is so many tests that errors and alerts just become noise. We’ve also seen the opposite end of the spectrum—only primary keys being tested. From our field experiences, we believe there’s room for a middle path. A desire for a better approach to data quality and testing isn’t just anecdotal to Coalesce, or to dbt’s training and services. The dbt community has long called for a more intentional approach to data quality and testing - data quality is on the industry’s mind! In fact, 57% of respondents to dbt’s 2024 State of Analytics Engineering survey said that data quality is a predominant issue facing their day-to-day work.

What does d@tA qUaL1Ty even mean?!

High-quality data is trusted and used frequently. It doesn’t get argued over or endlessly scrutinized for matching to other data. Data testing should lead to higher data quality and insights, period.

Best practices in data quality are still nascent. That said, a lot of important baseline work has been done here. There are case studies on implementing dbt testing well. dbt Labs also has an Advanced Testing course, emphasizing that testing should spur action and be focused and informative enough to help address failures. You can even enforce testing best practices and dbt Labs’s own best practices using the dbt_meta_testing or dbt_project_evaluator packages and dbt Explorer’s Recommendations page.

The missing piece is still cohesion and guidance for everyday practitioners to help develop their testing framework.

To recap, we’re going to start with:

  • identifying data hygiene issues
  • identifying business-focused anomaly issues
  • identifying stats-focused anomaly issues

Next, we’ll prioritize. We will:

  • think through each concern in terms of the breadth of impact
  • decide if each concern should be at error or warning severity

Get a pen and paper (or a google doc) and join us in constructing your own testing framework.

Identifying data quality issues in your pipeline

Let’s start our framework by identifying types of data quality issues.

In our daily work with customers, we find that data quality issues tend to fall into one of three broad buckets: data hygiene, business-focused anomalies, and stats-focused anomalies. Read the bucket descriptions below, and list 2-3 data quality concerns in your own business context that fall into each bucket.

Bucket One: Data Hygiene

Data hygiene issues **are concerns you address in your staging layer. Hygienic data meets your expectations around formatting, completeness, and granularity requirements. Here are a few examples.

  • Granularity: primary keys are unique and not null. Duplicates throw off calculations!
  • Completeness: columns that should always contain text, do. Incomplete data is less useful!
  • Formatting: email addresses always have a valid domain. Incorrect emails may affect marketing outreach!

Bucket Two: Business-focused Anomalies

Business-focused anomalies catch unexpected behavior. You can flag unexpected behavior by clearly defining expected behavior. Business-focused anomalies are when aspects of the data differ from what you know to be typical in your business. You’ll know what’s typical either through your own analyses, your colleagues’ analyses, or things your stakeholder homies point out to you.

Since business-focused anomaly testing is set by a human, it will be fluid and need to be adjusted periodically. Here’s an example.

Imagine you’re a sales analyst. Generally, you know that if your daily sales amount goes up or down by more than 20% daily, that’s bad. Specifically, it’s usually a warning sign for fraud or the order management system (OMS) dropping orders. You set a test in dbt to fail if any given day’s sales amount is a delta of 20% from the previous day. This works for a while.

Then, you have a stretch of 3 months where your test fails 5 times a week! Every time you investigate, it turns out to be valid consumer behavior. You’re suddenly in hypergrowth, and sales are legitimately increasing that much.

Your 20%-change fraud and OMS failure detector is no longer valid. You need to investigate anew which sales spikes or drops indicate fraud or OMS problems. Once you figure out a new threshold, you’ll go back and adjust your testing criteria.

Although your data’s expected behavior will shift over time, you should still commit to defining business-focused anomalies to grow your understanding of what is normal for your data.

Here’s how to identify potential anomalies.

Start at your business intelligence (BI) layer. Pick 1-3 dashboards or tables that you know are used frequently. List these 1-3 dashboards or tables. For each dashboard or table you have, identify 1-3 “expected” behaviors that your end-users rely on. Here are a few examples to get you thinking:

  • Revenue numbers should not change by more than X% in Y amount of time. This could indicate fraud or OMS problems.
  • Monthly active users should not decline more than X% after the initial onboarding period. This might indicate user dissatisfaction, usability issues, or that users not finding a feature valuable.
  • Exam passing rates should stay above Y%. A decline below that threshold may indicate recent content changes or technical issues are affecting understanding or accessibility.

You should also consider what data issues you have had in the past! Look through recent data incidents and pick out 3 or 4 to guard against next time. These might be in a #data-questions channel or perhaps a DM from a stakeholder.

Bucket 3: Stats-focused Anomalies

Stats-focused anomalies are fluctuations that go against your expected volumes or metrics. Some examples include:

  • Volume anomalies. This could be site traffic amounts that may indicate illicit behavior, or perhaps site traffic dropping one day then doubling the next, indicating that data were not loaded properly.
  • Dimensional anomalies, like too many product types underneath a particular product line that may indicate incorrect barcodes.
  • Column anomalies, like sale values more than a certain number of standard deviations from a mean, that may indicate improper discounting.

Overall, stats-focused anomalies can indicate system flaws, illicit site behavior, or fraud, depending on your industry. They also tend to require more advanced testing practices than we are covering in this blog. We feel stats-based anomalies are worth exploring once you have a good handle on your data hygiene and business-focused anomalies. We won’t give recommendations on stats-focused anomalies in this post.

How to prioritize data quality concerns in your pipeline

Now, you have a written and categorized list of data hygiene concerns and business-focused anomalies to guard against. It’s time to prioritize which quality issues deserve to fail your pipelines.

To prioritize your data quality concerns, think about real-life impact. A couple of guiding questions to consider are:

  • Are your numbers customer-facing? For example, maybe you work with temperature-tracking devices. Your customers rely on these devices to show them average temperatures on perishable goods like strawberries in-transit. What happens if the temperature of the strawberries reads as 300C when they know their refrigerated truck was working just fine? How is your brand perception impacted when the numbers are wrong?
  • Are your numbers used to make financial decisions? For example, is the marketing team relying on your numbers to choose how to spend campaign funds?
  • Are your numbers executive-facing? Will executives use these numbers to reallocate funds or shift priorities?

We think these 3 categories above constitute high-impact, pipeline-failing events. We think these categories should be your top priorities. Of course, adjust priority order if your business context calls for it.

Consult your list of data quality issues in the categories we mention above. Decide and mark if any are customer facing, used for financial decisions, or are executive-facing. Mark any data quality issues in those categories as “error”. These are your pipeline-failing events.

If any data quality concerns fall outside of these 3 categories, we classify them as nice-to-knows. Nice-to-know data quality testing can be helpful. But if you don’t have a specific action you can immediately take when a nice-to-know quality test fails, the test should be a warning, not an error.

You could also remove nice-to-know tests altogether. Data testing should drive action. The more alerts you have in your pipeline, the less action you will take. Configure alerts with care!

However, we do think nice-to-know tests are worth keeping if and only if you are gathering evidence for action you plan to take within the next 6 months, like product feature research. In a scenario like that, those tests should still be set to warning.

Start your action plan

Now, your data quality concerns are listed and prioritized. Next, add 1 or 2 initial debugging steps you will take if/when the issues surface. These steps should get added to your framework document. Additionally, consider adding them to a test’s description.

This step is important. Data quality testing should spur action, not accumulate alerts. Listing initial debugging steps for each concern will refine your list to the most critical elements.

If you can't identify an action step for any quality issue, remove it. Put it on a backlog and research what you can do when it surfaces later.

Here’s a few examples from our list of unexpected behaviors above.

  • For calculated field X, a value above Y or below Z is not possible.
    • Debugging initial steps
      • Use dbt test SQL or recent test results in dbt Explorer to find problematic rows
      • Check these rows in staging and first transformed model
      • Pinpoint where unusual values first appear
  • Revenue shouldn’t change by more than X% in Y amount of time.
    • Debugging initial steps:
      • Check recent revenue values in staging model
      • Identify transactions near min/max values
      • Discuss outliers with sales ops team

You now have written out a prioritized list of data quality concerns, as well as action steps to take when each concern surfaces. Next, consult hub.getdbt.com and find tests that address each of your highest priority concerns. dbt-expectations and dbt_utils are great places to start.

The data tests you’ve marked as “errors” above should get error-level severity. Any concerns falling into that nice-to-know category should either not get tested or have their tests set to warning.

Your data quality priorities list is a living reference document. We recommend linking it in your project’s README so that you can go back and edit it as your testing needs evolve. Additionally, developers in your project should have easy access to this document. Maintaining good data quality is everyone’s responsibility!

As you try these ideas out, come to the dbt community Slack and let us know what works and what doesn’t. Data is a community of practice, and we are eager to hear what comes out of yours.

How Hybrid Mesh unlocks dbt collaboration at scale

· 7 min read
Jason Ganz

One of the most important things that dbt does is unlock the ability for teams to collaborate on creating and disseminating organizational knowledge.

In the past, this primarily looked like a team working in one dbt Project to create a set of transformed objects in their data platform.

As dbt was adopted by larger organizations and began to drive workloads at a global scale, it became clear that we needed mechanisms to allow teams to operate independently from each other, creating and sharing data models across teams — dbt Mesh.

How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers

· 10 min read

The dbt Semantic Layer is founded on the idea that data transformation should be both flexible, allowing for on-the-fly aggregations grouped and filtered by definable dimensions and version-controlled and tested. Like any other codebase, you should have confidence that your transformations express your organization’s business logic correctly. Historically, you had to choose between these options, but the dbt Semantic Layer brings them together. This has required new paradigms for how you express your transformations though.

Putting Your DAG on the internet

· 5 min read
Ernesto Ongaro
Sebastian Stan
Filip Byrén

New in dbt: allow Snowflake Python models to access the internet

With dbt 1.8, dbt released support for Snowflake’s external access integrations further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, EQT AB. Learn about why they needed it and how they helped build the feature and get it shipped!

Unit testing in dbt for test-driven development

· 9 min read
Doug Beatty

Do you ever have "bad data" dreams? Or am I the only one that has recurring nightmares? 😱

Here's the one I had last night:

It began with a midnight bug hunt. A menacing insect creature has locked my colleagues in a dungeon, and they are pleading for my help to escape . Finding the key is elusive and always seems just beyond my grasp. The stress is palpable, a physical weight on my chest, as I raced against time to unlock them.

Of course I wake up without actually having saved them, but I am relieved nonetheless. And I've had similar nightmares involving a heroic code refactor or the launch of a new model or feature.

Good news: beginning in dbt v1.8, we're introducing a first-class unit testing framework that can handle each of the scenarios from my data nightmares.

Before we dive into the details, let's take a quick look at how we got here.

Maximum override: Configuring unique connections in dbt Cloud

· 6 min read

dbt Cloud now includes a suite of new features that enable configuring precise and unique connections to data platforms at the environment and user level. These enable more sophisticated setups, like connecting a project to multiple warehouse accounts, first-class support for staging environments, and user-level overrides for specific dbt versions. This gives dbt Cloud developers the features they need to tackle more complex tasks, like Write-Audit-Publish (WAP) workflows and safely testing dbt version upgrades. While you still configure a default connection at the project level and per-developer, you now have tools to get more advanced in a secure way. Soon, dbt Cloud will take this even further allowing multiple connections to be set globally and reused with global connections.

LLM-powered Analytics Engineering: How we're using AI inside of our dbt project, today, with no new tools.

· 10 min read
Joel Labes

Cloud Data Platforms make new things possible; dbt helps you put them into production

The original paradigm shift that enabled dbt to exist and be useful was databases going to the cloud.

All of a sudden it was possible for more people to do better data work as huge blockers became huge opportunities:

  • We could now dynamically scale compute on-demand, without upgrading to a larger on-prem database.
  • We could now store and query enormous datasets like clickstream data, without pre-aggregating and transforming it.

Today, the next wave of innovation is happening in AI and LLMs, and it's coming to the cloud data platforms dbt practitioners are already using every day. For one example, Snowflake have just released their Cortex functions to access LLM-powered tools tuned for running common tasks against your existing datasets. In doing so, there are a new set of opportunities available to us:

Column-Level Lineage, Model Performance, and Recommendations: ship trusted data products with dbt Explorer

· 9 min read
Dave Connors

What’s in a data platform?

Raising a dbt project is hard work. We, as data professionals, have poured ourselves into raising happy healthy data products, and we should be proud of the insights they’ve driven. It certainly wasn’t without its challenges though — we remember the terrible twos, where we worked hard to just get the platform to walk straight. We remember the angsty teenage years where tests kept failing, seemingly just to spite us. A lot of blood, sweat, and tears are shed in the service of clean data!

Once the project could dress and feed itself, we also worked hard to get buy-in from our colleagues who put their trust in our little project. Without deep trust and understanding of what we built, our colleagues who depend on your data (or even those involved in developing it with you — it takes a village after all!) are more likely to be in your DMs with questions than in their BI tools, generating insights.

When our teammates ask about where the data in their reports come from, how fresh it is, or about the right calculation for a metric, what a joy! This means they want to put what we’ve built to good use — the challenge is that, historically, it hasn’t been all that easy to answer these questions well. That has often meant a manual, painstaking process of cross checking run logs and your dbt documentation site to get the stakeholder the information they need.

Enter dbt Explorer! dbt Explorer centralizes documentation, lineage, and execution metadata to reduce the work required to ship trusted data products faster.

More time coding, less time waiting: Mastering defer in dbt

· 9 min read
Dave Connors

Picture this — you’ve got a massive dbt project, thousands of models chugging along, creating actionable insights for your stakeholders. A ticket comes your way — a model needs to be refactored! "No problem," you think to yourself, "I will simply make that change and test it locally!" You look at your lineage, and realize this model is many layers deep, buried underneath a long chain of tables and views.

“OK,” you think further, “I’ll just run a dbt build -s +my_changed_model to make sure I have everything I need built into my dev schema and I can test my changes”. You run the command. You wait. You wait some more. You get some coffee, and completely take yourself out of your dbt development flow state. A lot of time and money down the drain to get to a point where you can start your work. That’s no good!

Luckily, dbt’s defer functionality allow you to only build what you care about when you need it, and nothing more. This feature helps developers spend less time and money in development, helping ship trusted data products faster. dbt Cloud offers native support for this workflow in development, so you can start deferring without any additional overhead!

To defer or to clone, that is the question

· 6 min read
Kshitij Aranke
Doug Beatty

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

  1. What is dbt clone?
  2. How is it different from deferral?
  3. Should I defer or should I clone?

Optimizing Materialized Views with dbt

· 11 min read
Amy Chen
note

This blog post was updated on December 18, 2023 to cover the support of MVs on dbt-bigquery and updates on how to test MVs.

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

Create dbt Documentation and Tests 10x faster with ChatGPT

· 8 min read
Pedro Brito de Sa

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

Data Vault 2.0 with dbt Cloud

· 15 min read
Rastislav Zdechovan
Sean McIntyre

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

Building a historical user segmentation model with dbt

· 14 min read
Santiago Jauregui

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

Modeling ragged time-varying hierarchies

· 18 min read
Sterling Paramore

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

How we reduced a 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake

· 12 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

Building a Kimball dimensional model with dbt

· 20 min read
Jonathan Neo

Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.

Data modeling techniques on a normalization vs denormalization scaleData modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

  • Understand dimensional modeling concepts
  • Set up a mock dbt project and database
  • Identify the business process to model
  • Identify the fact and dimension tables
  • Create the dimension tables
  • Create the fact table
  • Document the dimensional model relationships
  • Consume the dimensional model

dbt Squared: Leveraging dbt Core and dbt Cloud together at scale

· 12 min read
João Antunes
Yannick Misteli
Sean McIntyre

Teams thrive when each team member is provided with the tools that best complement and enhance their skills. You wouldn’t hand Cristiano Ronaldo a tennis racket and expect a perfect serve! At Roche, getting the right tools in the hands of our teammates was critical to our ability to grow our data team from 10 core engineers to over 100 contributors in just two years. We embraced both dbt Core and dbt Cloud at Roche (a dbt-squared solution, if you will!) to quickly scale our data platform.

The missing guide to debug() in dbt

· 7 min read
Benoit Perigaud

Editor's note—this post assumes intermediate knowledge of Jinja and macros development in dbt. For an introduction to Jinja in dbt check out the documentation and the free self-serve course on Jinja, Macros, Packages.

Jinja brings a lot of power to dbt, allowing us to use ref(), source() , conditional code, and macros. But, while Jinja brings flexibility, it also brings complexity, and like many times with code, things can run in expected ways.

The debug() macro in dbt is a great tool to have in the toolkit for someone writing a lot of Jinja code, but it might be difficult to understand how to use it and what benefits it brings.

Let’s dive into the last time I used debug() and how it helped me solve bugs in my code.

Audit_helper in dbt: Bringing data auditing to a higher level

· 15 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Auditing tables is a major part of analytics engineers’ daily tasks, especially when refactoring tables that were built using SQL Stored Procedures or Alteryx Workflows. In this article, we present how the audit_helper package can (as the name suggests) help the table auditing process to make sure a refactored model provides (pretty much) the same output as the original one, based on our experience using this package to support our clients at Indicium Tech®.