"Your Data Is Dirty. Here's What to Do About It."


Last weekend I was pressure washing my driveway.

I remember thinking, “This will be quick! A couple hours and I’ll be done. Then I can enjoy the rarity of sunshine in Seattle in the afternoon.” I pulled out the pressure washer. Attached the hose and electric. I had a great podcast in my AirPods and got to work.

My podcast wrapped up 75 minutes later and I looked around, astonished I had only made it through 25% of my driveway. That’s when I realized the driveway was far worse than I thought. This is going to take all day. Maybe multiple days. Multiple weekends lost to the joys of homeownership and a clean driveway.

That’s exactly how data quality problems work in enterprise applications. Most teams aren’t ignoring their data. They just don’t have a clear picture of how bad it actually is, who owns it, or what to fix first. Or worse yet, it is not a priority, at least not right now. So it sits. And it compounds.

After years managing data platforms at scale, here’s what I’ve learned about treating data quality as an operational discipline rather than a one-time cleanup project.


Step One: You Have to Actually Find the Problems

Data quality problems are rarely obvious from the outside. They surface in the worst possible moments, like when a campaign goes out to the wrong audience, when a real-time system makes a decision on stale records, or when a regulatory audit asks you to account for data you can’t trace.

Discovery is the unglamorous first step that teams consistently skip because they want to jump to fixing things. Before you can remediate anything, you need to know what’s broken, how it broke in the first place, how broadly it’s broken, and where it lives.

This means profiling your data. Running null checks, format validations, duplicate analysis, referential integrity checks, and volume comparisons across your datasets. It sounds tedious because it is. But it’s also clarifying. At Starbucks, a profiling exercise on our Customer Data Platform surfaced 14 million customer accounts with data integrity issues. Nobody had put a number to it before. Once you have a number, you can have a real conversation with leadership.

Discovery answers three simple questions: how big is the problem, what do we actually have, and how much of it is wrong?


Step Two: Build Visibility Into the Pipeline

Finding problems once is not enough. You need to be able to see them continuously. That’s where observability comes in.

Data observability is the practice of monitoring your data in motion, not just at a point in time. Think of it like application performance monitoring, but applied to your data pipelines. Are records arriving on schedule? Are schemas drifting between environments? Are volumes outside normal ranges? Is an upstream system suddenly producing nulls it wasn’t producing last week?

Without observability, you’re flying blind. A data quality issue can persist for days or weeks before a downstream system fails or a business user notices something is off. By then, the problem has already affected decisions, campaigns, or customers and the root cause is harder to isolate.

Consider what happens without it:

  • A retail organization launches a personalization campaign using customer segment data that was quietly corrupted three weeks earlier by an upstream schema change. Nobody knew because nobody was watching.
  • A financial services firm discovers during a quarterly audit that a core customer attribute has been null for 40% of records for the past two months. The data team had no alerting on that field.
  • A healthcare company’s reporting layer starts producing incorrect patient counts. The issue traces back to a pipeline that stopped deduplicating records after a routine system update. No monitoring, no alert, no catch.

Good observability means your team knows about a data quality issue before your stakeholders do.


Step Three: Know Where Your Data Has Been

Observability tells you what is happening now. Traceability tells you why it happened and how it got there.

Data lineage is the ability to follow a record from its origin through every transformation, integration, and landing point it touches along the way. When something is wrong, you need to answer three questions quickly: Where did this come from? What touched it along the way? Where did it go after?

Without lineage, every data quality investigation becomes an archaeological dig. You’re interviewing engineers, reading old pipeline code, and reverse-engineering integrations that were built years ago by people who no longer work there. With lineage, you can isolate the problem source and scope the downstream impact in a fraction of the time.

Traceability also matters beyond quality. If you’re handling PII, regulatory requirements like HIPAA, GDPR, and CCPA expect you to explain exactly how customer data flows through your systems, who can access it, and where it ends up. That’s not optional, which makes “we’re working on it” an unacceptable answer during an audit.


Step Four: Triage Before You Start Fixing

Before your team touches a single record, you need to answer one question: what do we fix first?

Data problems are rarely finite. There will always be more issues than capacity to address them. Teams that make real progress are the ones that prioritize by business impact, not record count.

A useful triage framework asks three questions for each identified issue:

What downstream systems or decisions does this data support? Bad data in a field that feeds real-time loyalty transactions is a different priority than bad data in a field used for quarterly reporting. Know the blast radius.

What is the cost of inaction? Some data problems grow. Duplicates compound. Corrupted records propagate downstream. Others are stable and bounded. Prioritize problems that are actively spreading or that sit in high-traffic data paths.

Is this a symptom or a root cause? This is the question most teams skip. Cleaning up bad records without addressing what created them is a temporary fix. If bad data is consistently entering your system from a specific upstream source, cleaning the output is less valuable than fixing the input.

Answering these questions before you start keeps your team focused on the work that actually matters. At Starbucks, fixing broken accounts was more important than eliminating abandoned accounts because broken meant something was blocking them from working correctly while a customer was still attached to them. Abandoned accounts were not hurting sales because there were no active customers tied to them. That distinction drove our prioritization and kept the team working on what the business actually needed first.


Step Five: Run the Cleanup

The cleanup itself taught me something too.

As I worked through pressure washing the driveway, I found there were several reasons for the poor state it was in. We live in the forest, and the forest has a way of reclaiming man-made improvements, so I added a soap solution to remove the film. Where we parked and washed our cars had more dirt and debris than other areas, so I let those sections dry in the sunshine and used a blower to clear loose material first. The rest was mostly surface dirt that came right off with a concrete cleanser. Finally, I sprayed a sealer on the driveway to make future cleanings easier. The same is true for data remediation. The right tooling changes the outcome.

Effective data remediation follows a few consistent patterns regardless of scale:

Scope it clearly. “Fix our customer data” is not a project. Define which records, which fields, and which systems are in scope before any work begins. Vague cleanup efforts produce vague results.

Automate where the problem is repeating. Manual data correction at scale is slow, expensive, and error-prone. If the same type of bad record keeps appearing, build the correction logic into the pipeline so it applies to future records automatically.

Define done before you start. Establish what “clean” looks like with specific, measurable criteria before remediation begins. Then validate against those criteria when the work is complete. Without a definition of done, cleanup efforts drift and never officially finish.

Communicate the outcome. Stakeholders should know what changed, when it changed, and what it means for them in terms of business value or impact. A significant cleanup effort that nobody knows about is a missed opportunity to build trust in your platform. Make it visible.


Step Six: Governance Keeps It Clean

Cleanup is reactive. Governance is what makes the cleanup stick.

If you run a major remediation effort without changing the upstream rules, standards, or processes that created the problem in the first place, you will be back in the same position in 18 months. This is the most common failure mode in data quality programs. Teams treat quality as a project with a finish line rather than an ongoing practice with standards.

The hardest part of governance isn’t the technical controls. It’s the organizational ones. Getting the teams that produce data to care as much about quality as the teams that depend on it is a long conversation. In my experience, getting the business to prioritize that conversation is the one that stalls most governance programs before they get started.

Data governance means establishing the controls that prevent bad data from entering your systems at scale:

  • Data contracts between producing and consuming teams define what each side is responsible for and what the expected format, frequency, and quality standards are for every data exchange.
  • Schema validation at ingestion catches malformed records before they propagate downstream rather than after.
  • Defined ownership means that when a field is wrong, someone is accountable for fixing it, not just impacted by it. Ownership without accountability is just a label.
  • Standards documentation that is actually maintained keeps teams from making well-intentioned decisions that introduce new quality problems.

Governance is as much organizational as it is technical. It requires buy-in from the teams that produce data, not just the teams that consume it. That conversation is harder than the technical work, but it’s the one that determines whether your data quality improves over time and once it’s improved, stays clean.


Data Quality Is a Practice, Not a Project

Today, the driveway looks great. But I know the forest will try to take it back. Moss will grow. Leaves will fall. We’ll add our own dirt to it. That means this wasn’t a one-time event, and I will have to go through this again. To keep it looking good I’ll have to build a habit of cleaning and sealing it every few months. Otherwise the forest wins, and it might be easier to cut my losses and start over with a new driveway at a much higher expense. That’s why regular, routine maintenance matters.

The organizations that get this right are not the ones that ran the biggest cleanup project. They’re the ones that built the discipline to find problems early, watch their data continuously, trace issues to their source, prioritize by business impact, remediate systematically, and govern upstream to prevent recurrence.

Find it. Watch it. Trace it. Triage it. Fix it. Govern it.

That’s not a project plan. It’s a practice.


How does your organization approach data quality — reactive cleanup or ongoing discipline? Connect with me on LinkedIn or Substack to continue the conversation.