Finding Patterns in Missing Data
You open a dataset and notice gaps. Some cells are blank, some columns are mostly empty, and the row count doesn't match what you expected. The question isn't just "is data missing?" - it's "why is it missing, and does the pattern tell me something?"
Random NULLs scattered across a dataset are annoying but manageable. Systematic gaps - where every row from a particular source or time period is missing the same fields - are a signal. Maybe a form field was added in March, so everything before that is blank. Maybe one regional office never fills in the "phone" column. Maybe an API started returning a new field halfway through the year.
The difference matters. Random gaps just need filling or filtering. Systematic gaps mean your data has structural problems that affect any analysis you do with it. Here's how to figure out which you're dealing with.
Step 1: See which columns have the most NULLs
Open your file in ExploreMyData and expand the Column Explorer panel. Each column shows a quick summary including the count of NULL values and the percentage of the column that's empty.
Scan through the columns. A dataset with 10,000 rows where "email" has 47 NULLs is probably fine - some people didn't provide an email. But if "phone" has 3,200 NULLs, that's a third of your data. Worth investigating.
| Column | Type | Nulls | % Null |
|---|---|---|---|
| customer_id | BIGINT | 0 | 0% |
| name | VARCHAR | 0 | 0% |
| VARCHAR | 47 | 0.5% | |
| phone | VARCHAR | 3,200 | 32% |
| signup_date | DATE | 0 | 0% |
| source | VARCHAR | 12 | 0.1% |
At a glance, phone stands out with 3,200 nulls (32%). All other columns are nearly complete.
Make a mental list of the columns with the highest NULL rates. Those are the ones you'll investigate next.
Step 2: Filter to just the rows with missing data
Pick the column with the most NULLs. Open the Filter
operation and set the condition to phone IS NULL.
Now you're looking at only the rows where that field is empty.
Scroll through the results. Do you notice anything? Are the dates clustered in a particular range? Do the rows share a common value in another column - the same region, source system, or account type?
This is the eyeball test. It's not rigorous, but it's often enough to spot obvious patterns.
If 3,200 rows with missing phone numbers all have source = "web_import",
you've found your answer - the web import doesn't collect phone numbers.
Step 3: Create a "has missing data" flag
For a more systematic view, use Add Column to create a flag that marks whether each row has missing data in the columns you care about.
Set the expression to something like:
CASE WHEN phone IS NULL OR email IS NULL THEN 'missing' ELSE 'complete' END
Name the new column data_completeness.
Now every row is tagged, and you can use this flag in further analysis.
| customer_id | name | phone | data_completeness | |
|---|---|---|---|---|
| 1001 | Alice Chen | alice@example.com | 555-0142 | complete |
| 1002 | Bob Lund | bob@example.com | NULL | missing |
| 1003 | Priya Shah | NULL | NULL | missing |
| 1004 | Tom Reed | tom@example.com | 555-0299 | complete |
New data_completeness column added with the CASE WHEN expression. Each row is now tagged, ready for grouping.
Step 4: Pivot to see the pattern
This is where it gets interesting. Use the Pivot operation to count missing vs. complete rows broken down by another column.
For example, pivot with:
- Rows: source (or region, or signup_month)
- Columns: data_completeness
- Values: COUNT
The result is a table showing how many complete vs. missing rows exist for each source. If one source has 3,100 of your 3,200 missing rows, that's your pattern.
| source | complete | missing |
|---|---|---|
| crm_sync | 4,821 | 63 |
| sales_team | 1,903 | 37 |
| web_import | 100 | 3,100 |
The pattern is clear: 3,100 of the 3,200 missing phone rows come from web_import. The web import flow simply doesn't collect phone numbers.
Try different breakdowns. Pivot by month to see if the gaps are time-based. Pivot by category to see if certain product lines have worse data quality. Each pivot takes a few seconds and gives you a different angle.
What to do once you find the pattern
Once you know why data is missing, you can decide what to do about it:
- Source-based gaps: The data was never collected. You can filter out those rows, fill with defaults, or fix the upstream system.
- Time-based gaps: A field was added later. Everything before the cutoff date is legitimately empty. Acknowledge this in your analysis.
- Category-based gaps: Certain categories don't have certain attributes. "Shipping weight" being NULL for digital products isn't a data quality issue - it's correct.
- Truly random: No pattern. Some values are just missing. Use Fill Missing with a sensible default, or exclude those rows from calculations.
A real example
Say you have a customer support dataset with columns for ticket_id, category, priority, assigned_to, resolution_time, and satisfaction_score. The satisfaction_score column is 40% NULL.
You filter to satisfaction_score IS NULL
and notice most of these tickets have category = "billing".
You pivot by category and confirm: billing tickets have a 72% NULL rate for satisfaction, while
technical support tickets are only 15% NULL.
The reason? Billing tickets get resolved through an automated system that doesn't send satisfaction surveys. This isn't a data quality problem - it's a process gap. And now you know that any "average satisfaction score" metric you calculate is biased toward non-billing tickets.
That's the kind of insight you only get when you look at the shape of what's missing, not just what's present.