← All posts
4 min read

Understanding Category Distributions in Your Data

Your orders table has a "category" column. You assume sales are spread fairly evenly across categories. But are they? Maybe "Electronics" accounts for 60% of all orders and "Garden Supplies" is less than 1%. Maybe there are 47 categories when you expected 12, because someone entered "electronics", "Electronics", and "ELECTRONICS" as three different things.

You won't know until you look at the distribution. And scrolling through rows won't show you the distribution. You need counts by category, sorted, all in one place.

Open the categorical explorer

In ExploreMyData, click any text column header to open the Column Explorer. For categorical columns, this shows every unique value alongside its count, sorted by frequency.

This single view answers the three most common questions about categorical data:

  • How many distinct categories exist?
  • Which ones dominate?
  • Are there values that shouldn't be there?
categoryCountShare
Electronics24,84059.1%
Clothing9,20321.9%
Home & Kitchen4,91111.7%
Books2,1875.2%
Sports6111.5%
Garden Supplies2480.6%

6 distinct values, sorted by frequency. Electronics dominates at 59% - good to know before building any category-level report.

Spot the long tail

Most categorical columns follow a power law: a few values account for the bulk of the data, and a long tail of rare values makes up the rest. The explorer makes this immediately visible because values are sorted by count.

Consider a "country" column in a customer table. You might see: United States (84,201), United Kingdom (12,433), Canada (9,871), Germany (4,502)... and then 60 more countries with counts under 100. That long tail matters. If you're building a report grouped by country, those 60 countries will each get their own row, making the report unusable. You might want to group them into an "Other" bucket.

The explorer shows you this problem before you waste time building the report.

Find dirty data hiding in categories

Categorical columns are magnets for data quality issues. Common problems that jump out in the explorer:

  • Inconsistent casing: Shipped (14,201) and shipped (83) side by side. That's one category, not two.
  • Leading/trailing whitespace: New York and New York (with a space before it). Hard to see in a spreadsheet, obvious in the explorer because they show up as separate entries.
  • Abbreviation mix: California, CA, and Calif. are the same state but three different values.
  • Unexpected values: A "payment_method" column with "Credit Card", "Debit Card", "Wire Transfer", and "asdf". That last one is probably a test row.

The search bar helps here. Type "new" and you'll see all the New York variants at once. Type "ship" and you'll catch "Shipped", "shipped", "SHIPPED", and "Ship" in one view.

Search results for "ship" in the status column:

status (matched)Count
Shipped14,201
shipped83
SHIPPED12
Ship5

Four variants of the same status - all lowercase, uppercase, and mixed. Searching for "ship" reveals them all at once, which scrolling through rows never would.

Click to filter, then investigate

When you see a suspicious value in the explorer, click it. The grid immediately filters to show only rows with that value. Now you can see the full context: when were these rows created? By whom? What do the other columns look like?

This is the fastest path from "that looks weird" to "I understand what happened." A category with 3 records out of 50,000 might be test data, a migration artifact, or a legitimate edge case. The surrounding columns tell you which.

Get a formal breakdown with Top/Bottom

The Column Explorer gives you a quick, visual distribution. If you need a more structured view, use the Top/Bottom operation to show the top N categories by count. This gives you a clean, sorted view in the main grid.

For a full summary table, use Pivot with COUNT as the aggregation. Set the category column as the row grouping. This produces a table with one row per category and a count column, which you can sort, filter, or export.

categorycount
Electronics24,840
Clothing9,203
Home & Kitchen4,911
Books2,187
Sports611
Garden Supplies248

Pivot with GROUP BY category, COUNT(*). Clean summary table you can sort, filter, or export directly.

What to look for in any categorical column

Keep a mental checklist when you open the categorical explorer:

  1. Count of unique values: Does it match expectations? A "status" column with 200 unique values is suspicious. A "product_name" column with 3 unique values might mean data is filtered already.
  2. Dominant value: If one category has 95% of the data, any analysis grouped by this column will be skewed. That's not wrong, but you should know it going in.
  3. Null or blank entries: How many records have no category? Are they evenly distributed or concentrated in a specific time period?
  4. Near-duplicates: Two values that look almost the same are usually the same, entered differently. Search for partial strings to find them.

Five minutes with the categorical explorer tells you whether a column is clean enough to use for grouping and filtering, or whether it needs cleanup first.

Explore your categories →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData