How to Find Outliers in Numeric Data Without Code
You're looking at a revenue column. Most orders are between $20 and $300. But there are three orders over $50,000. Are those enterprise deals? Bulk purchases? Or did someone accidentally enter 50000 instead of 500?
You can't answer that question by scrolling through rows. You need to see the distribution first, then drill into the extremes. Here's how to do that without writing any code.
Start with the histogram
Click any numeric column header in ExploreMyData to open the Column Explorer. The top section shows summary stats: min, max, average, sum, standard deviation, and null count. Below that is an interactive histogram.
The histogram is the fastest way to spot outliers because it shows the shape of your data. If 99% of values cluster between 0 and 500 and there's a single bar way out at 50,000, you have outliers. If the distribution is smooth and continuous, you probably don't.
| Bin range (revenue) | Count | Bar |
|---|---|---|
| $0 – $100 | 8,241 | |
| $100 – $200 | 11,047 | |
| $200 – $500 | 9,388 | |
| $500 – $1,000 | 3,102 | |
| $1,000 – $5,000 | 211 | |
| $50,000+ | 3 |
The gap between $5,000 and $50,000 is the signal. Three orders are far outside the main cluster and warrant investigation.
Pay attention to the gap between the main cluster and the extreme values. A revenue column where values smoothly taper from $0 to $5,000 tells a different story than one where everything is under $500 and then three values jump to $50,000+. The gap is the signal.
Adjust the bins
The default bin count works for most data, but sometimes you need to change it. Fewer bins (10-15) give you a high-level shape. More bins (50-100) reveal detail in the distribution.
When outliers compress the histogram, try increasing the bin count. With 10 bins, a dataset ranging from 0 to 100,000 gives you bins of 10,000 each. That puts 99% of your data into the first bin and hides all the interesting variation. With 50 bins, you get 2,000-wide bins and can actually see the distribution of the main cluster.
Read the summary stats
The stats panel above the histogram gives you quick diagnostic numbers:
- Min and Max show the full range. If min is negative in a column that shouldn't have negatives (like quantity or age), that's a problem.
- Average vs. Max tells you about skew. If the average is $312 but the max is $847,000, the max is roughly 2,700x the average. That's worth investigating.
- Standard deviation quantifies spread. A high stddev relative to the mean suggests either genuine variation or outlier contamination.
- Null count matters too. Missing values in a numeric column might be the real outliers: records where someone left the field blank because the number didn't make sense.
| Stat | Value |
|---|---|
| Min | -3,200.00 |
| Max | 52,400.00 |
| Average | 284.73 |
| Sum | 9,072,140.00 |
| Std Dev | 1,847.22 |
| Nulls | 63 |
The max is 184x the average, and the std dev is 6.5x the average. Both are strong indicators that outliers are distorting the distribution.
Pull up the extreme values
Once you know outliers exist, you need to see the actual rows. Use the Top/Bottom operation to show the top N rows by a specific column. Set it to show the top 10 rows sorted by revenue descending. Now you can see the full context of each outlier: who's the customer, what's the product, when did the order happen.
Context is everything. An order for $52,000 from "Acme Corp" for "Enterprise License - Annual" is probably legitimate. An order for $52,000 from "John Smith" for "Widget (Blue)" is probably a data entry error.
Do the same for the bottom. Show the bottom 10 by revenue. Negative values, zeros, and unusually small amounts often tell you as much as the large outliers. A $0.01 order might be a test transaction. A -$3,200 order is likely a refund that someone entered as a new order instead of an adjustment.
Filter to isolate them
Now that you've identified the threshold, use Filter
to isolate the outliers. Set a condition like revenue > 10000
to see only the high-end outliers. Or revenue < 0
for negatives.
With the outliers isolated, you can examine each one. Look at the other columns for patterns. Are all the $50K+ orders from the same customer? The same date? The same sales rep? Patterns in the context columns help you decide whether these are real data or errors.
Common outlier patterns
After looking at enough datasets, certain patterns keep showing up:
- Powers of 10: A value of $5,000 when everything else is around $50 usually means a missing decimal point.
- Round numbers: $100,000 exactly, in a column of precise dollar-and-cent values, often indicates a placeholder or estimate.
- Negatives in positive-only fields: Quantity, age, or price columns shouldn't have negative values. These are usually refunds, corrections, or entry errors.
- Zeros: A zero in a revenue or price column might mean "free" or might mean "unknown." Check the context.
- Same extreme value repeated: If you see 999999 or -1 multiple times, those are likely sentinel values used to mean "not applicable" or "missing."
Decide, then act
Finding outliers is step one. Deciding what to do with them is step two. Your options are usually:
- Keep them: They're real data points. Enterprise deals really do happen.
- Fix them: $50,000 should be $500. Use Find & Replace or a formula to correct.
- Remove them: They're test data or errors. Filter them out.
- Flag them: Add a column that marks them for review by someone who knows the business context.
The histogram and stats tell you the outliers exist. Top/Bottom and Filter let you see what they are. The decision about what to do with them requires judgment, but at least now you have the information to make that call.