← All posts
5 min read

What to Do First When You Open a New CSV File

Someone sends you a CSV and says "can you look at this?" You open it and see 50,000 rows and 25 columns. The column names are abbreviations that don't mean much. There's no data dictionary. Where do you even start?

Most people scroll around aimlessly for a few minutes, maybe sort a column or two, then go back to the sender with questions they could have answered themselves. There's a better way. Here's a five-minute routine that gives you a solid understanding of any data file before you start asking questions or doing analysis.

Step 1: Check the basics

Before anything else, look at the numbers in the status bar. How many rows? How many columns? This tells you the scale of what you're dealing with. A 500-row file with 8 columns is a different animal than 500,000 rows with 60 columns.

In ExploreMyData, the row count shows immediately after the file loads. Scroll the grid horizontally to see every column. Read the column headers. Even cryptic ones like amt_due_net or cust_seg_cd start to make sense once you see a few values beneath them.

#order_idcust_seg_cdamt_due_netorder_dtstatusregion
BIGINT             VARCHAR         DOUBLE         DATE            VARCHAR      VARCHAR       - 49,832 rows
110041ENT1,240.002025-01-03ActiveWest
210042SMB89.992025-01-03ClosedEast
310043MID430.502025-01-04PendingCentral
410044ENT5,820.002025-01-04ActiveWest
510045SMB212.002025-01-05activeEast

Column types are shown below each header. Row 5 has "active" (lowercase) - a casing inconsistency to watch for.

Step 2: Check column types

Each column header shows its detected type: text, number, date, boolean. This matters more than you'd think. A "zip_code" column detected as a number means leading zeros got stripped (00501 became 501). A "price" column detected as text means it has dollar signs or commas that need cleaning before you can do math.

Scan the types quickly. Are dates actually dates, or strings? Are numbers actually numbers? Type mismatches are one of the most common problems in CSV files, and catching them early saves a lot of confusion later.

Step 3: Explore a few text columns

Click on a text column header to open the Column Explorer. For categorical columns, this shows every unique value and how many times it appears. This is where you start to understand what the data actually represents.

Say you open a "status" column and see: Active (31,204), Closed (14,891), Pending (3,712), active (193). Now you know three things: this is probably a customer or account file, most records are active, and there's an inconsistent casing issue to deal with.

The search bar in the categorical explorer is useful when a column has hundreds of unique values. Type a few characters to narrow down results. Click any value to instantly filter the grid to just those rows.

statusCountShare
Active31,20462.6%
Closed14,89129.9%
Pending3,7127.4%
active1930.4%
NULL470.1%

5 distinct values - 4 unique statuses plus NULL. The lowercase "active" (193 rows) is the same category as "Active" entered inconsistently.

Step 4: Check numeric columns for range and distribution

Click on a numeric column to see its stats: min, max, average, sum, standard deviation, and null count. These six numbers tell you a lot.

For example, an "order_total" column with min = -$450, max = $847,000, avg = $312. That negative value is probably a refund. The $847K outlier needs investigation. And the average suggests most orders are modest.

Below the stats, the histogram shows the full distribution. Most numeric data is heavily skewed: a few extreme values with the bulk clustered at one end. The histogram makes that immediately visible. You can adjust the number of bins to get more or less granularity.

StatValueNotes
Min-450.00Likely a refund - worth checking
Max847,000.00~2,700x the average - investigate
Average312.40Typical order size
Std Dev2,841.17High relative to mean - outliers present
Nulls2140.4% of rows - likely draft orders

Column: amt_due_net - histogram shows 99% of orders clustered between $0–$1,000 with a few extreme values stretching the scale.

Step 5: Check date columns for range and gaps

Date columns get their own explorer with a time-series chart. This answers questions you didn't even know you had. Does the data cover a full year or just three months? Is there a gap in July where no records exist? Did the volume spike in December?

Switch the granularity between day, week, month, and year to see patterns at different scales. Daily granularity shows weekday vs. weekend patterns. Monthly smooths things out and shows seasonal trends.

Step 6: Check for nulls

Every Column Explorer shows the null count, but do a quick mental pass: which columns have missing data, and how much? A column that's 80% null is probably not useful. A column that's 99.9% filled but has 50 nulls might be important to investigate.

The numeric explorer puts the null count right in the stats panel. For text columns, NULL shows up as a value in the categorical list (usually at the bottom). Pay attention to the distinction between NULL and empty string, as they can behave differently in filters and aggregations.

Why this matters

This whole routine takes about five minutes. At the end, you know:

  • How big the dataset is
  • What each column represents (roughly)
  • Whether types are correct
  • The range and distribution of key numeric fields
  • The time period the data covers
  • Where the data quality problems are

That's enough to have an informed conversation with whoever sent you the file, or to start your actual analysis with confidence instead of guesswork.

The alternative is jumping straight into analysis and discovering twenty minutes later that the "revenue" column is text, half the dates are from a test environment, and the status column has six different spellings of "cancelled." Do the five-minute check first. It always pays off.

Try it with your own data →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData