← All posts
6 min read

DuckDB SQL Cheatsheet for CSV Analysis

ExploreMyData runs DuckDB WASM under the hood. When you use the SQL operation or write expressions for Add Column and Filter, you're writing DuckDB SQL. It's PostgreSQL-like but has its own functions and quirks.

This is a quick reference for the functions you'll actually use when analyzing CSV, Parquet, and JSON files. Bookmark it. Every example below works in the SQL operation or expression fields.

Aggregations

The basics. These go in GROUP BY queries or the Aggregate operation.

-- Count all rows
SELECT COUNT(*) FROM data

-- Count non-null values in a column
SELECT COUNT(email) FROM data

-- Count distinct values
SELECT COUNT(DISTINCT region) FROM data

-- Sum, average, min, max
SELECT SUM(amount), AVG(amount), MIN(amount), MAX(amount) FROM data

-- Median (50th percentile)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) FROM data

-- Standard deviation
SELECT STDDEV(amount) FROM data

String functions

Text manipulation. Useful in Add Column expressions and filter conditions.

-- Case conversion
LOWER(name)                          -- 'john doe'
UPPER(name)                          -- 'JOHN DOE'
INITCAP(name)                        -- 'John Doe'

-- Trimming whitespace
TRIM(name)                           -- remove leading/trailing spaces
LTRIM(name)                          -- left trim only
RTRIM(name)                          -- right trim only

-- Substrings and splitting
SPLIT_PART(email, '@', 2)            -- extract domain from email
SUBSTRING(phone, 1, 3)              -- first 3 characters
LEFT(zip_code, 5)                    -- first 5 characters
RIGHT(sku, 4)                        -- last 4 characters

-- Search and replace
REPLACE(city, 'NYC', 'New York')     -- exact replacement
REGEXP_REPLACE(phone, '[^0-9]', '', 'g')  -- strip non-digits
REGEXP_EXTRACT(text, '([A-Z]{2}-\d+)')    -- extract pattern

-- Concatenation
CONCAT(first_name, ' ', last_name)   -- combine strings
city || ', ' || state                -- alternative syntax

-- Length and position
LENGTH(name)                         -- character count
POSITION('@' IN email)              -- position of substring
CONTAINS(name, 'corp')              -- true/false search
STARTS_WITH(url, 'https')           -- prefix check

Date and time functions

DuckDB is strict about date types. If your column is a string, you'll need to parse it first with strptime or cast it.

-- Parse a date string into a DATE/TIMESTAMP
strptime(date_col, '%Y-%m-%d')           -- '2026-01-15' -> DATE
strptime(date_col, '%m/%d/%Y %H:%M')     -- '01/15/2026 14:30' -> TIMESTAMP

-- Format a date as string
strftime(order_date, '%B %Y')             -- 'January 2026'

-- Extract parts
EXTRACT(YEAR FROM order_date)             -- 2026
EXTRACT(MONTH FROM order_date)            -- 1
EXTRACT(DOW FROM order_date)              -- day of week (0=Sun)
EXTRACT(QUARTER FROM order_date)          -- 1-4

-- Truncate to period
DATE_TRUNC('month', order_date)           -- first day of month
DATE_TRUNC('week', order_date)            -- Monday of that week
DATE_TRUNC('quarter', order_date)         -- first day of quarter

-- Date arithmetic
order_date + INTERVAL '30 days'           -- add 30 days
CURRENT_DATE - INTERVAL '1 year'          -- one year ago
DATEDIFF('day', start_date, end_date)     -- days between two dates
DATEDIFF('month', hire_date, CURRENT_DATE) -- months since hire

-- Current date/time
CURRENT_DATE                              -- today's date
CURRENT_TIMESTAMP                         -- current timestamp
NOW()                                     -- same as CURRENT_TIMESTAMP

Window functions

Window functions let you compute values across rows without collapsing them. They're the answer to "I want a running total" or "rank within each group."

-- Row numbering
ROW_NUMBER() OVER (ORDER BY amount DESC)                    -- 1, 2, 3...
ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) -- rank within region

-- Ranking (handles ties differently)
RANK() OVER (ORDER BY score DESC)          -- 1, 2, 2, 4 (gaps after ties)
DENSE_RANK() OVER (ORDER BY score DESC)    -- 1, 2, 2, 3 (no gaps)

-- Running calculations
SUM(amount) OVER (ORDER BY order_date)                      -- running total
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)  -- 7-day moving avg

-- Previous/next row values
LAG(amount, 1) OVER (ORDER BY order_date)   -- previous row's amount
LEAD(amount, 1) OVER (ORDER BY order_date)  -- next row's amount
amount - LAG(amount) OVER (ORDER BY order_date)  -- change from previous

-- Partition calculations
SUM(amount) OVER (PARTITION BY region)      -- total per region (on every row)
amount / SUM(amount) OVER () * 100          -- percentage of grand total

Type casting

Use TRY_CAST instead of CAST when your data might have invalid values. It returns NULL instead of throwing an error.

-- Safe casting (returns NULL on failure)
TRY_CAST(price AS DOUBLE)                 -- string to number
TRY_CAST(date_col AS DATE)                -- string to date
TRY_CAST(id AS INTEGER)                   -- string to integer
TRY_CAST(active AS BOOLEAN)               -- string to boolean

-- Hard casting (throws error on failure)
CAST(amount AS VARCHAR)                    -- number to string
CAST(quantity AS BIGINT)                   -- integer upcast

-- Shorthand syntax
price::DOUBLE                              -- same as CAST(price AS DOUBLE)

Conditionals

Branching logic for computed columns and conditional aggregation.

-- CASE expression
CASE
  WHEN amount >= 1000 THEN 'high'
  WHEN amount >= 100 THEN 'medium'
  ELSE 'low'
END

-- COALESCE: first non-null value
COALESCE(preferred_name, first_name, 'Unknown')

-- NULLIF: return NULL if values match
NULLIF(status, '')                         -- treat empty strings as NULL
NULLIF(division_result, 0)                 -- avoid division by zero

-- IIF: inline if (DuckDB extension)
IIF(amount > 0, 'profit', 'loss')

JSON functions

If you load a JSON or JSONL file, or have JSON strings inside a CSV column, these are the extraction functions.

-- Extract a string value from JSON
json_extract_string(payload, '$.name')         -- extract as VARCHAR
json_extract_string(payload, '$.address.city') -- nested path

-- Extract typed values
json_extract(payload, '$.count')::INTEGER      -- extract as number
json_extract(payload, '$.active')::BOOLEAN     -- extract as boolean

-- Check if key exists
json_extract(payload, '$.email') IS NOT NULL

Useful patterns

A few combinations that come up constantly when analyzing real data.

-- Percentage of total
ROUND(amount / SUM(amount) OVER () * 100, 2) AS pct_of_total

-- Top N per group
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS rn
  FROM data
) WHERE rn <= 3

-- Deduplication keeping latest
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) AS rn
  FROM data
) WHERE rn = 1

-- Pivot-like: count by category
SELECT
  COUNT(*) FILTER (WHERE status = 'active') AS active_count,
  COUNT(*) FILTER (WHERE status = 'inactive') AS inactive_count
FROM data

-- Bucket numeric values
CASE
  WHEN age < 25 THEN '18-24'
  WHEN age < 35 THEN '25-34'
  WHEN age < 45 THEN '35-44'
  ELSE '45+'
END AS age_group

All of these work in the SQL operation and expression inputs. If you're not sure about a function, type it in and check the live preview. Five rows of output tells you immediately whether it does what you expect.

Open the SQL editor →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData