DuckDB SQL Cheatsheet for CSV Analysis
ExploreMyData runs DuckDB WASM under the hood. When you use the SQL operation or write expressions for Add Column and Filter, you're writing DuckDB SQL. It's PostgreSQL-like but has its own functions and quirks.
This is a quick reference for the functions you'll actually use when analyzing CSV, Parquet, and JSON files. Bookmark it. Every example below works in the SQL operation or expression fields.
Aggregations
The basics. These go in GROUP BY queries or the Aggregate operation.
-- Count all rows
SELECT COUNT(*) FROM data
-- Count non-null values in a column
SELECT COUNT(email) FROM data
-- Count distinct values
SELECT COUNT(DISTINCT region) FROM data
-- Sum, average, min, max
SELECT SUM(amount), AVG(amount), MIN(amount), MAX(amount) FROM data
-- Median (50th percentile)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) FROM data
-- Standard deviation
SELECT STDDEV(amount) FROM data
String functions
Text manipulation. Useful in Add Column expressions and filter conditions.
-- Case conversion
LOWER(name) -- 'john doe'
UPPER(name) -- 'JOHN DOE'
INITCAP(name) -- 'John Doe'
-- Trimming whitespace
TRIM(name) -- remove leading/trailing spaces
LTRIM(name) -- left trim only
RTRIM(name) -- right trim only
-- Substrings and splitting
SPLIT_PART(email, '@', 2) -- extract domain from email
SUBSTRING(phone, 1, 3) -- first 3 characters
LEFT(zip_code, 5) -- first 5 characters
RIGHT(sku, 4) -- last 4 characters
-- Search and replace
REPLACE(city, 'NYC', 'New York') -- exact replacement
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') -- strip non-digits
REGEXP_EXTRACT(text, '([A-Z]{2}-\d+)') -- extract pattern
-- Concatenation
CONCAT(first_name, ' ', last_name) -- combine strings
city || ', ' || state -- alternative syntax
-- Length and position
LENGTH(name) -- character count
POSITION('@' IN email) -- position of substring
CONTAINS(name, 'corp') -- true/false search
STARTS_WITH(url, 'https') -- prefix check
Date and time functions
DuckDB is strict about date types. If your column is a string, you'll need to parse it first
with strptime or cast it.
-- Parse a date string into a DATE/TIMESTAMP
strptime(date_col, '%Y-%m-%d') -- '2026-01-15' -> DATE
strptime(date_col, '%m/%d/%Y %H:%M') -- '01/15/2026 14:30' -> TIMESTAMP
-- Format a date as string
strftime(order_date, '%B %Y') -- 'January 2026'
-- Extract parts
EXTRACT(YEAR FROM order_date) -- 2026
EXTRACT(MONTH FROM order_date) -- 1
EXTRACT(DOW FROM order_date) -- day of week (0=Sun)
EXTRACT(QUARTER FROM order_date) -- 1-4
-- Truncate to period
DATE_TRUNC('month', order_date) -- first day of month
DATE_TRUNC('week', order_date) -- Monday of that week
DATE_TRUNC('quarter', order_date) -- first day of quarter
-- Date arithmetic
order_date + INTERVAL '30 days' -- add 30 days
CURRENT_DATE - INTERVAL '1 year' -- one year ago
DATEDIFF('day', start_date, end_date) -- days between two dates
DATEDIFF('month', hire_date, CURRENT_DATE) -- months since hire
-- Current date/time
CURRENT_DATE -- today's date
CURRENT_TIMESTAMP -- current timestamp
NOW() -- same as CURRENT_TIMESTAMP
Window functions
Window functions let you compute values across rows without collapsing them. They're the answer to "I want a running total" or "rank within each group."
-- Row numbering
ROW_NUMBER() OVER (ORDER BY amount DESC) -- 1, 2, 3...
ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) -- rank within region
-- Ranking (handles ties differently)
RANK() OVER (ORDER BY score DESC) -- 1, 2, 2, 4 (gaps after ties)
DENSE_RANK() OVER (ORDER BY score DESC) -- 1, 2, 2, 3 (no gaps)
-- Running calculations
SUM(amount) OVER (ORDER BY order_date) -- running total
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) -- 7-day moving avg
-- Previous/next row values
LAG(amount, 1) OVER (ORDER BY order_date) -- previous row's amount
LEAD(amount, 1) OVER (ORDER BY order_date) -- next row's amount
amount - LAG(amount) OVER (ORDER BY order_date) -- change from previous
-- Partition calculations
SUM(amount) OVER (PARTITION BY region) -- total per region (on every row)
amount / SUM(amount) OVER () * 100 -- percentage of grand total
Type casting
Use TRY_CAST instead of
CAST when your data might
have invalid values. It returns NULL instead of throwing an error.
-- Safe casting (returns NULL on failure)
TRY_CAST(price AS DOUBLE) -- string to number
TRY_CAST(date_col AS DATE) -- string to date
TRY_CAST(id AS INTEGER) -- string to integer
TRY_CAST(active AS BOOLEAN) -- string to boolean
-- Hard casting (throws error on failure)
CAST(amount AS VARCHAR) -- number to string
CAST(quantity AS BIGINT) -- integer upcast
-- Shorthand syntax
price::DOUBLE -- same as CAST(price AS DOUBLE)
Conditionals
Branching logic for computed columns and conditional aggregation.
-- CASE expression
CASE
WHEN amount >= 1000 THEN 'high'
WHEN amount >= 100 THEN 'medium'
ELSE 'low'
END
-- COALESCE: first non-null value
COALESCE(preferred_name, first_name, 'Unknown')
-- NULLIF: return NULL if values match
NULLIF(status, '') -- treat empty strings as NULL
NULLIF(division_result, 0) -- avoid division by zero
-- IIF: inline if (DuckDB extension)
IIF(amount > 0, 'profit', 'loss')
JSON functions
If you load a JSON or JSONL file, or have JSON strings inside a CSV column, these are the extraction functions.
-- Extract a string value from JSON
json_extract_string(payload, '$.name') -- extract as VARCHAR
json_extract_string(payload, '$.address.city') -- nested path
-- Extract typed values
json_extract(payload, '$.count')::INTEGER -- extract as number
json_extract(payload, '$.active')::BOOLEAN -- extract as boolean
-- Check if key exists
json_extract(payload, '$.email') IS NOT NULL
Useful patterns
A few combinations that come up constantly when analyzing real data.
-- Percentage of total
ROUND(amount / SUM(amount) OVER () * 100, 2) AS pct_of_total
-- Top N per group
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS rn
FROM data
) WHERE rn <= 3
-- Deduplication keeping latest
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) AS rn
FROM data
) WHERE rn = 1
-- Pivot-like: count by category
SELECT
COUNT(*) FILTER (WHERE status = 'active') AS active_count,
COUNT(*) FILTER (WHERE status = 'inactive') AS inactive_count
FROM data
-- Bucket numeric values
CASE
WHEN age < 25 THEN '18-24'
WHEN age < 35 THEN '25-34'
WHEN age < 45 THEN '35-44'
ELSE '45+'
END AS age_group
All of these work in the SQL operation and expression inputs. If you're not sure about a function, type it in and check the live preview. Five rows of output tells you immediately whether it does what you expect.