Analyzing Data Across Multiple Files
You have three files. orders.csv
has order IDs, customer IDs, product names, and amounts.
customers.csv has customer
IDs, names, and tier levels. products.csv
has product names, categories, and margins.
The question you need to answer: "What's the total revenue by customer tier for each product category?" No single file has all the columns. You need to combine them.
In a traditional workflow, you'd load everything into a database or write a Python script. In ExploreMyData, you load all three files as tabs, join them together, and build your analysis on the combined result. Everything stays in the browser.
Step 1: Load all your files
Open ExploreMyData and load your first file. Then load the second and third. Each file gets its own tab at the top of the workspace. You can switch between tabs to preview each file independently.
With all three files loaded, you can see:
- orders - 10,000 rows: order_id, customer_id, product_name, amount, order_date
- customers - 500 rows: customer_id, name, email, tier, signup_date
- products - 50 rows: product_name, category, margin_pct, supplier
| Tab | File | Rows | Columns |
|---|---|---|---|
| 1 | orders.csv | 10,000 | order_id, customer_id, product_name, amount, order_date |
| 2 | customers.csv | 500 | customer_id, name, email, tier, signup_date |
| 3 | products.csv | 50 | product_name, category, margin_pct, supplier |
All three files are available as separate tabs. Each can be previewed and cleaned independently before joining.
Step 2: Join orders with customers
On the orders tab, open the toolbar and select Join. You'll see a dialog that lets you pick the other table and the join columns.
- Select customers as the table to join with.
- Set the join column:
customer_idfrom orders matchescustomer_idfrom customers. - Choose the join type. For this analysis, LEFT JOIN makes sense - keep all orders, even if a customer record is missing.
- Click Apply.
The result is a new view with all the order columns plus the customer columns (name, email, tier, signup_date). The pipeline now shows the join as a step. Under the hood, it's running:
SELECT *
FROM orders
LEFT JOIN customers ON orders.customer_id = customers.customer_id
Join configuration
- Join with: customers
- Join type: LEFT JOIN
- Left key (orders): customer_id
- Right key (customers): customer_id
Generated SQL: SELECT * FROM orders LEFT JOIN customers ON orders.customer_id = customers.customer_id
Result: 10,000 rows with all order columns plus name, email, tier, signup_date from customers.
Step 3: Join the result with products
Now that orders and customers are combined, you can add the product data. Add another Join step:
- Select products as the table to join with.
- Set the join column:
product_namefrom the current view matchesproduct_namefrom products. - LEFT JOIN again.
- Apply.
The pipeline now has two join steps. The view contains columns from all three files: order details, customer tier, and product category. You're ready to analyze.
Step 4: Build the analysis
With the combined data, answering "total revenue by customer tier and product category" is straightforward. Add an Aggregate operation:
- Group by:
tierandcategory - Aggregate:
SUM(amount)as total_revenue - Apply, then add a Sort step to order by total_revenue descending.
The result is a clean table showing exactly what was asked. The pipeline has four steps: join customers, join products, aggregate, sort. Each step is visible, editable, and deletable.
| tier | category | total_revenue |
|---|---|---|
| Enterprise | Electronics | 98,400 |
| Enterprise | Software | 76,200 |
| Growth | Electronics | 54,100 |
| Growth | Software | 38,900 |
| Starter | Electronics | 21,600 |
| Starter | Software | 14,300 |
Total revenue by customer tier and product category, combining data from all three files. Sorted by total_revenue descending.
How joins work in the pipeline
A join is just another pipeline step. It combines two tables into one view. Every operation after the join works on the combined data. This means you can:
- Filter the joined data (e.g., only "premium" tier customers)
- Add calculated columns using fields from both tables (e.g.,
amount * margin_pct) - Aggregate across dimensions that came from different files
- Chain multiple joins to bring in three, four, or more files
If you delete a join step, subsequent steps that reference columns from the joined table will break. The pipeline will show an error, and you can fix it by re-adding the join or adjusting the downstream steps.
Multi-file tabs vs. multi-view tabs
There's an important distinction. Each loaded file gets its own tab. You can apply independent pipelines to each file. When you join, the result appears in the current tab's pipeline.
This means you can clean each file independently before joining. For example:
- On the customers tab: trim whitespace from names, fill missing tier values with "standard".
- On the products tab: lowercase all category names for consistency.
- On the orders tab: filter out cancelled orders, then join with the cleaned customers and products.
The join references the other file's current state, including any pipeline steps applied to it. Clean first, then join.
Tips for multi-file analysis
- Check join keys first. Before joining, switch to each tab and look at the join column. Are the values consistent? Does one file use "US" and the other "United States"? Clean that before joining.
- Watch for duplicate columns. If both files have a column called "name", the join will create two columns. Use Select Columns to drop the one you don't need.
- Start with the largest table. Join the smaller lookup tables into the main fact table, not the other way around. It's more intuitive and easier to verify.
- Use LEFT JOIN by default. It preserves all rows from your main table even if there's no match in the lookup table. INNER JOIN drops unmatched rows, which can silently lose data.