Parsing and Analyzing Server Log Files
You have a CSV export of server log entries. Maybe you pulled them from CloudWatch, maybe someone ran a script that dumped the last 50,000 lines into a spreadsheet. Each row has a single column with a raw log line like:
2024-03-15 09:23:45 ERROR [auth-service] Failed login for user@example.com from 192.168.1.1
That is a wall of text. You cannot filter by log level, you cannot group by service, you cannot chart errors over time. You need to break this apart into structured columns first, and then you can actually analyze it.
Here is how to go from raw log lines to structured, filterable data in ExploreMyData using seven pipeline steps.
Step 1: Extract the timestamp
Open the toolbar and select Extract Text from the Transform group. This operation pulls substrings out of a text column using patterns.
- Source column:
log_line(or whatever your raw column is called) - Method: regex
- Pattern:
\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} - New column name:
timestamp_raw
This regex matches the standard datetime format at the start of each log line. The result is a new VARCHAR column containing just the timestamp string, like 2024-03-15 09:23:45.
| log_line (raw) | timestamp_raw (extracted) |
|---|---|
| 2024-03-15 09:23:45 ERROR [auth-service] Failed login for user@example.com | 2024-03-15 09:23:45 |
| 2024-03-15 09:24:02 INFO [api-gateway] Request completed in 142ms | 2024-03-15 09:24:02 |
| 2024-03-15 09:24:18 WARN [db-connector] Slow query: 3.2s on users table | 2024-03-15 09:24:18 |
| 2024-03-15 09:25:01 ERROR [auth-service] Token expired for session abc123 | 2024-03-15 09:25:01 |
Regex \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} pulls the timestamp from each line. The raw log_line column is kept until all extractions are done.
Step 2: Extract the log level
Apply Extract Text again. The log level (ERROR, WARN, INFO, DEBUG) sits between the timestamp and the bracketed service name. You have two options:
- Regex: pattern
(ERROR|WARN|INFO|DEBUG) - Between delimiters: after the timestamp's trailing space and before the
[character
Either way, name the result log_level. The regex approach is more reliable here because it matches the known set of levels regardless of extra whitespace.
Step 3: Extract the service name
One more Extract Text. The service name is wrapped in square brackets: [auth-service]. Use the between-delimiters method:
- Method: between delimiters
- Start delimiter:
[ - End delimiter:
] - New column name:
service
Now you have three new columns alongside the original log line: timestamp_raw, log_level, and service.
| timestamp_raw | log_level | service | log_line (original) |
|---|---|---|---|
| 2024-03-15 09:23:45 | ERROR | auth-service | 2024-03-15 09:23:45 ERROR [auth-service] Failed login… |
| 2024-03-15 09:24:02 | INFO | api-gateway | 2024-03-15 09:24:02 INFO [api-gateway] Request completed… |
| 2024-03-15 09:24:18 | WARN | db-connector | 2024-03-15 09:24:18 WARN [db-connector] Slow query… |
| 2024-03-15 09:25:01 | ERROR | auth-service | 2024-03-15 09:25:01 ERROR [auth-service] Token expired… |
Three new structured columns extracted from a single raw text column. Now filterable, sortable, and groupable by level and service.
Step 4: Convert the timestamp
The extracted timestamp is still a string. Select Convert Type from the Transform group:
- Column:
timestamp_raw - Target type: TIMESTAMP
DuckDB's TRY_CAST() handles the conversion. Any malformed timestamps become NULL instead of causing an error. Once this is a proper TIMESTAMP, you can sort chronologically, filter by date ranges, and extract date parts.
Step 5: Extract the hour for time-of-day analysis
Select Extract Date from the Transform group:
- Source column:
timestamp_raw - Part to extract:
hour - New column name:
hour_of_day
This gives you a number from 0 to 23. Now you can group by hour to see when errors spike. Are most failures happening at 2 AM during batch jobs? Or at 9 AM when users start logging in?
Step 6: Find the most common entries
Select Top/Bottom from the Filter & Sort group:
- Mode: Top
- Column:
log_level(orservice) - Count: 10
This surfaces the most frequent log levels or the services generating the most entries. If one service accounts for 80% of your ERROR lines, that is where to look first.
Step 7: Filter to errors only
Select Filter from the Filter & Sort group:
- Column:
log_level - Operator: equals
- Value:
ERROR
Now you are looking at only the error entries, with structured columns you can sort and group. Want to drill into a specific service? Add another filter for service = 'auth-service'.
| timestamp_raw | log_level | service | hour_of_day | message |
|---|---|---|---|---|
| 2024-03-15 09:23:45 | ERROR | auth-service | 9 | Failed login for user@example.com from 192.168.1.1 |
| 2024-03-15 09:25:01 | ERROR | auth-service | 9 | Token expired for session abc123 |
| 2024-03-15 02:11:44 | ERROR | batch-processor | 2 | Timeout after 30s processing job JOB-9902 |
| 2024-03-15 02:14:08 | ERROR | batch-processor | 2 | Retry failed for job JOB-9902 |
Filtered to ERROR level only. The batch-processor errors cluster at 2 AM - a batch job failure pattern visible at a glance now that the data is structured.
Going further
Once the data is structured, you can branch your analysis in several directions:
- Use Group By to count errors per service per hour
- Use Extract Text again to pull IP addresses or email addresses from the message portion
- Use Window with COUNT to calculate a rolling error rate
- Export the structured version as a Parquet file for faster re-analysis later
The full pipeline
- Extract Text - regex for timestamp
- Extract Text - regex for log level
- Extract Text - between delimiters for service name
- Convert Type - timestamp_raw to TIMESTAMP
- Extract Date - hour from timestamp
- Top/Bottom - most common error types or services
- Filter - log_level = ERROR
Seven steps to go from a blob of text to a structured, filterable log analysis table. No grep, no awk, no regex debugging in a terminal. The pipeline is visible, each step is editable, and the SQL is generated for you.