← All posts
5 min read

Parsing and Analyzing Server Log Files

You have a CSV export of server log entries. Maybe you pulled them from CloudWatch, maybe someone ran a script that dumped the last 50,000 lines into a spreadsheet. Each row has a single column with a raw log line like:

2024-03-15 09:23:45 ERROR [auth-service] Failed login for user@example.com from 192.168.1.1

That is a wall of text. You cannot filter by log level, you cannot group by service, you cannot chart errors over time. You need to break this apart into structured columns first, and then you can actually analyze it.

Here is how to go from raw log lines to structured, filterable data in ExploreMyData using seven pipeline steps.

Step 1: Extract the timestamp

Open the toolbar and select Extract Text from the Transform group. This operation pulls substrings out of a text column using patterns.

  1. Source column: log_line (or whatever your raw column is called)
  2. Method: regex
  3. Pattern: \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
  4. New column name: timestamp_raw

This regex matches the standard datetime format at the start of each log line. The result is a new VARCHAR column containing just the timestamp string, like 2024-03-15 09:23:45.

log_line (raw)timestamp_raw (extracted)
2024-03-15 09:23:45 ERROR [auth-service] Failed login for user@example.com2024-03-15 09:23:45
2024-03-15 09:24:02 INFO [api-gateway] Request completed in 142ms2024-03-15 09:24:02
2024-03-15 09:24:18 WARN [db-connector] Slow query: 3.2s on users table2024-03-15 09:24:18
2024-03-15 09:25:01 ERROR [auth-service] Token expired for session abc1232024-03-15 09:25:01

Regex \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} pulls the timestamp from each line. The raw log_line column is kept until all extractions are done.

Step 2: Extract the log level

Apply Extract Text again. The log level (ERROR, WARN, INFO, DEBUG) sits between the timestamp and the bracketed service name. You have two options:

  • Regex: pattern (ERROR|WARN|INFO|DEBUG)
  • Between delimiters: after the timestamp's trailing space and before the [ character

Either way, name the result log_level. The regex approach is more reliable here because it matches the known set of levels regardless of extra whitespace.

Step 3: Extract the service name

One more Extract Text. The service name is wrapped in square brackets: [auth-service]. Use the between-delimiters method:

  1. Method: between delimiters
  2. Start delimiter: [
  3. End delimiter: ]
  4. New column name: service

Now you have three new columns alongside the original log line: timestamp_raw, log_level, and service.

timestamp_rawlog_levelservicelog_line (original)
2024-03-15 09:23:45ERRORauth-service2024-03-15 09:23:45 ERROR [auth-service] Failed login…
2024-03-15 09:24:02INFOapi-gateway2024-03-15 09:24:02 INFO [api-gateway] Request completed…
2024-03-15 09:24:18WARNdb-connector2024-03-15 09:24:18 WARN [db-connector] Slow query…
2024-03-15 09:25:01ERRORauth-service2024-03-15 09:25:01 ERROR [auth-service] Token expired…

Three new structured columns extracted from a single raw text column. Now filterable, sortable, and groupable by level and service.

Step 4: Convert the timestamp

The extracted timestamp is still a string. Select Convert Type from the Transform group:

  1. Column: timestamp_raw
  2. Target type: TIMESTAMP

DuckDB's TRY_CAST() handles the conversion. Any malformed timestamps become NULL instead of causing an error. Once this is a proper TIMESTAMP, you can sort chronologically, filter by date ranges, and extract date parts.

Step 5: Extract the hour for time-of-day analysis

Select Extract Date from the Transform group:

  1. Source column: timestamp_raw
  2. Part to extract: hour
  3. New column name: hour_of_day

This gives you a number from 0 to 23. Now you can group by hour to see when errors spike. Are most failures happening at 2 AM during batch jobs? Or at 9 AM when users start logging in?

Step 6: Find the most common entries

Select Top/Bottom from the Filter & Sort group:

  1. Mode: Top
  2. Column: log_level (or service)
  3. Count: 10

This surfaces the most frequent log levels or the services generating the most entries. If one service accounts for 80% of your ERROR lines, that is where to look first.

Step 7: Filter to errors only

Select Filter from the Filter & Sort group:

  1. Column: log_level
  2. Operator: equals
  3. Value: ERROR

Now you are looking at only the error entries, with structured columns you can sort and group. Want to drill into a specific service? Add another filter for service = 'auth-service'.

timestamp_rawlog_levelservicehour_of_daymessage
2024-03-15 09:23:45ERRORauth-service9Failed login for user@example.com from 192.168.1.1
2024-03-15 09:25:01ERRORauth-service9Token expired for session abc123
2024-03-15 02:11:44ERRORbatch-processor2Timeout after 30s processing job JOB-9902
2024-03-15 02:14:08ERRORbatch-processor2Retry failed for job JOB-9902

Filtered to ERROR level only. The batch-processor errors cluster at 2 AM - a batch job failure pattern visible at a glance now that the data is structured.

Going further

Once the data is structured, you can branch your analysis in several directions:

  • Use Group By to count errors per service per hour
  • Use Extract Text again to pull IP addresses or email addresses from the message portion
  • Use Window with COUNT to calculate a rolling error rate
  • Export the structured version as a Parquet file for faster re-analysis later

The full pipeline

  1. Extract Text - regex for timestamp
  2. Extract Text - regex for log level
  3. Extract Text - between delimiters for service name
  4. Convert Type - timestamp_raw to TIMESTAMP
  5. Extract Date - hour from timestamp
  6. Top/Bottom - most common error types or services
  7. Filter - log_level = ERROR

Seven steps to go from a blob of text to a structured, filterable log analysis table. No grep, no awk, no regex debugging in a terminal. The pipeline is visible, each step is editable, and the SQL is generated for you.

Analyze your log files →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData