Skip to main content

Parser Node

The parser node is a versatile component in ZephFlow that extracts structured data from string fields in your events. It can parse various log formats into structured key-value pairs, allowing you to transform raw log data into structured events for easier analysis and processing.

Key Features

  • Multi-format Parsing: Support for various log formats including Grok patterns, Syslog, CEF, Windows multiline logs, delimited text (CSV/TSV), JSON, and key-value pairs
  • Flexible Field Selection: Parse specific fields within your events
  • Field Management: Option to remove original raw fields after parsing

Attaching Parser Node Using SDK

ZephFlow flow = ZephFlow.startFlow();
flow.parse(parserConfig);

Basic Usage

The parser node requires a ParserConfig object that defines which field to parse and how to parse it:

// Create a parser config for Grok parsing
ParserConfigs.ParserConfig parserConfig = ParserConfigs.ParserConfig.builder()
.targetField("message")
.removeTargetField(true)
.extractionConfig(new GrokExtractionConfig("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:content}"))
.build();

// Add the parser to your flow
ZephFlow flow = ZephFlow.startFlow();
flow.parse(parserConfig);

Configuration Details

ParserConfig

The ParserConfig class contains the main configuration parameters:

ParameterDescription
targetFieldField to be parsed. Must be a string field, otherwise processing will fail.
removeTargetFieldWhen true, the original field is removed after parsing. When false, both original and parsed fields are present in output.
extractionConfigDefines the parsing method and format (Grok, Syslog, CEF, Windows multiline, delimited text, JSON, or key-value pairs).

Extraction Configurations

ZephFlow provides multiple extraction configurations to handle different log formats efficiently. Each extraction method is optimized for specific log structures and conventions.

Grok Extraction

Grok is a powerful pattern-matching syntax that combines named regular expressions to parse unstructured text into structured data.

Configuration

GrokExtractionConfig grokConfig = GrokExtractionConfig.builder()
.grokExpression("%ASA-%{INT:level}-%{INT:message_number}: %{GREEDYDATA:message_text}")
.build();
ParameterDescription
grokExpressionA pattern string that defines how to extract fields from the text

Grok Pattern Syntax

Grok patterns use the format %{SYNTAX:SEMANTIC} where:

  • SYNTAX is the pattern name (like IP, TIMESTAMP, NUMBER)
  • SEMANTIC is the field name you want to assign the matched value to

Common Grok Patterns

PatternDescription
%{NUMBER}Matches decimal numbers
%{IP}Matches IPv4 addresses
%{TIMESTAMP_ISO8601}Matches ISO8601 timestamps
%{LOGLEVEL}Matches log levels (INFO, ERROR, etc.)
%{GREEDYDATA}Matches everything remaining
%{WORD}Matches word characters (a-z, A-Z, 0-9, _)
%{NOTSPACE}Matches everything until a space

Example

// Grok pattern for Apache access logs
ParserConfigs.ParserConfig apacheConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(new GrokExtractionConfig(
"%{IPORHOST:client_ip} %{NOTSPACE:ident} %{NOTSPACE:auth} \\[%{HTTPDATE:timestamp}\\] \"%{WORD:method} %{NOTSPACE:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:response_code} %{NUMBER:bytes}"
))
.build();

Input:

{
"__raw__": "192.168.1.1 - - [10/Oct/2023:13:55:36 -0700] \"GET /index.html HTTP/1.1\" 200 2326"
}

Output:

{
"client_ip": "192.168.1.1",
"ident": "-",
"auth": "-",
"timestamp": "10/Oct/2023:13:55:36 -0700",
"method": "GET",
"request": "/index.html",
"httpversion": "1.1",
"response_code": "200",
"bytes": "2326"
}

Windows Multiline Extraction

This extraction method is specifically designed to parse Windows event logs that span multiple lines, which is common in Windows applications and services logs.

Configuration

WindowsMultilineExtractionConfig windowsConfig = WindowsMultilineExtractionConfig.builder()
.timestampLocationType(TimestampLocationType.FIRST_LINE)
.config(Map.of("key", "value"))
.build();
ParameterDescription
timestampLocationTypeSpecifies where to find the timestamp in the log entry
configAdditional configuration parameters as key-value pairs

Timestamp Location Types

TypeDescription
NO_TIMESTAMPLog entries don't contain timestamps
FIRST_LINETimestamp appears in the first line of each log entry
FROM_FIELDTimestamp is found in a specific field (requires setting target_field in config)

Using FROM_FIELD Timestamp Location

When your timestamp is in a specific field, configure as follows:

WindowsMultilineExtractionConfig windowsConfig = WindowsMultilineExtractionConfig.builder()
.timestampLocationType(WindowsMultilineExtractionConfig.TimestampLocationType.FROM_FIELD)
.config(Map.of("target_field", "event_time"))
.build();

Syslog Extraction

The Syslog extraction configuration parses standard syslog formatted messages, which are widely used in system and network device logging.

Configuration

SyslogExtractionConfig syslogConfig = SyslogExtractionConfig.builder()
.timestampPattern("MMM d HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP
))
.messageBodyDelimiter(':')
.build();
ParameterDescription
timestampPatternJava date format pattern for parsing the timestamp component. required when TIMESTAMP component is present
componentListOrdered list of syslog components present in the log
messageBodyDelimiterCharacter that separates the header from the message body (optional)

Syslog Components

ComponentDescriptionParsed Field NameExample
PRIORITYLog priority enclosed in angle bracketspriority<13>
VERSIONSyslog protocol versionversion1
TIMESTAMPTimestamp of the log event. If present, timestampPattern is requiredtimestampOct 11 22:14:15
DEVICEHost name or IP addressdeviceIdserver1
APPApplication or process nameappNamesshd
PROC_IDProcess IDprocId12345
MSG_IDMessage identifiermsgIdID47
STRUCTURED_DATAStructured data in the format [id@domain key="value"]structuredData[exampleSDID@32473 iut="3" eventSource="App"]
Remaining logwhatever remaining content after syslog headercontentFailed password for invalid user admin from 192.168.1.10 port 55279 ssh2

Example

// Syslog parser for standard BSD format logs
ParserConfigs.ParserConfig syslogConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("MMM d HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP,
SyslogExtractionConfig.ComponentType.PROC_ID
))
.messageBodyDelimiter(':')
.build())
.build();

Input:

{
"__raw__": "Oct 11 22:14:15 server1 sshd 12345: Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2"
}

Output:

{
"timestamp": "Oct 11 22:14:15",
"deviceId": "server1",
"appName": "sshd",
"procId": "12345",
"content": "Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2"
}

RFC5424 Format Example

For modern RFC5424 format syslog messages:

// Syslog parser for RFC5424 format
ParserConfigs.ParserConfig syslog5424Config = ParserConfigs.ParserConfig.builder()
.targetField("log_message")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.PRIORITY,
SyslogExtractionConfig.ComponentType.VERSION,
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP,
SyslogExtractionConfig.ComponentType.PROC_ID,
SyslogExtractionConfig.ComponentType.MSG_ID,
SyslogExtractionConfig.ComponentType.STRUCTURED_DATA
))
.build())
.build();

CEF Extraction

Common Event Format (CEF) is a logging and auditing file format developed by ArcSight, widely used in security information and event management (SIEM) systems.

Configuration

CefExtractionConfig cefConfig = new CefExtractionConfig();

The CEF extraction config doesn't require any additional parameters.

CEF Format Structure

CEF logs follow this structure:

CEF:Version|Device Vendor|Device Product|Device Version|Signature ID|Name|Severity|Extension

The header part (before the Extension) contains pipe-delimited fields, while the Extension part contains key-value pairs.

Example

// CEF parser for security logs
ParserConfigs.ParserConfig cefConfig = ParserConfigs.ParserConfig.builder()
.targetField("security_log")
.extractionConfig(new CefExtractionConfig())
.build();

Input:

CEF:0|Vendor|Product|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.1 spt=1234 dpt=80 act=blocked

Output:

{
"severity": 10,
"dst": "10.0.0.1",
"src": "192.168.1.1",
"deviceVendor": "Vendor",
"__raw__": "CEF:0|Vendor|Product|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.1 spt=1234 dpt=80 act=blocked",
"dpt": "80",
"deviceVersion": "1.0",
"version": 0,
"deviceEventClassId": "100",
"act": "blocked",
"spt": "1234",
"name": "Intrusion Detected",
"deviceProduct": "Product"
}

Delimited Text Extraction

The Delimited Text extraction configuration parses structured text files where values are separated by a specific delimiter, such as CSV (comma-separated values), TSV (tab-separated values), or custom delimiters.

Configuration

DelimitedTextExtractionConfig delimiterConfig = DelimitedTextExtractionConfig.builder()
.delimiter(",")
.columns(List.of("user_id", "username", "email", "status", "last_login"))
.build();
ParameterDescription
delimiterThe character(s) used to separate values (e.g., ",", "\t", "|")
columnsOrdered list of column names corresponding to the delimited values

Features

  • Correctly handles quoted values containing delimiters
  • Supports escaped quotes within values

Example

// CSV parser for user activity logs
ParserConfigs.ParserConfig csvConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(DelimitedTextExtractionConfig.builder()
.delimiter(",")
.columns(List.of("timestamp", "user_id", "action", "resource", "status"))
.build())
.build();

Input:

{
"__raw__": "2023-10-15T14:32:01Z,12345,LOGIN,/dashboard,SUCCESS"
}

Output:

{
"timestamp": "2023-10-15T14:32:01Z",
"user_id": "12345",
"action": "LOGIN",
"resource": "/dashboard",
"status": "SUCCESS"
}

Handling Quoted Values

The parser correctly handles quoted values that contain the delimiter:

Input:

{
"__raw__": "101,\"Smith, John\",john@example.com,ACTIVE"
}

Output:

{
"user_id": "101",
"name": "Smith, John",
"email": "john@example.com",
"status": "ACTIVE"
}

JSON Extraction

The JSON extraction configuration parses a JSON string and stores the resulting structured object in a specified field.

Configuration

JsonExtractionConfig jsonConfig = JsonExtractionConfig.builder()
.outputFieldName("parsed_data")
.build();
ParameterDescription
outputFieldNameThe field name to store the parsed JSON structure

Example

// JSON parser for embedded JSON strings
ParserConfigs.ParserConfig jsonConfig = ParserConfigs.ParserConfig.builder()
.targetField("json_string")
.removeTargetField(true)
.extractionConfig(JsonExtractionConfig.builder()
.outputFieldName("event_data")
.build())
.build();

Input:

{
"json_string": "{\"user\":\"alice\",\"action\":\"login\",\"ip\":\"192.168.1.100\"}"
}

Output:

{
"event_data": {
"user": "alice",
"action": "login",
"ip": "192.168.1.100"
}
}

Key-Value Pair Extraction

The Key-Value Pair extraction configuration parses strings containing key-value pairs with configurable separators.

Configuration

KvPairExtractionConfig kvConfig = KvPairExtractionConfig.builder()
.pairSeparator(',')
.kvSeparator('=')
.build();
ParameterDescription
pairSeparatorCharacter that separates key-value pairs (e.g., ',')
kvSeparatorCharacter that separates keys from values (e.g., '=')

Features

  • Handles quoted values containing separators
  • Supports escaped quotes within values
  • Flexible configuration for different key-value formats

Example

// Key-value parser for application logs
ParserConfigs.ParserConfig kvConfig = ParserConfigs.ParserConfig.builder()
.targetField("metadata")
.removeTargetField(true)
.extractionConfig(KvPairExtractionConfig.builder()
.pairSeparator(' ')
.kvSeparator('=')
.build())
.build();

Input:

{
"metadata": "user=john status=active role=admin last_login=2023-10-15"
}

Output:

{
"user": "john",
"status": "active",
"role": "admin",
"last_login": "2023-10-15"
}

Handling Quoted Values

The parser correctly handles quoted values containing separators:

Input:

{
"metadata": "name=\"Smith, John\" email=john@example.com dept=\"Sales, North\""
}

Output:

{
"name": "Smith, John",
"email": "john@example.com",
"dept": "Sales, North"
}

Multi-Stage Parsing with DAG

For complex log formats that require multiple parsing stages, you can chain multiple parser nodes together in a DAG (Directed Acyclic Graph). Each parser node processes the output of the previous node, enabling sophisticated processing pipelines for nested or multi-layered log formats.

How It Works

Multi-stage parsing is achieved by connecting parser nodes sequentially in your flow:

ZephFlow flow = ZephFlow.startFlow();

// Stage 1: Parse syslog header
ZephFlow stage1 = flow.fileSource("logs.txt", EncodingType.STRING_LINE)
.parse(ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("MMM dd yyyy HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP
))
.messageBodyDelimiter(':')
.build())
.build());

// Stage 2: Parse application-specific format
ZephFlow stage2 = stage1.parse(ParserConfigs.ParserConfig.builder()
.targetField("content")
.removeTargetField(true)
.extractionConfig(GrokExtractionConfig.builder()
.grokExpression("%{WORD:severity}-%{INT:code}: %{GREEDYDATA:message}")
.build())
.build());

// Stage 3: Branch and parse message-specific details
ZephFlow type1Flow = stage2
.filter("$.code == '305011'")
.parse(ParserConfigs.ParserConfig.builder()
.targetField("message")
.extractionConfig(GrokExtractionConfig.builder()
.grokExpression("%{WORD:action} %{WORD:protocol} from %{IP:src_ip}/%{INT:src_port}")
.build())
.build());

ZephFlow type2Flow = stage2
.filter("$.code == '106023'")
.parse(ParserConfigs.ParserConfig.builder()
.targetField("message")
.extractionConfig(/* different pattern for this message type */)
.build());

// Merge branches and output
ZephFlow output = ZephFlow.merge(type1Flow, type2Flow)
.stdoutSink(EncodingType.JSON_OBJECT);

Key Benefits

  • Modularity: Each parsing stage is a separate, testable node
  • Flexibility: Easy to add, remove, or modify parsing stages
  • Conditional Processing: Use filter nodes to branch based on extracted fields
  • Reusability: Common parsing stages can be shared across different pipelines

Complete Example

For a comprehensive example of multi-stage parsing, see the Cisco ASA Log Processing Tutorial, which demonstrates:

  • Parsing syslog headers
  • Extracting application-specific metadata
  • Branching based on message types
  • Message-specific field extraction
  • Transforming to standardized formats (OCSF)

Best Practices

Selecting the Right Extraction Configuration

  1. Analyze your log format first to determine which extraction configuration best matches your needs:
  • Use Grok for most text-based logs with consistent formats
  • Use Syslog for standard system and network device logs
  • Use Windows Multiline for Windows Event logs
  • Use CEF for security and SIEM-related logs
  • Use Delimited Text for CSV, TSV, or other delimiter-separated logs
  • Use JSON for parsing embedded JSON strings
  • Use Key-Value Pair for logs with key=value formatted data
  1. Test with sample data to verify your configuration handles all variations in your log format

  2. Create targeted parsers rather than trying to parse everything with one complex configuration

  3. Use DAG composition for multi-stage parsing - chain parser nodes together rather than creating overly complex single-stage parsers

Common Pitfalls to Avoid

  1. Overly complex Grok patterns can be difficult to maintain - break them down into smaller, reusable patterns

  2. Missing components in Syslog configuration - ensure your component list matches the exact format of your logs

  3. Incorrect timestamp patterns - test thoroughly with various date formats that appear in your logs

  4. Performance considerations - very complex parsing on high-volume logs can impact performance; consider pre-filtering or using simpler patterns where possible