Skip to main content

Parser Node

The parser node is a versatile component in ZephFlow that extracts structured data from string fields in your events. It can parse various log formats into structured key-value pairs, allowing you to transform raw log data into structured events for easier analysis and processing.

Key Features

  • Multi-format Parsing: Support for various log formats including Grok patterns, Syslog, CEF, and Windows multiline logs
  • Flexible Field Selection: Parse specific fields within your events
  • Conditional Parsing: Apply different parsing rules based on field values using dispatch configuration
  • Nested Parsing: Chain multiple parsers together for complex log structures
  • Field Management: Option to remove original raw fields after parsing

Attaching Parser Node Using SDK

ZephFlow flow = ZephFlow.startFlow();
flow.parse(parserConfig);

Basic Usage

The parser node requires a ParserConfig object that defines which field to parse and how to parse it:

// Create a parser config for Grok parsing
ParserConfigs.ParserConfig parserConfig = ParserConfigs.ParserConfig.builder()
.targetField("message")
.removeTargetField(true)
.extractionConfig(new GrokExtractionConfig("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:content}"))
.build();

// Add the parser to your flow
ZephFlow flow = ZephFlow.startFlow();
flow.parse(parserConfig);

Configuration Details

ParserConfig

The ParserConfig class contains the main configuration parameters:

ParameterDescription
targetFieldField to be parsed. Must be a string field, otherwise processing will fail.
removeTargetFieldWhen true, the original field is removed after parsing. When false, both original and parsed fields are present in output.
extractionConfigDefines the parsing method and format (Grok, Syslog, CEF, or Windows multiline).
dispatchConfigOptional configuration for further parsing based on field values.

Extraction Configurations

ZephFlow provides multiple extraction configurations to handle different log formats efficiently. Each extraction method is optimized for specific log structures and conventions.

Grok Extraction

Grok is a powerful pattern-matching syntax that combines named regular expressions to parse unstructured text into structured data.

Configuration

GrokExtractionConfig grokConfig = GrokExtractionConfig.builder()
.grokExpression("%ASA-%{INT:level}-%{INT:message_number}: %{GREEDYDATA:message_text}")
.build();
ParameterDescription
grokExpressionA pattern string that defines how to extract fields from the text

Grok Pattern Syntax

Grok patterns use the format %{SYNTAX:SEMANTIC} where:

  • SYNTAX is the pattern name (like IP, TIMESTAMP, NUMBER)
  • SEMANTIC is the field name you want to assign the matched value to

Common Grok Patterns

PatternDescription
%{NUMBER}Matches decimal numbers
%{IP}Matches IPv4 addresses
%{TIMESTAMP_ISO8601}Matches ISO8601 timestamps
%{LOGLEVEL}Matches log levels (INFO, ERROR, etc.)
%{GREEDYDATA}Matches everything remaining
%{WORD}Matches word characters (a-z, A-Z, 0-9, _)
%{NOTSPACE}Matches everything until a space

Example

// Grok pattern for Apache access logs
ParserConfigs.ParserConfig apacheConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(new GrokExtractionConfig(
"%{IPORHOST:client_ip} %{NOTSPACE:ident} %{NOTSPACE:auth} \\[%{HTTPDATE:timestamp}\\] \"%{WORD:method} %{NOTSPACE:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:response_code} %{NUMBER:bytes}"
))
.build();

Input:

{
"__raw__": "192.168.1.1 - - [10/Oct/2023:13:55:36 -0700] \"GET /index.html HTTP/1.1\" 200 2326"
}

Output:

{
"client_ip": "192.168.1.1",
"ident": "-",
"auth": "-",
"timestamp": "10/Oct/2023:13:55:36 -0700",
"method": "GET",
"request": "/index.html",
"httpversion": "1.1",
"response_code": "200",
"bytes": "2326"
}

Windows Multiline Extraction

This extraction method is specifically designed to parse Windows event logs that span multiple lines, which is common in Windows applications and services logs.

Configuration

WindowsMultilineExtractionConfig windowsConfig = WindowsMultilineExtractionConfig.builder()
.timestampLocationType(TimestampLocationType.FIRST_LINE)
.config(Map.of("key", "value"))
.build();
ParameterDescription
timestampLocationTypeSpecifies where to find the timestamp in the log entry
configAdditional configuration parameters as key-value pairs

Timestamp Location Types

TypeDescription
NO_TIMESTAMPLog entries don't contain timestamps
FIRST_LINETimestamp appears in the first line of each log entry
FROM_FIELDTimestamp is found in a specific field (requires setting target_field in config)

Using FROM_FIELD Timestamp Location

When your timestamp is in a specific field, configure as follows:

WindowsMultilineExtractionConfig windowsConfig = WindowsMultilineExtractionConfig.builder()
.timestampLocationType(WindowsMultilineExtractionConfig.TimestampLocationType.FROM_FIELD)
.config(Map.of("target_field", "event_time"))
.build();

Syslog Extraction

The Syslog extraction configuration parses standard syslog formatted messages, which are widely used in system and network device logging.

Configuration

SyslogExtractionConfig syslogConfig = SyslogExtractionConfig.builder()
.timestampPattern("MMM d HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP
))
.messageBodyDelimiter(':')
.build();
ParameterDescription
timestampPatternJava date format pattern for parsing the timestamp component. required when TIMESTAMP component is present
componentListOrdered list of syslog components present in the log
messageBodyDelimiterCharacter that separates the header from the message body (optional)

Syslog Components

ComponentDescriptionParsed Field NameExample
PRIORITYLog priority enclosed in angle bracketspriority<13>
VERSIONSyslog protocol versionversion1
TIMESTAMPTimestamp of the log event. If present, timestampPattern is requiredtimestampOct 11 22:14:15
DEVICEHost name or IP addressdeviceIdserver1
APPApplication or process nameappNamesshd
PROC_IDProcess IDprocId12345
MSG_IDMessage identifiermsgIdID47
STRUCTURED_DATAStructured data in the format [id@domain key="value"]structuredData[exampleSDID@32473 iut="3" eventSource="App"]
Remaining logwhatever remaining content after syslog headercontentFailed password for invalid user admin from 192.168.1.10 port 55279 ssh2

Example

// Syslog parser for standard BSD format logs
ParserConfigs.ParserConfig syslogConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("MMM d HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP,
SyslogExtractionConfig.ComponentType.PROC_ID
))
.messageBodyDelimiter(':')
.build())
.build();

Input:

{
"__raw__": "Oct 11 22:14:15 server1 sshd 12345: Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2"
}

Output:

{
"timestamp": "Oct 11 22:14:15",
"deviceId": "server1",
"appName": "sshd",
"procId": "12345",
"content": "Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2"
}

RFC5424 Format Example

For modern RFC5424 format syslog messages:

// Syslog parser for RFC5424 format
ParserConfigs.ParserConfig syslog5424Config = ParserConfigs.ParserConfig.builder()
.targetField("log_message")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.PRIORITY,
SyslogExtractionConfig.ComponentType.VERSION,
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP,
SyslogExtractionConfig.ComponentType.PROC_ID,
SyslogExtractionConfig.ComponentType.MSG_ID,
SyslogExtractionConfig.ComponentType.STRUCTURED_DATA
))
.build())
.build();

CEF Extraction

Common Event Format (CEF) is a logging and auditing file format developed by ArcSight, widely used in security information and event management (SIEM) systems.

Configuration

CefExtractionConfig cefConfig = new CefExtractionConfig();

The CEF extraction config doesn't require any additional parameters.

CEF Format Structure

CEF logs follow this structure:

CEF:Version|Device Vendor|Device Product|Device Version|Signature ID|Name|Severity|Extension

The header part (before the Extension) contains pipe-delimited fields, while the Extension part contains key-value pairs.

Example

// CEF parser for security logs
ParserConfigs.ParserConfig cefConfig = ParserConfigs.ParserConfig.builder()
.targetField("security_log")
.extractionConfig(new CefExtractionConfig())
.build();

Input:

CEF:0|Vendor|Product|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.1 spt=1234 dpt=80 act=blocked

Output:

{
"severity": 10,
"dst": "10.0.0.1",
"src": "192.168.1.1",
"deviceVendor": "Vendor",
"__raw__": "CEF:0|Vendor|Product|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.1 spt=1234 dpt=80 act=blocked",
"dpt": "80",
"deviceVersion": "1.0",
"version": 0,
"deviceEventClassId": "100",
"act": "blocked",
"spt": "1234",
"name": "Intrusion Detected",
"deviceProduct": "Product"
}

Advanced Usage: Multi-level Log Parsing

For complex log formats that require multiple parsing stages, ZephFlow's dispatch configuration enables sophisticated processing pipelines. This example demonstrates how to handle Cisco ASA network security logs.

Complex Example: Cisco ASA Log Parsing

This example shows how to parse complex Cisco ASA logs with multiple levels of extraction.

Input Log:

{
"__raw__": "Oct 10 2018 12:34:56 localhost CiscoASA[999]: %ASA-6-305011: Built dynamic TCP translation from inside:172.31.98.44/1772 to outside:100.66.98.44/8256"
}

Parser Configuration:

{
"targetField": "__raw__",
"removeTargetField": false,
"extractionConfig": {
"type": "syslog",
"timestampPattern": "MMM dd yyyy HH:mm:ss",
"componentList": [
"TIMESTAMP",
"DEVICE",
"APP"
],
"messageBodyDelimiter": ":"
},
"dispatchConfig": {
"dispatchField": "content",
"dispatchMap": {},
"defaultConfig": {
"targetField": "content",
"removeTargetField": true,
"extractionConfig": {
"type": "grok",
"grokExpression": "%ASA-%{INT:level}-%{INT:message_number}: %{GREEDYDATA:message_text}"
},
"dispatchConfig": {
"dispatchField": "message_number",
"dispatchMap": {
"305011": {
"targetField": "message_text",
"removeTargetField": true,
"extractionConfig": {
"type": "grok",
"grokExpression": "%{WORD:action} %{WORD:translation_type} %{WORD:protocol} translation from %{WORD:source_interface}:%{IP:source_ip}/%{INT:source_port} to %{WORD:dest_interface}:%{IP:dest_ip}/%{INT:dest_port}"
},
"dispatchConfig": null
}
},
"defaultConfig": null
}
}
}
}

Expected Output:

{
"level": "6",
"message_number": "305011",
"source_interface": "inside",
"__raw__": "Oct 10 2018 12:34:56 localhost CiscoASA[999]: %ASA-6-305011: Built dynamic TCP translation from inside:172.31.98.44/1772 to outside:100.66.98.44/8256",
"appName": "CiscoASA[999]",
"dest_interface": "outside",
"source_ip": "172.31.98.44",
"translation_type": "dynamic",
"deviceId": "localhost",
"protocol": "TCP",
"source_port": "1772",
"dest_ip": "100.66.98.44",
"action": "Built",
"dest_port": "8256",
"timestamp": "Oct 10 2018 12:34:56"
}

How It Works

This example processes logs through three sequential stages:

  1. First Stage (Syslog Parsing):
  • Parses the syslog header format from the __raw__ field
  • Extracts timestamp, device ID, and application name
  • Places the message content after the colon in a field called content
  1. Second Stage (Cisco ASA Format Parsing):
  • The default config in the first dispatchConfig targets the content field
  • Uses Grok to extract the ASA log format with level, message number, and message text
  • For example, parses %ASA-6-305011: Built dynamic TCP translation...
  • Places the message text into the message_text field
  1. Third Stage (Message-Type Specific Parsing):
  • Uses the message_number (305011) to select the appropriate parser
  • Each message type has specific fields to extract
  • For 305011 messages, extracts details about the network translation

This multi-stage approach demonstrates the power of nested dispatch configurations for handling complex, structured logs with varying formats.

Best Practices

Selecting the Right Extraction Configuration

  1. Analyze your log format first to determine which extraction configuration best matches your needs:
  • Use Grok for most text-based logs with consistent formats
  • Use Syslog for standard system and network device logs
  • Use Windows Multiline for Windows Event logs
  • Use CEF for security and SIEM-related logs
  1. Test with sample data to verify your configuration handles all variations in your log format

  2. Create targeted parsers rather than trying to parse everything with one complex configuration

Using Dispatch Configuration

For mixed log formats, use the dispatch configuration to apply different parsing strategies based on initial parsing results:

ParserConfigs.ParserConfig initialParser = ParserConfigs.ParserConfig.builder()
.targetField("raw_log")
.extractionConfig(new GrokExtractionConfig("%{WORD:log_type}: %{GREEDYDATA:log_content}"))
.dispatchConfig(ParserConfigs.DispatchConfig.builder()
.dispatchField("log_type")
.dispatchMap(Map.of(
"apache", apacheParserConfig,
"syslog", syslogParserConfig,
"windows", windowsParserConfig,
"security", cefParserConfig
))
.defaultConfig(defaultParserConfig)
.build())
.build();

Common Pitfalls to Avoid

  1. Overly complex Grok patterns can be difficult to maintain - break them down into smaller, reusable patterns

  2. Missing components in Syslog configuration - ensure your component list matches the exact format of your logs

  3. Incorrect timestamp patterns - test thoroughly with various date formats that appear in your logs

  4. Performance considerations - very complex parsing on high-volume logs can impact performance; consider pre-filtering or using simpler patterns where possible