Parser Node
The parser
node is a versatile component in ZephFlow that extracts structured data from string fields in your events.
It can parse various log formats into structured key-value pairs, allowing you to transform raw log data into structured
events for easier analysis and processing.
Key Features
- Multi-format Parsing: Support for various log formats including Grok patterns, Syslog, CEF, and Windows multiline logs
- Flexible Field Selection: Parse specific fields within your events
- Conditional Parsing: Apply different parsing rules based on field values using dispatch configuration
- Nested Parsing: Chain multiple parsers together for complex log structures
- Field Management: Option to remove original raw fields after parsing
Attaching Parser Node Using SDK
ZephFlow flow = ZephFlow.startFlow();
flow.parse(parserConfig);
Basic Usage
The parser node requires a ParserConfig
object that defines which field to parse and how to parse it:
// Create a parser config for Grok parsing
ParserConfigs.ParserConfig parserConfig = ParserConfigs.ParserConfig.builder()
.targetField("message")
.removeTargetField(true)
.extractionConfig(new GrokExtractionConfig("%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:content}"))
.build();
// Add the parser to your flow
ZephFlow flow = ZephFlow.startFlow();
flow.parse(parserConfig);
Configuration Details
ParserConfig
The ParserConfig
class contains the main configuration parameters:
Parameter | Description |
---|---|
targetField | Field to be parsed. Must be a string field, otherwise processing will fail. |
removeTargetField | When true, the original field is removed after parsing. When false, both original and parsed fields are present in output. |
extractionConfig | Defines the parsing method and format (Grok, Syslog, CEF, or Windows multiline). |
dispatchConfig | Optional configuration for further parsing based on field values. |
Extraction Configurations
ZephFlow provides multiple extraction configurations to handle different log formats efficiently. Each extraction method is optimized for specific log structures and conventions.
Grok Extraction
Grok is a powerful pattern-matching syntax that combines named regular expressions to parse unstructured text into structured data.
Configuration
GrokExtractionConfig grokConfig = GrokExtractionConfig.builder()
.grokExpression("%ASA-%{INT:level}-%{INT:message_number}: %{GREEDYDATA:message_text}")
.build();
Parameter | Description |
---|---|
grokExpression | A pattern string that defines how to extract fields from the text |
Grok Pattern Syntax
Grok patterns use the format %{SYNTAX:SEMANTIC}
where:
SYNTAX
is the pattern name (like IP, TIMESTAMP, NUMBER)SEMANTIC
is the field name you want to assign the matched value to
Common Grok Patterns
Pattern | Description |
---|---|
%{NUMBER} | Matches decimal numbers |
%{IP} | Matches IPv4 addresses |
%{TIMESTAMP_ISO8601} | Matches ISO8601 timestamps |
%{LOGLEVEL} | Matches log levels (INFO, ERROR, etc.) |
%{GREEDYDATA} | Matches everything remaining |
%{WORD} | Matches word characters (a-z, A-Z, 0-9, _) |
%{NOTSPACE} | Matches everything until a space |
Example
// Grok pattern for Apache access logs
ParserConfigs.ParserConfig apacheConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(new GrokExtractionConfig(
"%{IPORHOST:client_ip} %{NOTSPACE:ident} %{NOTSPACE:auth} \\[%{HTTPDATE:timestamp}\\] \"%{WORD:method} %{NOTSPACE:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:response_code} %{NUMBER:bytes}"
))
.build();
Input:
{
"__raw__": "192.168.1.1 - - [10/Oct/2023:13:55:36 -0700] \"GET /index.html HTTP/1.1\" 200 2326"
}
Output:
{
"client_ip": "192.168.1.1",
"ident": "-",
"auth": "-",
"timestamp": "10/Oct/2023:13:55:36 -0700",
"method": "GET",
"request": "/index.html",
"httpversion": "1.1",
"response_code": "200",
"bytes": "2326"
}
Windows Multiline Extraction
This extraction method is specifically designed to parse Windows event logs that span multiple lines, which is common in Windows applications and services logs.
Configuration
WindowsMultilineExtractionConfig windowsConfig = WindowsMultilineExtractionConfig.builder()
.timestampLocationType(TimestampLocationType.FIRST_LINE)
.config(Map.of("key", "value"))
.build();
Parameter | Description |
---|---|
timestampLocationType | Specifies where to find the timestamp in the log entry |
config | Additional configuration parameters as key-value pairs |
Timestamp Location Types
Type | Description |
---|---|
NO_TIMESTAMP | Log entries don't contain timestamps |
FIRST_LINE | Timestamp appears in the first line of each log entry |
FROM_FIELD | Timestamp is found in a specific field (requires setting target_field in config) |
Using FROM_FIELD Timestamp Location
When your timestamp is in a specific field, configure as follows:
WindowsMultilineExtractionConfig windowsConfig = WindowsMultilineExtractionConfig.builder()
.timestampLocationType(WindowsMultilineExtractionConfig.TimestampLocationType.FROM_FIELD)
.config(Map.of("target_field", "event_time"))
.build();
Syslog Extraction
The Syslog extraction configuration parses standard syslog formatted messages, which are widely used in system and network device logging.
Configuration
SyslogExtractionConfig syslogConfig = SyslogExtractionConfig.builder()
.timestampPattern("MMM d HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP
))
.messageBodyDelimiter(':')
.build();
Parameter | Description |
---|---|
timestampPattern | Java date format pattern for parsing the timestamp component. required when TIMESTAMP component is present |
componentList | Ordered list of syslog components present in the log |
messageBodyDelimiter | Character that separates the header from the message body (optional) |
Syslog Components
Component | Description | Parsed Field Name | Example |
---|---|---|---|
PRIORITY | Log priority enclosed in angle brackets | priority | <13> |
VERSION | Syslog protocol version | version | 1 |
TIMESTAMP | Timestamp of the log event. If present, timestampPattern is required | timestamp | Oct 11 22:14:15 |
DEVICE | Host name or IP address | deviceId | server1 |
APP | Application or process name | appName | sshd |
PROC_ID | Process ID | procId | 12345 |
MSG_ID | Message identifier | msgId | ID47 |
STRUCTURED_DATA | Structured data in the format [id@domain key="value"] | structuredData | [exampleSDID@32473 iut="3" eventSource="App"] |
Remaining log | whatever remaining content after syslog header | content | Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2 |
Example
// Syslog parser for standard BSD format logs
ParserConfigs.ParserConfig syslogConfig = ParserConfigs.ParserConfig.builder()
.targetField("__raw__")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("MMM d HH:mm:ss")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP,
SyslogExtractionConfig.ComponentType.PROC_ID
))
.messageBodyDelimiter(':')
.build())
.build();
Input:
{
"__raw__": "Oct 11 22:14:15 server1 sshd 12345: Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2"
}
Output:
{
"timestamp": "Oct 11 22:14:15",
"deviceId": "server1",
"appName": "sshd",
"procId": "12345",
"content": "Failed password for invalid user admin from 192.168.1.10 port 55279 ssh2"
}
RFC5424 Format Example
For modern RFC5424 format syslog messages:
// Syslog parser for RFC5424 format
ParserConfigs.ParserConfig syslog5424Config = ParserConfigs.ParserConfig.builder()
.targetField("log_message")
.extractionConfig(SyslogExtractionConfig.builder()
.timestampPattern("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
.componentList(List.of(
SyslogExtractionConfig.ComponentType.PRIORITY,
SyslogExtractionConfig.ComponentType.VERSION,
SyslogExtractionConfig.ComponentType.TIMESTAMP,
SyslogExtractionConfig.ComponentType.DEVICE,
SyslogExtractionConfig.ComponentType.APP,
SyslogExtractionConfig.ComponentType.PROC_ID,
SyslogExtractionConfig.ComponentType.MSG_ID,
SyslogExtractionConfig.ComponentType.STRUCTURED_DATA
))
.build())
.build();
CEF Extraction
Common Event Format (CEF) is a logging and auditing file format developed by ArcSight, widely used in security information and event management (SIEM) systems.
Configuration
CefExtractionConfig cefConfig = new CefExtractionConfig();
The CEF extraction config doesn't require any additional parameters.
CEF Format Structure
CEF logs follow this structure:
CEF:Version|Device Vendor|Device Product|Device Version|Signature ID|Name|Severity|Extension
The header part (before the Extension) contains pipe-delimited fields, while the Extension part contains key-value pairs.
Example
// CEF parser for security logs
ParserConfigs.ParserConfig cefConfig = ParserConfigs.ParserConfig.builder()
.targetField("security_log")
.extractionConfig(new CefExtractionConfig())
.build();
Input:
CEF:0|Vendor|Product|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.1 spt=1234 dpt=80 act=blocked
Output:
{
"severity": 10,
"dst": "10.0.0.1",
"src": "192.168.1.1",
"deviceVendor": "Vendor",
"__raw__": "CEF:0|Vendor|Product|1.0|100|Intrusion Detected|10|src=192.168.1.1 dst=10.0.0.1 spt=1234 dpt=80 act=blocked",
"dpt": "80",
"deviceVersion": "1.0",
"version": 0,
"deviceEventClassId": "100",
"act": "blocked",
"spt": "1234",
"name": "Intrusion Detected",
"deviceProduct": "Product"
}
Advanced Usage: Multi-level Log Parsing
For complex log formats that require multiple parsing stages, ZephFlow's dispatch configuration enables sophisticated processing pipelines. This example demonstrates how to handle Cisco ASA network security logs.
Complex Example: Cisco ASA Log Parsing
This example shows how to parse complex Cisco ASA logs with multiple levels of extraction.
Input Log:
{
"__raw__": "Oct 10 2018 12:34:56 localhost CiscoASA[999]: %ASA-6-305011: Built dynamic TCP translation from inside:172.31.98.44/1772 to outside:100.66.98.44/8256"
}
Parser Configuration:
{
"targetField": "__raw__",
"removeTargetField": false,
"extractionConfig": {
"type": "syslog",
"timestampPattern": "MMM dd yyyy HH:mm:ss",
"componentList": [
"TIMESTAMP",
"DEVICE",
"APP"
],
"messageBodyDelimiter": ":"
},
"dispatchConfig": {
"dispatchField": "content",
"dispatchMap": {},
"defaultConfig": {
"targetField": "content",
"removeTargetField": true,
"extractionConfig": {
"type": "grok",
"grokExpression": "%ASA-%{INT:level}-%{INT:message_number}: %{GREEDYDATA:message_text}"
},
"dispatchConfig": {
"dispatchField": "message_number",
"dispatchMap": {
"305011": {
"targetField": "message_text",
"removeTargetField": true,
"extractionConfig": {
"type": "grok",
"grokExpression": "%{WORD:action} %{WORD:translation_type} %{WORD:protocol} translation from %{WORD:source_interface}:%{IP:source_ip}/%{INT:source_port} to %{WORD:dest_interface}:%{IP:dest_ip}/%{INT:dest_port}"
},
"dispatchConfig": null
}
},
"defaultConfig": null
}
}
}
}
Expected Output:
{
"level": "6",
"message_number": "305011",
"source_interface": "inside",
"__raw__": "Oct 10 2018 12:34:56 localhost CiscoASA[999]: %ASA-6-305011: Built dynamic TCP translation from inside:172.31.98.44/1772 to outside:100.66.98.44/8256",
"appName": "CiscoASA[999]",
"dest_interface": "outside",
"source_ip": "172.31.98.44",
"translation_type": "dynamic",
"deviceId": "localhost",
"protocol": "TCP",
"source_port": "1772",
"dest_ip": "100.66.98.44",
"action": "Built",
"dest_port": "8256",
"timestamp": "Oct 10 2018 12:34:56"
}
How It Works
This example processes logs through three sequential stages:
- First Stage (Syslog Parsing):
- Parses the syslog header format from the
__raw__
field - Extracts timestamp, device ID, and application name
- Places the message content after the colon in a field called
content
- Second Stage (Cisco ASA Format Parsing):
- The default config in the first dispatchConfig targets the
content
field - Uses Grok to extract the ASA log format with level, message number, and message text
- For example, parses
%ASA-6-305011: Built dynamic TCP translation...
- Places the message text into the
message_text
field
- Third Stage (Message-Type Specific Parsing):
- Uses the
message_number
(305011) to select the appropriate parser - Each message type has specific fields to extract
- For 305011 messages, extracts details about the network translation
This multi-stage approach demonstrates the power of nested dispatch configurations for handling complex, structured logs with varying formats.
Best Practices
Selecting the Right Extraction Configuration
- Analyze your log format first to determine which extraction configuration best matches your needs:
- Use Grok for most text-based logs with consistent formats
- Use Syslog for standard system and network device logs
- Use Windows Multiline for Windows Event logs
- Use CEF for security and SIEM-related logs
-
Test with sample data to verify your configuration handles all variations in your log format
-
Create targeted parsers rather than trying to parse everything with one complex configuration
Using Dispatch Configuration
For mixed log formats, use the dispatch configuration to apply different parsing strategies based on initial parsing results:
ParserConfigs.ParserConfig initialParser = ParserConfigs.ParserConfig.builder()
.targetField("raw_log")
.extractionConfig(new GrokExtractionConfig("%{WORD:log_type}: %{GREEDYDATA:log_content}"))
.dispatchConfig(ParserConfigs.DispatchConfig.builder()
.dispatchField("log_type")
.dispatchMap(Map.of(
"apache", apacheParserConfig,
"syslog", syslogParserConfig,
"windows", windowsParserConfig,
"security", cefParserConfig
))
.defaultConfig(defaultParserConfig)
.build())
.build();
Common Pitfalls to Avoid
-
Overly complex Grok patterns can be difficult to maintain - break them down into smaller, reusable patterns
-
Missing components in Syslog configuration - ensure your component list matches the exact format of your logs
-
Incorrect timestamp patterns - test thoroughly with various date formats that appear in your logs
-
Performance considerations - very complex parsing on high-volume logs can impact performance; consider pre-filtering or using simpler patterns where possible