Grok is a regular expression dialect that supports reusable aliased expressions. Grok works really well with syslog logs, Apache and other webserver logs, mysql logs, and generally any log format that is written for humans and not computer consumption.
Grok sits on top of the Oniguruma regular expression library, so any regular expressions are valid in grok. Grok uses this regular expression language to allow naming existing patterns and combining them into more complex patterns that match your fields.
The {stack} ships with numerous predefined grok patterns that simplify working with grok. The syntax for reusing grok patterns takes one of the following forms:
|
|
|
SYNTAX
-
The name of the pattern that will match your text. For example,
NUMBER
andIP
are both patterns that are provided within the default patterns set. TheNUMBER
pattern matches data like3.44
, and theIP
pattern matches data like55.3.244.1
. ID
-
The identifier you give to the piece of text being matched. For example,
3.44
could be the duration of an event, so you might call itduration
. The string55.3.244.1
might identify theclient
making a request. TYPE
-
The data type you want to cast your named field.
int
,long
,double
,float
andboolean
are supported types.
For example, let’s say you have message data that looks like this:
3.44 55.3.244.1
The first value is a number, followed by what appears to be an IP address. You can match this text by using the following grok expression:
%{NUMBER:duration} %{IP:client}
To ease migration to the {ecs-ref}[Elastic Common Schema (ECS)], a new set of ECS-compliant patterns is available in addition to the existing patterns. The new ECS pattern definitions capture event field names that are compliant with the schema.
The ECS pattern set has all of the pattern definitions from the legacy set, and
is a drop-in replacement. Use the
{logstash-ref}/plugins-filters-grok.html#plugins-filters-grok-ecs_compatibility[ecs-compatability
]
setting to switch modes.
New features and enhancements will be added to the ECS-compliant files. The legacy patterns may still receive bug fixes which are backwards compatible.
You can incorporate predefined grok patterns into Painless scripts to extract data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless execute API or create a runtime field that includes the script. Runtime fields offer greater flexibility and accept multiple documents, but the Painless execute API is a great option if you don’t have write access on a cluster where you’re testing a script.
Tip
|
If you need help building grok patterns to match your data, use the {kibana-ref}/xpack-grokdebugger.html[Grok Debugger] tool in {kib}. |
For example, if you’re working with Apache log data, you can use the
%{COMMONAPACHELOG}
syntax, which understands the structure of Apache logs. A
sample document might look like this:
"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - -
[30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
To extract the IP address from the message
field, you can write a Painless
script that incorporates the %{COMMONAPACHELOG}
syntax. You can test this
script using the {painless}/painless-execute-api.html#painless-runtime-ip[ip
field context] of the Painless execute API, but let’s use a runtime field
instead.
Based on the sample document, index the @timestamp
and message
fields. To
remain flexible, use wildcard
as the field type for message
:
PUT /my-index/
{
"mappings": {
"properties": {
"@timestamp": {
"format": "strict_date_optional_time||epoch_second",
"type": "date"
},
"message": {
"type": "wildcard"
}
}
}
}
Next, use the bulk API to index some log data into
my-index
.
POST /my-index/_bulk?refresh
{"index":{}}
{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
Now you can define a runtime field in the mappings that includes your Painless
script and grok pattern. If the pattern matches, the script emits the value of
the matching IP address. If the pattern doesn’t match (clientip != null
), the
script just returns the field value without crashing.
PUT my-index/_mappings
{
"runtime": {
"http.clientip": {
"type": "ip",
"script": """
String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
if (clientip != null) emit(clientip);
"""
}
}
}
Alternatively, you can define the same runtime field but in the context of a
search request. The runtime definition and the script are exactly the same as
the one defined previously in the index mapping. Just copy that definition into
the search request under the runtime_mappings
section and include a query
that matches on the runtime field. This query returns the same results as if
you defined a search query for the http.clientip
runtime field in your index mappings, but only in the context of this specific
search:
GET my-index/_search
{
"runtime_mappings": {
"http.clientip": {
"type": "ip",
"script": """
String clientip=grok('%{COMMONAPACHELOG}').extract(doc["message"].value)?.clientip;
if (clientip != null) emit(clientip);
"""
}
},
"query": {
"match": {
"http.clientip": "40.135.0.0"
}
},
"fields" : ["http.clientip"]
}
Using the http.clientip
runtime field, you can define a simple query to run a
search for a specific IP address and return all related fields. The fields
parameter on the _search
API works for all fields,
even those that weren’t sent as part of the original _source
:
GET my-index/_search
{
"query": {
"match": {
"http.clientip": "40.135.0.0"
}
},
"fields" : ["http.clientip"]
}
The response includes the specific IP address indicated in your search query.
The grok pattern within the Painless script extracted this value from the
message
field at runtime.
{
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my-index",
"_id" : "1iN2a3kBw4xTzEDqyYE0",
"_score" : 1.0,
"_source" : {
"timestamp" : "2020-04-30T14:30:17-05:00",
"message" : "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
},
"fields" : {
"http.clientip" : [
"40.135.0.0"
]
}
}
]
}
}