Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Reference: handle special characters #14044

Merged
merged 12 commits into from
May 24, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion docs/static/field-reference.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
It is often useful to be able to refer to a field or collection of fields by name. To do this,
you can use the Logstash field reference syntax.

The syntax to access a field specifies the entire path to the field, with each fragment wrapped in square brackets.
The syntax to access a field specifies the entire path to the field, with each fragment wrapped in square brackets;
when a field name contains square brackets, they must be properly <<formal-grammar-escape-sequences, _escaped_>>.
yaauie marked this conversation as resolved.
Show resolved Hide resolved

_Field References_ can be expressed literally within <<conditionals,_Conditional_>> statements in your pipeline configurations,
as string arguments to your pipeline plugins, or within sprintf statements that will be used by your pipeline plugins:
Expand Down Expand Up @@ -133,3 +134,15 @@ embeddedFieldReference
;

An _Embedded Field Reference_ is a _Field Reference_ that is itself wrapped in square brackets (`[` and `]`), and can be a component of a _Composite Field Reference_.

[float]
[[formal-grammar-escape-sequences]]
=== Escape Sequences

In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped.
For {ls} to reference a field whose name contains a character that has special meaning in the field reference grammar, the character must be escaped.

Logstash can be globally configured to use one of two field reference escape modes:

- `NONE` (default): no escape sequence processing is done; fields containing literal square brackets cannot be referenced by the Event API.
yaauie marked this conversation as resolved.
Show resolved Hide resolved
- `PERCENT`: URI-style percent encoding of UTF-8 bytes; the left square bracket (`[`) is expressed as `%5B`, and the right square bracket (`]`) is expressed as `%5D`.
yaauie marked this conversation as resolved.
Show resolved Hide resolved
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.
// Note that the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using regular words rather than the asciidoc admonition format. The comment treatment would keep it from getting formatted, but it keeps catching my eye.

- `AMPERSAND`: HTML-style ampersand encoding (`&#` + decimal unicode codepoint + `;`); the left square bracket (`[`) is expressed as `&amp;#91;`, and the right square bracket (`]`) is expressed as `&amp;#93;`.
yaauie marked this conversation as resolved.
Show resolved Hide resolved
11 changes: 11 additions & 0 deletions docs/static/settings-file.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,17 @@ Values other than `disabled` are currently considered BETA, and may produce unin
| When set to `true`, quoted strings will process the following escape sequences: `\n` becomes a literal newline (ASCII 10). `\r` becomes a literal carriage return (ASCII 13). `\t` becomes a literal tab (ASCII 9). `\\` becomes a literal backslash `\`. `\"` becomes a literal double quotation mark. `\'` becomes a literal quotation mark.
| `false`

| `config.field_reference.escape_style`
a| _EXPERIMENTAL_ setting that provides a way to reference fields that contain <<formal-grammar-escape-sequences,field reference special characters>> `[` and `]`.
Copy link
Contributor

@karenzone karenzone May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "Technical preview" instead of "Experimental." Reference: https://github.com/elastic/docs#using-the-technical-preview-admonition

I tried adding/formatting this nugget in a variety of ways. So far, I don't really like any of them. Here are two of several things I tried:

Screen Shot 2022-05-20 at 4 31 18 PM

Screen Shot 2022-05-20 at 4 39 51 PM

Admonitions are supposed to handle formatting, but I haven't hit on a combination that looks good and provides adequate info for the user. Only tagging the option with "Preview" doesn't convey the risk that the option might change or go away. I can sync with @gtback next week for ideas and design intent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the admonitions contained a link to a better place where we define what we mean and add more detail. I've never seen these used in a table, but let me know what I can do to help.


Current options are:

* `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes (`[` -> `%5B`; `]` -> `%5D`)
* `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points (`[` -> `&amp;#91;`; `]` -> `&amp;#91;`)
yaauie marked this conversation as resolved.
Show resolved Hide resolved
* `NONE`: field names containing special characters _cannot_ be referenced.

| `NONE`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entry case sensitive? In this topic, we already have two instances of None and one instance of none.

Copy link
Member Author

@yaauie yaauie May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. And there really is no reason for it to be upcase, so I have changed the implementation to be downcase throughout to match the other none.

Both other instances of "None" in the file should actually be "N/A" since they represent an absence of a default value instead of a default value that is the literal N+o+n+e


| `modules`
| When configured, `modules` must be in the nested YAML structure described above this table.
| None
Expand Down
8 changes: 8 additions & 0 deletions logstash-core/lib/logstash/agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,14 @@ def initialize(settings = LogStash::SETTINGS, source_loader = nil)
# Generate / load the persistent uuid
id

field_reference_escape_style_setting = settings.get_setting('config.field_reference.escape_style')
if field_reference_escape_style_setting.set?
logger.warn(I18n.t("logstash.settings.experimental.set", canonical_name: field_reference_escape_style_setting.name))
end
field_reference_escape_style = field_reference_escape_style_setting.value
logger.debug("Setting global FieldReference escape style: #{field_reference_escape_style}")
org.logstash.FieldReference::set_escape_style(field_reference_escape_style)

# Initialize, but do not start the webserver.
@webserver = LogStash::WebServer.from_settings(@logger, self, settings)

Expand Down
1 change: 1 addition & 0 deletions logstash-core/lib/logstash/environment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ module Environment
Setting::Boolean.new("config.reload.automatic", false),
Setting::TimeValue.new("config.reload.interval", "3s"), # in seconds
Setting::Boolean.new("config.support_escapes", false),
Setting::String.new("config.field_reference.escape_style", "NONE", true, %w(NONE PERCENT AMPERSAND)),
Setting::Boolean.new("metric.collect", true),
Setting::String.new("pipeline.id", "main"),
Setting::Boolean.new("pipeline.system", false),
Expand Down
5 changes: 5 additions & 0 deletions logstash-core/lib/logstash/runner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@ class LogStash::Runner < Clamp::StrictCommand
:default => LogStash::SETTINGS.get_default("config.string"),
:attribute_name => "config.string"

option ["--field-reference-escape-style"], "STYLE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got confused that this does not take effect when using irb e.g.
bin/logstash --field-reference-escape-style PERCENT -i irb but that's for a separate issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in d58447d by moving the setting's application from the agent (which isn't started for shell sessions) to the runner before shell sessions are invoked.

I18n.t("logstash.runner.flag.field-reference-escape-style"),
:default => LogStash::SETTINGS.get_default("config.field_reference.escape_style"),
:attribute_name => "config.field_reference.escape_style"

# Module settings
option ["--modules"], "MODULES",
I18n.t("logstash.runner.flag.modules"),
Expand Down
29 changes: 29 additions & 0 deletions logstash-core/locales/en.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,31 @@ en:
"%{default_output}"
If you wish to use both defaults, please use
the empty string for the '-e' flag.
field-reference-escape-style: |+
Use the given STYLE when parsing field
references. This allows you to reference fields
whose name includes characters that are
meaningful in a field reference including square
brackets (`[` and `]`).

This feature is EXPERIMENTAL, and implementations
are subject to change.

Available escape styles are:
- `NONE`: escape sequences in field references
are not processed, which means fields that
contain special characters cannot be
referenced.
- `PERCENT`: characters may be encoded with
URI-style percent notation represeting UTF-8
bytes (`[` is `%5B`; `]` is `%5D`).
Unlike URI-encoding, literal percent characters
do not need to be escaped unless followed by a
sequence of 2 capital hexadecimal characters.
- `AMPERSAND`: characters may be encoded with
HTML-style ampersand-hash encoding notation
representing decimal unicode codepoints
(`[` is `&#91;`; `]` is `&#93;`).
modules: |+
Load Logstash modules.
Modules can be defined using multiple instances
Expand Down Expand Up @@ -431,4 +456,8 @@ en:
ambiguous: >-
Both `%{canonical_name}` and its deprecated alias `%{deprecated_alias}` have been set.
Please only set `%{canonical_name}`
experimental:
set: >-
The setting `%{canonical_name}` is EXPERIMENTAL and its implementation is subject to change
in a future release of Logstash

42 changes: 33 additions & 9 deletions logstash-core/src/main/java/org/logstash/FieldReference.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,10 @@
import java.util.Map;

import java.util.concurrent.ConcurrentHashMap;
import java.util.stream.Collectors;

import org.jruby.RubyString;
import org.logstash.util.EscapeHandler;

/**
* Represents a reference to another field of the event {@link Event}
Expand All @@ -45,19 +47,39 @@ public static class IllegalSyntaxException extends RuntimeException {
}
}

private static EscapeHandler ESCAPE_HANDLER = EscapeHandler.NONE;

public static void setEscapeStyle(final String escapeStyleSpec) {
final EscapeHandler newEscapeHandler;
switch(escapeStyleSpec) {
case "NONE":
newEscapeHandler = EscapeHandler.NONE;
break;
case "PERCENT":
newEscapeHandler = EscapeHandler.PERCENT;
break;
case "AMPERSAND":
newEscapeHandler = EscapeHandler.AMPERSAND;
break;
default:
throw new IllegalArgumentException(String.format("Invalid escape style: `%s`", escapeStyleSpec));
}
ESCAPE_HANDLER = newEscapeHandler;
yaauie marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* This type indicates that the referenced that is the metadata of an {@link Event} found in
* {@link Event#metadata}.
* {@link Event#getMetadata()}.
*/
public static final int META_PARENT = 0;

/**
* This type indicates that the referenced data must be looked up from {@link Event#metadata}.
* This type indicates that the referenced data must be looked up from {@link Event#getMetadata()}.
*/
public static final int META_CHILD = 1;

/**
* This type indicates that the referenced data must be looked up from {@link Event#data}.
* This type indicates that the referenced data must be looked up from {@link Event#getData()}.
*/
private static final int DATA_CHILD = -1;

Expand All @@ -73,11 +95,6 @@ public static class IllegalSyntaxException extends RuntimeException {
*/
private static final StrictTokenizer TOKENIZER = new StrictTokenizer();

/**
* Unique {@link FieldReference} pointing at the timestamp field in a {@link Event}.
*/
public static final FieldReference TIMESTAMP_REFERENCE = FieldReference.from(Event.TIMESTAMP);

/**
* Cache of all existing {@link FieldReference} by their {@link RubyString} source.
*/
Expand All @@ -90,6 +107,11 @@ public static class IllegalSyntaxException extends RuntimeException {
private static final Map<String, FieldReference> CACHE =
new ConcurrentHashMap<>(64, 0.2F, 1);

/**
* Unique {@link FieldReference} pointing at the timestamp field in a {@link Event}.
*/
public static final FieldReference TIMESTAMP_REFERENCE = FieldReference.from(Event.TIMESTAMP);

private final String[] path;

private final String key;
Expand Down Expand Up @@ -217,7 +239,9 @@ private static FieldReference parseToCache(final String reference) {
}

private static FieldReference parse(final CharSequence reference) {
final List<String> path = TOKENIZER.tokenize(reference);
final List<String> path = TOKENIZER.tokenize(reference).stream()
.map(ESCAPE_HANDLER::unescape)
.collect(Collectors.toList());

final String key = path.remove(path.size() - 1);
final boolean empty = path.isEmpty();
Expand Down
73 changes: 73 additions & 0 deletions logstash-core/src/main/java/org/logstash/util/EscapeHandler.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
package org.logstash.util;

import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.binary.Hex;

import java.net.URLDecoder;
import java.nio.charset.StandardCharsets;
import java.util.regex.Pattern;

public interface EscapeHandler {
String unescape(String escaped);
String escape(String unescaped);

EscapeHandler NONE = new EscapeHandler() {
@Override
public String unescape(final String escaped) {
return escaped;
}

@Override
public String escape(final String unescaped) {
return unescaped;
}
};

EscapeHandler PERCENT = new EscapeHandler() {
private final Pattern ESCAPE_REQUIRED_PERCENT_LITERAL = Pattern.compile("%(?=[0-9A-F]{2})");

@Override
public String escape(final String unescaped) {
// When a percent-literal is followed by a pair of hex digits, we must escape it.
return ESCAPE_REQUIRED_PERCENT_LITERAL.matcher(unescaped).replaceAll("%25")
.replace("[", "%5B")
.replace("]", "%5D");
}

private final Pattern PERCENT_ENCODED_SEQUENCE = Pattern.compile("%[0-9A-F]{2}");
private final Pattern UNESCAPED_PERCENT_LITERAL = Pattern.compile("%(?![0-9A-F]{2})");

public String unescape(String escaped) {
if (!PERCENT_ENCODED_SEQUENCE.matcher(escaped).find()) { return escaped; }

// In order to support unescaped percent-literals without implementing
// our own percent-decoder, we need to detect them and escape them before
// handing off to java's URLDecoder.
escaped = UNESCAPED_PERCENT_LITERAL.matcher(escaped).replaceAll("%25");

return URLDecoder.decode(escaped, StandardCharsets.UTF_8);
}
};

EscapeHandler AMPERSAND = new EscapeHandler() {
private final Pattern AMPERSAND_ENCODED_SEQUENCE = Pattern.compile("&#([0-9]{2,});");

@Override
public String escape(final String unescaped) {
return unescaped.replaceAll(AMPERSAND_ENCODED_SEQUENCE.pattern(), "&#38;#$1;")
.replace("[", "&#91;")
.replace("]", "&#93;");
}

@Override
public String unescape(final String escaped) {
if (!escaped.contains("&")) { return escaped; }

return AMPERSAND_ENCODED_SEQUENCE.matcher(escaped).replaceAll(matchResult -> {
final int codePoint = Integer.parseInt(matchResult.group(1));
final char[] chars = Character.toChars(codePoint);
return String.copyValueOf(chars);
});
}
};
}
Loading