Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field Reference: handle special characters #14044

Merged
merged 12 commits into from
May 24, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion docs/static/field-reference.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
It is often useful to be able to refer to a field or collection of fields by name. To do this,
you can use the Logstash field reference syntax.

The syntax to access a field specifies the entire path to the field, with each fragment wrapped in square brackets.
The syntax to access a field specifies the entire path to the field, with each fragment wrapped in square brackets;
when a field name contains square brackets, they must be properly <<formal-grammar-escape-sequences, _escaped_>>.
yaauie marked this conversation as resolved.
Show resolved Hide resolved

_Field References_ can be expressed literally within <<conditionals,_Conditional_>> statements in your pipeline configurations,
as string arguments to your pipeline plugins, or within sprintf statements that will be used by your pipeline plugins:
Expand Down Expand Up @@ -133,3 +134,15 @@ embeddedFieldReference
;

An _Embedded Field Reference_ is a _Field Reference_ that is itself wrapped in square brackets (`[` and `]`), and can be a component of a _Composite Field Reference_.

[float]
[[formal-grammar-escape-sequences]]
=== Escape Sequences

In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to reference a field whose name contains a character that has special meaning in the field reference grammar, it needs to be escaped.
For {ls} to reference a field whose name contains a character that has special meaning in the field reference grammar, the character must be escaped.

Logstash can be globally configured to use one of two field reference escape modes:

- `NONE` (default): no escape sequence processing is done; fields containing literal square brackets cannot be referenced by the Event API.
yaauie marked this conversation as resolved.
Show resolved Hide resolved
- `PERCENT`: URI-style percent encoding of UTF-8 bytes; the left square bracket (`[`) is expressed as `%5B`, and the right square bracket (`]`) is expressed as `%5D`.
yaauie marked this conversation as resolved.
Show resolved Hide resolved
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NOTE: the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.
// Note that the following is _also_ HTML-escaped in the asciidoc source document so that browsers rendering the HTML will unwrap one escape and leave the remaining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using regular words rather than the asciidoc admonition format. The comment treatment would keep it from getting formatted, but it keeps catching my eye.

- `AMPERSAND`: HTML-style ampersand encoding (`&#` + decimal unicode codepoint + `;`); the left square bracket (`[`) is expressed as `&#91;`, and the right square bracket (`]`) is expressed as `&#93;`.
yaauie marked this conversation as resolved.
Show resolved Hide resolved
11 changes: 11 additions & 0 deletions docs/static/settings-file.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,17 @@ Values other than `disabled` are currently considered BETA, and may produce unin
| When set to `true`, quoted strings will process the following escape sequences: `\n` becomes a literal newline (ASCII 10). `\r` becomes a literal carriage return (ASCII 13). `\t` becomes a literal tab (ASCII 9). `\\` becomes a literal backslash `\`. `\"` becomes a literal double quotation mark. `\'` becomes a literal quotation mark.
| `false`

| `config.field_reference.escape_style`
a| _EXPERIMENTAL_ setting that provides a way to reference fields that contain <<formal-grammar-escape-sequences,field reference special characters>> `[` and `]`.
Copy link
Contributor

@karenzone karenzone May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "Technical preview" instead of "Experimental." Reference: https://github.com/elastic/docs#using-the-technical-preview-admonition

I tried adding/formatting this nugget in a variety of ways. So far, I don't really like any of them. Here are two of several things I tried:

Screen Shot 2022-05-20 at 4 31 18 PM

Screen Shot 2022-05-20 at 4 39 51 PM

Admonitions are supposed to handle formatting, but I haven't hit on a combination that looks good and provides adequate info for the user. Only tagging the option with "Preview" doesn't convey the risk that the option might change or go away. I can sync with @gtback next week for ideas and design intent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish the admonitions contained a link to a better place where we define what we mean and add more detail. I've never seen these used in a table, but let me know what I can do to help.


Current options are:

* `PERCENT`: URI-style `%`+`HH` hexadecimal encoding of UTF-8 bytes (`[` -> `%5B`; `]` -> `%5D`)
* `AMPERSAND`: HTML-style `&#`+`DD`+`;` encoding of decimal Unicode code-points (`[` -> `&#91;`; `]` -> `&#91;`)
yaauie marked this conversation as resolved.
Show resolved Hide resolved
* `NONE`: field names containing special characters _cannot_ be referenced.

| `NONE`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this entry case sensitive? In this topic, we already have two instances of None and one instance of none.

Copy link
Member Author

@yaauie yaauie May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. And there really is no reason for it to be upcase, so I have changed the implementation to be downcase throughout to match the other none.

Both other instances of "None" in the file should actually be "N/A" since they represent an absence of a default value instead of a default value that is the literal N+o+n+e


| `modules`
| When configured, `modules` must be in the nested YAML structure described above this table.
| None
Expand Down
8 changes: 8 additions & 0 deletions logstash-core/lib/logstash/agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,14 @@ def initialize(settings = LogStash::SETTINGS, source_loader = nil)
# Generate / load the persistent uuid
id

field_reference_escape_style_setting = settings.get_setting('config.field_reference.escape_style')
if field_reference_escape_style_setting.set?
logger.warn(I18n.t("logstash.settings.experimental.set", canonical_name: field_reference_escape_style_setting.name))
end
field_reference_escape_style = field_reference_escape_style_setting.value
logger.debug("Setting global FieldReference escape style: #{field_reference_escape_style}")
org.logstash.FieldReference::set_escape_style(field_reference_escape_style)

# Initialize, but do not start the webserver.
@webserver = LogStash::WebServer.from_settings(@logger, self, settings)

Expand Down
1 change: 1 addition & 0 deletions logstash-core/lib/logstash/environment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ module Environment
Setting::Boolean.new("config.reload.automatic", false),
Setting::TimeValue.new("config.reload.interval", "3s"), # in seconds
Setting::Boolean.new("config.support_escapes", false),
Setting::String.new("config.field_reference.escape_style", "NONE", true, %w(NONE PERCENT AMPERSAND)),
Setting::Boolean.new("metric.collect", true),
Setting::String.new("pipeline.id", "main"),
Setting::Boolean.new("pipeline.system", false),
Expand Down
5 changes: 5 additions & 0 deletions logstash-core/lib/logstash/runner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@ class LogStash::Runner < Clamp::StrictCommand
:default => LogStash::SETTINGS.get_default("config.string"),
:attribute_name => "config.string"

option ["--field-reference-escape-style"], "STYLE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got confused that this does not take effect when using irb e.g.
bin/logstash --field-reference-escape-style PERCENT -i irb but that's for a separate issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in d58447d by moving the setting's application from the agent (which isn't started for shell sessions) to the runner before shell sessions are invoked.

I18n.t("logstash.runner.flag.field-reference-escape-style"),
:default => LogStash::SETTINGS.get_default("config.field_reference.escape_style"),
:attribute_name => "config.field_reference.escape_style"

# Module settings
option ["--modules"], "MODULES",
I18n.t("logstash.runner.flag.modules"),
Expand Down
29 changes: 29 additions & 0 deletions logstash-core/locales/en.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,31 @@ en:
"%{default_output}"
If you wish to use both defaults, please use
the empty string for the '-e' flag.
field-reference-escape-style: |+
Use the given STYLE when parsing field
references. This allows you to reference fields
whose name includes characters that are
meaningful in a field reference including square
brackets (`[` and `]`).

This feature is EXPERIMENTAL, and implementations
are subject to change.

Available escape styles are:
- `NONE`: escape sequences in field references
are not processed, which means fields that
contain special characters cannot be
referenced.
- `PERCENT`: characters may be encoded with
URI-style percent notation represeting UTF-8
bytes (`[` is `%5B`; `]` is `%5D`).
Unlike URI-encoding, literal percent characters
do not need to be escaped unless followed by a
sequence of 2 capital hexadecimal characters.
- `AMPERSAND`: characters may be encoded with
HTML-style ampersand-hash encoding notation
representing decimal unicode codepoints
(`[` is `&#91;`; `]` is `&#93;`).
modules: |+
Load Logstash modules.
Modules can be defined using multiple instances
Expand Down Expand Up @@ -431,4 +456,8 @@ en:
ambiguous: >-
Both `%{canonical_name}` and its deprecated alias `%{deprecated_alias}` have been set.
Please only set `%{canonical_name}`
experimental:
set: >-
The setting `%{canonical_name}` is EXPERIMENTAL and its implementation is subject to change
in a future release of Logstash

14 changes: 14 additions & 0 deletions logstash-core/spec/logstash/event_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,15 @@
expect { e.set('[', 'value') }.to raise_exception(::RuntimeError)
end
end

context 'with map value whose keys have FieldReference-special characters' do
let(:event) { LogStash::Event.new }
let(:value) { {"this[field]" => "okay"} }
it 'sets the value correctly' do
event.set('[some][field]', value.dup)
expect(event.get('[some][field]')).to eq(value)
end
end
end

context "timestamp" do
Expand Down Expand Up @@ -351,6 +360,11 @@
expect(e.timestamp.to_iso8601).to eq("2016-05-28T23:02:05.350Z")
end

it 'accepts maps whose keys contain FieldReference-special characters' do
e = LogStash::Event.new({"nested" => {"i][egal" => "okay"}, "n[0]" => "bene"})
expect(e.get('nested')).to eq({"i][egal" => "okay"})
expect(e.to_hash).to include("n[0]" => "bene")
end
end

context "method missing exception messages" do
Expand Down
49 changes: 43 additions & 6 deletions logstash-core/src/main/java/org/logstash/ConvertedMap.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import java.util.HashMap;
import java.util.IdentityHashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import org.jruby.RubyHash;
import org.jruby.RubyString;
import org.jruby.runtime.ThreadContext;
Expand All @@ -35,15 +36,51 @@
* as key.</p>
* <p>The {@code put} method will work with any {@link String} key but is only intended for use in
* situations where {@link ConvertedMap#putInterned(String, Object)} would require manually
* interning the {@link String} key. This is due to the fact that we use our internal
* {@link FieldReference} cache to get an interned version of the given key instead of JDKs
* {@link String#intern()}, which is faster since it works from a much smaller and hotter cache
* in {@link FieldReference#CACHE} than using String interning directly.</p>
* interning the {@link String} key.
* This is due to the fact that it is based on {@link IdentityHashMap}, and we rely on the String
* intern pool to ensure identity match of equivalent strings.
* For performance, we keep a global cache of strings that have been interned for use with {@link ConvertedMap},
* and encourage interning through {@link ConvertedMap#internStringForUseAsKey(String)} to avoid
* the performance pentalty of the global string intern pool.
*/
public final class ConvertedMap extends IdentityHashMap<String, Object> {

private static final long serialVersionUID = 1L;

private static final ConcurrentHashMap<String,String> KEY_CACHE = new ConcurrentHashMap<>(100, 0.2F, 16);

/**
* Returns an equivalent interned string, possibly avoiding the
* global intern pool.
*
* @param candidate the candidate {@link String}
* @return an interned string from the global String intern pool
*/
static String internStringForUseAsKey(final String candidate) {
// TODO: replace with LRU cache and/or isolated intern pool
final String cached = KEY_CACHE.get(candidate);
if (cached != null) { return cached; }

final String interned = candidate.intern();
if (KEY_CACHE.size() <= 10_000 ) {
KEY_CACHE.put(interned, interned);
}
return interned;
}

/**
* Ensures that the provided {@code String[]} contains only
* instances that have been {@link ConvertedMap::internStringForUseAsKey},
* possibly replacing entries with equivalent interned strings.
*
* @param candidates an array of non-null strings
*/
static void internStringsForUseAsKeys(final String[] candidates) {
for (int i = 0; i < candidates.length; i++) {
candidates[i] = internStringForUseAsKey(candidates[i]);
}
}

private static final RubyHash.VisitorWithState<ConvertedMap> RUBY_HASH_VISITOR =
new RubyHash.VisitorWithState<ConvertedMap>() {
@Override
Expand Down Expand Up @@ -91,7 +128,7 @@ public static ConvertedMap newFromRubyHash(final ThreadContext context, final Ru

@Override
public Object put(final String key, final Object value) {
return super.put(FieldReference.from(key).getKey(), value);
return super.put(internStringForUseAsKey(key), value);
}

/**
Expand All @@ -118,6 +155,6 @@ public Object unconvert() {
* @return Interned String
*/
private static String convertKey(final RubyString key) {
return FieldReference.from(key).getKey();
return internStringForUseAsKey(key.asJavaString());
}
}
66 changes: 47 additions & 19 deletions logstash-core/src/main/java/org/logstash/FieldReference.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,10 @@
import java.util.Map;

import java.util.concurrent.ConcurrentHashMap;
import java.util.stream.Collectors;

import org.jruby.RubyString;
import org.logstash.util.EscapeHandler;

/**
* Represents a reference to another field of the event {@link Event}
Expand All @@ -44,19 +47,39 @@ public static class IllegalSyntaxException extends RuntimeException {
}
}

private static EscapeHandler ESCAPE_HANDLER = EscapeHandler.NONE;

public static void setEscapeStyle(final String escapeStyleSpec) {
final EscapeHandler newEscapeHandler;
switch(escapeStyleSpec) {
case "NONE":
newEscapeHandler = EscapeHandler.NONE;
break;
case "PERCENT":
newEscapeHandler = EscapeHandler.PERCENT;
break;
case "AMPERSAND":
newEscapeHandler = EscapeHandler.AMPERSAND;
break;
default:
throw new IllegalArgumentException(String.format("Invalid escape style: `%s`", escapeStyleSpec));
}
ESCAPE_HANDLER = newEscapeHandler;
yaauie marked this conversation as resolved.
Show resolved Hide resolved
}

/**
* This type indicates that the referenced that is the metadata of an {@link Event} found in
* {@link Event#metadata}.
* {@link Event#getMetadata()}.
*/
public static final int META_PARENT = 0;

/**
* This type indicates that the referenced data must be looked up from {@link Event#metadata}.
* This type indicates that the referenced data must be looked up from {@link Event#getMetadata()}.
*/
public static final int META_CHILD = 1;

/**
* This type indicates that the referenced data must be looked up from {@link Event#data}.
* This type indicates that the referenced data must be looked up from {@link Event#getData()}.
*/
private static final int DATA_CHILD = -1;

Expand All @@ -72,15 +95,6 @@ public static class IllegalSyntaxException extends RuntimeException {
*/
private static final StrictTokenizer TOKENIZER = new StrictTokenizer();

/**
* Unique {@link FieldReference} pointing at the timestamp field in a {@link Event}.
*/
public static final FieldReference TIMESTAMP_REFERENCE =
deduplicate(new FieldReference(EMPTY_STRING_ARRAY, Event.TIMESTAMP, DATA_CHILD));

private static final FieldReference METADATA_PARENT_REFERENCE =
new FieldReference(EMPTY_STRING_ARRAY, Event.METADATA, META_PARENT);

/**
* Cache of all existing {@link FieldReference} by their {@link RubyString} source.
*/
Expand All @@ -93,6 +107,11 @@ public static class IllegalSyntaxException extends RuntimeException {
private static final Map<String, FieldReference> CACHE =
new ConcurrentHashMap<>(64, 0.2F, 1);

/**
* Unique {@link FieldReference} pointing at the timestamp field in a {@link Event}.
*/
public static final FieldReference TIMESTAMP_REFERENCE = FieldReference.from(Event.TIMESTAMP);

private final String[] path;

private final String key;
Expand All @@ -106,9 +125,10 @@ public static class IllegalSyntaxException extends RuntimeException {
private final int type;

private FieldReference(final String[] path, final String key, final int type) {
this.key = key;
this.type = type;
this.key = ConvertedMap.internStringForUseAsKey(key);
ConvertedMap.internStringsForUseAsKeys(path);
this.path = path;
this.type = type;
hash = calculateHash(this.key, this.path, this.type);
}

Expand Down Expand Up @@ -219,12 +239,14 @@ private static FieldReference parseToCache(final String reference) {
}

private static FieldReference parse(final CharSequence reference) {
final List<String> path = TOKENIZER.tokenize(reference);
final List<String> path = TOKENIZER.tokenize(reference).stream()
.map(ESCAPE_HANDLER::unescape)
.collect(Collectors.toList());

final String key = path.remove(path.size() - 1).intern();
final String key = path.remove(path.size() - 1);
final boolean empty = path.isEmpty();
if (empty && key.equals(Event.METADATA)) {
return METADATA_PARENT_REFERENCE;
return new FieldReference(EMPTY_STRING_ARRAY, key, META_PARENT);
} else if (!empty && path.get(0).equals(Event.METADATA)) {
return new FieldReference(
path.subList(1, path.size()).toArray(EMPTY_STRING_ARRAY), key, META_CHILD
Expand All @@ -240,7 +262,11 @@ private static FieldReference parse(final CharSequence reference) {
**/
private static class StrictTokenizer {

public List<String> tokenize(CharSequence reference) {
/**
* @param reference a sequence of characters representing a reference to a field
* @return a list of string path fragments.
*/
public List<String> tokenize(final CharSequence reference) {
ArrayList<String> path = new ArrayList<>();
final int length = reference.length();

Expand Down Expand Up @@ -286,7 +312,7 @@ public List<String> tokenize(CharSequence reference) {

if (splitPoint < i) {
// if we have something to add, add it.
path.add(reference.subSequence(splitPoint, i).toString().intern());
path.add(reference.subSequence(splitPoint, i).toString());
}

depth--;
Expand All @@ -309,7 +335,9 @@ public List<String> tokenize(CharSequence reference) {
// if we saw no brackets, this is a top-level reference that can be emitted as-is without
// further processing
path.add(reference.toString());
path.trimToSize();
yaauie marked this conversation as resolved.
Show resolved Hide resolved
return path;

} else if (depth > 0) {
// when we hit the end-of-input while still in an open bracket, we have an invalid field reference
potentiallyAmbiguousSyntaxDetected = true;
Expand Down
Loading