Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type mappings transformation #1154

Merged

Conversation

gregschohn
Copy link
Collaborator

@gregschohn gregschohn commented Nov 20, 2024

Description

  • Category - Enhancement, New feature
  • Why these changes are required? - To allow users to more easily upgrade from an <= ES 6.8 cluster by remapping type mappings to new indices.
  • What is the old behavior before changes and new behavior after changes? New behaviors are only present if the type mapping sanitization transformer is specified at the command line of the replayer (TODO fill in the invocation string later).

These transformations are implemented via jinjava transformations that are configured with some lightweight json as described in the README. Notice that each type of request can be suppressed via feature flags, which are also an option that can be passed to the provider. That allows users to opt-in to some transformations that might have value while not being forced to adopt every type of request transformation that this package does. Users still have the ability to chain these transformations alongside any others using the existing transformation plugin framework.

To use type mapping transforms for document backfill (RfsMigrateDocuments) or TrafficReplayer, add a command line option (base64 & from a file work too w/ those analog parameters)

Replayer Arguments Description Resolved Target Index for Outgoing Traffic
--transformer-config [{"TypeMappingSanitizationTransformerProvider":{"sourceProperties":{"version":{"major":6,"minor":8}}}}] Uses default transformations to make separate target indices for each type SOURCEINDEX_SOURCETYPE
--transformer-config [{"TypeMappingSanitizationTransformerProvider":{"regexMappings":[["",".*",""]],"sourceProperties":{"version":{"major":6,"minor":8}}}}] When resolving the target index name, this drops the type component from the source requests SOURCEINDEX

To set the transformer configuration for RfsMigrateDocuments, replace --transformer-config with --doc-transformer-config.

Issues Resolved

This replayer jira task, with a large dose of scope creep to bring jinjava transforms and ones for RFS into the fold.

Is this a backport? If so, please add backport PR # and/or commits #

Testing

Only CICD checks so far, which don't yet have sufficient coverage for 6.8 and earlier sources yet. I'm trying to set that up further now.

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • [partial] New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…g w/ Jinjava.

Major things to figure out: 1) how to deal with compositing parts of jinja templates (maybe it's just snippets vended as resources and a user writes their own top-level template) and 2) how to suppress reading any part of the json body when only the headers need to be transformed to prevent parsing the json/ndjson for every request through the system (probably minor though since most setups will probably be doing full-transforms on the vast majority of data flowing).

Signed-off-by: Greg Schohn <[email protected]>
… so that they can fit w/in the transformation ecosystems.

Signed-off-by: Greg Schohn <[email protected]>
This supports preparsing templates and loading default or specific versions.  Those features are likely now in flux to be replaced with a better UX.

Signed-off-by: Greg Schohn <[email protected]>
…ts of jinjava transforms (see 'is_enabled').

These changes will reduce the efficiency of this type of transform, but w/out it, it isn't clear how we'll utilize dynamic typing.  My hope is that template logic is small relative to documents - and compute is cheap, so these are good tradeoffs to make.

Signed-off-by: Greg Schohn <[email protected]>
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.46%. Comparing base (fc62f57) to head (00968cd).
Report is 32 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1154      +/-   ##
============================================
- Coverage     80.93%   80.46%   -0.48%     
- Complexity     2995     3068      +73     
============================================
  Files           407      421      +14     
  Lines         15241    15545     +304     
  Branches       1021     1047      +26     
============================================
+ Hits          12336    12508     +172     
- Misses         2277     2391     +114     
- Partials        628      646      +18     
Flag Coverage Δ
unittests 80.46% <ø> (-0.48%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…s after the templating...

That preserve splicing not only improves efficiency, it also lets us pass messages through a template that might have non-textual data, like the ByteBufs, etc.

Signed-off-by: Greg Schohn <[email protected]>
…macro by name and invoke it since jinjava can't deal with invoking macros with dynamic names.

There are some rough tests in place that don't fully work.  The original ambition was to allow a user to specify both a route and an action for a given map in a macro but since macros only output rendered content, they're not well-suited to chain complex data (like regex matches) between macros.  As such, there will be significant design changes to the jinja routing mechanisms.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the TypeMappingsTransformation branch from e68a155 to e8dfb3f Compare November 24, 2024 04:10
Switched out Google's re2j for Java's implementation.  I didn't realize that the java implementation is closer to python regexes and that re2j (which I had thought was closer to python).
Minor tweaks to the route jinja template to make it easier to read when it's invoked.
Switched the replayer template to use the route functionality, which is considerably easier to get one's head around.
TODO -
* I'd like to make the feature flag names the same as the macro name for the route when the feature flag name is omitted.
* The test cases for the type mappings sanitization need a lot of work.
* When indices are created that DO NOT use type mappings, we should handle those more gracefully.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the TypeMappingsTransformation branch from b1f3756 to 93dc5c8 Compare November 24, 2024 20:48
…gexes and the target index can be specified with backreferences.

Notice that there's no way to discard data with regexes like there is with the type mappings (and the static type mappings do yield to regexes when no match was found).
Other changes:
The regex capture filter now has a guava cache for compiled patterns.
When static and regex mappings are null, we'll use a default (.*), (.*) -> \1_\2 mapping.
Enhanced some tests to pick up more corner cases.

Signed-off-by: Greg Schohn <[email protected]>
…igration, and regex target index patterns.

Fix regex replace to use java's regex classes instead of the builtin ones.  The builtin filter uses Google's re2j library, which doesn't do backreferences.  Now both capture and replace use custom filters that use the builtin java regex library.  Tests are in place to do some index remapping w/ backreferenced captures.
For bulk APIs, only the delete command is fully supported, but the structure for the others should be in place.
I've also worked on improving the test for creating an index and fixed numerous issues there.

It's also CRITICAL to note that constructing dictionaries in jinjava with keys that are defined via variables is NOT SUPPORTED.  See HubSpot/jinjava#379.  The workaround that I'm currently using is to construct a map as a string in json and parse it into a map - or to construct maps dynamically.  Also note that python dictionary operations like 'update' or 'delete' are NOT present since we don't have the luxury of overriding the dictionary implementation easily.  This is also something that has a significant impact on the ease of use for jinjava and to maintain compatibility to the python implementation.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the TypeMappingsTransformation branch from cc7651e to 0712eae Compare December 2, 2024 16:25
…nsformations. All unit tests pass.

Signed-off-by: Greg Schohn <[email protected]>
Cleanup a number of linting issues, including removing the old openSearch23PlusTargetTransformer code.  That was for example purposes, but the newer jinja versions are more complete and easier to express/understand.
Removing the old transfomer required a couple other changes.
1) When the TypeMappingSanitizationTransformerProvider is invoked with an empty string, that will cause transformers to be created w/ null values for all of the parameters (which means that we'll default to stripping type mappings and sending all of the docs to the same index).
2) The new type mappings transformer doesn't throw an exception for the same types of errant input that the old opensearch23... implementation did.  That caused a test to fall through the type mapping transformation with the same output as its input (missing a headers value), which in turn caused the host rewrite mapping to throw an exception.  When such an occurrence happens later in the pipeline, a TransformationException will be thrown.  However, in this test case, the exception was thrown in the preliminary handler which didn't translate exceptions during transformation into TransformationExceptions.  As long as the exception isn't a PayloadNotLoadedException exception, the code now does that translation.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn marked this pull request as ready for review December 3, 2024 06:30
…equests that don't need to be transformed for type mappings excision.

Improved some tests as well.

Signed-off-by: Greg Schohn <[email protected]>
…son to force it - just the POJO structure matters anyway.

Signed-off-by: Greg Schohn <[email protected]>
… was accessed.

When the PayloadAccessFaultingMap throws a PayloadNotLoadedException, set a flag to show that.  The Preliminary Transformation handler now checks the flag to determine if the payload needs to be parsed instead of relying upon the type of the exception that was thrown.  The exception might be wrapped or eaten any number of ways.  Jinjava does a nice job of packing all of the exception into an aggregate exception w/ line numbers, etc.  It doesn't make sense to throw that away just to get a signal across a number of different unknown boundaries.  Doing the direct connection w/ the flag makes a lot more sense.

Signed-off-by: Greg Schohn <[email protected]>
Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st wave of comments, after having spent some time learning Jinja and looking through the JinjavaTransformer code. I expect I will have more comments as I go through the TypeMappingsSanitizationTransformer itself, next.

}
return value;
}

public boolean missingPaylaodWasAccessed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Paylaod

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

]
```

## Final Results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be a bit more explicit about which of the example mapping directives above this is the final result for? It appears to be:

activity:
  user: any_activity
  post: any_activity

Also - it would be super-helpful to highlight what the "final results" of the other mapping would be:

activity:
  user: new_users
  post: new_posts

@@ -0,0 +1,152 @@
This transformer converts routes for various requests (see below) to indices that used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be a bit more explicit about the range of possible outcomes from using this transformer? From our conversation yesterday, it seems like we can't use this to create multiple indices on the Target if we're putting a single index on the Source. Is that correct? Or is this only remapping the documents rather than handling index PUTs as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The templates are in place to handle multi-type mappings. I'm working on debugging through all of the bugs with that since the create index call's specification has changed over time.

Comment on lines 72 to 74
findAndReplacePreserves(incomingJson, parsedObj);
findAndReplacePreserves((Map<String, Object>) incomingJson.get(JsonKeysForHttpMessage.PAYLOAD_KEY),
(Map<String, Object>) parsedObj.get(JsonKeysForHttpMessage.PAYLOAD_KEY));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises interesting questions about pre/post processing elsewhere. Example: the user loads up a snapshot that contains an index full of image files in the Transformation Playground. What is displayed/operated on in the Playground? Do we display and operate on the binary of the images? Do we hide or simplify it it? If we hide it, how? And so on...

String.class,
Object[].class
));
jinjava.getGlobalContext().registerTag(new ThrowTag());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got it. So these filters/functions/tags would become part of our formal interface specification to the LLM as options it can use when generating its Jinja, yeah?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one unifies the 'throw' tag between jinja2 and jinjava. Without this tag, valid jinja that throws wouldn't be able to be interpreted with jinjava. So, to answer your question, in some cases we might need to lift this configuration. I hope that we don't have to though so that there's good parity between jinja and jinjava - so that we don't need to think about hosting a python runtime for these.

However, I DO think we'll definitely want to lift the jinja library files into the context so that those snippets can be leveraged and composited to create more consistent solutions. Hopefully, there's less room for error too.

true
{%- else -%}
{%- set ns = namespace(value=features) -%}
{%- set debug = namespace(log=[]) -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this debug log get used/printed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we'll accumulate stuff in that list and then do nothing with it. If we choose to, we can emit its value (or derivatives) in the output - which would have an impact on how the result of is_enabled would be interpreted. It's pretty much a one-time use solution - once you emit something and muddy the signal, the rest of the templating will be off.

Another option is to pass arguments into the dynamic function, which gets them back into a java context. From there, we could debug, etc. That might be something very much worth adding (just thought of it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have functions to log values through the Slf4j configurations for the jinjava transformer package.

import org.testcontainers.shaded.com.fasterxml.jackson.databind.ObjectMapper;

@Slf4j
class JinjavaTransformerTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments explaining how this test operates would be helpful. I find it hard to parse what was actually being tested, where, and how.

import org.junit.jupiter.api.Test;

@Slf4j
public class RouteTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same request for comments here explaining what this is testing, and how. This is requiring more mental effort to grok than ideal.

Comment on lines +69 to +70
Assertions.assertEquals(1, resultMap.size());
Assertions.assertEquals("_and more!", resultMap.get("matchedVal"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of what I mean - why are these the expected results? Part of this is that the routing macro isn't commented so we have to infer what it does via the tests, but the tests only get you so far in understanding things.

Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I like where this code is going. Definitely feels like a step in the right direction, though I'm sure we'll iterate on it a bunch. My biggest concern is finding ways to help make the Java/Jinja interface more understandable. It's hard to piece things together just from looking at either the Java or the Jinja; comments would go a long way, but risk becoming out of date. There might be room for more formal API between the two layers to enforce a contact (I have no clue what this would look like, just spitballing here). It's hard to understand what the Jinja side is doing without knowing everything the Java side is passing it.

Comment on lines 8 to 13
{%- set source_and_mappings = {
'request': request,
'index_mappings': index_mappings,
'regex_index_mappings': regex_index_mappings}
-%}
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place I can see the full context that will be present when this template is executed? It feels like we add bits and bobs to it in a few places (the JinjavaTransformer class, the JsonTypeMappingSanitizationTransformer class, and I guess whatever ultimately kicks off the transform). It would be great to define somewhere everything that is in the context and what its purpose is, like an API spec for the transformation templating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can print out the full map when we start the template.
However, what's required of the map will be a the decision of the caller and it might even be deferred all the way until runtime. The map needs to correspond to the template. Both are parameters, so they can be anything. That's also why things are being added in pieces - though the lambda sent into the Jinjava transformer should be picking up most of that load.

@@ -0,0 +1,22 @@
{%- import "common/featureEnabled.j2" as fscope -%}
{%- import "common/featureEnabled.j2" as fscope -%}
{%- macro route(input, field_to_match, feature_flags, default_action, routes) -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments explaining the overall function behavior as well as its fields, their expected types, etc like you would for an API spec?

{%- macro route(input, field_to_match, feature_flags, default_action, routes) -%}
{%- set ns = namespace(result=none, matched=false) -%}
{%- for pattern, action_fn, feature_name_param in routes if not ns.matched -%}
{%- set feature_name = feature_name_param | default(action_fn) -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was hard for me to follow what this feature_name is due to the lack of API spec comments, as requested above.

-%}
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json',
[
('(?:PUT|POST) /([^/]*)/([^/]*)/(.*)', 'rewrite_doc_request', 'rewrite_add_request_to_strip_types'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I going crazy, or is rewrite_add_request_to_strip_types not defined anywhere else in the PR? At least, CMD+F isn't finding it...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the name of the feature. If a user wanted to suppress it, that's the value that they'd reference in the feature map.

Copy link
Collaborator Author

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. On the better documentation, I'll loop back on that after I stabilize the last couple bugs and features that I've tossed in. In the meantime, here's some of the responses to the feedback.

true
{%- else -%}
{%- set ns = namespace(value=features) -%}
{%- set debug = namespace(log=[]) -%}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have functions to log values through the Slf4j configurations for the jinjava transformer package.

}
return value;
}

public boolean missingPaylaodWasAccessed() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,152 @@
This transformer converts routes for various requests (see below) to indices that used
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The templates are in place to handle multi-type mappings. I'm working on debugging through all of the bugs with that since the create index call's specification has changed over time.

Comment on lines 8 to 13
{%- set source_and_mappings = {
'request': request,
'index_mappings': index_mappings,
'regex_index_mappings': regex_index_mappings}
-%}
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can print out the full map when we start the template.
However, what's required of the map will be a the decision of the caller and it might even be deferred all the way until runtime. The map needs to correspond to the template. Both are parameters, so they can be anything. That's also why things are being added in pieces - though the lambda sent into the Jinjava transformer should be picking up most of that load.

-%}
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json',
[
('(?:PUT|POST) /([^/]*)/([^/]*)/(.*)', 'rewrite_doc_request', 'rewrite_add_request_to_strip_types'),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the name of the feature. If a user wanted to suppress it, that's the value that they'd reference in the feature map.

Copy link
Member

@chelma chelma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid start, makes good progress to where we want to be. Per discussion w/ @gregschohn, there's a lot of future work to make the mechanics simpler and clearer as we iterate and improve the overall system. Approving this change despite the outstanding request for comments/docs as this is PR is blocking priority work, we expect the items I requested more comments on to change substantively with future revisions, and there's a performance cost to just sticking comments in Jinja without preprocessing to strip them out at runtime.

…d other findings along the way

* JsonAccumulator (replayer) no longer throws when numeric values are > 32 bit ranges - now we use longs and doubles to parse the string sequences into, so some tuple transformation exceptions cease.
* regex replace filters now can be configured via the jinjava config to specify what format the replacement strings should be in.  The format is controlled by a series of regex replacements on the replacement string itself - e.g. to convert \\# to $# so that python-style regexes can be used.  Notice that this was done the way that it was so that it's a easier to override and let users tweak for their own specific needs.  Things get really sticky on malformed input and I don't have an interest in fully supporting that we behave the same on malformed escape sequences.  Letting users have the option to specify the exact escaping seems like it's a hedge.
* template files (resources) are still loaded from the jarfile, but users can add more templates and override existing ones.  New templates can be passed through the json config to the provider in a simple key->template dictionary.
* VARIANTs of top-level templates are gone.  Now there's a top-level transformByType.j2 that does some tests on the incoming document and routes switches for the replayer template or the document backfill one (implementation and tests are still pending).  That simplifies contextual awareness where this transformer no longer needs to have it.  To round that change out, replayer.j2 was renamed httpRequests.j2 since that's a more accurate name now.
* log_value and log_value_and_return are bound for jinjava templates to log through Slf4j.
* There's still more work to be done in rewriteCreateIndexRequest to work w/ various versions of ES so that we can figure out when type mappings should/shouldn't be present and do the right thing.  Now that we have source properties, that's just a matter of pulling a field and writing some more template code.
* Converted all of the test indices and types to NOT include upper-case characters to avoid confusion and exercising transforms in ways that they'll never need to be tested.

Signed-off-by: Greg Schohn <[email protected]>
…-type rewrite rules from the type mappings sanitization transformer.

Signed-off-by: Greg Schohn <[email protected]>
I still need to do a sweep over the tests to make sure none of the tests are miswired

Signed-off-by: Greg Schohn <[email protected]>
…y wasn't in the target's doc.

That makes it a bit easier to write '"preserveWhenMissing": "*"' and not have to spend us much effort to be more specific.

Signed-off-by: Greg Schohn <[email protected]>
… file + logging improvements & cleanup.

Most useful logging improvement - add the URI of the source request to the end of the progess log's lines.  E.g

2024-12-09 14:19:34,024: 921, 0242acfffe120008-00000007-00000065-c88ca66f1bf2aa5a-a978504d.16, 2024-12-09T14:19:33.757503Z, 2872/2890, 200/500,500,500,500, 2866/486,486,486,486, 9/3,2,2,3, /_search/scroll
2024-12-09 14:19:41,602: 950, 0242acfffe120008-00000007-00000065-c88ca66f1bf2aa5a-a978504d.17, 2024-12-09T14:19:33.767614Z, 2862/2880, 200/404,404,404,404, 119/125,125,125,125, 10/1,1,1,1, /_search/scroll

Signed-off-by: Greg Schohn <[email protected]>
…gPayload to opt-in rather than opt-out.

HttpJsonMessageWithFaultingPayload throws by default - so it could be run even if it wasn't within a transform - like from within a LoggingHandler, which is what I was observing when I added some more logging.  The worst part was that it created other errors which then caused the message to be processed in a very different way.

Signed-off-by: Greg Schohn <[email protected]>
…ces instead of '$' ones.

Backslashes are the default behavior.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/datahandlers/http/NettyDecodedHttpRequestPreliminaryTransformHandler.java
…ard.

Notice that headers are required to contain 'host' for http/1.1, so strictly speaking, copying headers AND protocol should be safer so that we stay in compliance.

Signed-off-by: Greg Schohn <[email protected]>
Comment on lines +46 to +55
private static final LoadingCache<ReplacementAndTransform, String> replacementCache =
CacheBuilder.newBuilder().build(CacheLoader.from(rat -> {
var r = rat.replacement;
if (rat.substitutions != null) {
for (var kvp : rat.substitutions) {
r = r.replaceAll(kvp.getKey(), kvp.getValue());
}
}
return r;
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit of research and the caching makes a lot of sense. I'd be super-curious to see the performance difference w/ vs. w/o this caching on a heavy workload.

Comment on lines +21 to +26
// other things to try
// Assertions.assertEquals("Pattern\\$$1", makeReplacementFromKnownMatch("\\2\\$$\\1"));
// Assertions.assertEquals("$1\\$amount", "\\1$amount", replacement);
// "\\\\1$50", // -> \\1\$50
// "\\\\\\1$total$", // -> \\$1\$total\$
// "cost$1$price$2" // -> cost$1\$price$2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker - Did you mean to leave these commented out?

Map<String, Map<String, String>> indexMappingsIncoming,
List<List<String>> regexIndexMappingsIncoming)
{
var featureFlags = featureFlagsIncoming != null ? featureFlagsIncoming : Map.of();
var indexMappings = indexMappingsIncoming != null ? indexMappingsIncoming : Map.of();
// By NOT including a backreference, we're a bit more efficient, but it also lets us be agnostic to what
// types of patterns are being used.
// This regex says, match the type part and reduce it to nothing, leave the index part untouched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker

I've spent a bit trying to understand the wider implications of this default, and I'm still not groking. Can you please expand on the comments here and explore how this default fits into the bigger picture?

@NoArgsConstructor
@AllArgsConstructor
public class SourceProperties {
private String type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the type be a String or an enumeration? It feels like we want a constrained and obvious set of values for this one.

@@ -0,0 +1,19 @@
{# see https://github.com/opensearch-project/opensearch-migrations/pull/1110 for the format of these messages #}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not have more formal documentation?


{%- set type_name = parameters['_type'] -%}
{%- if type_name -%}
{%- set target_index = transidx.convert_source_index_to_target(parameters['_index'], type_name, input_map.index_mappings, input_map.regex_index_mappings) if type_name -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check if the type_name is defined again?

@@ -0,0 +1,19 @@
{# see https://github.com/opensearch-project/opensearch-migrations/pull/1110 for the format of these messages #}
{%- include "typeMappings/rewriteBulkRequest.j2" -%}
{%- import "typeMappings/rewriteIndexForTarget.j2" as transidx -%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question - I'm curious if you've given thought to how "default" transformations will be incorporated into the wider story we'll have in 6+ months from now. Like - will part of our assessment process be figuring out if the pre-canned transforms are sufficient for the Cx index vs. needing some manual/GenAI transforms? Just thinking ahead a bit, and there seems to be complexity in the UX and design we're gonna need to tackle.

Comment on lines +48 to +57
{%- macro uses_type_names(input_map) -%}
{%- set uri_flag_match = input_map.request.URI | regex_capture("[?&]include_type_name=([^&#]*)") -%}
{%- if uri_flag_match -%}
{{- uri_flag_match.group1 | lower -}}
{%- else -%}
{%- set major_version = ((input_map.properties | default({})).version | default({})).major -%}
{%- if major_version >= 7-%}
false
{%- elif major_version <= 6 -%}
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much do you expect we'll need to embed the conditional combinatoric complexity of source version -> target version mappings into our Jinja logic longer term? I wonder if we're better off just having GenAI make these transformations as they are needed as per-migration "one-offs" rather than try to handle it all in our pre-canned transforms.

Taking it a step further... are we making our lives harder by having these hand-built, pre-canned transforms, longer-term?

"}";

var expectedString = "{\n" +
" \"index\": { \"_index\": \"network\", \"_id\": \"1\" },\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker

This was a bit surprising, assuming this is the default behavior. I'd have expected us to have the transformed index name be performance_network based on prior conversation/context. Curious to hear a bit more about the motivation of this design choice, which seems to erase a clear link to the original index and set us up for a bigger chance of collisions (multiple indices could have types with the same name, and only by prefixing the new index name with the original would you keep those documents from all ending up in the same index on the target).

}

@Test
public void testTypeMappingsWithSourcePropertiesWorks() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand the difference between this and the testMappingWithoutTypesAndLatestSourceInfoDoesNothing test in terms of what they're verifying. I see that we're passing the regex in here and the version is different, but the result is the same either way? Maybe I'm missing something here...

@gregschohn gregschohn force-pushed the TypeMappingsTransformation branch from 8b4df23 to c13ef4a Compare December 10, 2024 06:43
@gregschohn gregschohn merged commit 6d3b225 into opensearch-project:main Dec 10, 2024
22 checks passed
@gregschohn gregschohn deleted the TypeMappingsTransformation branch December 10, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants