Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/202 transform date add date conversion #622

Conversation

stevanbz
Copy link
Contributor

@stevanbz stevanbz commented Dec 8, 2022

Issue #, if available:
#202

Description of changes:

This PR enables creation of target index mapping for a date fields by following next "rules":

  • If the date is used in a term aggregation/grouping, default format of a date (strict_date_optional_time||epoch_millis) will be used for mapping the target date field
  • If the date is used in metric aggregation(MIN, MAX, COUNT), default format of a date (strict_date_optional_time||epoch_millis) will be used for mapping the target date field
  • Once the term aggregation values for date fields are retrieved from a cluster (in epoch time millis format), automatically will be formatted to human readable format (ISO 8601 uuuu-MM-dd'T'HH:mm:ss.SSSZZ)
  • If the aggregations (MAX, MIN, COUNT) are applied on a date fields (in transform job), the same logic like in previous step will be applied when retrieving the values (since the cluster will return the epoch time millis once the agg function is applied on a given field)
  • Enabling specifying the format once the date_histogram is being used

@bowenlan-amzn
Here is some contextual background:

Once grouping by date field as term is executed, the request is being translated to composite aggregation request, that is executed against the cluster (on transform side) ie.:

"groups": [
    {
    "terms": {
      "source_field": "@timestamp",
      "target_field": "our_custom_target_date"
    }
    }
  ]...

is being translated to:

 "aggregations":{
      "some_test_agg":{
         "composite":{
            "size":10,
            "sources":[
               {
                  "our_custom_target_date":{
                     "terms":{
                        "field":"@timestamp",
                        "missing_bucket":true,
                        "order":"asc"
                     }
                  }
               }
            ]
         }

First thing noticed is that composite aggregation doesn't allow specifying a format of a date field that will be used for the purpose of grouping the data into buckets when using composite aggregation. On the other hand, cluster can return "original" field value as a string if regular terms grouping is being used or if the aggregation function (like min/max) is applied:

"aggs": {
    "custom_target_field": {
      "terms": {
        "field": "@timestamp",
        "size": 10
      }
    }
  }

Cluster, then returns:

"buckets" : [
        {
          "key" : 1661344590443,
          "key_as_string" : "2022-08-24T12:36:30.443Z",
          "doc_count" : 7
        }

So long story short, the cluster "flattened" the field values to epcoh millis and then did the grouping - in the case of composite aggregation the format can't be specified UNLESS the date_histogram is being used. In the case of "normal" term grouping, cluster by default returns the value _as_string.

Specifying a format is enabled on a date_histogram:

"date_histogram": {
     "source_field": "timestamp",
     "target_field": "hour_grouping",
     "calendar_interval": "hour",
     "format": "yyyy-MM-dd"
   }

Caveat:
When defining the interval and format (when date_histogram is used), user must be aware of their co-relation ie. since transform creates a buckets and once each result group is being returned it returns after_key (which is used in order to detect where the processing ended, in which bucket) which is used later for pagination, cluster won't know how to create the after key (since it did bucketing on the hour level but because of the format it flattens in a date format and after key that is being returned is in one moment always the same).

Beside the date format when using the date_histogram, the idea was to create target index mapping with the date field types used as a field definition, when grouping by a date is applied.

Further improvement would be - taking care about the date format that is used for a date field in a source index (user can specify multiple formats so, somehow once the execution of transform occurs, we would need to figure out in which format the date is stored and maybe use this format for storing and formatting the target index date field once the cluster retrieves the value)

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@stevanbz stevanbz requested a review from a team December 8, 2022 22:03
@stevanbz stevanbz force-pushed the bugfix/202-transform-date-add-date-conversion branch 5 times, most recently from 0d50db4 to 77d4499 Compare December 9, 2022 18:18
@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2022

Codecov Report

Merging #622 (e470657) into main (d921f0b) will decrease coverage by 0.02%.
The diff coverage is 74.37%.

❗ Current head e470657 differs from pull request most recent head 96af8b6. Consider uploading reports for the commit 96af8b6 to get more accurate results

@@             Coverage Diff              @@
##               main     #622      +/-   ##
============================================
- Coverage     75.90%   75.88%   -0.02%     
- Complexity     2811     2849      +38     
============================================
  Files           362      364       +2     
  Lines         16042    16176     +134     
  Branches       2301     2324      +23     
============================================
+ Hits          12176    12275      +99     
- Misses         2535     2563      +28     
- Partials       1331     1338       +7     
Impacted Files Coverage Δ
...pensearch/indexmanagement/IndexManagementPlugin.kt 89.49% <ø> (+0.07%) ⬆️
...h/indexmanagement/transform/util/TransformUtils.kt 0.00% <0.00%> (ø)
...earch/indexmanagement/transform/TransformRunner.kt 82.24% <66.66%> (+0.32%) ⬆️
.../action/preview/TransportPreviewTransformAction.kt 85.71% <72.00%> (-1.25%) ⬇️
...xmanagement/transform/TargetIndexMappingService.kt 72.34% <72.34%> (ø)
...management/common/model/dimension/DateHistogram.kt 77.55% <77.77%> (-0.48%) ⬇️
...arch/indexmanagement/transform/TransformIndexer.kt 69.09% <85.71%> (ø)
...ndexmanagement/transform/TransformSearchService.kt 71.42% <100.00%> (+1.17%) ⬆️
...indexmanagement/transform/util/TransformContext.kt 80.00% <100.00%> (+5.00%) ⬆️

... and 9 files with indirect coverage changes

@lezzago
Copy link
Member

lezzago commented Apr 6, 2023

PR comments need to be addressed

}
val sourceFieldType = IndexUtils.getFieldFromMappings(dimension.sourceField, sourceIndexMapping)
// Consider only date fields as relevant for building the target index mapping
if (dimension !is DateHistogram && sourceFieldType?.get(TYPE) != null && sourceFieldType[TYPE] == "date") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why excludes DateHistogram here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the DateHistogram is supported by default.

The whole idea about this PR is about enabling support for grouping on the timestamp field (which is initial bug described here).

I listed here all the changes that I did in the PR.

So basically, grouping by using date_histogram is supported by default, only the format was missing - which I added in the PR (also in the comment of the PR I mentioned on which things we must take care when using date_histogram and format-interval relation).

You can also see that the integration tests I added are related with the date used in term aggregation - so If we decide to go without this support - I would suggest complete rework of this P (removal of term date agg support by adding validation that prevents user creating the grouping by date; then the question is do we need to create target index mapping based on a date fields at all)

if (!isFieldInMappings(dimension.sourceField, sourceIndexMapping)) {
throw TransformIndexException("Missing field ${dimension.sourceField} in source index")
}
val sourceFieldType = IndexUtils.getFieldFromMappings(dimension.sourceField, sourceIndexMapping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems to be the a position to check and restrict user to only do date_histogram grouping on date field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out this comment.
If we decide to support only date_histogram than this PR should be reworked just to prevent user from doing a grouping based on the timestamp field (since date_histogram is already supported - just the format field is missing which is also added in this PR).
Pasting the Ravi response from mine, Praveen and Ravi communication:
"
It was a miss during initial implementation to correctly index the date field columns.

I believe the best way to fix this would be while creating the target index set the mapping types for target fields same as the mapping types of source fields -

https://github.com/opensearch-project/index-management/blob/main/src/main/kotlin/org/opensearch/indexmanagement/transform/TransformIndexer.kt#L54,at the moment we rely on cluster determine the mapping type for target fields.

We do something similar for rollups but instead of setting individual field mapping during rollup index creation we set the mapping type based on field suffix -

https://github.com/opensearch-project/index-management/blob/main/src/main/resources/mappings/opendistro-rollup-target.json

But for transforms we cannot define the types in template since we don’t enforce suffixes.
"

Now, we just need to decide :) @bowenlan-amzn what do you think?

Copy link
Member

@bowenlan-amzn bowenlan-amzn May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I prefer to go with dynamic mapping then for transform and don't restrict the grouping on date type field

@stevanbz stevanbz force-pushed the bugfix/202-transform-date-add-date-conversion branch from 7975edd to a755e51 Compare May 11, 2023 18:25
@stevanbz stevanbz force-pushed the bugfix/202-transform-date-add-date-conversion branch from a755e51 to 20af048 Compare May 29, 2023 10:44
stevanbz added 7 commits May 29, 2023 13:01
…g once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>
… index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>
…d in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>
…nsform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>
Signed-off-by: Stevan Buzejic <[email protected]>
Signed-off-by: Stevan Buzejic <[email protected]>
…ield.

Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>
@stevanbz stevanbz force-pushed the bugfix/202-transform-date-add-date-conversion branch from 20af048 to 0f29a0b Compare May 29, 2023 11:14
stevanbz added 2 commits May 29, 2023 13:46
Signed-off-by: Stevan Buzejic <[email protected]>
Signed-off-by: Stevan Buzejic <[email protected]>
@stevanbz stevanbz force-pushed the bugfix/202-transform-date-add-date-conversion branch from c39b680 to 7e43c7c Compare May 31, 2023 08:22
…form mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>
@stevanbz stevanbz force-pushed the bugfix/202-transform-date-add-date-conversion branch from e470657 to 96af8b6 Compare May 31, 2023 13:03
Copy link
Member

@bowenlan-amzn bowenlan-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@bowenlan-amzn bowenlan-amzn merged commit 42833b1 into opensearch-project:main May 31, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 31, 2023
* 202: Added format property when specifying the date histogram

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Added component responsible for building the target index mapping once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: date_histogram considered in the case of the creating the target index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Enabled target index date field mappings if those fields are used in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated code according to comments. Added targetIndexMapping when transform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated schema versions

Signed-off-by: Stevan Buzejic <[email protected]>

* Addressed the comments

Signed-off-by: Stevan Buzejic <[email protected]>

* Refactored transform tests related with aggregation based on a date field.
Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>

* detekt fix

Signed-off-by: Stevan Buzejic <[email protected]>

* Added zone in IT

Signed-off-by: Stevan Buzejic <[email protected]>

* Added function for creating target index mapping that considers transform mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>

---------

Signed-off-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 42833b1)
bowenlan-amzn pushed a commit that referenced this pull request May 31, 2023
* 202: Added format property when specifying the date histogram

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Added component responsible for building the target index mapping once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: date_histogram considered in the case of the creating the target index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Enabled target index date field mappings if those fields are used in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated code according to comments. Added targetIndexMapping when transform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated schema versions

Signed-off-by: Stevan Buzejic <[email protected]>

* Addressed the comments

Signed-off-by: Stevan Buzejic <[email protected]>

* Refactored transform tests related with aggregation based on a date field.
Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>

* detekt fix

Signed-off-by: Stevan Buzejic <[email protected]>

* Added zone in IT

Signed-off-by: Stevan Buzejic <[email protected]>

* Added function for creating target index mapping that considers transform mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>

---------

Signed-off-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 42833b1)

Co-authored-by: Stevan Buzejic <[email protected]>
petardz pushed a commit to petardz/index-management that referenced this pull request May 31, 2023
* 202: Added format property when specifying the date histogram

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Added component responsible for building the target index mapping once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: date_histogram considered in the case of the creating the target index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Enabled target index date field mappings if those fields are used in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated code according to comments. Added targetIndexMapping when transform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated schema versions

Signed-off-by: Stevan Buzejic <[email protected]>

* Addressed the comments

Signed-off-by: Stevan Buzejic <[email protected]>

* Refactored transform tests related with aggregation based on a date field.
Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>

* detekt fix

Signed-off-by: Stevan Buzejic <[email protected]>

* Added zone in IT

Signed-off-by: Stevan Buzejic <[email protected]>

* Added function for creating target index mapping that considers transform mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>

---------

Signed-off-by: Stevan Buzejic <[email protected]>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 1, 2023
* 202: Added format property when specifying the date histogram

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Added component responsible for building the target index mapping once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: date_histogram considered in the case of the creating the target index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Enabled target index date field mappings if those fields are used in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated code according to comments. Added targetIndexMapping when transform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated schema versions

Signed-off-by: Stevan Buzejic <[email protected]>

* Addressed the comments

Signed-off-by: Stevan Buzejic <[email protected]>

* Refactored transform tests related with aggregation based on a date field.
Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>

* detekt fix

Signed-off-by: Stevan Buzejic <[email protected]>

* Added zone in IT

Signed-off-by: Stevan Buzejic <[email protected]>

* Added function for creating target index mapping that considers transform mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>

---------

Signed-off-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 42833b1)
bowenlan-amzn pushed a commit that referenced this pull request Jun 1, 2023
* 202: Added format property when specifying the date histogram

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Added component responsible for building the target index mapping once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: date_histogram considered in the case of the creating the target index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Enabled target index date field mappings if those fields are used in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated code according to comments. Added targetIndexMapping when transform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated schema versions

Signed-off-by: Stevan Buzejic <[email protected]>

* Addressed the comments

Signed-off-by: Stevan Buzejic <[email protected]>

* Refactored transform tests related with aggregation based on a date field.
Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>

* detekt fix

Signed-off-by: Stevan Buzejic <[email protected]>

* Added zone in IT

Signed-off-by: Stevan Buzejic <[email protected]>

* Added function for creating target index mapping that considers transform mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>

---------

Signed-off-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 42833b1)

Co-authored-by: Stevan Buzejic <[email protected]>
ronnaksaxena pushed a commit to ronnaksaxena/index-management that referenced this pull request Jul 19, 2023
opensearch-project#803)

* 202: Added format property when specifying the date histogram

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Added component responsible for building the target index mapping once the transform is being triggered.

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: date_histogram considered in the case of the creating the target index for the date fields when transform is executed

Signed-off-by: Stevan Buzejic <[email protected]>

* 202: Enabled target index date field mappings if those fields are used in aggregations or as a term aggregation for defining the buckets

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated code according to comments. Added targetIndexMapping when transform preview action is triggered

Signed-off-by: Stevan Buzejic <[email protected]>

* Updated schema versions

Signed-off-by: Stevan Buzejic <[email protected]>

* Addressed the comments

Signed-off-by: Stevan Buzejic <[email protected]>

* Refactored transform tests related with aggregation based on a date field.
Updated transform preview action to consider target index mapping when using a date field. Kept formatting of the date field in target index.

Signed-off-by: Stevan Buzejic <[email protected]>

* detekt fix

Signed-off-by: Stevan Buzejic <[email protected]>

* Added zone in IT

Signed-off-by: Stevan Buzejic <[email protected]>

* Added function for creating target index mapping that considers transform mapping json. Target date field mappings are generated after transform validation when running transform. Removed target index date field values formatting. emoved default format for date_histogram because of the rollup. Updated schema version in test.

Signed-off-by: Stevan Buzejic <[email protected]>

---------

Signed-off-by: Stevan Buzejic <[email protected]>
(cherry picked from commit 42833b1)

Co-authored-by: Stevan Buzejic <[email protected]>
Signed-off-by: Ronnak Saxena <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants