Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Date field format parsing for legacy query engine #3160

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

andy-k-improving
Copy link
Contributor

@andy-k-improving andy-k-improving commented Nov 20, 2024

Description

This is a MR, which aim to fix the issue described over #1545 to align the representation of date field between the legacy engine and the v2 engine.

Code changes highlight:

  • Remove the break identifier from DateFieldFormatter, the current behaviour will only parse the first date field from the resultSet, which is not a valid assumption, because OpenSearch as platform allow multiple date fields co-exist on an index, also this is not an uncommon use-case for enterprise. Hence all date fields present on any given index should be formatted regardless of the order.
  • To detect and attempt parse any unix timestamp (second || milisecond) on best-effort basis, OpenSearch as platform allow unix timestamp value to be inserted despite epoch_millis being explicitly specified on field format, in this particular case there is no clear hint from the query indicate this is numeric timestamp from OpenSearch.

Although issue is being reported #1545 under nested-query, however this bug has wider impact outside of nested-query, and the conditions to trigger are:

  • Query that unsupported by the v2 engine and being forwarded to legacy engine
  • Multiple date fields are being specified on the Index schema WITHOUT any explicit format configuration.

When this happen, the current behaviour will only parse the first date field from the resultset.

Reference doc:

Related Issues

Resolves #1545

Testing plan

# Create table
PUT http://localhost:9200/datetype_test?pretty

{
  "mappings": {
    "properties": {
      "dateAsDate": {"type": "date"},
      "longAsDate": {"type": "date"},      
      "id": { "type": "long" },
      "projects": {
        "type": "nested", 
        "properties": {
          "name": { "type": "text", "fields": {"keyword": {"type": "keyword"   }
            }
          }}}}}}

# Insert data
POST http://localhost:9200/datetype_test/_bulk?refresh

{"index":{"_id":"1"}}
{"id":3,"dateAsDate":"2023-01-01T00:00:00", "longAsDate": 1672531200000, "projects":[{"name":"SQL Spectrum querying"},{"name":"SQL security"},{"name":"OpenSearch security"}]}
{"index":{"_id":"2"}}
{"id":4,"dateAsDate":"2024-01-01T00:00:00", "longAsDate": 1704067200000, "projects":[]}
{"index":{"_id":"3"}}
{"id":6,"dateAsDate":"2025-01-01T00:00:00","longAsDate": 1735689600000, "projects":[{"name":"SQL security"},{"name":"Hello security"}]}

#Qeury, both fields should be in simple date format.
POST http://localhost:9200/_plugins/_sql

{
  "query" : "SELECT t.id, t.dateAsDate, t.longAsDate FROM datetype_test AS t, t.projects AS p"
}


Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Andy Kwok <[email protected]>
Signed-off-by: Andy Kwok <[email protected]>
@andy-k-improving
Copy link
Contributor Author

@parked-toes, @dai-chen and @anasalkouz
Would you mind to have a look on this?
Thanks!

Signed-off-by: Andy Kwok <[email protected]>
@andy-k-improving
Copy link
Contributor Author

@dai-chen I have fixed the spotlessJavaCheck, and tested it offline.
Would you mind to trigger the build again?
Thanks,

Comment on lines +155 to +157
// It's possible to have date stored in second / millisecond form without explicit
// format hint.
// Parse it on a best-effort basis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if anything in OpenSearch's DateFormatters can be reused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see there is a suitable method I can call to detect and parse it on best effort basis, if we decide the use DataFormatters, the logic will turns into following:

              try {
                parsedDate = Date.from(DateFormatters.from(
                                DateFormatter.forPattern("epoch_millis").parse(columnOriginalDate))
                        .toInstant());
              } catch (DateTimeException ex) {
                // Fallback to epoch_second
                try {
                  parsedDate = Date.from(DateFormatters.from(
                                  DateFormatter.forPattern("epoch_second").parse(columnOriginalDate))
                          .toInstant());
                } catch (DateTimeException innerEx) {
                  parsedDate =
                          DateUtils.parseDate(
                                  columnOriginalDate,
                                  FORMAT_DOT_OPENSEARCH_DASHBOARDS_SAMPLE_DATA_LOGS_EXCEPTION,
                                  FORMAT_DOT_OPENSEARCH_DASHBOARDS_SAMPLE_DATA_FLIGHTS_EXCEPTION,
                                  FORMAT_DOT_OPENSEARCH_DASHBOARDS_SAMPLE_DATA_FLIGHTS_EXCEPTION_NO_TIME,
                                  FORMAT_DOT_OPENSEARCH_DASHBOARDS_SAMPLE_DATA_ECOMMERCE_EXCEPTION,
                                  FORMAT_DOT_DATE_AND_TIME,
                                  FORMAT_DOT_DATE);
                }
                

Because I need the try-catch to perform the parsing logic fallback from millisecond, second and eventually a normal date String.
Which I don't think that is ideal from the performance perspective, as this will run against every row being resulted.

@dai-chen
Copy link
Collaborator

@andy-k-improving It seem CI is failing on IT task. I've triggered the CI again. Please check if there is still failure.

Signed-off-by: Andy Kwok <[email protected]>
@andy-k-improving
Copy link
Contributor Author

@andy-k-improving It seem CI is failing on IT task. I've triggered the CI again. Please check if there is still failure.

Thanks, indeed we have a breaking change on the date field presentation on v1 engine (The old engine).
I have updated the Integration test to reflect the same, WDYT?

Signed-off-by: Andy Kwok <[email protected]>
@andy-k-improving
Copy link
Contributor Author

andy-k-improving commented Nov 27, 2024

I have updated testIncorrectFormat the test, but instead, should I remove this use-case altogether?
Because indeed this incorrect format is accepted by OpenSearch.
So it's not really incorrect from my perspective.

@andy-k-improving
Copy link
Contributor Author

@dai-chen Would you mind to have another look on this?
Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Inconsistent behaviour for date field in nested collection queries
2 participants