Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#639: allow metadata fields (_id) to be used as QualifiedName #142

Closed

Conversation

acarbonetto
Copy link

@acarbonetto acarbonetto commented Oct 25, 2022

Description

OpenSearch reserved fields (_id, _score, _index, etc) are not allowed to be used in SQL clauses (SELECT, WHERE, ORDER BY) because the field format starting with underscore _ is not allowed.

This ticket adds specific identifiers to the language, and opens up support for OpenSearch reserved identifiers.

As an aside, identifiers with double underscore at the start (such as __myCoolField) is acceptable as an identifier.

Example:
SELECT calcs.key, str0, _id, _score, _maxscore FROM calcs WHERE _id="5"

Result:
{ "key04", "OFFICE SUPPLIES", "5", null, null }

Example - Metadata fields not requested are not displayed:
SELECT *, _id FROM bigint WHERE _id="2"

Result (assuming bigint only has one field):
[9223372026854775807, "2"]

Issues Resolved

#639

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Andrew Carbonetto [email protected]

@codecov
Copy link

codecov bot commented Oct 25, 2022

Codecov Report

Merging #142 (92096b4) into integ-poc-metadata-fields (bfad32c) will decrease coverage by 35.19%.
The diff coverage is n/a.

@@                       Coverage Diff                        @@
##             integ-poc-metadata-fields     #142       +/-   ##
================================================================
- Coverage                        97.95%   62.76%   -35.20%     
================================================================
  Files                              303       10      -293     
  Lines                             7783      658     -7125     
  Branches                           501      119      -382     
================================================================
- Hits                              7624      413     -7211     
- Misses                             158      192       +34     
- Partials                             1       53       +52     
Flag Coverage Δ
query-workbench 62.76% <ø> (?)
sql-engine ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rg/opensearch/sql/analysis/ExpressionAnalyzer.java
...ch/sql/opensearch/response/OpenSearchResponse.java
...pensearch/sql/sql/parser/AstExpressionBuilder.java
...ensearch/sql/expression/parse/ParseExpression.java
...earch/response/agg/CompositeAggregationParser.java
...ch/planner/logical/rule/MergeLimitAndRelation.java
...ensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
...l/planner/optimizer/rule/MergeFilterAndFilter.java
.../main/java/org/opensearch/sql/planner/Planner.java
...ilter/lucene/relevance/MatchPhrasePrefixQuery.java
... and 303 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was straight-forward! It does bring up questions...

How will this interact with Prometheus and other catalogs? They have different notion of what metadata fields are.

I remember we had METADATA(...)... syntax at some point -- that was a good idea.

It can work with StorageEngine instance and be catalog agnostic.

Say, StorageEngine has getMetadata method that returns a metadata descriptor -- a map from a name of metadata field to a getter function. The function returns that field for a given result row.

When a query like SELECT METADATA(_id) FROM opensearch.indexA executes, OpenSearchIndexScan will return field data as well as metadata descriptor. ProjectOperator will execute METADATA implementation and pass it _id, metadata descriptor, and current result row. METADATAwill check_id` is valid metadata field and execute the getter function on the current result row.

This will avoid keeping catalog-specific data in the core module and the parser and support different metadata for different catalogs.

I expect we can throw a SemanticCheck exception for a query like SELECT METADATA(_id) FROM prometheus.tableA.

Last point. OpenSearch seems to support custom metadata fields for a plugin. I'm curious what it would take to support them.

Comment on lines +108 to +113
if (!Float.isNaN(hit.getScore())) {
builder.put("_score", new ExprFloatValue(hit.getScore()));
}
if (!Float.isNaN(maxScore)) {
builder.put("_maxscore", new ExprLongValue(maxScore));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _score/_maxscore could be missing?
What will be returned to user in that case? NULL? MISSING (aka empty cell)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null is what gets returned. That's what's defined in the _score field.
Which is consistent with what OpenSearch is returning... I'm assuming that's the correct way forward.

}

/**
* QualifiedName Constructor.
*/
public QualifiedName(Iterable<String> parts) {
public QualifiedName(Iterable<String> parts, Boolean isMetadataField) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to determine isMetadataField dynamically rather than through constructor parameter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easiest to define it at constructor time... since we need to look up the expression tree.

}

public QualifiedName(Iterable<String> parts) {
this(parts, Boolean.FALSE);
}

/**
* QualifiedName Constructor.
Copy link

@GabeFernandez310 GabeFernandez310 Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment need to be moved up? Are the above not considered as constructors as well?

@@ -110,4 +123,6 @@ public List<UnresolvedExpression> getChild() {
public <R, C> R accept(AbstractNodeVisitor<R, C> nodeVisitor, C context) {
return nodeVisitor.visitQualifiedName(this, context);
}

public Boolean isMetadataField() { return Boolean.TRUE; }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this line. Doesn't this always return true? was it meant to return isMetadataField?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha. That was wrong. I've fix this now.


ImmutableMap.Builder<String, ExprValue> builder = new ImmutableMap.Builder<>();
builder.putAll(docData.tupleValue());
builder.put("_index", new ExprStringValue(hit.getIndex()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we ever have hits from different indices?
If yes, maxScore should be gotten per each hit. If not, you can move getIndex() outside of the loop to optimize performance.

@@ -430,12 +430,13 @@ private UnresolvedExpression visitConstantFunction(String functionName,
}

private QualifiedName visitIdentifiers(List<IdentContext> identifiers) {
Boolean isMetadataField = identifiers.stream().filter(id -> id.metadataField() != null).findFirst().isPresent();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be there a mix of different types of identifiers - some of them are metadata and some of them are not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants