Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions #9831

Merged
merged 9 commits into from
May 27, 2022

Conversation

kasiafi
Copy link
Member

@kasiafi kasiafi commented Nov 2, 2021

No description provided.

@cla-bot cla-bot bot added the cla-signed label Nov 2, 2021
@kasiafi kasiafi force-pushed the 286jsonFeature branch 6 times, most recently from fe236ff to 70e48c9 Compare November 8, 2021 18:54
@kasiafi kasiafi force-pushed the 286jsonFeature branch 2 times, most recently from 0acac64 to e1f3ccd Compare November 12, 2021 17:19
@kasiafi kasiafi changed the title Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions in grammar and AST Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions in grammar, AST and Analyzer Nov 29, 2021
@kasiafi kasiafi force-pushed the 286jsonFeature branch 5 times, most recently from 5fa3108 to 1866670 Compare November 30, 2021 13:23
@kasiafi kasiafi force-pushed the 286jsonFeature branch 7 times, most recently from a53e24a to 5b355b5 Compare December 14, 2021 14:42
@kasiafi kasiafi force-pushed the 286jsonFeature branch 3 times, most recently from 4a4b801 to a17b750 Compare December 16, 2021 07:45
@kasiafi kasiafi force-pushed the 286jsonFeature branch 4 times, most recently from b1c768f to b5d5f26 Compare December 30, 2021 09:43
this(Optional.of(location), inputExpression, inputFormat, jsonPath, pathParameters, errorBehavior);
}

public JsonExists(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the rewriter should create the JsonExists node with the same location as the original one, and we should remove the location-less constructor. Same for the other ones.

return toSlice(jsonExpression, UTF_32, errorBehavior, omitQuotes);
}

private static Slice toSlice(JsonNode json, EncodingSpecificConstants constants, long errorBehavior, boolean omitQuotes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this serialize

import static io.trino.spi.type.TypeUtils.readNativeValue;
import static io.trino.sql.analyzer.ExpressionAnalyzer.JSON_NO_PARAMETERS_ROW_TYPE;

public final class JsonParametersToArrayHelper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this class for? The name is too vague and the method below is not very clear. I'm thinking of whether it should be renamed or placed elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a javadoc, and renamed the class to ParameterUtil.

function = plannerContext.getMetadata().resolveFunction(session, QualifiedName.of(JSON_EXISTS_FUNCTION_NAME), fromTypes(argumentTypes.build()));
}
catch (TrinoException e) {
if (e.getLocation().isPresent()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When function is not registered, which shouldn't happen. Should I remove the try-catch?


// wrapper behavior, empty behavior and error behavior will be passed as arguments to function
// quotes behavior is handled by the corresponding output function
ImmutableList.Builder<Type> argumentTypes = ImmutableList.<Type>builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call .build() here directly and declare the variable as List<Type>. No need to have an explicit variable to hold the intermediate.

Comment on lines 464 to 466
// process(node.getPath());
// types.put(PathNodeRef.of(node), BOOLEAN);
// return BOOLEAN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or add a TODO comment.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
package io.trino.sql.planner;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in the "planner" package, but in the io.trino.json package.

this.invocationContext = requireNonNull(invocationContext, "invocationContext is null");
}

public List<Object> evaluate(IrJsonPath path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems backward. The fixed part should be the path (i.e., the "program" or "IR"), the functions, etc. The variable part should be the json over which the path is being evaluated.

That will allow caching resolutions and other similar decisions and attach them to the corresponding IR nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

right = (TypedValue) rightObject;
}

ResolvedOperatorAndCoercions operators = getOperators(OperatorType.valueOf(node.getOperator().name()), left.getType(), right.getType());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the operator being cached? Ideally, it should be attached to the node (via a map of node->operator) to avoid having to re-resolve it every time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operators are cached in CachingResolver.

@kasiafi kasiafi force-pushed the 286jsonFeature branch 2 times, most recently from b5341a7 to 04d22e4 Compare May 6, 2022 13:23
@kasiafi
Copy link
Member Author

kasiafi commented May 6, 2022

@martint I refactored operator caching according to our discussion. Please take a look at the last commit.

@kasiafi kasiafi force-pushed the 286jsonFeature branch 3 times, most recently from 32c0f72 to 4838a88 Compare May 19, 2022 13:09
*/
public void initializeIfNecessary(IrJsonPath path, ConnectorSession session)
{
if (evaluator != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious by looking at this why it's safe to skip re-initializing the evaluator if it's already set. Technically, and from the JsonPathInvocationContext's perspective, the path could be different. The fact that it's not, is a side effect of how the JSON constructs in SQL work (the path is constant) and how they are mapped to functions. But that knowledge is too far removed from this site.

I don't think this class is needed. Instead, the functions should initialize and hold on to the JsonPathEvaluator instance upon first invocation (i.e., wherever they are calling initializeIfNecessary)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that JsonPathInvocationContext is needed. It is passed through the instanceFactory and this way an instance of JsonPathInvocationContext persists through multiple rows processed by the JSON-funciton.

We can't skip it and just pass JsonPathEvaluator as the context object, because we can create the JsonPathEvaluator only after the function is invoked with certain arguments (including path). The context object already exists at that moment (it comes as one of the arguments to the function).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This is the state object for the (stateful) function. However, it could be changed to be just that -- a state object, and leave all logic to be done in the caller.

kasiafi added 8 commits May 26, 2022 16:48
Let `handleSubqueries` find subqueries nested under `Expression`
nodes, linked via children not being `Expression`.
`json_extract` and `json_extract_scalar` are old JSON-processing
functions which have very limited capabilities, and are not
compliant with the spec. However, they have simple lightweight
implementation, optimized for the use case.

`json_query` and `json_value` are new spec-compliant JSON-processing
functions, which support the complete specification for JSON path.
Their implementation is much more complicated, resulting both from
the scope of the supported feature, and the "streaming" semantics.
These benchmarks are to compare both implementations applied to the
common use-case.

Benchmark json_extract vs json_query:
Benchmark                                            (depth)  Mode  Cnt     Score    Error  Units
BenchmarkJsonFunctions.benchmarkJsonExtractFunction        1  avgt   15   813.030 ± 12.263  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractFunction        3  avgt   15   939.202 ± 94.621  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractFunction        6  avgt   15  1004.352 ± 12.945  ns/op
BenchmarkJsonFunctions.benchmarkJsonQueryFunction          1  avgt   15  1136.371 ± 17.502  ns/op
BenchmarkJsonFunctions.benchmarkJsonQueryFunction          3  avgt   15  1399.780 ± 27.967  ns/op
BenchmarkJsonFunctions.benchmarkJsonQueryFunction          6  avgt   15  1666.810 ± 29.572  ns/op

Benchmark json_extract_scalar vs json_value:
Benchmark                                                  (depth)  Mode  Cnt     Score    Error  Units
BenchmarkJsonFunctions.benchmarkJsonExtractScalarFunction        1  avgt   15   644.762 ±  9.195  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractScalarFunction        3  avgt   15   720.244 ± 13.288  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractScalarFunction       10  avgt   15   928.069 ± 17.623  ns/op
BenchmarkJsonFunctions.benchmarkJsonValueFunction                1  avgt   15   811.158 ± 14.694  ns/op
BenchmarkJsonFunctions.benchmarkJsonValueFunction                3  avgt   15  1042.795 ± 40.555  ns/op
BenchmarkJsonFunctions.benchmarkJsonValueFunction               10  avgt   15  1663.328 ± 27.320  ns/op
private final Metadata metadata;
private final Session session;
private final TypeCoercion typeCoercion;
private final NonEvictableCache<IrPathNodeRef<IrPathNode>, NonEvictableCache<InputTypes, ResolvedOperatorAndCoercions>> operators = buildNonEvictableCache(CacheBuilder.newBuilder().maximumSize(MAX_CACHE_SIZE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not flatten the caches an use a composite key made of IrPathNode ref, left type and right type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then one node could take too much space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be ok, especially if there's a case where a single node sees many combinations of types.

It would also make the code simpler. If later we find that this is not sufficient, we can think of better caching and eviction strategies, or JIT-style codegen/execution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is now flattened.

@kasiafi kasiafi requested a review from martint May 27, 2022 12:08
@kasiafi kasiafi merged commit 9376c36 into trinodb:master May 27, 2022
@github-actions github-actions bot added this to the 383 milestone May 27, 2022
@kasiafi kasiafi mentioned this pull request May 27, 2022
@findepi
Copy link
Member

findepi commented May 28, 2022

Nice!
it took a while, so the more happy it landed. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Add support for json_path wildcard
4 participants