Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Loader implement java.util.stream.Collector #14

Conversation

ferdinand-swoboda
Copy link

@ferdinand-swoboda ferdinand-swoboda commented Nov 7, 2021

jOOQ v3.15 deprecated the org.jooq.RecordHandler interface which Jolo's public tech.picnic.jolo.Loader API implemented.
One way to deal with this upstream change is to make Loader implement java.util.function.Consumer directly and changing the usage advice from

 // At runtime:
 Query query = ...;
 Collection<T> result = query.fetchInto(FACTORY.newLoader()).get();

to

 // At runtime:
 Query query = ...;
 Loader<T> loader = FACTORY.newLoader();
 query.forEach(loader);
 Collection<T> result = loader.get();

However, this leads to an unnecessarily cumbersome API. Since jOOQ v3.15 also introduced org.jooq.ResultQuery#collect(Collector<? super R, A, X> collector) a more convenient & generic API is possible:

 // At runtime:
 Query query = ...;
 Collection<T> result = query.collect(FACTORY.newLoader());

Both options require downstream users to update their usage of Jolo.

This PR foregoes the first option and makes the necessary changes to implement the Collector interface for the Loader. Consequently, it's a non-goal of this PR to maintain a backwards-compatible API surface.

This PR includes a number of dependency updates (JDK 11 among others) that I suggest to extract into a separate PR and possibly a release of the OSS parent (cc @Stephan202 ). I've also addressed some upstream bug references and outdated comments along the way.

Some details to pay attention to:

  • To implement a clean java.util.stream.Collector#combiner I've created a mutable object graph data structure to separate state from entity & relation definitions, making Entity and Relation effectively immutable.
  • Linking loaded objects is done in the loader's finishing step. Callers adhering to the Collector contract will call finisher() exactly once, thus there's no need for memoizing this step.
  • Since the public API changes anyway I've removed/hidden some now unnecessary public API methods from Entity (though, I maintained the corresponding test cases) and Relation. Please see the added CHANGELOG.md for details.
  • A change worth discussing: I changed the custom relation loader interface from BiConsumer<Relation, Set<IdPair>> to Function<Record, Set<IdPair>. I failed to understand why the former was chosen and the latter makes for a cleaner implementation. @scranen would you mind chipping in?
  • The collector remains ordered, meaning that the main entity's objects are returned in encounter order (relevant for ordered input as is typical for records streamed from jOOQ's query results).

A potential future improvement is making the loader java.util.stream.Collector.Characteristics#CONCURRENT i.e. suitable for concurrent parallel collection. This is not relevant for streams directly stemming from org.jooq.ResultQuerys (which are ordered) but downstream users might opt to do some intermediate processing resulting in unordered streams in which case it might be handy.

@ferdinand-swoboda ferdinand-swoboda self-assigned this Nov 7, 2021

### Release date:

2021/11/14
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously wishful thinking

pom.xml Outdated
Comment on lines 146 to 149
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying this duplicate dependency currently causes Maven warnings which should be resolved with an OSS parent update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up until now we explicitly avoided using Guava, is this a conscious decision to drop that?

Copy link
Author

@ferdinand-swoboda ferdinand-swoboda Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guava was already used in test classes and is specified in the OSS parent so I wasn't aware of this intention.

The current Guava usage is not exposed in the public API, though.
What concerns do/did you have?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the Guava dependency in non-test code.

Comment on lines 62 to 64
for (var cell : other.entityAndIdToObject.cellSet()) {
add(cell.getRowKey(), cell.getColumnKey(), cell.getValue());
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entityAndIdToObject.putAll(other.entityAndIdToObject) would override existing cells but we want to give precedence to existing ones in case of conflicts.

Comment on lines +85 to +87
&& Objects.equals(leftSetter, relation.leftSetter)
&& Objects.equals(rightSetter, relation.rightSetter)
&& Objects.equals(relationLoader, relation.relationLoader);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function equivalence is generally undecidable so I'm not sure if these setters should be included in this equals check. WDYT?

(object, successors) -> {
validate(
successors.size() <= 1,
"N-to-1 relation between %s (%s) and %s (%s) contains conflicting tuples",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is now slightly less helpful as it's missing examples.

Comment on lines -20 to -37
* @see <a href="https://github.com/jOOQ/jOOQ/issues/8509">jOOQ bug report</a>
*/
@Test
public void testEqualsOnFields() {
Field<Long> withSchema = DSL.field(DSL.name("PUBLIC", "FOO", "ID"), Long.class);
Field<Long> noSchema = DSL.field(DSL.name("FOO", "ID"), Long.class);
assertTrue(FOO.ID.equals(withSchema));
assertFalse(withSchema.equals(FOO.ID));
assertFalse(FOO.ID.equals(noSchema));
assertFalse(noSchema.equals(FOO.ID));

// Also documenting this oddity: the qualified name of FOO.ID is not "PUBLIC.FOO.ID", even
// though it is only considered equal to a field that has the "PUBLIC" qualifier.
assertEquals(FOO.ID.toString(), "\"PUBLIC\".\"FOO\".\"ID\"");
assertEquals(FOO.ID.getQualifiedName(), DSL.name("FOO", "ID"));
assertEquals(withSchema.getQualifiedName(), DSL.name("PUBLIC", "FOO", "ID"));
assertEquals(noSchema.getQualifiedName(), DSL.name("FOO", "ID"));
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed upstream. The following tests would now pass.
I have changed the implementation to now use Field#equals() directly.

Suggested change
* @see <a href="https://github.com/jOOQ/jOOQ/issues/8509">jOOQ bug report</a>
*/
@Test
public void testEqualsOnFields() {
Field<Long> withSchema = DSL.field(DSL.name("PUBLIC", "FOO", "ID"), Long.class);
Field<Long> noSchema = DSL.field(DSL.name("FOO", "ID"), Long.class);
assertTrue(FOO.ID.equals(withSchema));
assertFalse(withSchema.equals(FOO.ID));
assertFalse(FOO.ID.equals(noSchema));
assertFalse(noSchema.equals(FOO.ID));
// Also documenting this oddity: the qualified name of FOO.ID is not "PUBLIC.FOO.ID", even
// though it is only considered equal to a field that has the "PUBLIC" qualifier.
assertEquals(FOO.ID.toString(), "\"PUBLIC\".\"FOO\".\"ID\"");
assertEquals(FOO.ID.getQualifiedName(), DSL.name("FOO", "ID"));
assertEquals(withSchema.getQualifiedName(), DSL.name("PUBLIC", "FOO", "ID"));
assertEquals(noSchema.getQualifiedName(), DSL.name("FOO", "ID"));
}
* @see <a href="https://github.com/jOOQ/jOOQ/issues/8509">jOOQ bug report</a>
*/
@Test
public void testEqualsOnFields() {
Field<Long> withSchema = DSL.field(DSL.name("PUBLIC", "FOO", "ID"), Long.class);
Field<Long> noSchema = DSL.field(DSL.name("FOO", "ID"), Long.class);
assertFalse(FOO.ID.equals(withSchema));
assertFalse(withSchema.equals(FOO.ID));
assertTrue(FOO.ID.equals(noSchema));
assertTrue(noSchema.equals(FOO.ID));
// Also documenting this oddity: the qualified name of FOO.ID is not "PUBLIC.FOO.ID", even
// though it is only considered equal to a field that has the "PUBLIC" qualifier.
assertEquals(FOO.ID.toString(), "\"FOO\".\"ID\"");
assertEquals(FOO.ID.getQualifiedName(), DSL.name("FOO", "ID"));
assertEquals(withSchema.getQualifiedName(), DSL.name("PUBLIC", "FOO", "ID"));
assertEquals(noSchema.getQualifiedName(), DSL.name("FOO", "ID"));
}

cc @scranen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the semantics changed but is still a bit strange. assertFalse(FOO.ID.equals(withSchema)) seems unexpected; can this cause any issues?

Copy link
Author

@ferdinand-swoboda ferdinand-swoboda Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the default schema assumed by Jooq's code generator is empty, hence FOO.ID.toString() (Field#equals is based on the String representation) returns "FOO"."ID" and the check fails.

If I declare the default schema used by jooq-code-gen to be "public"

<property>
   <key>unqualifiedSchema</key>
   <value>public</value>
</property>

then the following test passes:

    @Test
    public void testEqualsOnFields() {
        Field<Long> withSchema = DSL.field(DSL.name("PUBLIC", "FOO", "ID"), Long.class);
        Field<Long> noSchema = DSL.field(DSL.name("FOO", "ID"), Long.class);

        assertTrue(FOO.ID.equals(withSchema));
        assertTrue(withSchema.equals(FOO.ID));
        assertFalse(FOO.ID.equals(noSchema));
        assertFalse(noSchema.equals(FOO.ID));
        
        assertEquals(FOO.ID.toString(), "\"PUBLIC\".\"FOO\".\"ID\"");
        // this is now unexpected?
        assertEquals(FOO.ID.getQualifiedName(), DSL.name("FOO", "ID"));
        assertEquals(withSchema.getQualifiedName(), DSL.name("PUBLIC", "FOO", "ID"));
        assertEquals(noSchema.getQualifiedName(), DSL.name("FOO", "ID"));
    }

I think the H2 in-memory DB does not create a default public schema like Postgres does, hence the confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To conclude, fields from tables with the same name but from different schemas are considered different.
Given that schemas name-space tables, that makes sense to me.

@ferdinand-swoboda ferdinand-swoboda force-pushed the implement-collector-interface branch from 2e3868d to dd12846 Compare November 7, 2021 10:10
@ferdinand-swoboda ferdinand-swoboda marked this pull request as ready for review November 7, 2021 10:13
@@ -61,6 +60,31 @@ private Entity(Table<R> table, Class<T> type, Field<Long> primaryKey, Field<?>[]
this.fields = fields;
}

@Override
public boolean equals(@Nullable Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this? Two distinct Entity objects should never be considered equal IMO. You can create two different Entity objects for the same table, and use both to load the result of a single query (e.g. a self join).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two entity objects e1 and e2 for the same table and with the same result fields will Entity#load the same objects for the same record as calling e.g. e1.load(record) twice, right?

Can you provide an example that is not possible with this equals implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry. You'd of course use 2 different aliases for the tables. Then it makes sense. Still not sure why we're adding a definition for equals though, just because we can?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt natural to override equals such that it is in sync with hashCode but it is indeed not necessary.
API-wise what is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would not override equals/hashCode, simply because I cannot think of a natural use case for wanting to have 2 instances of this class that should be considered equal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But this is a highly subjective of course, so also happy to be convinced otherwise)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to guard against accidental misuse since 2 entities that refer to the same table and same fields must effectively be equivalent, otherwise the corresponding SQL query would be invalid.
In that case the loader would automatically squash these entities into one.

It's indeed an unlikely scenario, though (guessing here since I haven't seen Jolo examples outside of Picnic). Adding this API characteristic would also constrain further API evolution and possibly prevent some unforeseen use cases.

Another option would be to emit a warning log if aforementioned misuse is detected during loader-building.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, there's already a check in the builder that states that "Distinct entities cannot refer to the same primary key".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave it out then. The default semantics meet the requirements for Jolo, and the most obvious form of abuse is guarded by a fail fast mechanism. If use cases do pop up, then we'll have some real requirements to work with and we can open an new PR based on those.

src/main/java/tech/picnic/jolo/Entity.java Show resolved Hide resolved
src/main/java/tech/picnic/jolo/ObjectGraph.java Outdated Show resolved Hide resolved
pom.xml Outdated
Comment on lines 146 to 149
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up until now we explicitly avoided using Guava, is this a conscious decision to drop that?

@scranen
Copy link
Contributor

scranen commented Nov 9, 2021

Custom relation loader: the BiConsumer<Record, Set<IdPair>> was a natural way to avoid nulls and not construct a container around the (optional) IdPair returned from a record. When loading 100k+ records this kind of churn can make a difference. The interface you've got now returns a Set<IdPair>; I'm not sure there are any cases where it would return more than one.

In general, we should probably check if there's any performance difference with the new code if applied to a large data set.

@ferdinand-swoboda
Copy link
Author

Custom relation loader: the BiConsumer<Record, Set<IdPair>> was a natural way to avoid nulls and not construct a container around the (optional) IdPair returned from a record. When loading 100k+ records this kind of churn can make a difference.

Just for clarification: The BiConsumer.... interface was chosen to avoid unnecessary container object allocations?
If so, I agree we should measure if and how this matters.

The interface you've got now returns a Set<IdPair>; I'm not sure there are any cases where it would return more than one.

There's actually a test covering this functionality, see the tech.picnic.jolo.LoaderTest#customRelationLoader.

In general, we should probably check if there's any performance difference with the new code if applied to a large data set.

I can add a jfrUnit test or some JMH test. Have you done sth like that on the initial Jolo version?

@scranen
Copy link
Contributor

scranen commented Nov 9, 2021

Thanks for pointing out the unit test; I forgot about the array use case.

I did try to minimise the overhead when writing the first version of Jolo, but I do not know whether I dodged any bullets with that effort; I just chose the implementation with the least overhead if there was no compelling reason to not do so.

No performance tests were written explicitly I'm afraid, just kept an eye on WMS performance when we started using it there. I guess a good performance test could be to just feed it a set of records that represents a complete carthesian product of a few largish tables.

@ferdinand-swoboda ferdinand-swoboda force-pushed the implement-collector-interface branch from 25bb5e9 to 570d02d Compare November 9, 2021 21:06
@ferdinand-swoboda
Copy link
Author

ferdinand-swoboda commented Nov 13, 2021

No performance tests were written explicitly I'm afraid, just kept an eye on WMS performance when we started using it there. I guess a good performance test could be to just feed it a set of records that represents a complete carthesian product of a few largish tables.

I added a JMH benchmark to to this PR and the master branch.
Without collector interface:
Screen Shot 2021-11-13 at 22 47 50

With collector interface:
Screen Shot 2021-11-13 at 22 54 39

Definitely worse but I'm not sure if it matters for common use cases. I'll investigate where time is wasted.

Update: Disabling Entity#hashCode and Relation#hashCode reduces the overhead a fair bit, though not completely.
Update 2: Memoizing the hash code computation achieves the same result i.e. a 30% reduction.
Screen Shot 2021-11-28 at 09 22 58

@ferdinand-swoboda
Copy link
Author

@scranen Besides the open questions I'm considering removing the factory pattern and just keeping a LoaderBuilder.
What was the reason for including the factory? Did you expect a need to parametrise the loader instantiation somehow?

@scranen
Copy link
Contributor

scranen commented Jan 6, 2022

I made the separation to create a more foolproof API. The idea wasn't completely finished because after calling LoaderFactoryBuilderImpl::build, the builder should ideally not allow any more modifications (making it effectively immutable), or alternatively LoaderFactoryBuilderImpl should be split into a LoaderFactoryBuilderImpl and an immutable LoaderFactoryImpl. The fact that it returns itself as a LoaderFactory from the build method gets rid of all those methods that you should not be calling anymore; ideally it returns an implementation that indeed does not allow you to change the entity-relation model that you configured on it.

BTW a performance reduction of almost 100% still seems on the high side to me. Did you see if the BiConsumer (mutable set vs immutable copying) had any impact?

@ferdinand-swoboda
Copy link
Author

ferdinand-swoboda commented Jan 7, 2022

BTW a performance reduction of almost 100% still seems on the high side to me. Did you see if the BiConsumer (mutable set vs immutable copying) had any impact?

I removed JUnit from the benchmark equation to get a more accurate benchmark and now things are less clear.
On master (without the collector interface):
without-collector-interface

With collector interface:
with-collector-interface

If I further revert the relation loader's type to BiConsumer<Record, Set<IdPair>> the numbers change slightly but not by much and not necessarily in the right direction.
with-collector-with-biconsumer

I doubt that the relation loader makes a big dent because in the code path it is called from the allocation can easily be optimized away (the immutable set is unpacked right after). This would explain why we see no improvement for numRecords > 100.

Unfortunately, the profiler is not particularly helpful in determining where the slowdown comes from since the collector call accounts for only 10% of the sampled CPU time. And of course, the benchmark can be inaccurate.

@ferdinand-swoboda
Copy link
Author

ferdinand-swoboda commented Feb 5, 2022

Another update on the performance difference:
I've updated the baseline based on an expected new OSS parent release, fixed some benchmark inaccuracies and replaced Guava usages in non-test code.
The resulting collector implementation can be found on this branch and the corresponding baseline here.

The benchmark output still reports a ~100% slowdown for 100 records (higher number of records are affected as well). The JFR results indicate the slowdown occurs in org.jooq.Record#into(org.jooq.Field<?>...) and org.jooq.Record#into(java.lang.Class<? extends E>).

I would appreciate if someone could run the benchmarks locally as well and have a critical look at the benchmark itself.

@ferdinand-swoboda
Copy link
Author

ferdinand-swoboda commented Feb 6, 2022

I found the issue and fixed it by lazily loading records into objects only when the collector hasn't come across the record's ID before. This optimisation was already in place but not applied to the collector implementation.

JMH results without collector implementation:
Screen Shot 2022-02-06 at 11 04 18

JMH results with collector implementation:
Screen Shot 2022-02-06 at 10 52 25

Btw, nice improvement unlocked by JDK 11 and/or jOOQ v3.15 (compare it to my earlier comment).

@@ -8,8 +8,7 @@
[![SonarCloud Maintainability][sonarcloud-badge-maintainability]][sonarcloud-measure-maintainability]
[![BCH compliance][bettercodehub-badge]][bettercodehub-results]

Short for _jOOQ Loader_. A utility library to add basic object-relation mapping
to your [jOOQ][jooq] code.
Short for _jOOQ Loader_. A utility library to add basic object-relation mapping to your [jOOQ][jooq] code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we are switching to unwrapped text? MarkDown is a lot easier to read in plaintext viewers if it's wrapped.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text is still wrapped but IntelliJ defaults to a 120 character limit instead of 80.
I don't know of anyone who doesn't have screen estate for at least 120 characters these days so that default seems sensible to me.

Let me check how other OSS projects handle that.

@Nullable private Field<?>[] resultFields;

/**
* Creates a mapping from a jOOQ table to the given class.
* Creates a mapping from a {@link Table table} to the given {@link Class class}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you want to make a code-base wide style change in the scope of a functional change. Personally I prefer the original. Note that the exact types are already clear from the signature, and that most developers will not feel like a link to the documentation of Class is very helpful. The context that it's the jOOQ table class is now missing so you'll have to do something extra (hover over the text?) to recover that information.

*/
public void load(Record record) {
private void check(Record record) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method no longer belongs here. It's keeping state about the data set being loaded inside the object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need resultFields? Shouldn't it always be equivalent to fields?
A mismatch IMO indicates that the loader definition doesn't match what the query returns and thus means some columns are SELECTed but unused. Following best practices that is discouraged.

public Function<ObjectGraph, List<T>> finisher() {
return objectGraph -> {
for (Relation relation : relations) {
relation.link(objectGraph.getObjectMapping(relation));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make objectGraph responsible for doing the linking, that could simplify the code a bit. You would not have to instantiate this object mapping, and instead of using extra containers to store the intermediate results you could probably just have a relation.link(left, right) method that just accepts the left and right side of the relation. I think it would probably also shave a little bit of the performance overhead of this new implementation again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make objectGraph responsible for doing the linking, that could simplify the code a bit.

My intention was to cleanly separate state from (linking) logic, hence why the objectGraph doesn't apply any linking which IMO is associated with the relation definition.
If I understand you correctly merging these concerns would not simplify the overall design. Did I understand you correctly?

An option to avoid instantiating an ObjectMapping while keeping that separation would be to add a signature similar to ObjectGraph#apply(Entity left, Entity right, BiConsumer<T, U> setter.
That looks unwieldy and some validation logic would still need to be moved inside ObjectGraph.

@ferdinand-swoboda
Copy link
Author

Closing in favour of https://github.com/ferdinand-swoboda/folo.

Thanks for all the feedback and comments. The reason for going with a new project is that I want to see this code being available to the world and I felt this was required to unblock myself and potential users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants