Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate usage of foreign arrow function #9150

Merged
merged 17 commits into from
Mar 8, 2024
Merged

Conversation

hubertp
Copy link
Collaborator

@hubertp hubertp commented Feb 23, 2024

Pull Request Description

Including arrow language in the distribution by default. Added a basic example for creating an Arrow array.
Making sure that memory layout agrees with Arrow specification (padding, continuous allocation of memory chunks).
Related to #9118.

This should unblock work on allowing serialization/deserialization to/from Parquet but I'd like to delay it to a follow up ticket as it is going to be a significant amount of specialized work.

Important notes

As requested, Arrow array can currently be only modified via a builder pattern. Once the builder is sealed, no more modifications are allowed. At least not via official API. We all know that one can bypass those via a foreign function in JS or Python.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.

Including arrow language in the distribution by default.
Added a basic example for creating an Arrow array.
@hubertp hubertp added CI: No changelog needed Do not require a changelog entry for this PR. CI: Clean build required CI runners will be cleaned before and after this PR is built. labels Feb 23, 2024
build.sbt Outdated
@@ -3021,7 +3020,8 @@ buildEngineDistribution := {
val _ = (`engine-runner` / assembly).value
updateLibraryManifests.value
val modulesToCopy = componentModulesPaths.value.map(_.data)
val engineModules = Seq(file("runtime.jar"))
val arrow = Seq((`runtime-language-arrow`/Compile/packageBin ).value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. Let's not assemble runtime-language-arrow, as that is unnecessary.

Arrow specification will enforce padding during serialization.
This change is mostly copying from existing Java reference
implementation.
Also while the padding is now correct, allocation of separate
`ByteDirect` buffers is not. TBC
The builder accepts an Arrow array and allows for populating the array
via the `set` method. Once the builder is `built` no more modifications
are allowed.
Arrow spec expects (informally) that memory allocated to data and
bitmaps is continous.
With this change, when allocating memory for some array, we allocate
full chunk of memory and then slice **from it** chunks for data and
bitmap, rather than allocating separate memory chunks.
TODO: add some reference counting in the future.
@hubertp hubertp marked this pull request as ready for review March 5, 2024 13:42
Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have designed, implemented and packaged the Arrow language as a separate, Enso independent entity. This PR puts Arrow Array Builder into Standard.Base. That's not right!

Arrow isn't a core concept of Enso. As such there should be no builtins related to Arrow support.

Should the Arrow language remain independent from Enso language, then the builder has to be provided by the Arrow language, exposing its functionality via InteropLibrary. Simplest idea is to require two arguments to new_arrow constructor (created after parsing new[Int8] by the Arrow language) - the first being the size and the second an isExecutable TruffleObject (for example an Enso lambda function) that will be execute(builder) where builder is another TruffleObject reacting to invokeMember("set") & co.

Or we can decide that Arrow is natural part of Enso. But then we shouldn't have a separate Arrow language JAR & co. Needless to say, I'd rather keep Enso small and have Arrow separate - unless there are inherent performance or UX limitations.

@hubertp
Copy link
Collaborator Author

hubertp commented Mar 6, 2024

We have designed, implemented and packaged the Arrow language as a separate, Enso independent entity. This PR puts Arrow Array Builder into Standard.Base. That's not right!

Despite the name, there is nothing Arrow-specific in that module. And I think it would make sense to make it generic. I will update the PR.

@hubertp hubertp requested a review from JaroslavTulach March 6, 2024 11:27
@hubertp
Copy link
Collaborator Author

hubertp commented Mar 6, 2024

@JaroslavTulach As I mentioned, there was nothing in the builder that made it arrow-specific. In fact this can serve as a generic polyglot array builder. The changes now reflect that.
But I also added a case where you can use Polyglot API to do the modifications in place. Then the sky is the limit (the potential damage as well).

foreign arrow new_arrow = """
new[Int8]

Polyglot_Array_Builder.set self index value = Polyglot_Array_Builder.set_builtin self index value
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this if people object. This is mostly for testing the functionality and it is not part of the official API

builder.set 1 0 . should_fail_with Unsupported_Argument_Types
Polyglot.write_array_element array_10 1 0
builder.at 1 . should_equal 0
v.at 1 . should_equal 0 # dangerous but allowed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JaroslavTulach This demonstrates the potential consequences of using Polyglot API i.e. updating the builder after the fact that makes it "sealed".
/cc @jdunkerley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

demonstrates the potential consequences of using Polyglot API

This demonstrates that the sealing only happens in the builder, but not on the actual array. If we want object coming from our arrow language to support sealing, why don't we put the sealed flag into ArrowFixedArrayInt & co.?

@radeusgd radeusgd mentioned this pull request Mar 6, 2024
5 tasks
Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I continue to have an API design architectural problem with this PR.

was nothing in the builder that made it arrow-specific. In fact this can serve as a generic polyglot array builder. The changes now reflect that.

Whatever was @jdunkerley reason for requesting the builder, it is not achieved as the usage of the new builder is completely optional and there are other means to achieve the same (and more) without the builder.

Proper API design shall start with description of the use-case. As a reviewer I am not supposed to guess the use-case. However I have to: If the use-case was encapsulation of the write operation, then it is not achieved! If the use-case is something else, then please mention it in the Pull Request Description - as currently the justification for majority of the code (I mean the bloat related to all the new builder builtins) in this PR is missing (or at last I cannot find it).

But I also added a case where you can use Polyglot API to do the modifications in place. Then the sky is the limit (the potential damage as well).

Right. So what is your use-case related to Polyglot.write_array_element? Should one be allowed to do the modifications in place from Enso language or not? The original decision when working on #5011 and related PRs was that we don't want to expose write operations in Enso.

Using builder pattern is a good way to avoid exposing write operations, however it has to be done properly - e.g. (in current architectural design) the builder has to be provided by the Arrow language.


package org.enso.interpreter.arrow.runtime;

public class RoundingUtil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the class have to be public at all? Isn't it enough to be package private?

err.enter();
Builtins builtins = EnsoContext.get(this).getBuiltins();
throw new PanicException(builtins.error().makeInvalidArrayIndex(array, index), this);
} catch (UnsupportedTypeException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnsupportedMessageException leads to makeTypeError while UnsupportedTypeException leads to panic. Isn't that strange? Shouldn't UnsupportedTypeException lead to some type error?

@@ -165,6 +168,10 @@ public Atom makeNoSuchConversion(Object target, Object that, UnresolvedConversio
return noSuchConversion.newInstance(target, that, conversion);
}

public Atom makeInvalidOperation(Object message) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Object message type is pretty generic and potentially wrong. We cannot pass java.lang.String there - that'd be an error. Recently I changed one of such signatures to Text. Other alternative is EnsoObject or TruffleObject or overloaded methods supporting also long and double.

I can see the Object argument is accepted all around, but as I say, it is technically not correct.

stdlibName = "Standard.Base.Data.Polyglot_Array_Builder.Polyglot_Array_Builder")
public abstract class PolyglotArrayBuilder implements EnsoObject {

protected boolean sealed = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected? Better to make (package) private.


protected boolean sealed = false;

public void seal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final.

builder.set 1 0 . should_fail_with Unsupported_Argument_Types
Polyglot.write_array_element array_10 1 0
builder.at 1 . should_equal 0
v.at 1 . should_equal 0 # dangerous but allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

demonstrates the potential consequences of using Polyglot API

This demonstrates that the sealing only happens in the builder, but not on the actual array. If we want object coming from our arrow language to support sealing, why don't we put the sealed flag into ArrowFixedArrayInt & co.?

v.at 8 . should_equal 127
builder.append 21 . should_fail_with Invalid_Operation
builder.set 1 21 . should_fail_with Invalid_Operation
Polyglot.write_array_element array_10 1 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me recap what we achieved so far:

  • we expose new[Int8] to public
  • we let public to create array_10 and hold it indefinitly
  • we let public to modify the array any time with Polyglot.write_array_element array_10
  • on top of that we created some fake facade called Polyglot_Array_Builder
  • the facade provides no additional functionality
  • every operation can already be done on array_10 by other means

I guess I am missing the point of the whole builder exercise. Whatever it tried to achieve (probably some encapsulation or giving the creator more rights) clearly isn't achieved. Btw. Give the Creator of an Object More Rights pattern is described at page 79 of the Practical API Design book.

test/Base_Tests/src/Semantic/Arrow_Interop_Spec.enso Outdated Show resolved Hide resolved
test/Base_Tests/src/Semantic/Arrow_Interop_Spec.enso Outdated Show resolved Hide resolved
@JaroslavTulach
Copy link
Member

JaroslavTulach commented Mar 7, 2024

Assuming @jdunkerley main requirement is the: don't let mutable arrays float thru the computation, then ...

.... Simplest idea is to ....

Change the syntax of the Arrow language to create the builder itself:

foreign arrow new_array_int8 = """
    new[Int8]

b10 = new_array_int8 10
b10.append 2 
b10.append 4
arr = b10.build # or to_array?

arr.length . should_equal 10
arr.at 0 . should_equal 2
arr.at 1 . should_equal 4
arr.at 5 . is_nothing . should_be_true # maybe?

Polyglot.write_array_element arr 6 33 # raises an exception as the arr is already immutable

if this is what we want, then it all can be implemented in the Arrow language itself without any need for so many builtins in the Enso engine. As a result of the changes the test cases that use new would need to be updated to use the builder (newly returned from the factory method).

Arrow Array is already Enso Array

Please note that one can apply all Enso Array methods on the arr produced by Arrow language:

b3 = new_array_int8 3
b3.append 1 
b3.append 2
b3.append 3
arr = b3.build 

v1 = arr.map _.to_text # yields ['1', '2', '3']
v2 = arr.fold 10 (+) # yields 16
v3 = Meta.type_of arr # yields Array

hubertp added 3 commits March 7, 2024 23:34
Avoiding the introduction of various builtins by pushing builder pattern
into allow language.
Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an architecture perspective the change is looking good now. It is located in the right place (e.g. runtime-language-arrow), it doesn't require any changes to engine/runtime. It fits perfectly the vision why Arrow language was founded as a (loosely) separated project.

I have doubts about the decision to make builder : Array. Possibly that's unnecessary leakage of mutability into the Enso world. However that may stay as it is and remain a subject of future discussions.

I believe we should improve error reporting by using better subclasses of InteropException. Given the typed array nature of Arrow buffers, I'd use UnsupportedTypeException more.

I haven't verified the change from the point of performance. I assume there will be some benchmarking work in the future.


@ExportMessage
public boolean hasArrayElements() {
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "builder" is in fact also an (Enso) Array and it can be mutated until somebody calls build on it. I'd be interested to see a use-cases for such a design decision...

test/Base_Tests/src/Semantic/Arrow_Interop_Spec.enso Outdated Show resolved Hide resolved
@hubertp hubertp requested a review from radeusgd March 8, 2024 11:39
@hubertp hubertp added the CI: Ready to merge This PR is eligible for automatic merge label Mar 8, 2024
@mergify mergify bot merged commit f80dd9f into develop Mar 8, 2024
34 of 36 checks passed
@mergify mergify bot deleted the wip/hubert/9118-arrow branch March 8, 2024 15:20
@hubertp hubertp mentioned this pull request Mar 11, 2024
2 tasks
mergify bot pushed a commit that referenced this pull request Mar 13, 2024
Follow up on #9150 - making sure that Arrow builder is not accidentally treated as an Array by disallowing reading elements.

# Important Notes
Also making sure that the length of the resulting Arrow Array is consistent with what user requested.
@hubertp hubertp mentioned this pull request Mar 15, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Clean build required CI runners will be cleaned before and after this PR is built. CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants