Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

columnar support for Arrow tables #2030

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
"@types/node": "^20.5.0",
"@typescript-eslint/eslint-plugin": "^7.2.0",
"@typescript-eslint/parser": "^7.2.0",
"apache-arrow": "^16.0.2",
"c8": "^9.1.0",
"canvas": "^2.0.0",
"d3-geo-projection": "^4.0.0",
Expand Down
35 changes: 34 additions & 1 deletion src/options.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ export const reindex = Symbol("reindex");
export function valueof(data, value, type) {
const valueType = typeof value;
return valueType === "string"
? maybeTypedMap(data, field(value), type)
? columnar(data, value, type)
: valueType === "function"
? maybeTypedMap(data, value, type)
: valueType === "number" || value instanceof Date || valueType === "boolean"
Expand Down Expand Up @@ -133,6 +133,7 @@ export function keyword(input, name, allowed) {
// Promotes the specified data to an array as needed.
export function arrayify(values) {
if (values == null || values instanceof Array || values instanceof TypedArray) return values;
if (isArrowTable(values)) return arrowTableProxy(values);
switch (values.type) {
case "FeatureCollection":
return values.features;
Expand Down Expand Up @@ -575,3 +576,35 @@ export function maybeClip(clip) {
else if (clip != null) clip = keyword(clip, "clip", ["frame", "sphere"]);
return clip;
}

// Duck typing Apache Arrow tables
function isArrowTable(data) {
return typeof data?.getChild === "function" && typeof data.numRows === "number" && typeof data.slice === "function";
}

// Extract columnar data
function columnar(data, name, type) {
if (isArrowTable(data)) {
const column = maybeTypedArrayify(data.getChild(name), type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be calling data.getChild(name).toArray() here, or else maybeTypedArrayify will use the slowest iterable path rather than using the more efficient Vector.toArray (which could be zero-copy if the Arrow table only has one chunk).

if (Array.isArray(column) && String(data.schema?.fields?.find((d) => d.name === name)).endsWith("<MILLISECOND>"))
column.find((d, i) => d != null && (column[i] = new Date(d)));
Comment on lines +589 to +590
Copy link
Member

@mbostock mbostock Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment will have no effect if type is not Array, since assigning a value to a typed array will coerce back to a number.

return column;
}
return maybeTypedMap(data, field(name), type);
}

// Arrayify arrow tables. We try to avoid materializing the values, but the
// Proxy might be used by the group reducer to construct groupData.
function arrowTableProxy(data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this negative most of the benefit of using arrow? I think proxies add a lot of overhead and arrow already has fast proxies for e.g. iteration. Can you defer the conversion to the group reducer?

Copy link
Contributor Author

@Fil Fil Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works. We still read the named channels with getChild(name), and the Symbol.iterator is also passed directly to the _Table object, so there shouldn't be any waste here. The only place where this is not great, is in the group transform, which assembles a new array of data points for each group (with the take function—that's what I think Arrow doesn't allow). I made a Proxy because I didn't want to add methods on the source _Table, but we could just slice() it defensively and not use a Proxy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. So in most cases you use columnar access anyway. Then maybe ignore my comment.

return new Proxy(data, {
get(target, prop) {
return prop === "length"
? target.numRows
: prop === "constructor" // for take/map
? Array
: typeof prop === "string" && !isNaN(prop)
? target.get(prop)
: target[prop]; // pass all other properties
}
});
}
121 changes: 121 additions & 0 deletions test/output/arrowDates.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 61 additions & 0 deletions test/output/arrowTest.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 61 additions & 0 deletions test/output/arrowTestAccessor.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading