-
Notifications
You must be signed in to change notification settings - Fork 0
explore usage of arrow files instead of msgpack.zst #25
Comments
I would recommend calling the disk format Parquet because that's the actual disk format. Arrow itself is an abstraction that can work with various disk formats, including the now obsolete Feather and Parquet. The implementations in R and Python make this distinction clearer. 🙁 |
I think they are different (arrow-on-disk vs parquet), unless https://stackoverflow.com/a/56481636 is out of date? |
Ugh, this explains part of the mess I've had in roundtripping output from Arrow.jl to any of the other languages. I've had essentially no problem moving data between Python and R, but moving between either of those to Julia has been a source of frustration. But the main takeaway is the same: Arrow is primarily an in-memory format/protocol and so any talk of "Arrow files" needs to be really explicit about what's going on. Also, things without native types in other languages can wind up really messed up, e.g. not all languages have unsigned types. There's an obvious fix for the readers for those languages, but I suspect there will be a few lingering compatibility issues (although that may be less an issue for Onda). |
To be clear, whenever I refer to "Arrow files" I mean https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format. I did a super naive little experiment to make sure that basic Onda metadata was (de)serializable via Arrow.jl; can find it here (note that I'm also playing with some of the other v0.5.0 changes there too). So it seems like switching to Arrow.jl is definitely doable and seems like it'd even yield slight performance improvements. It'd be nice if Onda's metadata was more columnar in nature - when the data is amenable, migrating from fewer columns/more nesting to more columns/less nesting is generally a win from an accessibility perspective (since it makes it easier to naively express finer-grained accesses). Unfortunately, it really is the case that most of the Onda metadata is pretty "record-like" in that you really do want to read most of a row pretty often. I guess it'd be worth breaking up annotations and signals into separate columns, at least...I'll play around with it. The only roadbump that I really encountered was apache/arrow-julia#75, though it's pretty workaroundable; could just go back to EDIT: After thinking about the above, I think the fields most likely to be individually predicated against (e.g. in reductions/searches) are signal names, channel names, and annotation values. Sample rates and time spans are next after, and I'd say the least likely thing to be individually predicated against is plain encoding information (which will likely just be used when sample data is loaded). In many senses, it'd be really, really nice to have annotations in their own table and replace the current "value" field with "the dataset author can include whatever additional columns they want alongside the recording UUID and time span columns." However, I'm not sure if the complexity of having multiple files is really worthwhile. I wonder if I can just store two tables alongside each other in a single Arrow file...keep going back and forth between separate tables and just continuing with the nested struct option used in my example lol. Theoretically, IIRC Arrow's struct layout technically enables similar field access optimizations either way, but it would be nice to keep things less nested if possible...will keep playing around with it. |
Okay - I've implemented two primary alternative approaches here that I'd like input on.
Thoughts:
I'm leaning towards implementing an Approach-1-like index on top of Approach 2, and seeing how much slower/more work it is. If the result is feasible/ergonomic, then it seems to me like Approach 2 is the way to go. Otherwise it'll be a tougher call...thoughts? |
Have been playing around with this all day, and now feel pretty satisfied with the implementation here 😎 I'm still a Tables.jl noob so some of those delegations might not be ideal (and the I'm going to start test-driving this on Beacon-internal data later this week and will report back. |
Just curious, any idea why the I'm assuming that if we write using the streaming format, we can append to existing tables? |
Yup, making those faster was actually the goal that motivated the implementation change. The alternative approach wraps whatever row types are produced from The big difference here is that this alternative approach better respects caller/table constructor decisions so that the caller can use the right underlying (or overlying!) table type for their use case |
now that Arrow is v1.0 and there's a nice Julia package for it
The text was updated successfully, but these errors were encountered: