Onda Format v0.5.0 #28

jrevels · 2020-12-26T23:09:27Z

implements explore usage of arrow files instead of msgpack.zst #25 (switched from MessagePack to tabular metadata representation w/ Arrow files)
implements allow sample data and non-Onda-specified metadata artifacts to be addressed by URI rather than relative location #22 (provided explicit file_path column for signals table)
implements change name of "recordings.msgpack.zst" to "manifest.onda" #24 "in spirit" (*.onda.annotations.arrow and *.onda.signals.arrow files can have arbitrary file names)
implements support float32/float64 sample_type fields #26 (float32/float64 sample type support)
replaces wonky file_extension/file_options fields with new file_format column to go along with the new file_path column
minor touch ups to a couple of sections to improve clarity/provide detail.

Summarizing some of the high-level impact of these changes:

Users may now separately construct/manipulate/transfer/analyze annotations (in *.onda.annotations.arrow files) and signals (in *.onda.signals.arrow files).
Users may work with multiple files of the same type referencing the same set of recordings (e.g. I could splitting all of a dataset's annotations across multiple *.onda.annotations.arrow files, or write new signals to an existing recording w/o modifying existing files).
Users may now leverage rich tooling ecosystems for tabular data manipulation widely available across a variety of languages to manipulate/analyze signal/annotation metadata.
(De)serializing/reading Onda metadata can now be made quite a bit faster/mmap-able.
Users may now store sample data in decoded floating point format.
Users are no longer required to maintain a relationship between metadata and sample data locations; arbitrary URIs (e.g. S3 URIs) can be used to directly associate signals with sample data files. Relative paths may still be used when desirable.
IMHO, on top of all of the above, the new format is actually clearer/cleaner/easier to understand 😎

I'm currently preparing a corresponding PR to Onda.jl 🙂

palday

A few small ambiguities and some formulation tweaks, but two bigger comments

a lot of my comments on "uh, this could be confusing for people of a certain background" doesn't need any changes in this, but might be a great candidate for something akin to an FAQ or documentation elsewhere.
what you call "interleaved" reminds me a lot of what is called "multiplexed" in EEG formats. I haven't thought hard enough about it to know if they are actually the same or whether we're abusing DSP terms (like when non math people use "group" and "set" interchangeably), but maybe @hannahilea knows better?

But yeah, I think Arrow makes everything better.

README.md

Co-authored-by: Phillip Alday <[email protected]>

palday · 2021-01-07T19:40:19Z

README.md

@@ -50,7 +50,7 @@ This document uses the term...
    - ...popular distributed analytics tools (e.g. Spark, TensorFlow).
    - ...traditional databases (e.g. PostgresSQL, Cassandra).
    - ...object-based storage systems (e.g. S3, GCP Cloud Storage).
- ...enable metadata, annotations etc. to be stored and processed separately from raw sample data artifacts without significant file system overhead.
+- ...enable metadata, annotations etc. to be stored and processed separately from raw sample data without significant communication overhead.


even better.

jrevels · 2021-01-07T19:45:38Z

what you call "interleaved" reminds me a lot of what is called "multiplexed" in EEG formats. I haven't thought hard enough about it to know if they are actually the same or whether we're abusing DSP terms (like when non math people use "group" and "set" interchangeably), but maybe @hannahilea knows better?

I think it's basically the same IIUC. FWIW I tend to use multiplex more as a verb and interleave more as a noun (e.g. "this service multiplexes incoming streams into a single interleaved output stream"). Interleaved (e.g. vs. planar) is what's commonly used in codec terminology.

One subtle terminology difference might arise if you e.g. multiplexed incoming streams into a chunked planar format, rather than a fully interleaved format.

palday · 2021-01-07T20:20:52Z

One subtle terminology difference might arise if you e.g. multiplexed incoming streams into a chunked planar format, rather than a fully interleaved format.

And I think some of the formats do this. 🤷

…w codified in column type

…cified

…econd suffix removal

…+ you almost always need both anyway

jrevels force-pushed the jr/v0_5_0 branch 2 times, most recently from c56c533 to 36c961c Compare December 26, 2020 23:18

Onda Format v0.5.0

9c0e58d

jrevels force-pushed the jr/v0_5_0 branch from 36c961c to 9c0e58d Compare December 28, 2020 01:18

jrevels added 2 commits December 30, 2020 12:51

signal type -> signal kind to avoid confusion with sample_type

6130ddf

clarify allowance of Arrow extension type aliases

e761985

jrevels force-pushed the jr/v0_5_0 branch from 35c2026 to e761985 Compare December 30, 2020 18:11

jrevels added 2 commits December 30, 2020 15:12

channel_names -> channels

3a151ca

shuffle column order

b56365d

jrevels mentioned this pull request Jan 4, 2021

refactor for Onda Format v0.5.0 beacon-biosignals/Onda.jl#59

Merged

jrevels requested a review from palday January 7, 2021 13:57

tweak outdated wording

cbcace4

palday reviewed Jan 7, 2021

View reviewed changes

jrevels and others added 4 commits January 7, 2021 10:31

Update README.md

b61603e

Co-authored-by: Phillip Alday <[email protected]>

Update README.md

9cd11f4

Co-authored-by: Phillip Alday <[email protected]>

implement Phillip suggestions

a217c44

merge

ad94ca5

palday reviewed Jan 7, 2021

View reviewed changes

jrevels added 7 commits January 20, 2021 00:14

start_nanosecond -> start, stop_nanosecond -> stop since units are no…

f68eaee

…w codified in column type

the format technically doesn't force recording time periods to be spe…

1903443

…cified

recording_uuid -> recording, uuid -> id, for similar reason as _nanos…

30390e0

…econd suffix removal

start/stop -> span to enable nicer application-layer type extensions …

3c9b1a5

…+ you almost always need both anyway

allow additional author-provided columns to Onda tables

ac38fcb

*.signals -> *.onda.signals.arrow

03a9f9d

wip

19cdcbd

jrevels merged commit 7ec8ddc into master Feb 19, 2021

jrevels deleted the jr/v0_5_0 branch February 19, 2021 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onda Format v0.5.0 #28

Onda Format v0.5.0 #28

jrevels commented Dec 26, 2020 •

edited

Loading

palday left a comment

palday Jan 7, 2021

jrevels commented Jan 7, 2021

palday commented Jan 7, 2021

Onda Format v0.5.0 #28

Onda Format v0.5.0 #28

Conversation

jrevels commented Dec 26, 2020 • edited Loading

palday left a comment

Choose a reason for hiding this comment

palday Jan 7, 2021

Choose a reason for hiding this comment

jrevels commented Jan 7, 2021

palday commented Jan 7, 2021

jrevels commented Dec 26, 2020 •

edited

Loading