Skip to content
This repository has been archived by the owner on Aug 5, 2021. It is now read-only.

Onda Format v0.5.0 #28

Merged
merged 17 commits into from
Feb 19, 2021
Merged

Onda Format v0.5.0 #28

merged 17 commits into from
Feb 19, 2021

Conversation

jrevels
Copy link
Member

@jrevels jrevels commented Dec 26, 2020

Summarizing some of the high-level impact of these changes:

  • Users may now separately construct/manipulate/transfer/analyze annotations (in *.onda.annotations.arrow files) and signals (in *.onda.signals.arrow files).
  • Users may work with multiple files of the same type referencing the same set of recordings (e.g. I could splitting all of a dataset's annotations across multiple *.onda.annotations.arrow files, or write new signals to an existing recording w/o modifying existing files).
  • Users may now leverage rich tooling ecosystems for tabular data manipulation widely available across a variety of languages to manipulate/analyze signal/annotation metadata.
  • (De)serializing/reading Onda metadata can now be made quite a bit faster/mmap-able.
  • Users may now store sample data in decoded floating point format.
  • Users are no longer required to maintain a relationship between metadata and sample data locations; arbitrary URIs (e.g. S3 URIs) can be used to directly associate signals with sample data files. Relative paths may still be used when desirable.
  • IMHO, on top of all of the above, the new format is actually clearer/cleaner/easier to understand 😎

I'm currently preparing a corresponding PR to Onda.jl 🙂

@jrevels jrevels force-pushed the jr/v0_5_0 branch 2 times, most recently from c56c533 to 36c961c Compare December 26, 2020 23:18
Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small ambiguities and some formulation tweaks, but two bigger comments

  1. a lot of my comments on "uh, this could be confusing for people of a certain background" doesn't need any changes in this, but might be a great candidate for something akin to an FAQ or documentation elsewhere.
  2. what you call "interleaved" reminds me a lot of what is called "multiplexed" in EEG formats. I haven't thought hard enough about it to know if they are actually the same or whether we're abusing DSP terms (like when non math people use "group" and "set" interchangeably), but maybe @hannahilea knows better?

But yeah, I think Arrow makes everything better.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@@ -50,7 +50,7 @@ This document uses the term...
- ...popular distributed analytics tools (e.g. Spark, TensorFlow).
- ...traditional databases (e.g. PostgresSQL, Cassandra).
- ...object-based storage systems (e.g. S3, GCP Cloud Storage).
- ...enable metadata, annotations etc. to be stored and processed separately from raw sample data artifacts without significant file system overhead.
- ...enable metadata, annotations etc. to be stored and processed separately from raw sample data without significant communication overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better.

@jrevels
Copy link
Member Author

jrevels commented Jan 7, 2021

what you call "interleaved" reminds me a lot of what is called "multiplexed" in EEG formats. I haven't thought hard enough about it to know if they are actually the same or whether we're abusing DSP terms (like when non math people use "group" and "set" interchangeably), but maybe @hannahilea knows better?

I think it's basically the same IIUC. FWIW I tend to use multiplex more as a verb and interleave more as a noun (e.g. "this service multiplexes incoming streams into a single interleaved output stream"). Interleaved (e.g. vs. planar) is what's commonly used in codec terminology.

One subtle terminology difference might arise if you e.g. multiplexed incoming streams into a chunked planar format, rather than a fully interleaved format.

@palday
Copy link
Member

palday commented Jan 7, 2021

One subtle terminology difference might arise if you e.g. multiplexed incoming streams into a chunked planar format, rather than a fully interleaved format.

And I think some of the formats do this. 🤷

@jrevels jrevels merged commit 7ec8ddc into master Feb 19, 2021
@jrevels jrevels deleted the jr/v0_5_0 branch February 19, 2021 21:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants