Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAMS v2 Schema #5

Closed
justinsalamon opened this issue Nov 20, 2014 · 33 comments
Closed

JAMS v2 Schema #5

justinsalamon opened this issue Nov 20, 2014 · 33 comments
Labels
Milestone

Comments

@justinsalamon
Copy link
Contributor

@ejhumphrey I've been snooping around the v2 schema, and wanted to discuss the series type.

  1. If it has an array of of "values", why should it also have time and duration fields?
  2. is there any way to identify what the series actually represents? i.e. how do I know if this is a time series or an frequency series?
@ejhumphrey
Copy link
Collaborator

The idea is that time and duration provide the placement of the series in a
global frame, and that values are sampled uniformly over that interval. If
you wanted to time stretch a series, you would only need to change two
numbers --time and duration-- and every thing else is consistent.

This idea also falls out of a philosophical argument ... if your data are
uniformly sampled in time, it makes a lot of sense to use a dense format.
If the data are not uniformly sampled in time, this information is, by
certain definitions, sparse, and a "series" representation is probably a
bad idea.

Also, as far as JSON schema's are concerned, we can't enforce two arrays
have the same length. This isn't necessarily a deal breaker, but it causes
an interesting boundary between where and by what technology a data
structure might be validated, e.g. JSON schema versus API. We want to
define as much as possible in the JSON schema, but some things may need to
be the responsibility of the API.

On Thu, Nov 20, 2014 at 12:18 PM, justinsalamon [email protected]
wrote:

@ejhumphrey https://github.com/ejhumphrey I've been snooping around the
v2 schema, and wanted to discuss the series type
https://github.com/marl/jams/blob/ver2/schema/jams_schema_v2.json#L49.

  1. If it has an array of of "values", why should it also have time and
    duration fields?
  2. is there any way to identify what the series actually represents?
    i.e. how do I know if this is a time series or an frequency series?


Reply to this email directly or view it on GitHub
#5.

@ejhumphrey
Copy link
Collaborator

Oh, and per (2), I think the semantic significance of a "series" would be
defined by it's namespace, taking something like "melody.hertz" or
"pitch.cents", etc.

On Thu, Nov 20, 2014 at 12:34 PM, Eric Humphrey [email protected]
wrote:

The idea is that time and duration provide the placement of the series in
a global frame, and that values are sampled uniformly over that interval.
If you wanted to time stretch a series, you would only need to change two
numbers --time and duration-- and every thing else is consistent.

This idea also falls out of a philosophical argument ... if your data are
uniformly sampled in time, it makes a lot of sense to use a dense format.
If the data are not uniformly sampled in time, this information is, by
certain definitions, sparse, and a "series" representation is probably a
bad idea.

Also, as far as JSON schema's are concerned, we can't enforce two arrays
have the same length. This isn't necessarily a deal breaker, but it causes
an interesting boundary between where and by what technology a data
structure might be validated, e.g. JSON schema versus API. We want to
define as much as possible in the JSON schema, but some things may need to
be the responsibility of the API.

On Thu, Nov 20, 2014 at 12:18 PM, justinsalamon [email protected]
wrote:

@ejhumphrey https://github.com/ejhumphrey I've been snooping around
the v2 schema, and wanted to discuss the series type
https://github.com/marl/jams/blob/ver2/schema/jams_schema_v2.json#L49.

  1. If it has an array of of "values", why should it also have time
    and duration fields?
  2. is there any way to identify what the series actually represents?
    i.e. how do I know if this is a time series or an frequency series?


Reply to this email directly or view it on GitHub
#5.

@justinsalamon
Copy link
Contributor Author

So the timestamps are implicit O_O

In an ideal world where everyone plays by the book that might be ok, but that makes me very nervous: having the timestamps explicitly listed can be a big headache saver. For staters, you can get different interpretations of how to map start+duration to timestamps - e.g. do the first and last timestamps correspond to times start and start+duration, or, does the last timestamp happen just before start+duration (i.e. n vs n+1 timestamps). And that's still completely ignoring the whole "is the timestamp centered in the frame or does it represent the beginning of each analysis frame" issue (which admittedly isn't solved by having explicit timestamps, but would be further hidden if they were implicit). Also, in the past having the timestamps listed explicitly has been really helpful in figuring out the different annotation procedures for different melody f0 annotations (including timestamp centering, hop size, etc.). And finally, it adds an extra step to obtain the timestamps which people will have to code if they're not using the python API (and we can't assume everyone will be), opening the door to some evil mistakes.

Yes, if you want to time-stretch a series it requires re-computing the timestamps, but if you're time-stretching uniformly by a factor X, then you just need to multiply each timestamp by X right? In which case I don't see the big win here.

@bmcfee
Copy link
Contributor

bmcfee commented Dec 18, 2014

Hijacking this thread to talk about schema in the pandas branch, which it sounds like we're converging toward for v2. Below are miscellaneous comments and questions to chew on:

@justinsalamon
Copy link
Contributor Author

2¢ here we go:

  • I think that was set by @ejhumphrey. Maybe for readability?
  • Probably a good idea. That said, we should definitely not enforce it to be a number (think v0.1.2)
  • ok I guess
  • Personally I'm not a fan of having confidence encoded as a tuple, it's kinda implicit and could give rise to inconsistencies across API implementations. What are the cons of schematizing the confidence values per field?
  • Are there any use-cases where time can/should be negative?
  • Yes, but perhaps pin that until the schema stabilizes a bit

@bmcfee
Copy link
Contributor

bmcfee commented Dec 18, 2014

Personally I'm not a fan of having confidence encoded as a tuple, it's kinda implicit and could give rise to inconsistencies across API implementations. What are the cons of schematizing the confidence values per field?

I'm not a huge fan of it either. There are a few things at play here:

  1. "confidence" is necessarily ill-defined here, so there will always be some up-stream interpretation of this value that lies outside of the scope of schema. Does it even make sense to require it to be a number? Do we have concrete use-cases to help guide this decision? (FWIW, I often encode confidence in terms of quantile thresholds, which already doesn't fit the schema.)
  2. In what settings do we need different confidence values for time, duration, and value? Are there any? If so, are three values enough, or would we still need to expand?

Are there any use-cases where time can/should be negative?

I can't think of any good ones, but it does pop up (accidentally) in isophonics. If we want to accurately preserve the original data, we'd have to allow negative time/duration. Do we want this, or should we be bigger sticklers for correctness? (I vote correctness over legacy; interested parties can always go back to the original sources if need be.)

@rabitt
Copy link
Contributor

rabitt commented Jan 13, 2015

https://github.com/marl/jams/blob/pandas/schema/jams_schema.json#L19 Why is duration a string here? I think we should stick to always measuring time in seconds, with the possible exception of UTC timestamp strings for file creation.

I agree that it should be a float.

https://github.com/marl/jams/blob/pandas/schema/jams_schema.json#L41 Should we define a schema item for version fields? Here, we use string|number, elsewhere it's string+pattern.

+1

https://github.com/marl/jams/blob/pandas/schema/jams_schema.json#L60 I've removed all type requirements for value fields. This might be a little dangerous, but it does allow us to pack arbitrary data keyed to (time, duration) values.

No issue here.

Related to the above: should we do the same for confidence, allowing it to expand beyond a single number to a tuple which may be interpreted upstream? This would allow a practitioner to encode confidence for value, time, and/or duration without schematizing the semantics. Not sure if we want/need to allow this behavior. Use cases would be tremendously helpful here.

I'm trying to think of examples where confidence is more than just a single number... I can imagine a case where "confidence" is an array of confidence values (e.g. from a panel of experts-type system), but eh...

https://github.com/marl/jams/blob/pandas/schema/jams_schema.json#L62 I added a requirement that times and durations be non-negative. Do we really want this? Should it be enforced everywhere? I think probably yes, but a second opinion would be good.

In the Rock Corpus dataset, for example, I've seen pickup measures be notated with negative time (though, in this case, time was "global beat number", which isn't exactly time)

https://github.com/marl/jams/blob/pandas/schema/jams_schema.json#L105 "Annotation.annotation_metadata" is a bit redundant. Can we simplify to "Annotation.metadata"?

I think this should stay as it is. The only "metadata" related fields are annotation_metadata and file_metadata. If you change one I'd change both, but It doesn't make sense to me to change file_metadata to metadata because it's ambiguous at the top level.

In general, we should audit the required fields for consistency/sanity. I had to nix a few for debugging purposes, and am open to adding them back in.

Proceeding audit :

: audit complete.

@bmcfee
Copy link
Contributor

bmcfee commented Jan 13, 2015

I'm trying to think of examples where confidence is more than just a single number... I can imagine a case where "confidence" is an array of confidence values (e.g. from a panel of experts-type system), but eh...

Example 1: confidence expressed as inter-quartile ranges (25th, 75th)-percentile
Example 2: simultaneous time and value confidence measurements. Not sure where that would pop up, but it seems sensible.

In the Rock Corpus dataset, for example, I've seen pickup measures be notated with negative time (though, in this case, time was "global beat number", which isn't exactly time)

Yes, I'd like to make a very clear distinction between "time"-typed measurements and all other values. Beat numbers can be negative, no problem.

I think this should stay as it is. The only "metadata" related fields are annotation_metadata and file_metadata. If you change one I'd change both, but It doesn't make sense to me to change file_metadata to metadata because it's ambiguous at the top level.

Good point. +1.

@bmcfee
Copy link
Contributor

bmcfee commented Jan 14, 2015

Scientific wild-assed guesses follow:

What is "release"? I'd opt for a more specific name (e.g. dataset_release)

I believe this is inherited from discogs or musicbrainz schema, and corresponds to the track release, not the dataset.

What is the difference between "corpus" and "data_source"?

I think "corpus" would be something like "Isophonics" or "RWC"; ~~~"data_source" would be "QMUL" or "AIST"?~~~ @ejhumphrey care to clarify?

source --> activation or source_activation

+1

@justinsalamon
Copy link
Contributor Author

We originally based fields in annotation_metadata on:
G. Peeters, K. Fort Towards A (Better) Definition Of The Description Of Annotated M.I.R. Corpora

There, they have a field called "origin", where example values include "synthetic", "experiment", "aggregation", "crowdsourcing", "game with a purpose" and "manual". I.e., the idea was to describe how the annotation data was generated.

I think that somewhere in history the term "origin" was deemed too ambiguous and replaced with "data_source". But yeah, I think that was the original purpose of that field...

@bmcfee
Copy link
Contributor

bmcfee commented Jan 14, 2015

This is starting to sound like the schema should be commented... :)

@bmcfee
Copy link
Contributor

bmcfee commented Feb 2, 2015

Note: turns out json doesn't allow comments. laaaaaaaaaame!

Unrelated: I tweaked the FileMetadata.duration schema pattern yesterday to fix a couple of things:

  • The expression was allowing minutes and seconds fields > 59
  • it also had no provision for fractional seconds

Then I got to thinking... why is duration a string instead of a number measuring track duration in seconds? Can we change it to be a number?

Time strings are slightly more human-readable than raw numbers, but they add considerable complexity to anything downline that wants to actually use the values.

@urinieto
Copy link
Contributor

urinieto commented Feb 2, 2015

I would also like to have a float in seconds to represent duration. In
fact, that's what I have in MSAF. No idea why it is a string.

On Monday, February 2, 2015, Brian McFee [email protected] wrote:

Note: turns out json doesn't allow comments. laaaaaaaaaame!

Unrelated: I tweaked the FileMetadata.duration schema pattern yesterday to
fix a couple of things:

  • The expression was allowing minutes and seconds fields > 59
  • it also had no provision for fractional seconds

Then I got to thinking... why is duration a string instead of a number
measuring track duration in seconds? Can we change it to be a number?

Time strings are slightly more human-readable than raw numbers, but they
add considerable complexity to anything downline that wants to actually use
the values.


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 3, 2015

Another irksome thing: do we have a convention for encoding filenames in FileMetadata? Does this belong in the identifiers sandbox, or should we promote it to a first-class property.

The use-case I'm thinking of is that many datasets have standardized layouts on disk (see: SMC, beatles_iso, MSD, etc etc etc), and since jams provides no handles for storing audio data itself, we should provide some way of linking back to the file on disk.

@bmcfee
Copy link
Contributor

bmcfee commented Feb 3, 2015

+3 for duration->float
+3 for adding 'content_path' to filemetadata for audio path

@ejhumphrey
Copy link
Collaborator

I understand the value of something like "content_path", but this gets a
little janky if you start sharing annotations. This isn't some fundamental
piece of information about the file to which it corresponds, and will have
a bunch of drift if everyone keeps updating it.

Seems like a piece of information you'd want to sit on top of a JAMS
object, no? thoughts?

On Tue, Feb 3, 2015 at 11:52 AM, Brian McFee [email protected]
wrote:

+3 for duration->float
+3 for adding 'content_path' to filemetadata for audio path


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 3, 2015

07804f0 fixes the duration and adds content_path to the metadata.

I understand the value of something like "content_path", but this gets a
little janky if you start sharing annotations.

I don't see this as an issue, especially for annotations tied to specific corpora (eg, beatles_isophonics or cal500). Presumably, every jams file will correspond to at least some audio file (or midi file, h5 file, whatever), whereas linked identifiers are a bit more open-ended.

We previously had no provision for linking back to the content source. It could have been stuck in a sandbox along with identifiers, but we ( @rabitt and @justinsalamon ) discussed the point in the lab and agreed that source file path is a bit different from generic identifiers, and there should be a common way to access it.

I'm open to other solutions, but adding a content_path field to the metadata seemed like the simplest solution.

@ejhumphrey
Copy link
Collaborator

​simple solution, sure, but I don't think it's a permanent solution.
​There's no way to validate that the information is still true, and
filename drift is a thing.

On Tue, Feb 3, 2015 at 2:21 PM, Brian McFee [email protected]
wrote:

07804f0
07804f0
fixes the duration and adds content_path to the metadata.

I understand the value of something like "content_path", but this gets a
little janky if you start sharing annotations.

I don't see this as an issue, especially for annotations tied to specific
corpora (eg, beatles_isophonics or cal500). Presumably, every jams file
will correspond to at least some audio file (or midi file, h5 file,
whatever), whereas linked identifiers are a bit more open-ended.

We previously had no provision for linking back to the content source. It
could have been stuck in a sandbox along with identifiers, but we (
@rabitt https://github.com/rabitt and @justinsalamon
https://github.com/justinsalamon ) discussed the point in the lab and
agreed that source file path is a bit different from generic identifiers,
and there should be a common way to access it.

I'm open to other solutions, but adding a content_path field to the
metadata seemed like the simplest solution.


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 3, 2015

​There's no way to validate that the information is still true, and
filename drift is a thing.

Agreed. I suppose this is what the identifiers sandbox is for, but that's a much bigger hammer for a small, but more frequent problem.

I guess the more general concern here is: how much stuff in jams is meant to be mutable, and how should we indicate it?

@ejhumphrey
Copy link
Collaborator

On Tue, Feb 3, 2015 at 6:24 PM, Brian McFee [email protected]
wrote:

​There's no way to validate that the information is still true, and
filename drift is a thing.

Agreed. I suppose this is what the identifiers sandbox is for, but that's
a much bigger hammer for a small, but more frequent problem.

So, re: sponging local filepaths, maybe it's not the job of JAMS to do
that. Writing the filename into the JAMS object is no more stable or
reliable than telling someone to apply the same filebase to both the JAMS
file and the audio file to which it corresponds.
​This is information for a dataset manager, and I don't think JAMS is that
technology.​

I guess the more general concern here is: how much stuff in jams is meant
to be mutable, and how should we indicate it?

​I have a strong aversion to mutability, and I'd argue the only real
modifications a JAMS object should undergo are an append or merge. ​Curious
to hear any counterarguments.

Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 4, 2015

​I have a strong aversion to mutability, and I'd argue the only real
modifications a JAMS object should undergo are an append or merge. ​Curious
to hear any counterarguments.

Ideally sure. But in reality, datasets always change. Probably it's the job of a VCS to manage this kind of thing. (PSSSST: always version your data!)

So, re: sponging local filepaths, maybe it's not the job of JAMS to do
that. Writing the filename into the JAMS object is no more stable or
reliable than telling someone to apply the same filebase to both the JAMS
file and the audio file to which it corresponds.

Fair enough.

So.. would you object to using the identifiers sandbox for relative locations inside well-defined datasets? My understanding is that these can always be augmented with additional identifiers as needed, and relative file path seems like a reasonable id for most existing datasets.

@bmcfee
Copy link
Contributor

bmcfee commented Feb 9, 2015

Hey folks: can we get a consensus on this content path issue? The way I see it, we have three options for this functionality:

OPTION 1: schematized relative paths

  • add a field to FileMetadata to point form the jams object to a relative path within the collection
  • example: in isophonics, you might have have jam.file_metadata.content_path == 'audio/The Beatles/01_-_Please_Please_Me/01_-_I_Saw_Her_Standing_There.flac'
  • Pro: easy access. Makes intuitive sense for many existing corpora.
  • Con: rigid. would be unwieldy for data that has no unique source (ie, lives in multiple collections)

OPTION 2: sandboxed relative paths

  • Like the above, but content_path would be a sandbox variable, ie, outside of the schema.
  • example: jam.file_metadata.sandbox.content_path == 'audio/The Beatles/01_-_Please_Please_Me/01_-_I_Saw_Her_Standing_There.flac'
  • Pro: more flexible than option 1, but does basically the same thing. Fits the existing style for cross-referencing identifiers (eg, musicbrainz or echo nest).
  • Con: could become a de facto standard, rather than an official standard, for accessing content through jams. No way to enforce consistency of use across datasets.

OPTION 3: punt upstream to a proper database

  • Provide no functionality for indexing content within a jams. Instead, rely on an upstream database to provide the mapping.
  • Pro: makes our lives (as jams developers) easier
  • Con: makes usage more difficult, especially for one-off projects that lack proper database infrastructure.

Votes?

@bmcfee
Copy link
Contributor

bmcfee commented Feb 9, 2015

PS: my vote is to support 2 in our conversion scripts, but encourage 3 whenever possible. The two need not be incompatible.

@justinsalamon
Copy link
Contributor Author

+1 for option 2. If it does becomes a de-facto standard further down the line we can consider incorporating it in the schema in a future release, but yeah, not sure I'd make it a rigid part of the schema right now.

@urinieto
Copy link
Contributor

Also like 2 better.

On Mon, Feb 9, 2015 at 11:37 AM, Justin Salamon [email protected]
wrote:

+1 for option 2. If it does becomes a de-facto standard further down
the line we can consider incorporating it in the schema in a future
release, but yeah, not sure I'd make it a rigid part of the schema right
now.


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 13, 2015

Sounds like option 2 is the winner. @ejhumphrey : you agree?

@ejhumphrey
Copy link
Collaborator

Apologies for delays, been in the woods for the last few days.

Generally on board with 2. How do we feel about
jam.file_metadata.sandbox.content_path vs jam.sandbox.content_path ?

On Fri, Feb 13, 2015 at 3:22 PM, Brian McFee [email protected]
wrote:

Sounds like option 2 is the winner. @ejhumphrey
https://github.com/ejhumphrey : you agree?


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 17, 2015

jam.sandbox.content_path

+1

@ejhumphrey
Copy link
Collaborator

i digz

On Mon, Feb 16, 2015 at 8:05 PM, Brian McFee [email protected]
wrote:

jam.sandbox.content_path

+1


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee
Copy link
Contributor

bmcfee commented Feb 25, 2015

Done and done.

Anything else remaining, or shall we close this issue?

@bmcfee
Copy link
Contributor

bmcfee commented May 15, 2015

once more: can we close this one?

@ejhumphrey
Copy link
Collaborator

uh, sure?

On Fri, May 15, 2015 at 3:24 PM, Brian McFee [email protected]
wrote:

once more: can we close this one?


Reply to this email directly or view it on GitHub
#5 (comment).

@bmcfee bmcfee modified the milestone: 0.2 May 15, 2015
@justinsalamon
Copy link
Contributor Author

looks like it yeah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants