-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAMS v2 Schema #5
Comments
The idea is that time and duration provide the placement of the series in a This idea also falls out of a philosophical argument ... if your data are Also, as far as JSON schema's are concerned, we can't enforce two arrays On Thu, Nov 20, 2014 at 12:18 PM, justinsalamon [email protected]
|
Oh, and per (2), I think the semantic significance of a "series" would be On Thu, Nov 20, 2014 at 12:34 PM, Eric Humphrey [email protected]
|
So the timestamps are implicit O_O In an ideal world where everyone plays by the book that might be ok, but that makes me very nervous: having the timestamps explicitly listed can be a big headache saver. For staters, you can get different interpretations of how to map start+duration to timestamps - e.g. do the first and last timestamps correspond to times start and start+duration, or, does the last timestamp happen just before start+duration (i.e. n vs n+1 timestamps). And that's still completely ignoring the whole "is the timestamp centered in the frame or does it represent the beginning of each analysis frame" issue (which admittedly isn't solved by having explicit timestamps, but would be further hidden if they were implicit). Also, in the past having the timestamps listed explicitly has been really helpful in figuring out the different annotation procedures for different melody f0 annotations (including timestamp centering, hop size, etc.). And finally, it adds an extra step to obtain the timestamps which people will have to code if they're not using the python API (and we can't assume everyone will be), opening the door to some evil mistakes. Yes, if you want to time-stretch a series it requires re-computing the timestamps, but if you're time-stretching uniformly by a factor X, then you just need to multiply each timestamp by X right? In which case I don't see the big win here. |
Hijacking this thread to talk about schema in the pandas branch, which it sounds like we're converging toward for v2. Below are miscellaneous comments and questions to chew on:
|
2¢ here we go:
|
I'm not a huge fan of it either. There are a few things at play here:
I can't think of any good ones, but it does pop up (accidentally) in isophonics. If we want to accurately preserve the original data, we'd have to allow negative time/duration. Do we want this, or should we be bigger sticklers for correctness? (I vote correctness over legacy; interested parties can always go back to the original sources if need be.) |
I agree that it should be a float.
+1
No issue here.
I'm trying to think of examples where confidence is more than just a single number... I can imagine a case where "confidence" is an array of confidence values (e.g. from a panel of experts-type system), but eh...
In the Rock Corpus dataset, for example, I've seen pickup measures be notated with negative time (though, in this case, time was "global beat number", which isn't exactly time)
I think this should stay as it is. The only "metadata" related fields are annotation_metadata and file_metadata. If you change one I'd change both, but It doesn't make sense to me to change file_metadata to metadata because it's ambiguous at the top level.
Proceeding audit :
: audit complete. |
Example 1: confidence expressed as inter-quartile ranges (25th, 75th)-percentile
Yes, I'd like to make a very clear distinction between "time"-typed measurements and all other values. Beat numbers can be negative, no problem.
Good point. +1. |
Scientific wild-assed guesses follow:
I believe this is inherited from discogs or musicbrainz schema, and corresponds to the track release, not the dataset.
I think "corpus" would be something like "Isophonics" or "RWC"; ~~~"data_source" would be "QMUL" or "AIST"?~~~ @ejhumphrey care to clarify?
+1 |
We originally based fields in annotation_metadata on: There, they have a field called "origin", where example values include "synthetic", "experiment", "aggregation", "crowdsourcing", "game with a purpose" and "manual". I.e., the idea was to describe how the annotation data was generated. I think that somewhere in history the term "origin" was deemed too ambiguous and replaced with "data_source". But yeah, I think that was the original purpose of that field... |
This is starting to sound like the schema should be commented... :) |
Note: turns out json doesn't allow comments. laaaaaaaaaame! Unrelated: I tweaked the FileMetadata.duration schema pattern yesterday to fix a couple of things:
Then I got to thinking... why is duration a string instead of a number measuring track duration in seconds? Can we change it to be a number? Time strings are slightly more human-readable than raw numbers, but they add considerable complexity to anything downline that wants to actually use the values. |
I would also like to have a float in seconds to represent duration. In On Monday, February 2, 2015, Brian McFee [email protected] wrote:
|
Another irksome thing: do we have a convention for encoding filenames in FileMetadata? Does this belong in the identifiers sandbox, or should we promote it to a first-class property. The use-case I'm thinking of is that many datasets have standardized layouts on disk (see: SMC, beatles_iso, MSD, etc etc etc), and since jams provides no handles for storing audio data itself, we should provide some way of linking back to the file on disk. |
+3 for duration->float |
I understand the value of something like "content_path", but this gets a Seems like a piece of information you'd want to sit on top of a JAMS On Tue, Feb 3, 2015 at 11:52 AM, Brian McFee [email protected]
|
07804f0 fixes the duration and adds content_path to the metadata.
I don't see this as an issue, especially for annotations tied to specific corpora (eg, beatles_isophonics or cal500). Presumably, every jams file will correspond to at least some audio file (or midi file, h5 file, whatever), whereas linked identifiers are a bit more open-ended. We previously had no provision for linking back to the content source. It could have been stuck in a sandbox along with identifiers, but we ( @rabitt and @justinsalamon ) discussed the point in the lab and agreed that source file path is a bit different from generic identifiers, and there should be a common way to access it. I'm open to other solutions, but adding a content_path field to the metadata seemed like the simplest solution. |
simple solution, sure, but I don't think it's a permanent solution. On Tue, Feb 3, 2015 at 2:21 PM, Brian McFee [email protected]
|
Agreed. I suppose this is what the identifiers sandbox is for, but that's a much bigger hammer for a small, but more frequent problem. I guess the more general concern here is: how much stuff in jams is meant to be mutable, and how should we indicate it? |
On Tue, Feb 3, 2015 at 6:24 PM, Brian McFee [email protected]
—
|
Ideally sure. But in reality, datasets always change. Probably it's the job of a VCS to manage this kind of thing. (PSSSST: always version your data!)
Fair enough. So.. would you object to using the identifiers sandbox for relative locations inside well-defined datasets? My understanding is that these can always be augmented with additional identifiers as needed, and relative file path seems like a reasonable id for most existing datasets. |
Hey folks: can we get a consensus on this content path issue? The way I see it, we have three options for this functionality: OPTION 1: schematized relative paths
OPTION 2: sandboxed relative paths
OPTION 3: punt upstream to a proper database
Votes? |
PS: my vote is to support 2 in our conversion scripts, but encourage 3 whenever possible. The two need not be incompatible. |
+1 for option 2. If it does becomes a de-facto standard further down the line we can consider incorporating it in the schema in a future release, but yeah, not sure I'd make it a rigid part of the schema right now. |
Also like 2 better. On Mon, Feb 9, 2015 at 11:37 AM, Justin Salamon [email protected]
|
Sounds like option 2 is the winner. @ejhumphrey : you agree? |
Apologies for delays, been in the woods for the last few days. Generally on board with 2. How do we feel about On Fri, Feb 13, 2015 at 3:22 PM, Brian McFee [email protected]
|
+1 |
i digz On Mon, Feb 16, 2015 at 8:05 PM, Brian McFee [email protected]
|
Done and done. Anything else remaining, or shall we close this issue? |
once more: can we close this one? |
uh, sure? On Fri, May 15, 2015 at 3:24 PM, Brian McFee [email protected]
|
looks like it yeah |
@ejhumphrey I've been snooping around the v2 schema, and wanted to discuss the series type.
The text was updated successfully, but these errors were encountered: