-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyjams] discussion: data frame layer? #8
Comments
Discussed the above with @ejhumphrey in the context of the next version of the schema, and we're pretty close to having something working as of 9813955 A notebook demonstrating dataframe construction in jams objects is up here This version of the schema reduces all observations to a simple structure: The top-level structure is essentially the same as what's in the current master branch cb01e04 , except that Some issues yet to be resolved:
|
@bmcfee just a small comment/question about semantics - when you say "observation" what exactly are you referring to? In the published version of jams the are 4 "atomic" data types, of which observation is just one (observation, event, range, time_series). |
This version has only a single atomic data type.
I think this covers all the use cases of the current schema, no? |
Yes, it does. It's pretty much what I assumed was going on, but I thought it important to clarify that by "observation" you were not referring to observation as defined in the current schema, to make the distinction explicit. |
New notebook demonstrating functionality of 4e42939 We ended up extending the pandas DataFrame object into a new type called JamsFrame. There is minimal discrepancy between the two; the new class is designed to make two things easy:
This also allows us to do some cool things, like: >>> ref_intervals, ref_labels = ref_jam.segment[0].data.to_interval_labels()
>>> est_intervals, est_labels = est_jam.segment[0].data.to_interval_labels()
>>> mir_eval.segment.pairwise(ref_intervals, ref_labels, est_intervals, est_labels) I think this pretty much covers everything that we basically need for sparse types. What's left to do is implement a second DataFrame class for dense types (eg melody) that automatically handles packing and unpacking of array-valued observations, and computes time samples based on the specified interval. Additionally, we will need to build some kind of a registry that maps namespaces to data types. We can perhaps simplify this by assuming JamsFrame (ie sparse) by default, unless the namespace belongs to the "dense" set. Of course, this can be manipulated at runtime if necessary, or maybe even defined in the schema somehow? Otherwise, it can just be a static object that lives in pyjams, eg, >>> pyjams.dense_spaces = set(['melody.hz', 'mood.xy', ...]) At construction time, the Anyone see any pitfalls to this general setup? |
About dense types, as noted in #5: there's the question of explicitly storing time samples vs computing them from a specified interval. According to the current schema, time samples are stored explicitly in the jams file. My personal preference would be to keep it this way - even though it is less elegant than simply storing a hop size, it ensures different users will be using exactly the same time values (down to the number of decimal places), which I think is essential when treating an annotation as a reference. Also, if times are implicit there's the risk that someone will compute them differently outside of the pyjams ecosystem. More complaints about implicit timestamps can be found in #5 :) |
I'm sympathetic to both views, actually. Let's look at three options:
Shall we put it to a vote? |
If we vote, I vote 3. Additional comments:
|
of course not :) But since we're defining the spec and implementing the interface (at least, those interfaces most likely to be used by everyone), we're in a much more privileged position to dictate conventions than, say, folks using lab files.
This actually seems like an argument in favor of doing it implicitly on load, rather than serializing each timestamp observation independently. (float<->string conversion can be dicey once rounding gets involved.) However, in that particular example, I think it's the evaluation's fault for being so sensitive to sub-microsecond deviations. A stronger argument for storing timestamps and durations explicitly is that it allows for both non-uniform and incomplete coverage of dense observations. I see this as a bigger win than the efficiency loss due to often redundant encoding. As I said above, it also simplifies the logic for import/export since we would have no reason to infer any times, and that's a good thing. |
I also think 3 is the best option, et voila, it's working already as of 3bb193e Sparse and dense data get identical representation as JamsFrames, so we only need one class. The only difference is in how they serialize, as displayed in cells 47 and 48 of the notebook above. The really nice thing here is that DataFrame does all the heavy lifting, and we only need to change one parameter to differentiate between dense and sparse export: This is now controlled by a class attribute |
Nice! schema: since there's only one type of annotation, we could rename ObservationAnnotation to just Annotation. Or, have 2 types: DenseAnnotation and SparseAnnotation? |
I think what it should be is:
I think either encoding should be considered valid for any task, but we should prefer Sparse by default except for certain namespaces (eg melody). It's probably not worth our time to enforce namespaces using one encoding vs the other at the schema level. |
I added a bit of syntactic sugar and constructor methods, and it's pretty simple to use at this point (in pyjams). Simplified example notebook demonstrating how to translate a .lab chord annotation file into JAMS now. |
Just wanted to say that I finally reviewed this question with its multiple associated note books and I should say I fully support the usage of pandas in jams. I would also vote (3). Moreover, the JamsFrame makes a lot of sense and seems to make things even clearer. Since I already use pandas in msaf, I am really looking forward to integrating this new jams version with msaf, so that I can provide some feedback when using this version in the wild. |
Over the weekend, I wrote an import script for the SMC beat tracking dataset. Here's an ipynb showing how this all works out, now that the dust has settled. I'm pretty happy with it. |
This is great. |
I've been trying to work with pyjams, and keep running into the same stumbling blocks that we've discussed offline. To summarize:
After digging around a bit, I think it makes a lot of sense to use pandas DataFrames as an in-memory representation for several observation types. Rather than go into detail of how exactly this might look, I mocked up a notebook illustrating the main concepts with a chord annotation example here.
The key features are:
The way I see this playing out is that the json objects would stay pretty much the same on disk (modulo any schema changes we cook up for convenience). However, when serializing/deserializing, certain arrays get translated into pandas dataframes, which is the primary backing store in memory.
For example, a chord annotation record
my_jams.chord[0].data
, rather than being a list ofRange
objects, can instead be a single DataFrame which encapsulates all measurements asociated with the annotation. The jams object still retains hierarchy and multi-annotation collections, but each annotation becomes a single object, rather than a list of more atomic types.If people are on board with this idea, I'd like to propose adopting a couple of conventions to make life easier:
timedelta[ns]
data type, rather than a float. Aside from making time values easy to find, pandas also provides some nice functionality for working with time series data.nan
rather than hallucinating data or leaving the record empty. (I'm looking at you, confidence values.)I realize that this adds yet another dependency and source of complexity to something that's already pretty monstrous, but I think it'll be worth it in the long run. Of course, this is all intended as a long-winded question/request for comments, so feel free to shoot it all down.
The text was updated successfully, but these errors were encountered: