-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RC #1
RC #1
Conversation
…p track of the Pipeline, of the method to convert the results to pandas, and of the index level names
…particular properties
I cloned unittest_metacorpus and pleyel into ~, but all tests fail with Edit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks amazing! The notebook runs smoothly, and the user-facing interface seems very clean to me.
I left a number of comments. I also have 4 general questions/comments here:
- I think tests/test_skeleton.py can be removed?
- I get a lot of pandas errors about using Series.append (to be deprecated; should be concat). I guess they might be coming from ms3, the code it complains about is this (no line number given):
res = res.append(pd.Series([last_val], index=['end']))
- In the notebook, why does localkey_slices not produce "intervals" but mode_slices does? Is that correct?
- In general, in-line comments would be extremely helpful in every process_data function (and elsewhere, but mainly process_data). process_data tends to be pretty dense and do a lot of non-obvious indexing.
A lot of these questions/comments are just things to note/questions to discuss, while some are easily changed, I think. Feel free to reply, ask me to change some of these, or change them yourself. Also, feel free to push some of the comments to Issues for future changes. But before merge, I think at least the in-line comments and docs should be complete.
self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize) | ||
self.level_names["processed"] = pitch_class_format | ||
self.group2pandas = "group2dataframe_unstacked" | ||
|
||
@staticmethod | ||
def compute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we pass these params here, within a pipeline?
May be better to include these in the Analyzer constructor? This would make compute no longer static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these are also unused (I'm aware this PR didn't add them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we pass these params here, within a pipeline?
All pipeline steps are initialized and parametrized when their sequence is being defined, e.g. Pipeline([CorpusGrouper(), ChordSymbolUnigrams(once_per_group=True)])
.
May be better to include these in the Analyzer constructor?
How do you mean?
This would make compute no longer static.
I think we wanted to have compute static so that it can be used independently of the pipeline step object. If compute takes parameters, the parametrization will be stored within the object.
Not sure I've understood you correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these are also unused (I'm aware this PR didn't add them)
Some of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you got the main idea. What I'm saying is I don't see a clean way to set these parameters (pitch_class_format
, normalize
, etc.) from within a Pipeline. It would be much more natural to instead pass them through to the Analyzer constructor when creating the Pipeline.
We can still keep this computation static (though I don't recall that discussion), for example by defining:
def compute(self, notes):
PitchClassVectors.get_pcvs(notes, pitch_class_format=self.pitch_class_format, ...)
And then having a new static get_pcvs
perform the actual computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I see this is done via self.config actually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, just added docstrings for this. I think the solution to always have compute()
as a static function makes for a cleaner interface because then you know you can always call that from outside without have to lookup the relevant method for each analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these are also unused (I'm aware this PR didn't add them)
Some of what?
index_levels and fill_na are not referenced in this compute function.
counts = ( | ||
df.groupby([0, 1]).size().sort_values(ascending=False).rename("count") | ||
) | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this catching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the bigram columns of df
are not called 0
and 1
.
@@ -75,6 +82,20 @@ def get_arg_parser(): | |||
type=check_and_create, | |||
help="""Output directory.""", | |||
) | |||
input_args.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ordering of the -g and -s meaningful? E.g. dimcat -g Grouper1 -s Slicer1 -g Grouper2
vs dimcat -g Grouper1 -g Grouper2 -s Slicer1
?
And should it be? Would it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a general problem that we need to discuss and solve together. Does it even make sense to allow for arbitrary pipelines using the command line?
My suggestion would be, instead, to offer a well-defined set of commands such as dimcat bigrams
and define the possible pipelines for it. The various combinations of groupers and slicers could then be distributed between
- general arguments available for all commands (such as Corpus grouper) and
- specific arguments share between those commands where they apply.
P.S.: Not even sure argparse is actually able to capture the difference in order in your example.
@@ -1,8 +1,13 @@ | |||
"""Class hierarchy for data types.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest, I didn't look at data.py very closely, but I trust you.
Done.
Right, this is fixed in the upcoming version of ms3.
I couldn't find the spot in the notebook where you see this. Had a quick look into
Yes, I agree. Will add those. |
Couldn't find it now, not sure what I was talking about. |
src/dimcat/slicer.py
Outdated
continue | ||
try: | ||
name = "_".join(index) | ||
segmented = segment_by_adjacency_groups( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line throws a KeyError (via ms3) when some data has Nans for its localkey column. For example, currently: https://github.com/DCMLab/frescobaldi_fiori_musicali/blob/5c036b3ffb510ced77050437df370d106dd98419/harmonies/12.16_Toccata_cromaticha_per_l%E2%80%99elevatione_phrygian.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to address all comments. Let me know if there are more docstrings or in-line comments needed.
self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize) | ||
self.level_names["processed"] = pitch_class_format | ||
self.group2pandas = "group2dataframe_unstacked" | ||
|
||
@staticmethod | ||
def compute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these are also unused (I'm aware this PR didn't add them)
Some of what?
counts = ( | ||
df.groupby([0, 1]).size().sort_values(ascending=False).rename("count") | ||
) | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the bigram columns of df
are not called 0
and 1
.
@@ -75,6 +82,20 @@ def get_arg_parser(): | |||
type=check_and_create, | |||
help="""Output directory.""", | |||
) | |||
input_args.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a general problem that we need to discuss and solve together. Does it even make sense to allow for arbitrary pipelines using the command line?
My suggestion would be, instead, to offer a well-defined set of commands such as dimcat bigrams
and define the possible pipelines for it. The various combinations of groupers and slicers could then be distributed between
- general arguments available for all commands (such as Corpus grouper) and
- specific arguments share between those commands where they apply.
P.S.: Not even sure argparse is actually able to capture the difference in order in your example.
for index in index_group: | ||
new_group = self.criterion(index, data) | ||
if new_group is None: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an element cannot be grouped, instead of dropping it like here right now, we could also leave it in but with the old index. Or the behaviour could depend on a parameter similar to panda's .groupby(XY, dropna=False)
.
For example, let's say we're grouping by years but a couple of pieces don't have the year information in their metadata. Currently, those would "get lost" but we could give the choice to leave them in. The group name tuples would then have to be (previous_group, year), (previous_group, NaN)
. What do you think?
self.level_names = dict(grouper="corpus") | ||
|
||
def criterion(self, index: tuple, data: Data) -> str: | ||
return index[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an index always starts with (corpus, fname)
class YearGrouper(Grouper): | ||
"""Groups indices based on the composition years indicated in the metadata.""" | ||
|
||
def __init__(self, sort=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should we be keeping track of planned features somewhere? In the issues?
src/dimcat/grouper.py
Outdated
|
||
def criterion(self, index: tuple, data: Data) -> str: | ||
if self.slicer is None: | ||
print("Need LocalKeyGrouper") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you're right, because ModeGrouper.process_data()
checks if the slicer has been applied. Removed it.
self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize) | ||
self.level_names["processed"] = pitch_class_format | ||
self.group2pandas = "group2dataframe_unstacked" | ||
|
||
@staticmethod | ||
def compute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, just added docstrings for this. I think the solution to always have compute()
as a static function makes for a cleaner interface because then you know you can always call that from outside without have to lookup the relevant method for each analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me for now. There is one remaining comment about the 2 unused params in PitchClassVectors.compute
, but after that (or if you have a reason to leave them there), feel free to merge.
I added all remaining comments to issues. I may go through more closely in future weeks for some optimizations or other small changes (I saw a couple of places where a list generator could be used instead of a for loop I think). But this can be in a future, small PR.
Hi @apmcleod,
The interface is mostly as we discussed and working fine for current use cases. These can be seen in
test_analyzer.py
and in the data_reports/generate.ipynb notebook that demonstrates unigram and bigram creationTODO: changelog to be updated