Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RC #1

Merged
merged 35 commits into from
Jun 20, 2022
Merged

RC #1

merged 35 commits into from
Jun 20, 2022

Conversation

johentsch
Copy link
Member

Hi @apmcleod,

The interface is mostly as we discussed and working fine for current use cases. These can be seen in test_analyzer.py and in the data_reports/generate.ipynb notebook that demonstrates unigram and bigram creation

  • for the entire dataset
  • for corpus groups
  • piece-wise

TODO: changelog to be updated

@johentsch johentsch requested a review from apmcleod May 12, 2022 20:16
@apmcleod
Copy link

apmcleod commented May 23, 2022

I cloned unittest_metacorpus and pleyel into ~, but all tests fail with ValueError: No data has been loaded. when I run pytest from the command line. Any ideas?

Edit:
Ah, I had to point pytest/conftest.py line 51 to the directory I have all_subcorpora as well. Now they are running.

Copy link

@apmcleod apmcleod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing! The notebook runs smoothly, and the user-facing interface seems very clean to me.

I left a number of comments. I also have 4 general questions/comments here:

  • I think tests/test_skeleton.py can be removed?
  • I get a lot of pandas errors about using Series.append (to be deprecated; should be concat). I guess they might be coming from ms3, the code it complains about is this (no line number given): res = res.append(pd.Series([last_val], index=['end']))
  • In the notebook, why does localkey_slices not produce "intervals" but mode_slices does? Is that correct?
  • In general, in-line comments would be extremely helpful in every process_data function (and elsewhere, but mainly process_data). process_data tends to be pretty dense and do a lot of non-obvious indexing.

A lot of these questions/comments are just things to note/questions to discuss, while some are easily changed, I think. Feel free to reply, ask me to change some of these, or change them yourself. Also, feel free to push some of the comments to Issues for future changes. But before merge, I think at least the in-line comments and docs should be complete.

self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize)
self.level_names["processed"] = pitch_class_format
self.group2pandas = "group2dataframe_unstacked"

@staticmethod
def compute(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we pass these params here, within a pipeline?

May be better to include these in the Analyzer constructor? This would make compute no longer static.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are also unused (I'm aware this PR didn't add them)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we pass these params here, within a pipeline?

All pipeline steps are initialized and parametrized when their sequence is being defined, e.g. Pipeline([CorpusGrouper(), ChordSymbolUnigrams(once_per_group=True)]).

May be better to include these in the Analyzer constructor?

How do you mean?

This would make compute no longer static.

I think we wanted to have compute static so that it can be used independently of the pipeline step object. If compute takes parameters, the parametrization will be stored within the object.

Not sure I've understood you correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are also unused (I'm aware this PR didn't add them)

Some of what?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got the main idea. What I'm saying is I don't see a clean way to set these parameters (pitch_class_format, normalize, etc.) from within a Pipeline. It would be much more natural to instead pass them through to the Analyzer constructor when creating the Pipeline.

We can still keep this computation static (though I don't recall that discussion), for example by defining:

def compute(self, notes):
    PitchClassVectors.get_pcvs(notes, pitch_class_format=self.pitch_class_format, ...)

And then having a new static get_pcvs perform the actual computation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I see this is done via self.config actually?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, just added docstrings for this. I think the solution to always have compute() as a static function makes for a cleaner interface because then you know you can always call that from outside without have to lookup the relevant method for each analyzer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are also unused (I'm aware this PR didn't add them)

Some of what?

index_levels and fill_na are not referenced in this compute function.

counts = (
df.groupby([0, 1]).size().sort_values(ascending=False).rename("count")
)
except KeyError:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this catching?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bigram columns of df are not called 0 and 1.

@@ -75,6 +82,20 @@ def get_arg_parser():
type=check_and_create,
help="""Output directory.""",
)
input_args.add_argument(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ordering of the -g and -s meaningful? E.g. dimcat -g Grouper1 -s Slicer1 -g Grouper2 vs dimcat -g Grouper1 -g Grouper2 -s Slicer1?

And should it be? Would it matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a general problem that we need to discuss and solve together. Does it even make sense to allow for arbitrary pipelines using the command line?

My suggestion would be, instead, to offer a well-defined set of commands such as dimcat bigrams and define the possible pipelines for it. The various combinations of groupers and slicers could then be distributed between

  • general arguments available for all commands (such as Corpus grouper) and
  • specific arguments share between those commands where they apply.

P.S.: Not even sure argparse is actually able to capture the difference in order in your example.

@@ -1,8 +1,13 @@
"""Class hierarchy for data types."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, I didn't look at data.py very closely, but I trust you.

src/dimcat/grouper.py Outdated Show resolved Hide resolved
src/dimcat/grouper.py Outdated Show resolved Hide resolved
src/dimcat/slicer.py Outdated Show resolved Hide resolved
src/dimcat/utils.py Show resolved Hide resolved
src/dimcat/writer.py Show resolved Hide resolved
src/dimcat/writer.py Outdated Show resolved Hide resolved
@johentsch
Copy link
Member Author

I think tests/test_skeleton.py can be removed?

Done.

I get a lot of pandas errors about using Series.append (to be deprecated; should be concat). I guess they might be coming from ms3, the code it complains about is this (no line number given): res = res.append(pd.Series([last_val], index=['end']))

Right, this is fixed in the upcoming version of ms3.

In the notebook, why does localkey_slices not produce "intervals" but mode_slices does? Is that correct?

I couldn't find the spot in the notebook where you see this. Had a quick look into localkey_slices.sliced and mode_slices.sliced and both seemed to have the required intervals?

In general, in-line comments would be extremely helpful in every process_data function (and elsewhere, but mainly process_data). process_data tends to be pretty dense and do a lot of non-obvious indexing.

Yes, I agree. Will add those.

@apmcleod
Copy link

I couldn't find the spot in the notebook where you see this. Had a quick look into localkey_slices.sliced and mode_slices.sliced and both seemed to have the required intervals?

Couldn't find it now, not sure what I was talking about.

continue
try:
name = "_".join(index)
segmented = segment_by_adjacency_groups(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johentsch johentsch requested a review from apmcleod June 17, 2022 15:20
Copy link
Member Author

@johentsch johentsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to address all comments. Let me know if there are more docstrings or in-line comments needed.

self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize)
self.level_names["processed"] = pitch_class_format
self.group2pandas = "group2dataframe_unstacked"

@staticmethod
def compute(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are also unused (I'm aware this PR didn't add them)

Some of what?

counts = (
df.groupby([0, 1]).size().sort_values(ascending=False).rename("count")
)
except KeyError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bigram columns of df are not called 0 and 1.

@@ -75,6 +82,20 @@ def get_arg_parser():
type=check_and_create,
help="""Output directory.""",
)
input_args.add_argument(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a general problem that we need to discuss and solve together. Does it even make sense to allow for arbitrary pipelines using the command line?

My suggestion would be, instead, to offer a well-defined set of commands such as dimcat bigrams and define the possible pipelines for it. The various combinations of groupers and slicers could then be distributed between

  • general arguments available for all commands (such as Corpus grouper) and
  • specific arguments share between those commands where they apply.

P.S.: Not even sure argparse is actually able to capture the difference in order in your example.

for index in index_group:
new_group = self.criterion(index, data)
if new_group is None:
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an element cannot be grouped, instead of dropping it like here right now, we could also leave it in but with the old index. Or the behaviour could depend on a parameter similar to panda's .groupby(XY, dropna=False).

For example, let's say we're grouping by years but a couple of pieces don't have the year information in their metadata. Currently, those would "get lost" but we could give the choice to leave them in. The group name tuples would then have to be (previous_group, year), (previous_group, NaN). What do you think?

self.level_names = dict(grouper="corpus")

def criterion(self, index: tuple, data: Data) -> str:
return index[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an index always starts with (corpus, fname)

class YearGrouper(Grouper):
"""Groups indices based on the composition years indicated in the metadata."""

def __init__(self, sort=True):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should we be keeping track of planned features somewhere? In the issues?


def criterion(self, index: tuple, data: Data) -> str:
if self.slicer is None:
print("Need LocalKeyGrouper")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you're right, because ModeGrouper.process_data() checks if the slicer has been applied. Removed it.

self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize)
self.level_names["processed"] = pitch_class_format
self.group2pandas = "group2dataframe_unstacked"

@staticmethod
def compute(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, just added docstrings for this. I think the solution to always have compute() as a static function makes for a cleaner interface because then you know you can always call that from outside without have to lookup the relevant method for each analyzer.

src/dimcat/slicer.py Outdated Show resolved Hide resolved
Copy link

@apmcleod apmcleod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me for now. There is one remaining comment about the 2 unused params in PitchClassVectors.compute, but after that (or if you have a reason to leave them there), feel free to merge.

I added all remaining comments to issues. I may go through more closely in future weeks for some optimizations or other small changes (I saw a couple of places where a list generator could be used instead of a for loop I think). But this can be in a future, small PR.

@johentsch johentsch merged commit b488f87 into main Jun 20, 2022
johentsch pushed a commit that referenced this pull request Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants