RC #1

johentsch · 2022-05-12T20:16:22Z

The interface is mostly as we discussed and working fine for current use cases. These can be seen in test_analyzer.py and in the data_reports/generate.ipynb notebook that demonstrates unigram and bigram creation

for the entire dataset
for corpus groups
piece-wise

TODO: changelog to be updated

…p track of the Pipeline, of the method to convert the results to pandas, and of the index level names

…particular properties

apmcleod · 2022-05-23T09:28:12Z

I cloned unittest_metacorpus and pleyel into ~, but all tests fail with ValueError: No data has been loaded. when I run pytest from the command line. Any ideas?

Edit:
Ah, I had to point pytest/conftest.py line 51 to the directory I have all_subcorpora as well. Now they are running.

apmcleod

This looks amazing! The notebook runs smoothly, and the user-facing interface seems very clean to me.

I left a number of comments. I also have 4 general questions/comments here:

I think tests/test_skeleton.py can be removed?
I get a lot of pandas errors about using Series.append (to be deprecated; should be concat). I guess they might be coming from ms3, the code it complains about is this (no line number given): res = res.append(pd.Series([last_val], index=['end']))
In the notebook, why does localkey_slices not produce "intervals" but mode_slices does? Is that correct?
In general, in-line comments would be extremely helpful in every process_data function (and elsewhere, but mainly process_data). process_data tends to be pretty dense and do a lot of non-obvious indexing.

A lot of these questions/comments are just things to note/questions to discuss, while some are easily changed, I think. Feel free to reply, ask me to change some of these, or change them yourself. Also, feel free to push some of the comments to Issues for future changes. But before merge, I think at least the in-line comments and docs should be complete.

apmcleod · 2022-05-23T15:55:34Z

src/dimcat/analyzer.py

        self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize)
+        self.level_names["processed"] = pitch_class_format
+        self.group2pandas = "group2dataframe_unstacked"

    @staticmethod
    def compute(


How do we pass these params here, within a pipeline?

May be better to include these in the Analyzer constructor? This would make compute no longer static.

Some of these are also unused (I'm aware this PR didn't add them)

How do we pass these params here, within a pipeline?

All pipeline steps are initialized and parametrized when their sequence is being defined, e.g. Pipeline([CorpusGrouper(), ChordSymbolUnigrams(once_per_group=True)]).

May be better to include these in the Analyzer constructor?

How do you mean?

This would make compute no longer static.

I think we wanted to have compute static so that it can be used independently of the pipeline step object. If compute takes parameters, the parametrization will be stored within the object.

Not sure I've understood you correctly?

Some of these are also unused (I'm aware this PR didn't add them)

Some of what?

I think you got the main idea. What I'm saying is I don't see a clean way to set these parameters (pitch_class_format, normalize, etc.) from within a Pipeline. It would be much more natural to instead pass them through to the Analyzer constructor when creating the Pipeline.

We can still keep this computation static (though I don't recall that discussion), for example by defining:

def compute(self, notes): PitchClassVectors.get_pcvs(notes, pitch_class_format=self.pitch_class_format, ...)

And then having a new static get_pcvs perform the actual computation.

Oh, I think I see this is done via self.config actually?

Yes exactly, just added docstrings for this. I think the solution to always have compute() as a static function makes for a cleaner interface because then you know you can always call that from outside without have to lookup the relevant method for each analyzer.

Some of these are also unused (I'm aware this PR didn't add them)

Some of what?

index_levels and fill_na are not referenced in this compute function.

apmcleod · 2022-05-23T15:56:28Z

src/dimcat/analyzer.py

+            counts = (
+                df.groupby([0, 1]).size().sort_values(ascending=False).rename("count")
+            )
+        except KeyError:


What is this catching?

When the bigram columns of df are not called 0 and 1.

apmcleod · 2022-05-23T15:57:57Z

src/dimcat/cli.py

@@ -75,6 +82,20 @@ def get_arg_parser():
        type=check_and_create,
        help="""Output directory.""",
    )
+    input_args.add_argument(


Is the ordering of the -g and -s meaningful? E.g. dimcat -g Grouper1 -s Slicer1 -g Grouper2 vs dimcat -g Grouper1 -g Grouper2 -s Slicer1?

And should it be? Would it matter?

That's a general problem that we need to discuss and solve together. Does it even make sense to allow for arbitrary pipelines using the command line?

My suggestion would be, instead, to offer a well-defined set of commands such as dimcat bigrams and define the possible pipelines for it. The various combinations of groupers and slicers could then be distributed between

general arguments available for all commands (such as Corpus grouper) and

specific arguments share between those commands where they apply.

P.S.: Not even sure argparse is actually able to capture the difference in order in your example.

apmcleod · 2022-05-23T15:58:53Z

src/dimcat/data.py

@@ -1,8 +1,13 @@
 """Class hierarchy for data types."""


I'll be honest, I didn't look at data.py very closely, but I trust you.

src/dimcat/grouper.py

src/dimcat/slicer.py

src/dimcat/utils.py

src/dimcat/writer.py

johentsch · 2022-06-16T11:56:52Z

I think tests/test_skeleton.py can be removed?

Done.

I get a lot of pandas errors about using Series.append (to be deprecated; should be concat). I guess they might be coming from ms3, the code it complains about is this (no line number given): res = res.append(pd.Series([last_val], index=['end']))

Right, this is fixed in the upcoming version of ms3.

In the notebook, why does localkey_slices not produce "intervals" but mode_slices does? Is that correct?

I couldn't find the spot in the notebook where you see this. Had a quick look into localkey_slices.sliced and mode_slices.sliced and both seemed to have the required intervals?

In general, in-line comments would be extremely helpful in every process_data function (and elsewhere, but mainly process_data). process_data tends to be pretty dense and do a lot of non-obvious indexing.

Yes, I agree. Will add those.

apmcleod · 2022-06-17T12:43:03Z

I couldn't find the spot in the notebook where you see this. Had a quick look into localkey_slices.sliced and mode_slices.sliced and both seemed to have the required intervals?

Couldn't find it now, not sure what I was talking about.

apmcleod · 2022-06-17T12:46:57Z

src/dimcat/slicer.py

+                    continue
+                try:
+                    name = "_".join(index)
+                    segmented = segment_by_adjacency_groups(


This line throws a KeyError (via ms3) when some data has Nans for its localkey column. For example, currently: https://github.com/DCMLab/frescobaldi_fiori_musicali/blob/5c036b3ffb510ced77050437df370d106dd98419/harmonies/12.16_Toccata_cromaticha_per_l%E2%80%99elevatione_phrygian.tsv

…slices() method

johentsch

Tried to address all comments. Let me know if there are more docstrings or in-line comments needed.

johentsch · 2022-06-16T11:50:02Z

src/dimcat/analyzer.py

        self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize)
+        self.level_names["processed"] = pitch_class_format
+        self.group2pandas = "group2dataframe_unstacked"

    @staticmethod
    def compute(


Some of these are also unused (I'm aware this PR didn't add them)

Some of what?

johentsch · 2022-06-16T12:00:17Z

src/dimcat/analyzer.py

+            counts = (
+                df.groupby([0, 1]).size().sort_values(ascending=False).rename("count")
+            )
+        except KeyError:


When the bigram columns of df are not called 0 and 1.

johentsch · 2022-06-16T12:06:58Z

src/dimcat/cli.py

@@ -75,6 +82,20 @@ def get_arg_parser():
        type=check_and_create,
        help="""Output directory.""",
    )
+    input_args.add_argument(


That's a general problem that we need to discuss and solve together. Does it even make sense to allow for arbitrary pipelines using the command line?

My suggestion would be, instead, to offer a well-defined set of commands such as dimcat bigrams and define the possible pipelines for it. The various combinations of groupers and slicers could then be distributed between

general arguments available for all commands (such as Corpus grouper) and

specific arguments share between those commands where they apply.

P.S.: Not even sure argparse is actually able to capture the difference in order in your example.

johentsch · 2022-06-16T12:32:54Z

src/dimcat/grouper.py

+            for index in index_group:
+                new_group = self.criterion(index, data)
+                if new_group is None:
+                    continue


If an element cannot be grouped, instead of dropping it like here right now, we could also leave it in but with the old index. Or the behaviour could depend on a parameter similar to panda's .groupby(XY, dropna=False).

For example, let's say we're grouping by years but a couple of pieces don't have the year information in their metadata. Currently, those would "get lost" but we could give the choice to leave them in. The group name tuples would then have to be (previous_group, year), (previous_group, NaN). What do you think?

johentsch · 2022-06-16T12:43:30Z

src/dimcat/grouper.py

+        self.level_names = dict(grouper="corpus")
+
+    def criterion(self, index: tuple, data: Data) -> str:
+        return index[0]


Yes, an index always starts with (corpus, fname)

johentsch · 2022-06-16T12:45:01Z

src/dimcat/grouper.py

+class YearGrouper(Grouper):
+    """Groups indices based on the composition years indicated in the metadata."""
+
+    def __init__(self, sort=True):


Yes, should we be keeping track of planned features somewhere? In the issues?

johentsch · 2022-06-16T12:50:20Z

src/dimcat/grouper.py

+
+    def criterion(self, index: tuple, data: Data) -> str:
+        if self.slicer is None:
+            print("Need LocalKeyGrouper")


No you're right, because ModeGrouper.process_data() checks if the slicer has been applied. Removed it.

johentsch · 2022-06-17T14:44:02Z

src/dimcat/analyzer.py

        self.config = dict(pitch_class_format=pitch_class_format, normalize=normalize)
+        self.level_names["processed"] = pitch_class_format
+        self.group2pandas = "group2dataframe_unstacked"

    @staticmethod
    def compute(


Yes exactly, just added docstrings for this. I think the solution to always have compute() as a static function makes for a cleaner interface because then you know you can always call that from outside without have to lookup the relevant method for each analyzer.

src/dimcat/slicer.py

apmcleod

All looks good to me for now. There is one remaining comment about the 2 unused params in PitchClassVectors.compute, but after that (or if you have a reason to leave them there), feel free to merge.

I added all remaining comments to issues. I may go through more closely in future weeks for some optimizations or other small changes (I saw a couple of places where a list generator could be used instead of a for loop I think). But this can be in a future, small PR.

Fix dependencies

johentsch added 24 commits May 9, 2022 08:13

adds NoteSlicer and LocalKeySlicer

3e3c784

bugfix

c36a915

adds -s/--slices parameter to all commands

1519b9c

PipelineSteps consistently use data.load_result(), allowing it to kee…

a6118a3

…p track of the Pipeline, of the method to convert the results to pandas, and of the index level names

renames data.load_result() to data.track_pipeline() & changes signature

9120ee5

adds CorpusGrouper, YearGrouper, and ModeGrouper plus tests

ede9a1a

adds -g/--groupers parameter to cli

2e8cf8c

enables instantiating Corpus with a collection of directories

9c3da09

adds PieceGrouper and cares about empty or incomplete facet groups

809073a

renames parameter concat_groups to once_per_group

6efd3f1

adds the path for all_subcorpora to calibrate tests

2c62bc1

improves docstring of Corpus.iter_facet()

6eac94c

adds IsAnnotatedFilter

7fd88f2

adds checks and warnings to LocalKeySlicer

de3bb0e

docstring

e9c0910

abstraction into Grouper.process_data(), parallel to other types

19b68c6

adds FacetAnalyzer.check() to allow for exclusion of DataFrames with …

bb41783

…particular properties

deepcopying fields of Corpus() holding dictionaries

68a2f06

removes redundant index levels from output pandas objects

7033ddb

keys of Corpus.slice_info are now Slicer objects

ba09d26

enable slicing of any facet via data.get_slice()

a562b11

refines testing Slicers

faf62e1

rudimentary file name factory for TSVwriter()

b065fca

adds command "dimcat pcvs"

d20caf7

johentsch requested a review from apmcleod May 12, 2022 20:16

apmcleod requested changes May 23, 2022

View reviewed changes

removes unused file

36eafe6

when grouping, append, not prepend, new group name to previous ones

66607be

apmcleod reviewed Jun 17, 2022

View reviewed changes

johentsch added 7 commits June 17, 2022 16:28

renames TSVwriter => TSVWriter

64a7455

adds abstract Analzer class

ccb8e8f

adds default method PipelineStep.check()

e3d9a2a

abstracts Slicers to FacetSlicer superclass and subclasses with iter_…

d5633fd

…slices() method

docstrings and comments

80b1c4e

docstrings

5d5a31a

order of index levels names according to 66607be

f057aa8

johentsch requested a review from apmcleod June 17, 2022 15:20

johentsch commented Jun 17, 2022

View reviewed changes

This was referenced Jun 17, 2022

Allow to group by Year Ranges #3

Open

Define a good command line interface #4

Open

Define behavior when a Grouper's value is undefined or not present #5

Open

apmcleod reviewed Jun 17, 2022

View reviewed changes

johentsch added 2 commits June 20, 2022 17:54

removes unused parameters & harmonizes docstrings

300e22b

preparing version 0.2.0

0cc5156

johentsch merged commit b488f87 into main Jun 20, 2022

johentsch pushed a commit that referenced this pull request Nov 7, 2023

Merge pull request #1 from Elizafox/loader

9a22e76

Fix dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RC #1

RC #1

johentsch commented May 12, 2022

apmcleod commented May 23, 2022 •

edited

Loading

apmcleod left a comment •

edited

Loading

apmcleod May 23, 2022

apmcleod May 23, 2022

johentsch Jun 16, 2022

johentsch Jun 16, 2022

apmcleod Jun 17, 2022

apmcleod Jun 17, 2022

johentsch Jun 17, 2022

apmcleod Jun 17, 2022

apmcleod May 23, 2022

johentsch Jun 16, 2022

apmcleod May 23, 2022

johentsch Jun 16, 2022

apmcleod May 23, 2022

johentsch commented Jun 16, 2022

apmcleod commented Jun 17, 2022

apmcleod Jun 17, 2022

johentsch left a comment

johentsch Jun 16, 2022

johentsch Jun 16, 2022

johentsch Jun 16, 2022

johentsch Jun 16, 2022

johentsch Jun 16, 2022

johentsch Jun 16, 2022

johentsch Jun 16, 2022

johentsch Jun 17, 2022

apmcleod left a comment

RC #1

RC #1

Conversation

johentsch commented May 12, 2022

apmcleod commented May 23, 2022 • edited Loading

apmcleod left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johentsch commented Jun 16, 2022

apmcleod commented Jun 17, 2022

Choose a reason for hiding this comment

johentsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apmcleod left a comment

Choose a reason for hiding this comment

apmcleod commented May 23, 2022 •

edited

Loading

apmcleod left a comment •

edited

Loading