Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor writing raw file #11953

Merged
merged 41 commits into from
Sep 6, 2023
Merged

Refactor writing raw file #11953

merged 41 commits into from
Sep 6, 2023

Conversation

dmalt
Copy link
Contributor

@dmalt dmalt commented Sep 4, 2023

Reference issue

Necessary for PR #11924

What does this implement/fix?

  • Purge recursion from _write_raw() and _write_raw_fid() interaction
  • Improve general code readability

Additional information

In #11924 I'm adding support for reading and writing from zip archives via using Path interface in the reading and writing functions. At the moment such change is not possible because zipfile.Paths don't support writing to several fids at once. It's a problem because currently MNE-Python writes splits recursively, such that it keeps the fid for the first file open until we're done writing all the remaining splits.

This PR attempts at fixing this recursion problem by using a simple while loop. As a bonus, no-recursion implementation is easier to understand and maintain (hopefully).

@dmalt dmalt force-pushed the refactor_write_raw branch from af1551c to 5dbd876 Compare September 4, 2023 18:15
@dmalt dmalt marked this pull request as draft September 4, 2023 18:28
@dmalt dmalt force-pushed the refactor_write_raw branch 2 times, most recently from b1acbb1 to c586bf6 Compare September 4, 2023 20:28
@dmalt dmalt force-pushed the refactor_write_raw branch from c586bf6 to f9d85a5 Compare September 4, 2023 21:14
@dmalt dmalt force-pushed the refactor_write_raw branch from 49fbaec to 1c448cd Compare September 4, 2023 21:16
@dmalt dmalt force-pushed the refactor_write_raw branch from f87fc1d to 146a34d Compare September 4, 2023 22:43
@dmalt dmalt force-pushed the refactor_write_raw branch from cdac5c7 to 0a5b437 Compare September 5, 2023 11:29
fix a newly introduced bug when we moved fpath before closing its fid
@dmalt dmalt marked this pull request as ready for review September 5, 2023 20:07
@larsoner
Copy link
Member

larsoner commented Sep 6, 2023

If you still think it's too much, please check the first commit in this PR: 5dbd876.

Hmm... I do find that somewhat more readable than the class + dataclasses version that is in the current diff. What's the advantage of the class+dataclass-based approach?

@dmalt
Copy link
Contributor Author

dmalt commented Sep 6, 2023

I did classes because I wanted to organise parameters to the _write_raw() call . Most of the arguments to _write_raw() are just passed through to _write_raw_fid() and since there's so many of them it's hard to see, which variables are modified and which are just passed. Same logic goes behind hiding start as a class state variable. _write_raw() doesn't really need to be exposed to these. Although, arguably now it's hard to see from a first glance, that start is a mutable state of _FidRawWriter, while other variables are read-only.

Also It feels like _start_writing_raw() and _write_raw_fid() conceptually are really part of the same function. E.g. there's opening fiff tag in the former and the matching closing tag in the latter.

At this point, I've looked at this piece of code for too long and I'm not sure about anything any longer:)
What do you think I should do?

@larsoner
Copy link
Member

larsoner commented Sep 6, 2023

Although, arguably now it's hard to see from a first glance, that start is a mutable state of _FidRawWriter, while other variables are read-only.

Can you make a comment saying this? I think it's an important point

Also It feels like _start_writing_raw() and _write_raw_fid() conceptually are really part of the same function

Agreed, the names are not descriptive. Like if the former writes metadata (info, annotation, etc) and the latter writes actual data we could rename them accordingly...

there's opening fiff tag in the former and the matching closing tag in the latter.

... And/or in theory we could split off some stuff in a _finish_writing_raw if it helps close the loop.

But these are half baked ideas I came up with without looking at the code in depth so they could be way off. If you want I can dig deeper in an hour or two and see if these make any sense and possibly push changes if they do (or if something else occurs to me when looking deeper).

@wmvanvliet
Copy link
Contributor

great improvement!

@dmalt
Copy link
Contributor Author

dmalt commented Sep 6, 2023

Can you make a comment saying this? I think it's an important point

Sure.

... And/or in theory we could split off some stuff in a _finish_writing_raw if it helps close the loop.

I'd rather move the outermost tags to _RawFidWriter.write() like this:

def write(...):
    begin_block(fid,  FIFF.FIFFB_MEAS)
    _start_writing_raw(...)
    is_new_split = _write_raw_fid(...)
    end_block(fid, FIFF.FIFFB_MEAS)
    return is_new_split

And rename _start_writing_raw to something like _write_raw_fid_metadata. So these begin_block and end_block outline the structure. I attempted to do it like this in the initial _FidRawWriter version.

The same way _write_raw_fid would begin with

    if info.get("maxshield", False):
        start_block(fid, FIFF.FIFFB_IAS_RAW_DATA)
    else:
        start_block(fid, FIFF.FIFFB_RAW_DATA)

(Which is now part of _start_writing_raw() for some reason)
And end with

    if info.get("maxshield", False):
        end_block(fid, FIFF.FIFFB_IAS_RAW_DATA)
    else:
        end_block(fid, FIFF.FIFFB_RAW_DATA)

Also this part of _start_writing_raw() really belongs in _write_raw_fid():

    cals = []
    for k in range(info["nchan"]):
        #
        #   Scan numbers may have been messed up
        #
        info["chs"][k]["scanno"] = k + 1  # scanno starts at 1 in FIF format
        if reset_range is True:
            info["chs"][k]["range"] = 1.0
        cals.append(info["chs"][k]["cal"] * info["chs"][k]["range"])

All it does is calculating calibrations just to pass them to _write_raw_fid().

Maybe just merging _start_writing_raw() and _write_raw_fid() together would be the simplest solution to everything.

@cbrnr cbrnr changed the title Refactror writing raw file Refactor writing raw file Sep 6, 2023
@larsoner
Copy link
Member

larsoner commented Sep 6, 2023

I'd rather move the outermost tags to _RawFidWriter.write() like this: ... _write_raw_fid_metadata

There are exactly part of the solution I came up with! Pushing now...

Maybe just merging _start_writing_raw() and _write_raw_fid() together would be the simplest solution to everything.

Yes but I don't mind keeping them a bit separate since one really is about metadata and the other about data. That plus a smaller diff is nice I think.

@dmalt
Copy link
Contributor Author

dmalt commented Sep 6, 2023

There are exactly part of the solution I came up with! Pushing now...

Ok, cool!

@larsoner
Copy link
Member

larsoner commented Sep 6, 2023

@dmalt let me know if you're happy with this version and/or push any final tweaks and I'll mark for merge when green!

@dmalt
Copy link
Contributor Author

dmalt commented Sep 6, 2023

Yes but I don't mind keeping them a bit separate since one really is about metadata and the other about data. That plus a smaller diff is nice I think.

If you move out tags and calibrations, it becomes just a couple of lines of code.

@dmalt dmalt force-pushed the refactor_write_raw branch from d3b1123 to df1b6c0 Compare September 6, 2023 13:53
@dmalt
Copy link
Contributor Author

dmalt commented Sep 6, 2023

@dmalt let me know if you're happy with this version and/or push any final tweaks and I'll mark for merge when green!

Looks great!

Just made that calibrations move. Feel free to merge when green!

UPD.
I'm an idiot. Calibrations snippet mutates info before writing it, so it can't be moved. I reverted this change.
UPD2.
Added back a modified version. I hope this time it's correct.

@dmalt dmalt force-pushed the refactor_write_raw branch from df1b6c0 to fb2dd5f Compare September 6, 2023 14:15
mne/io/base.py Outdated Show resolved Hide resolved
Comment on lines +2546 to +2561
dir_path = fpath.parent
# We have to create one extra filename here to make the for loop below happy,
# but it will raise an error if it actually gets used
split_fnames = _make_split_fnames(
fpath.name, n_splits=MAX_N_SPLITS + 1, split_naming=split_naming
)
is_next_split, prev_fname = True, None
for part_idx in range(0, MAX_N_SPLITS):
if not is_next_split:
break
bids_special_behavior = part_idx == 0 and split_naming == "bids"
if bids_special_behavior:
reserved_fname = dir_path / split_fnames[0]
logger.info(f"Reserving possible split file {reserved_fname.name}")
_check_fname(reserved_fname, overwrite)
reserved_ctx = _ReservedFilename(reserved_fname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmalt in addition to changing from while to for, I also DRY'ed the code a bit. The previous version had a first if clause that had a lot of the same code in it that the while loop had. By embedding the logic for bids_special_naming inside the loop, we can remove that redundancy. Now I think it's clearer when something special has to happen. WDYT?

Copy link
Contributor Author

@dmalt dmalt Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not the biggest fan of the DRY principle. I feel like it definitely must be applied when there are >= 3 repetitions, but when something is repeated 2 times, it's sometimes better to resist the urge and just repeat yourself. For me here DRY doesn't improve readability because the control flow gets more complex. Here I guess it's a matter of taste, so I'll be happy with whichever option you choose. Just saying it for the sake of discussion.

Copy link
Member

@larsoner larsoner Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY doesn't improve readability because the control flow gets more complex.

It is true that there are now two if statements instead of one. But in the old code the conditional variable values that end up needing to be accounted for (in the mental model and executed in code) were more plentiful: part_idx, prev_fname, and is_next_split could all be modified inside that first conditional, so all of their values were essentially conditional on (what is now known as) bids_special_behavior, even though setting those values was accomplished with a single if statement.

In the new version of the code that uses a unified loop, this isn't the case: part_idx always starts at zero and increases in one place (range); prev_fname starts as None and gets set to the current_fname or whatever at the end of each iteration of the loop; is_next_split only gets overwritten one place (inside the loop) instead of two. The only special/conditional thing that happens is that for BIDS there might be a renaming, which leads to two "if" statements, but to me the variable behavior and otherwise equivalence of the behaviors under the two naming schemes is more readily apparent. But I guess this could be based on a different emphasis / mental model for how similar these two naming schemes are similar rather than how they differ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of prev_fname more as of explanatory variable. It could be plugged directly into _write_raw_data(), e.g. as keyword arguments to be more explicit. I see the point with regards to part_idx and is_next_split but I feel like with duplication you just read the code from top to bottom and in dried version there's more mental context to keep. Especially with the with start..., reserved_ctx: line for which you need to go back and to reassure yourself, that it's triggered only for the 0-th iteration and you're not actually reserving the filename every time. For me the file reserving behaviour is the hardest part to understand and an extra conditional on it doesn't help.

But I guess this could be based on a different emphasis / mental model for how similar these two naming schemes are similar rather than how they differ...

I agree. And readability has a lot of subjectiveness to it in general. I've just started exploring this 'DRY being not always good' idea and I was curious where you stand on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'DRY being not always good' idea and I was curious where you stand on it.

Agreed it's not always good, just one of the many things to weigh when deciding how to implement something!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, both cyclomatic and cognitive complexity give a little lower scores for the duplicated version: 7 vs 8 and 9 vs 11 correspondingly.

@larsoner
Copy link
Member

larsoner commented Sep 6, 2023

Pushed a tiny commit to fix the Windows error (I was creating an absurdly large test file!) and marking for merge when green, thanks @dmalt !

@larsoner larsoner enabled auto-merge (squash) September 6, 2023 18:45
@larsoner larsoner merged commit 3dd8899 into mne-tools:main Sep 6, 2023
22 checks passed
snwnde pushed a commit to snwnde/mne-python that referenced this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants