Recompute method in computed tables #917

lfrank · 2024-04-05T17:47:12Z

One of our main goals is to be able to share complete pipelines with results. Sharing of the various intermediate computed data outputs is one way to do this, but for data that can be computed relatively quickly, we could also enable remote users to recompute on the fly.

One possible solution would be to add a recompute method to each dj.Computed table that would regenerate the NWB file (using the same name) if it was not present locally. This would be quite a lot of work, and would also require a lot of thought as to what to do when upstream computed results are not available, but if we could get this to work we could share a much smaller subet of results when we publish papers, which would likely help.

samuelbray32 · 2024-04-05T18:46:14Z

Thoughts on a potential structure:

Recompute is essentially recalling the make function outside of a datajoint transaction and avoiding any insert statements therin.
- We could add a recompute=False argument to make functions and change any insert statements in them to only run if this is False.
Files would only need recomputed when accessed. This could be a new fallback in fetch_nwb that would call Table.make(key, recompute=True) when the analysis nwb can't be obtained another way.
- Benefit is this would recursively handle propagating recompute up to other missing intermediate tables if they are needed in the the original recompute call since the upstream data should be accessed through fetch_nwb

I'm sure there's some edge cases in things like spikesorting I'm not thinking of, but this might handle a lot without too much change to the code

CBroz1 · 2024-08-28T16:08:23Z

We previously added then removed logging of file size and creation time to the AnalysisNwbfileLog table. I put together the following script to look at means for good recompute candidates

Script

from pathlib import Path

import pandas as pd
from datajoint.utils import to_camel_case
from hurry.filesize import size  # REQUIRES: pip install hurry.filesize

from spyglass.common.common_nwbfile import AnalysisNwbfileLog

DATA_PATH = Path("data.pkl")
DATA_BCK = Path("data_bck.pkl")


class LA:
    def __init__(self, fetch=False):
        self.data = (
            AnalysisNwbfileLog().fetch(format="frame")
            if not DATA_BCK.exists() or fetch
            else pd.read_pickle(DATA_BCK)
        )
        self.data.to_pickle(DATA_BCK)
        self._grouped = None

    def load_from_backup(self):
        self.data = pd.read_pickle(DATA_BCK)
        self.reformat()
        self._grouped = None

    def drop_if_cols(self, data, cols):
        cols = [col for col in cols if col in data.columns]
        if not cols:  # no columns to drop
            return data
        data = data.drop(columns=cols, axis=1)
        return data

    def reformat(self):

        def to_tbl_name(full_name):
            if "." not in full_name:
                return full_name, full_name
            schema, table = full_name.replace("`", "").split(".")
            return schema, to_camel_case(table)

        self.data["full_table_name"] = self.data["table"]
        self.data["schema"] = self.data["full_table_name"].apply(
            lambda x: to_tbl_name(x)[0] if x is not None else None
        )
        self.data["table"] = self.data["full_table_name"].apply(
            lambda x: to_tbl_name(x)[1] if x is not None else None
        )

        self.data = self.drop_if_cols(
            self.data, ["analysis_file_name", "full_table_name"]
        )

        numeric_cols = ["time_delta", "file_size", "accessed"]
        self.data[numeric_cols] = self.data[numeric_cols].apply(pd.to_numeric)

        self.data.to_pickle(DATA_PATH)

    @property
    def grouped(self):
        if self._grouped is not None:
            return self._grouped
        self.reformat()
        grouped = self.drop_if_cols(
            self.data, ["dj_user", "timestamp"]
        ).groupby(["schema", "table"])

        mean_df = grouped.mean()
        mean_df = mean_df[mean_df["time_delta"].notnull()]
        sorted_df = mean_df.sort_values("file_size", ascending=False)

        def sec_to_min(sec):
            min = round(sec / 60, 2)
            return f"{min} min"

        def adj_accessed(accessed):
            """Adjust accessed count to be more readable, fix indexing."""
            return round(accessed + 1, 2)

        sorted_df["time_delta"] = sorted_df["time_delta"].apply(sec_to_min)
        sorted_df["file_size"] = sorted_df["file_size"].apply(size)
        sorted_df["accessed"] = sorted_df["accessed"].apply(adj_accessed)

        self._grouped = sorted_df
        return self._grouped


if __name__ == "__main__":
    la = LA(fetch=True)
    print(la.grouped)

With the following results. These are means, including time to create, file size and number of times accessed, including file creation, ordered by file size.

                                                      time_delta file_size  accessed
schema                          table
spikesorting_v1_recording       SpikeSortingRecording   9.85 min        6G      1.00
spikesorting_v1_metric_curation MetricCuration           8.3 min      888M      3.28
lfp_v1                          LFPV1                  12.41 min      108M     21.60
lfp_band_v1                     LFPBandV1               0.19 min       92M     24.39
spikesorting_curation           Waveforms               0.52 min       28M      1.00
position_linearization_v1       LinearizedPositionV1    0.17 min       26M     20.21
position_v1_trodes_position     TrodesPosV1             0.11 min       23M     19.69
position_v1_dlc_pose_estimation DLCPoseEstimation       0.12 min       14M      7.58
position_v1_dlc_position        DLCSmoothInterp         0.88 min        8M      4.91
spikesorting_v1_sorting         SpikeSorting            5.18 min        7M      7.51
spikesorting_v1_curation        CurationV1              0.09 min        6M      5.03
position_v1_dlc_centroid        DLCCentroid              0.1 min        4M      2.34
position_v1_dlc_selection       DLCPosV1                0.13 min        3M      4.50
decoding_waveform_features      UnitWaveformFeatures    6.31 min        2M      6.43
spikesorting_curation           CuratedSpikeSorting     0.08 min      878K      6.60
position_v1_dlc_orient          DLCOrientation          0.07 min      876K      2.02
spikesorting_curation           QualityMetrics          0.67 min      468K      1.00

All conclusions will assume we have a representative sample, which may not be the case. I'll also assume we want to recompute files that are seldom re-accessed.

The following tables produced files that were seldom re-accessed:
v1 SpikeSortingRecording, v0 Waveforms, and v0 QualityMetrics. Only two of 700+ files SpikeSortingRecording files were re-accessed, once and twice, respectively.

If we're willing to tolerate a 10m recompute time, focusing on SpikeSortingRecording will let us clear out an average of 6gb per file across 700+ cases (total 4T).
If we want to keep recompute time < 1m, Waveforms is a better candidate, but will only save 200M in each case (total 72G, 1.5% of the former case)

@edeno - Are both these operations deterministic?

samuelbray32 · 2024-08-28T17:14:59Z

I would also put 'spikesorting_recording.__spike_sorting_recording' (the v0 version) in the priority list. It didn't show up in the logging since it's not stored in an AnalysisNwbfile, but it should be roughly the same size and access rates as spikesorting_v1_recording

CBroz1 · 2024-11-19T21:24:12Z

Updates

I've have a working version of a hasher in #1093. Ideally, we could regenerate some test files and start compiling a list of the files that match. Unfortunately, none of my randomly selected files from SpikeSortingV1 have matched so far, due to small differences in saved dependency versions or larger differences like mismatched data. My working theory is that changing around the dependencies in my recompute environment will resolve these differences, but this requires more testing.

Small differences from my test files include...

Changes in git hash of spyglass version saved as source script
Changes in hdmf/pynwb version saved as part of object names
Changes to docstrings for nwb objects

By censoring these values before hashing, I cut down on mismatches (see remove_version in the code below), but pynwb version also impacts other things like whether or not data type is saved, or object reference names (HERD vs ExternalResources). There were also datasets that appeared to be off by 1e-13 microvolts across the dataset, or places where typos in Spyglass have since been corrected

File Compare Tool

See files in /stelmo/nwb/analysis/ vs /stelmo/cbroz/temp_rcp/

"""
Usage:
> old = "/stelmo/nwb/analysis/example/example_RAND.nwb"
> new = "/stelmo/cbroz/temp_rcp/example/example_RAND.nwb"
> comp = NwbfileComparator(old,new)
> comp.name_mismatch # see names missing in one or the other
> comp.obj_mismatch # see differing objects
> comp.obj('optional_obj_name') # see diffs in scalar data
"""

import atexit
import re
import warnings
from difflib import SequenceMatcher
from hashlib import md5
from pathlib import Path
from pprint import pprint
from typing import Any, Union

import datajoint as dj
import h5py
import numpy as np
from datajoint.logging import logger as dj_logger

warnings.filterwarnings("ignore", module="hdmf")
warnings.filterwarnings("ignore", module="pynwb")

schema = dj.schema("cbroz_temp")

dj_logger.setLevel("INFO")

DEFAULT_BATCH_SIZE = 4095


class NwbfileComparator:
    def __init__(
        self,
        old: Union[str, Path],
        new: Union[str, Path],
        batch_size: int = DEFAULT_BATCH_SIZE,
        verbose: bool = True,
    ):
        """Compares NWB files by pairwise hashing objects.

        Parameters
        ----------
        path : Union[str, Path]
            Path to the NWB file.
        batch_size  : int, optional
            Limit of data to hash for large datasets, by default 4095.
        verbose : bool, optional
            Display progress bar, by default True.
        """
        if not Path(old).exists():
            raise FileNotFoundError(f"File not found: {old}")
        if not Path(new).exists():
            raise FileNotFoundError(f"File not found: {new}")

        self.old = h5py.File(old, "r")
        self.new = h5py.File(new, "r")
        atexit.register(self.cleanup)

        self.batch_size = batch_size
        self.verbose = verbose
        self.name_mismatch = []
        self.hash_mismatch = []
        self.all_old, self.all_new = [], []
        self.obj_mismatch = dict()

        self.status = "which"

        _ = self.compare_files()

        self._obj_mismatch_iter = iter(self.obj_mismatch.items())
        self.comps = zip(self.all_old, self.all_new)

        if not self.obj_mismatch:  # Only close if no mismatches
            self.cleanup()
        atexit.unregister(self.cleanup)

    def remove_version(self, content):
        version_pattern = (
            r"\d+\.\d+\.\d+"  # Major.Minor.Patch
            + r"(?:-alpha|-beta|a\d+)?"  # Optional alpha or beta, -alpha
            + r"(?:\.dev\d{2})?"  # Optional dev build, .dev01
            + r"(?:\+[a-z0-9]{9})?"  # Optional commit hash, +abcdefghi
            + r"(?:\.d\d{8})?"  # Optional date, dYYYYMMDD
        )
        no_ver = re.sub(version_pattern, "VERSION", content)
        docstring_pattern = r'"doc":"(.*?)"'
        ret = re.sub(docstring_pattern, '"doc":"DOCSTRING"', no_ver)
        return ret

    @property
    def mismatches_diff(self):
        ret = []
        for k, (old, new) in self.obj_mismatch.items():
            ret.append(f"Object: {k}")
            if getattr(old, "shape", None) == ():
                old_str = self.remove_version(str(old[()]))
                new_str = self.remove_version(str(new[()]))
                ret.append(self.diff_strings(old_str, new_str, context=15))
            elif isinstance(old, h5py.Dataset):
                ret.append(str(old[:5]))
                ret.append(str(new[:5]))
            ret.append(" ")
        return "\n".join([r for r in ret if r is not None])

    def cleanup(self):
        self.old.close()
        self.new.close()

    def compare_files(self):
        old_items = self.collect_names(self.old)
        new_items = self.collect_names(self.new)

        all_names = set(old_items.keys()) | set(new_items.keys())

        for name in all_names:
            if name not in old_items:
                self.name_mismatch.append({"name": name, "missing_from": "old"})
                continue
            if name not in new_items:
                self.name_mismatch.append({"name": name, "missing_from": "new"})
                continue
            self.status = "old"
            old_hash = self.compute_hash(name, old_items[name])
            self.status = "new"
            new_hash = self.compute_hash(name, new_items[name])

            if old_hash != new_hash:
                self.hash_mismatch.append({"name": name})
                self.obj_mismatch[name] = (old_items[name], new_items[name])

    def collect_names(self, file):
        """Collects all object names in the file."""

        def collect_items(name, obj):
            name = self.remove_version(name)
            if name in items_to_process:
                raise ValueError(f"Duplicate key: {name}")
            items_to_process.update({name: obj})

        items_to_process = dict()
        file.visititems(collect_items)
        return items_to_process

    def serialize_attr_value(self, value: Any):
        """Serializes an attribute value into bytes for hashing.

        Setting all numpy array types to string avoids false positives.

        Parameters
        ----------
        value : Any
            Attribute value.

        Returns
        -------
        bytes
            Serialized bytes of the attribute value.
        """
        if isinstance(value, np.ndarray):
            return value.astype(str).tobytes()  # must be 'astype(str)'
        elif isinstance(value, (str, int, float)):
            return self.remove_version(str(value)).encode()
        return self.remove_version(repr(value)).encode()

    def hash_dataset(self, dataset: h5py.Dataset):
        hashed = md5(self.hash_shape_dtype(dataset))

        if dataset.shape == ():
            hashed.update(self.serialize_attr_value(dataset[()]))
            return hashed.hexdigest().encode()

        # WARNING: only head of data
        size = min(dataset.shape[0], self.batch_size * 5)
        # size = dataset.shape[0]
        start = 0
        padding = len(str(size))

        while start < size:
            pad_start = f"{round(start,padding-2):0{padding}}"
            print(f"\rData: {dataset.name}: {pad_start}/{size}", end="")
            end = min(start + self.batch_size, size)
            hashed.update(self.serialize_attr_value(dataset[start:end]))
            start = end

        print()
        return hashed.hexdigest().encode()

    def hash_shape_dtype(self, obj: [h5py.Dataset, np.ndarray]) -> str:
        if not hasattr(obj, "shape") or not hasattr(obj, "dtype"):
            return "".encode()
        return str(obj.shape).encode() + str(obj.dtype).encode()

    def compute_hash(self, name, obj) -> str:
        hashed = md5(name.encode())

        for attr_key in sorted(obj.attrs):
            attr_value = obj.attrs[attr_key]
            hashed = self.uhash(hashed, self.hash_shape_dtype(attr_value))
            hashed = self.uhash(hashed, attr_key.encode())
            hashed = self.uhash(hashed, self.serialize_attr_value(attr_value))

        if isinstance(obj, h5py.Dataset):
            hashed = self.uhash(hashed, self.hash_dataset(obj))
        elif isinstance(obj, h5py.SoftLink):
            hashed = self.uhash(hashed, obj.path.encode())
        elif isinstance(obj, h5py.Group):
            for k, v in obj.items():
                hashed = self.uhash(hashed, self.remove_version(k).encode())
                hashed = self.uhash(hashed, self.serialize_attr_value(v))
        else:
            raise TypeError(
                f"Unknown object type: {type(obj)}\n"
                + "Please report this an issue on GitHub."
            )

        return hashed.hexdigest()

    def uhash(self, hash, value):
        hash.update(value)
        if self.status == "old":
            self.all_old.append(value)
        elif self.status == "new":
            self.all_new.append(value)
        return hash

    def comp_obj(self, obj=None):
        """Show string diffs for given object. 

        If obj=None, iterate over differing objects.
        """
        if obj is not None and obj in self.obj_mismatch:
            key = obj
            old, new = self.obj_mismatch[key]
        else:
            try:
                key, (old, new) = next(self._obj_mismatch_iter)
            except StopIteration:
                return None
        print(f"Object: {key}")
        if getattr(old, "shape", None) == ():
            old = self.remove_version(str(old[()]))
            new = self.remove_version(str(new[()]))
            pprint(self.diff_strings(old, new, context=15))
        return old, new

    def diff_strings(self, a: str, b: str, context=30) -> str:
        """Highlight differences between two strings with surrounding context."""
        a = str(a)
        b = str(b)

        matcher = SequenceMatcher(None, a, b)
        diffs = []
        for tag, i1, i2, j1, j2 in matcher.get_opcodes():
            if tag != "equal":
                diffs.append(
                    f"...{a[max(0, i1-context):i2+context]}"
                    + f" -> {b[max(0, j1-context):j2+context]}..."
                )
        return "\n".join(diffs)

Questions

From a detail perspective, should a two files have the same hash if they differ in these ways?

Saved version
NWB docstrings
Spyglass typos - how do we maintain records of these cases?
Data - what is the appropriate amount of rounding for electrical series? other data?

Each case where we make changes to the input during hashing is additional processing time hashing the recompute product. This is especially true for rounding big datasets

Bigger picture, what is our process for a mismatch on recompute? Hypothetically, a paper has been submitted using a downstream analysis and a reviewer asks for summary stats that would require a recomputed file, whose new hash does not match

Redo: We delete everything downstream of now-unknown provenance, updating all figures with (probably) minor differences in data.

Pro: The updated version becomes the new 'official' version
Con: Lots of work to update downstream analyses and figures late in the submission process

Asterisk: We accept that intermediate files cannot be kept in perpetuity and use the replicated version as 'close enough'.

Pro: Reduce recompute burden
Con: Potential concerns about replicability

Resolve: We fully document conda environment prior to deletion and (either always or only in the case of mismatches) spin up that environment to replicate the file.

Pro: Hopefully, full replicabilty
Con:
- Long wait times for recomputing files
- Potentially not feasible for under-documented existing files, removing our ability to delete them
- A lot of overhead infrastructure/dev time to implement env spin-up systematically

Certainly, we can start default to one approach and take another on a case-by-case basis.

Next steps

Next, I'll make an effort to reverse-engineer the required environment from existing files to test our capacity to take the 'resolve' route for existing files

CBroz1 · 2024-12-04T22:02:27Z

Replication

Environment

I've had trouble replicating files without a complete record of the original
conda environment.

The files provide (a) a pynwb version, and (b) a spyglass git hash.
Because pynwb always pins hdmf and h5py, we can use this version to set
those dependencies. Setting spyglass is trickier, as the replication feature
is targeted for a future release. Backporting is possible, but a large tech
burden.

Unfortunately, most of the existing files use pynwb==2.6.0-alpha, for which
there is no record of dependency pins. I've made a best guess based on
one user's environment, but there may be variations, as the pins were changed
in the subsequent release. For replicating, I'll use the versions in the table
below with an up to date Spyglass.

File counts by version, and dependency pins

# Existing files by Spyglass and PyNWB versions
spy_ver      0.1  0.4.0  0.4.3  0.5.0  0.5.2  0.5.4  Total
pynwb_ver
2.5.0          0    204      0      0      0      0    204
2.6.0-alpha   42      0    658      1    220      0    921
2.7.0          0      0      0      0     96    171    267
Total         42    204    658      1    316    171   1392

# PyNWB version pins, with *best guess
pynwb version   hdmf       h5py
+------------+ +--------+ +--------+
2.5.0          3.9.0      3.8.0
2.6.0-alpha*   3.11.0     3.10.0
2.6.0          3.12.2     3.10.0

Censoring

In an effort to minimize mismatches, I've added the option to censor version
numbers and docstrings from both object names and scalar datasets prior to
hashing. This prevents arbitrary docstring changes from causing a mismatch, but
also treats these fields as strings, rather than nested dictionaries, to speed
up the process. There have been some cases where ExternalResources or Group
objects have had different structures across old and replicated files.

Mismatches

Missing Objects

A few old files are missing general/source_script objects. While the contents
is censored, a missing object still impacts the hash. We could

Ignore: Decide that this field should not be hashed, adding to a list of
exceptions that becomes hard to maintain.
Retrofit: Add a default value to all existing files where it is missing. This
would require a careful review of when items were updated, and dependency
versions to use when replicating.

Hash mismatches

~80% of generated files have some hashing mismatch of at least one object.

specifications/core/VERSION/nwb.icephys
specifications/core/VERSION/nwb.ophys
general (missing source_script)
specifications/hdmf-experimental/VERSION/resources
acquisition/ProcessedElectricalSeries/data

Some of these are actual datasets, while others are groups or scalar datasets,
that are initially read as strings. Scalar datasets that can be further
unpacked into nested dicts/lists using eval methods, which present a security
risk, and should be handled carefully. While we can be confident in data we
generate, we may need to adjust the process for imported data.

def unpack_scalar(obj):
    return eval(
        eval(
            str(obj[()])
            .replace("null", '"null"')
            .replace("false", '"false"')[1:]
        )
    )

Icephys

Old nwb.icephys objects are missing value key/value pairs for
'bias_currens', in group datasets. As above, we can either attempt to ignore
these, or retrofit existing files. Using SC7920240910_B1PER4NSHE.nwb as an
example.

old['groups'][2]['datasets'][0].get('value') == None
new['groups'][2]['datasets'][0].get('value') == 0.0

Ophys

nwb.ophys objects have an additional group with links to an ImagingPlane
object. See j1620210710_QLQKV5ZTOQ.nwb as an example. Again, we can find
a way to ignore or retrofit these objects.

resources

Even controlling for hdmf version, resources objects have mismatching groups
in ~50% of files, especially for those generated with pynwb==2.5.0. Using
Lewis20240222_S3M69NGSAK.nwb as an example, the old file has groups for
entity_keys missing in the new, and the new file has a group for resources
missing in the old. The objects key also differs in structure, with one the
old file having 5 datatypes, versus 3 in the new.

ProcessedElectricalSeries

Many ProcessedElectricalSeries datasets have mismatching hashes, but will
pass a np.isclose test, with an average difference of less than 1e-16. With
the old data in hand, we could be more confident in the replication, but
adjusting the hash to account for small differences would require significant
increases to the hash time for large datasets.

Questions

Should the hasher take the time to...

Ignore missing general/source_script objects?
Unpack scalar datasets into nested dictionaries? (Security risk)
Round ProcessedElectricalSeries datasets to a certain precision?

Should we adjust the existing files before hashing to account for the updated
specifications? Are we then tying ourselves to the current spec, and would need to run a file surgery overhaul whenever this is updated?

source_script
icephys value
ophys ImagingPlane

A growing list of exceptions also increases the risk of a false negative, if future file specs introduce a meaningful difference in some region we've decided to ignore.

CBroz1 · 2025-01-23T23:41:50Z

Update

After a lot of work exploring, I found I'm able to replicate all (so far) ProcessedElectricalSeries/data objects to four points of precision. While I'm away, I'll continue to run more.

Perhaps noteworthy: the only subject that required 4 vs 5 points of precision is eliot, and may indicate a difference in dependencies used

There are a handful of differences from files that don't replicate all objects.

J1* files had some probe geometry adjusted in the original and we elected not to rerun spike sorting for these cases
Some replicated SC79* have missing objects: acquisition, acquisition/ProcessedElectricalSeries, ProcessedElectricalSeries/timestamps, etc
One RN2 file is missing the ProcessedElectricalSeries/data in the original file
Some existing files (eliot & RN2) are missing source script information that is worth correcting before deletion - again, indicating some inconsistencies in spyglass versioning and dependencies
Some RN2 files are missing upstream data required for replication

Moving forward

The script below includes a first draft of the tools I'm using to test replicability of existing files. Notably, it restricts replication to environments that match the original file. To remove this restriction might cause issues downstream as new files may not match what the pipeline was designed to do.

To integrate this into spyglass, I propose adding two tables for each pipeline's replication requirements

{Pipeline}Dependencies: Tracks all versions used in the existing files (see NwbVer)
{Pipeline}Replication: Tracking cases of matching and mismatching objects (see SSRComp). Queuing an item for replication would involve first processing it though this table, and addressing any issues with the upstream files

To finalize this work, I'll be adding these tables to the spike sorting pipeline, and working to reduce redundancy across the draft below and the hasher in the open PR

Draft

import datajoint as dj

dj.config.load("/home/cbroz/wrk/alt/creds/dj_local_conf.json_proe")

import atexit
import json
import re
import sys
import warnings
from difflib import SequenceMatcher
from hashlib import md5
from pathlib import Path
from pprint import pprint
from typing import Any, Union

import datajoint as dj
import h5py
import numpy as np
import pynwb
from datajoint.logging import logger as dj_logger
from hdmf.build import TypeMap
from hdmf.spec import NamespaceCatalog
from pynwb.spec import NWBDatasetSpec, NWBGroupSpec, NWBNamespace

from spyglass.spikesorting.v1.recording import SpikeSortingRecording
from spyglass.utils.nwb_hash import NwbfileHasher

warnings.filterwarnings("ignore", module="hdmf")
warnings.filterwarnings("ignore", module="pynwb")

schema = dj.schema("cbroz_temp")

dj_logger.setLevel("INFO")

SYS_ARGS = sys.argv[1:]
DEFAULT_BATCH_SIZE = 4095
IGNORED_KEYS = ["version", "object_id", "doc", "source_script"]
PRECISION_LOOKUP = dict(ProcessedElectricalSeries=4)

from spyglass.settings import analysis_dir

# Draft makes new filles in a replaced 'analysis_dir' - 
# final version should make them in `tmp`
if analysis_dir == "/stelmo/nwb/analysis":
    raise RuntimeError("analysis_dir is set to /stelmo/nwb/analysis")


@schema
class NwbVer(dj.Computed):
    definition = """ # Track file dependencies
    -> SpikeSortingRecording
    ---
    core = '' : varchar(32)
    hdmf_common = '' : varchar(32)
    hdmf_experimental = '' : varchar(32)
    ndx_franklab_novela = '' : varchar(32)
    spyglass = '' : varchar(64)
    """

    # TODO: Track spikeinterface on replication
    old_dir = Path("/stelmo/nwb/analysis")

    @property
    def valid_ver_restr(self):
        return self.namespace_dict(pynwb.get_manager().type_map)

    def namespace_dict(self, type_map: TypeMap):
        # need to sub _ for - in field names for datajoint syntax
        hyph_fields = [f.replace("_", "-") for f in self.heading.names]
        name_cat = type_map.namespace_catalog
        return {
            field.replace("-", "_"): name_cat.get_namespace(field).get(
                "version", None
            )
            for field in name_cat.namespaces
            if field in hyph_fields
        }

    def make(self, key):
        ukey = (SpikeSortingRecording & key).fetch1()
        fn = ukey["analysis_file_name"]
        parts = fn.split("_")
        subdir = "_".join(parts[:-1])
        path = self.old_dir / subdir / fn

        catalog = NamespaceCatalog(NWBGroupSpec, NWBDatasetSpec, NWBNamespace)
        pynwb.NWBHDF5IO.load_namespaces(catalog, path)
        type_map = TypeMap(catalog)

        insert = key.copy()
        insert.update(self.namespace_dict(type_map))

        with h5py.File(path, "r") as f:
            script = f.get("general/source_script")
            if script is not None:
                script = str(script[()]).split("=")[1].strip().replace("'", "")
            insert["spyglass"] = script

        self.insert1(insert)


@schema
class SSRComp(dj.Computed):
    definition = """ # Compare new and old versions
    -> NwbVer
    ---
    match: bool
    precision=8 : int # maximum tested matching precision
    diffs=null: longblob
    """

    class Name(dj.Part):
        definition = """
        -> master
        name : varchar(255)
        ---
        missing_from: enum('old', 'new')
        """

    class Hash(dj.Part):
        definition = """
        -> master
        name : varchar(255)
        """

    key_source = NwbVer & NwbVer().valid_ver_restr
    old_dir = Path("/stelmo/nwb/analysis")
    new_dir = Path("/stelmo/cbroz/temp_rcp").expanduser()

    def _get_subdir(self, key):
        file = key["analysis_file_name"] if isinstance(key, dict) else key
        parts = file.split("_")
        subdir = "_".join(parts[:-1])
        return subdir + "/" + file

    def _get_paths(self, key):
        old = self.old_dir / self._get_subdir(key)
        new = self.new_dir / self._get_subdir(key)
        return old, new

    def _hash_existing(self, precision_lookup=PRECISION_LOOKUP):
        for nwb in self.new_dir.rglob("*.nwb"):
            if nwb.with_suffix(".hash").exists():
                continue
            try:
                print(f"Hashing {nwb}")
                _ = self._hash_one(nwb, precision_lookup=precision_lookup)
            except (OSError, ValueError, RuntimeError) as e:
                print(f"Error: {e.__class__.__name__}: {nwb.name}")
                continue

    def _hash_one(self, path, precision_lookup=PRECISION_LOOKUP):
        hasher = NwbfileHasher(
            path, verbose=False, precision_lookup=precision_lookup
        )
        with open(path.with_suffix(".hash"), "w") as f:
            f.write(hasher.hash)
        return hasher.hash

    def make(self, key):
        if self & key:
            return

        try:
            parent = (SpikeSortingRecording * NwbVer & key).fetch1()
        except (KeyError, dj.DataJointError) as e:
            print(f"Error: {e}: {key}")
            return

        old, new = self._get_paths(parent)

        if not new.exists():
            try:
                print(f"Making {new}")
                new_vals = SpikeSortingRecording()._make_file(
                    parent,
                    recompute_file_name=parent["analysis_file_name"],
                    force=True,
                )
                # except (KeyError, PermissionError, FileNotFoundError) as e:
            except RuntimeError as e:
                print(f"{e}: {new.name}")
                return
            except KeyError as err:
                if "MISSING" in err.args[0]:
                    e = err.args[0].split("MISSING")[0].strip()
                    self.insert1(dict(key, match=False, diffs=e))
                    self.Name().insert1(
                        dict(
                            key, name=f"Parent missing {e}", missing_from="old"
                        )
                    )
                else:
                    print(f"KeyError: {err}: {new.name}")
                return
            with open(new.with_suffix(".hash"), "w") as f:
                f.write(new_vals["file_hash"])
        elif new.with_suffix(".hash").exists():
            print(f"\nReading hash {new}")
            with open(new.with_suffix(".hash"), "r") as f:
                new_vals = dict(file_hash=f.read())
        else:
            new_vals = dict(file_hash=self._hash_one(new))

        old_hasher = NwbfileHasher(old, verbose=False)

        if new_vals["file_hash"] == old_hasher.hash:
            self.insert1(dict(key, match=True))
            new.unlink(missing_ok=True)
            return

        print(f"Comp diffs {new}")
        comp = NwbfileComparator(
            old, new, close_files=False, precision_lookup=PRECISION_LOOKUP
        )
        if len(comp.obj_mismatch) == 0:
            self.insert1(dict(key, match=True))
            new.unlink(missing_ok=True)
            return

        precision = PRECISION_LOOKUP.get("ProcessedElectricalSeries", None)
        self.insert1(
            dict(
                key,
                match=False,
                precision=precision,
                diffs=comp.mismatches_diff,
            )
        )
        self.Name().insert([dict(key, **m) for m in comp.name_mismatch])
        self.Hash().insert([dict(key, **k) for k in comp.hash_mismatch])

        comp.cleanup()

    ignore_files = set()

    def get_comp(self, key=None, with_obj=None, precision=7):
        rand = False
        if isinstance(key, str):
            key = dict(analysis_file_name=key)
        if not isinstance(key, dict):
            rand = True
            mismatch = self & "`match` = 0"
            if with_obj:
                mismatch = mismatch.Hash & f'name LIKE "%{with_obj}%"'
            rand_offset = key or np.random.randint(0, len(mismatch))
            key = mismatch.fetch(
                "recording_id", as_dict=True, limit=1, offset=rand_offset
            )[0]
        file_name = (SpikeSortingRecording & key).fetch1("analysis_file_name")
        if file_name in self.ignore_files:
            return self.get_comp(
                key=None, with_obj=with_obj, precision=precision
            )
        prefix = "Rand" if rand else "File"
        print(f"{prefix} select: {file_name}")
        new, old = self._get_paths(file_name)
        try:
            comp = NwbfileComparator(
                old, new, close_files=False, precision_lookup=precision
            )
        except FileNotFoundError as e:
            print(f"File not found: {e}")
            return

        if not comp.obj_mismatch:
            print("Match")
            self.ignore_files.add(file_name)
            # return self.get_comp(
            #     key=None, with_obj=with_obj, precision=precision
            # )
            return

        print(f"Diff: {list(comp.obj_mismatch.keys())}")
        return comp

    def check_precision(self, precision=7, restr=None):
        if precision <= 3:
            return

        if restr is None:
            restr = True

        mismatches = self & "`match` = 0" & f" `precision` > {precision}"
        if not mismatches:
            self.check_precision(precision - 1, restr)

        for row in mismatches:
            comp = self.get_comp(row, precision=precision)
            key = dict(recording_id=row["recording_id"])
            if not comp or not comp.obj_mismatch:
                row.update(
                    {"match": True, "precision": precision, "diffs": None}
                )
                self.update1(row)
            elif comp.obj_mismatch:
                row.update(
                    {
                        "match": False,
                        "precision": precision,
                        "diffs": comp.mismatches_diff,
                    }
                )
                self.update1(row)

        self.check_precision(precision - 1, restr)


class NwbfileComparator:
    def __init__(
        self,
        old: Union[str, Path],
        new: Union[str, Path],
        batch_size: int = DEFAULT_BATCH_SIZE,
        precision_lookup: dict = PRECISION_LOOKUP,
        verbose: bool = True,
        close_files: bool = True,
    ):
        """Hashes the contents of an NWB file, limiting to partial data.

        Parameters
        ----------
        path : Union[str, Path]
            Path to the NWB file.
        batch_size  : int, optional
            Limit of data to hash for large datasets, by default 4095.
        verbose : bool, optional
            Display progress bar, by default True.
        """
        if not Path(old).exists():
            raise FileNotFoundError(f"File not found: {old}")
        if not Path(new).exists():
            raise FileNotFoundError(f"File not found: {new}")

        if isinstance(precision_lookup, int):
            precision_lookup = dict(ProcessedElectricalSeries=precision_lookup)

        self.old = h5py.File(old, "r")
        self.new = h5py.File(new, "r")
        atexit.register(self.cleanup)

        self.batch_size = batch_size
        self.precision = precision_lookup
        self.verbose = verbose
        self.name_mismatch = []
        self.hash_mismatch = []
        self.all_old, self.all_new = [], []
        self.obj_mismatch = dict()

        self.status = "which"

        _ = self.compare_files()

        self._obj_mismatch_iter = iter(self.obj_mismatch.items())
        self.comps = zip(self.all_old, self.all_new)

        if close_files:
            self.cleanup()
            atexit.unregister(self.cleanup)

    def remove_version(self, content):
        version_pattern = (
            r"\d+\.\d+\.\d+"  # Major.Minor.Patch
            + r"(?:-alpha|-beta|a\d+)?"  # Optional alpha or beta, -alpha
            + r"(?:\.dev\d+)?"  # Optional dev build, .dev01
            + r"(?:\+[a-z0-9]{9})?"  # Optional commit hash, +abcdefghi
            + r"(?:\.d\d{8})?"  # Optional date, dYYYYMMDD
        )
        return re.sub(version_pattern, "VERSION", content)

    @property
    def mismatches_diff(self):
        ret = []
        for k, (old, new) in self.obj_mismatch.items():
            ret.append(f"Object: {k}")
            if getattr(old, "shape", None) == ():
                old_str = str(old[()])
                new_str = str(new[()])
                ret.append(self.diff_strings(old_str, new_str, context=15))
            elif isinstance(old, h5py.Dataset):
                ret.append(str(old[:5]))
                ret.append(str(new[:5]))
            ret.append(" ")
        return "\n".join([r for r in ret if r is not None])

    def cleanup(self):
        self.old.close()
        self.new.close()

    def compare_files(self):
        old_items = self.collect_names(self.old)
        new_items = self.collect_names(self.new)

        all_names = set(old_items.keys()) | set(new_items.keys())

        for name in all_names:
            if name not in old_items:
                self.name_mismatch.append({"name": name, "missing_from": "old"})
                continue
            if name not in new_items:
                self.name_mismatch.append({"name": name, "missing_from": "new"})
                continue
            self.status = "old"
            old_hash = self.compute_hash(name, old_items[name])
            self.status = "new"
            new_hash = self.compute_hash(name, new_items[name])

            if old_hash != new_hash:
                self.hash_mismatch.append({"name": name})
                self.obj_mismatch[name] = (old_items[name], new_items[name])

            if (
                old_hash != new_hash
                and isinstance(old_items[name], h5py.Dataset)
                and old_items[name].shape == ()
            ):
                self.compare_dicts(old_items[name], new_items[name])

    def collect_names(self, file):
        """Collects all object names in the file."""

        def collect_items(name, obj):
            name = self.remove_version(name)
            if "specifications" in name:
                return  # Ignore specifications, because we hash namespaces
            if name in items_to_process:
                raise ValueError(f"Duplicate key: {name}")
            items_to_process.update({name: obj})

        items_to_process = dict()
        file.visititems(collect_items)
        return items_to_process

    def serialize_attr_value(self, value: Any):
        """Serializes an attribute value into bytes for hashing.

        Setting all numpy array types to string avoids false positives.

        Parameters
        ----------
        value : Any
            Attribute value.

        Returns
        -------
        bytes
            Serialized bytes of the attribute value.
        """
        if isinstance(value, np.ndarray):
            return value.astype(str).tobytes()  # must be 'astype(str)'
        elif isinstance(value, (str, int, float)):
            return self.remove_version(str(value)).encode()
        return self.remove_version(repr(value)).encode()

    def hash_dataset(self, dataset: h5py.Dataset):
        hashed = md5(self.hash_shape_dtype(dataset))

        if dataset.shape == ():
            raw_scalar = str(dataset[()])
            if "source_script" in dataset.name:
                raw_scalar = self.remove_version(raw_scalar)
            hashed.update(self.serialize_attr_value(raw_scalar))
            return hashed.hexdigest().encode()

        # WARNING: only head of data
        size = min(dataset.shape[0], self.batch_size * 5)
        # size = dataset.shape[0]
        start = 0

        dataset_name = dataset.parent.name.split("/")[-1]
        precision = PRECISION_LOOKUP.get(dataset_name, None)

        while start < size:
            end = min(start + self.batch_size, size)
            data = dataset[start:end]
            if precision:
                data = np.round(data, precision)
            hashed.update(self.serialize_attr_value(data))
            start = end

        return hashed.hexdigest().encode()

    def hash_shape_dtype(self, obj: [h5py.Dataset, np.ndarray]) -> str:
        if not hasattr(obj, "shape") or not hasattr(obj, "dtype"):
            return "".encode()
        return str(obj.shape).encode() + str(obj.dtype).encode()

    def compute_hash(self, name, obj) -> str:
        verbose = False
        if verbose:
            print(f"{self.status}: {name}")

        hashed = md5(name.encode())

        for attr_key in sorted(obj.attrs):
            if attr_key in IGNORED_KEYS:
                continue
            attr_value = obj.attrs[attr_key]
            if verbose and "rel" in attr_key:
                print(f"    Attr {attr_key}: {attr_value}")
            hashed = self.uhash(hashed, self.hash_shape_dtype(attr_value))
            hashed = self.uhash(hashed, attr_key.encode())
            hashed = self.uhash(hashed, self.serialize_attr_value(attr_value))

        if isinstance(obj, h5py.Dataset):
            hashed = self.uhash(hashed, self.hash_dataset(obj))
        elif isinstance(obj, h5py.SoftLink):
            hashed = self.uhash(hashed, obj.path.encode())
        elif isinstance(obj, h5py.Group):
            for k, v in obj.items():
                hashed = self.uhash(hashed, self.remove_version(k).encode())
                hashed = self.uhash(hashed, self.serialize_attr_value(v))
        else:
            raise TypeError(
                f"Unknown object type: {type(obj)}\n"
                + "Please report this an issue on GitHub."
            )

        if verbose:
            print(f"    Final: {hashed.hexdigest()}")

        return hashed.hexdigest()

    def uhash(self, hash, value):
        hash.update(value)
        if self.status == "old":
            self.all_old.append(value)
        elif self.status == "new":
            self.all_new.append(value)
        return hash

    def comp_obj(self, obj=None):
        if obj is not None and obj in self.obj_mismatch:
            key = obj
            old, new = self.obj_mismatch[key]
        else:
            try:
                key, (old, new) = next(self._obj_mismatch_iter)
            except StopIteration:
                return None
        print(f"Object: {key}")
        if getattr(old, "shape", None) == ():
            old = self.remove_version(str(old[()]))
            new = self.remove_version(str(new[()]))
            pprint(self.diff_strings(old, new, context=15))
        return old, new

    def diff_strings(self, a: str, b: str, context=10) -> str:
        """Highlight differences between two strings with surrounding context."""
        a = str(a)
        b = str(b)

        matcher = SequenceMatcher(None, a, b)
        diffs = []
        for tag, i1, i2, j1, j2 in matcher.get_opcodes():
            if tag != "equal":
                diffs.append(
                    f"...{a[max(0, i1-context):i2+context]}"
                    + f" -> {b[max(0, j1-context):j2+context]}..."
                )
        return "\n".join(diffs)

    def unpack_scalar(self, obj):
        if isinstance(obj, (int, float)):
            return dict(scalar=obj)
        if isinstance(obj, str):
            return dict(scalar=obj)
        str_obj = self.remove_version(str(obj[()]))
        if "{" not in str_obj:
            return dict(scalar=str_obj)
        return json.loads(str_obj)

    def sort_list_of_dicts(self, obj):
        draft = sorted(
            obj,
            key=lambda x: sorted(x.keys() if isinstance(x, dict) else str(x)),
        )
        ret = []
        for d in draft:
            if isinstance(d, dict) and ("dataset" in d or "groups" in d):
                ret.append(d)
            else:
                ret.append(d)
        return ret

    def compare_dicts(self, old, new, level="", iter=0):
        old = old if isinstance(old, dict) else self.unpack_scalar(old)
        new = new if isinstance(new, dict) else self.unpack_scalar(new)
        all_keys = set(old.keys()) | set(new.keys())
        for key in all_keys:
            if key in IGNORED_KEYS:
                continue
            if key not in old:
                print(f"{level} {iter}: old missing key: {key}")
                continue
            if key not in new:
                print(f"{level} {iter}: new missing key: {key}")
                continue
            oval = old[key]
            nval = new[key]
            if old[key] != new[key]:
                print(f"{level} {iter}: dict val differ for {key}")
                # if not key not in ["groups", "datasets"]:
                print(f"\t{str(oval)[:50]}\n\t{str(nval)[:50]}")
            if isinstance(oval, dict):
                self.compare_dicts(oval, nval, f"{level} {key}", iter)
            if isinstance(old[key], list):
                old_sorted = self.sort_list_of_dicts(oval)
                new_sorted = self.sort_list_of_dicts(nval)
                for o, n in zip(old_sorted, new_sorted):
                    iter += 1
                    if isinstance(o, dict):
                        self.compare_dicts(o, n, f"{level} {key}", iter)
                    elif o != n:
                        print(f"{level} {iter}: list val differ for {key}")
                        # if key not in ["groups", "datasets"]:
                        print(f"\t{str(o)[:50]}\n\t{str(n)[:50]}")


def unpack_scalar(obj):
    if isinstance(obj, (int, float)):
        return dict(scalar=obj)
    if isinstance(obj, str):
        return dict(scalar=obj)
    str_obj = str(obj[()])
    if "{" not in str_obj:
        return dict(scalar=str_obj)
    try:
        return eval(
            eval(
                str(obj[()])
                .replace("null", '"null"')
                .replace("false", '"false"')[1:]
            )
        )
    except SyntaxError:
        print(f"Error unpacking scalar: {obj[()][:100]}")
        return dict()


def assemble_dict(obj):
    ret = dict()
    for k, v in obj.items():
        if isinstance(v, h5py.Dataset):
            ret[k] = unpack_scalar(v)
        elif isinstance(v, h5py.Group):
            ret[k] = assemble_dict(v)
    return obj


def obj_to_dict(obj):
    if isinstance(obj, dict):
        return obj
    if isinstance(obj, h5py.Group):
        return assemble_dict(obj)
    return json.loads(obj)


def compare_dicts(old, new, level="", iter=0):
    old = obj_to_dict(old)
    new = obj_to_dict(new)
    all_keys = set(old.keys()) | set(new.keys())
    for key in all_keys:
        if key in IGNORED_KEYS:
            continue
        if key not in old:
            print(f"{level} {iter}: old missing key: {key}")
            continue
        if key not in new:
            print(f"{level} {iter}: new missing key: {key}")
            continue
        oval = old[key]
        nval = new[key]
        if old[key] != new[key]:
            print(f"{level} {iter}: dict val differ for {key}")
            # if not key not in ["groups", "datasets"]:
            print(f"\t{str(oval)[:50]}\n\t{str(nval)[:50]}")
        if isinstance(oval, dict):
            compare_dicts(oval, nval, f"{level} {key}", iter)
        if isinstance(old[key], list):
            old_sorted = sort_list_of_dicts(oval)
            new_sorted = sort_list_of_dicts(nval)
            for o, n in zip(old_sorted, new_sorted):
                iter += 1
                if isinstance(o, dict):
                    compare_dicts(o, n, f"{level} {key}", iter)
                elif o != n:
                    print(f"{level} {iter}: list val differ for {key}")
                    # if key not in ["groups", "datasets"]:
                    print(f"\t{str(o)[:50]}\n\t{str(n)[:50]}")


def sort_list_of_dicts(self, obj):
    draft = sorted(
        obj,
        key=lambda x: sorted(x.keys() if isinstance(x, dict) else str(x)),
    )
    ret = []
    for d in draft:
        if isinstance(d, dict) and ("dataset" in d or "groups" in d):
            ret.append(d)
        else:
            ret.append(d)
    return ret


def hash2(
    comp: NwbfileComparator, name: str = None, precision=8, batches=5, offset=0
):
    if name is None:
        name = comp.hash_mismatch[0]["name"]
    print(f"Hashing: {name}")

    old, new = comp.obj_mismatch[name]
    oh, nh = md5(comp.hash_shape_dtype(old)), md5(comp.hash_shape_dtype(new))

    if not old.shape == new.shape:
        print("Shape mismatch")
        return

    # WARNING: only head of data
    if batches == 0:
        size = old.shape[0]
    else:
        size = min(old.shape[0], comp.batch_size * batches)
    start = offset * comp.batch_size
    padding = len(str(size))

    while start < size:
        pad_start = f"{round(start,padding-2):0{padding}}"
        end = min(start + comp.batch_size, size)
        od, nd = old[start:end], new[start:end]
        od = np.round(od, precision)
        nd = np.round(nd, precision)
        oh.update(comp.serialize_attr_value(od))
        nh.update(comp.serialize_attr_value(nd))
        status = oh.hexdigest().encode() == nh.hexdigest().encode()
        start = end
        print(f"{pad_start}/{size}: {status}")
        if status is False:
            print(f"Start: {start}")
            return od, nd


if __name__ == "__main__" and len(SYS_ARGS) > 0 and "run" in SYS_ARGS:
    kwargs = dict(
        display_progress=True,
        processes=1,
        reserve_jobs=False,
        suppress_errors=False,
        limit=30,
        order="random",
    )
    SSRComp().populate(**kwargs)

edeno added enhancement New feature or request infrastructure Unix, MySQL, etc. settings/issues impacting users labels Apr 19, 2024

samuelbray32 mentioned this issue Jul 12, 2024

Table Metadata Lock by long populate calls #1030

Closed

samuelbray32 mentioned this issue Sep 4, 2024

Export process with non-nwb external files #1085

Open

CBroz1 mentioned this issue Nov 8, 2024

SpikeSortingRecordingSelection attributes #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recompute method in computed tables #917

Recompute method in computed tables #917

lfrank commented Apr 5, 2024

samuelbray32 commented Apr 5, 2024

CBroz1 commented Aug 28, 2024 •

edited

Loading

samuelbray32 commented Aug 28, 2024

CBroz1 commented Nov 19, 2024

CBroz1 commented Dec 4, 2024

CBroz1 commented Jan 23, 2025

Recompute method in computed tables #917

Recompute method in computed tables #917

Comments

lfrank commented Apr 5, 2024

samuelbray32 commented Apr 5, 2024

CBroz1 commented Aug 28, 2024 • edited Loading

samuelbray32 commented Aug 28, 2024

CBroz1 commented Nov 19, 2024

Updates

Questions

Next steps

CBroz1 commented Dec 4, 2024

Replication

Environment

Censoring

Mismatches

Missing Objects

Hash mismatches

Icephys

Ophys

resources

ProcessedElectricalSeries

Questions

CBroz1 commented Jan 23, 2025

Update

Moving forward

CBroz1 commented Aug 28, 2024 •

edited

Loading