-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recompute method in computed tables #917
Comments
Thoughts on a potential structure:
I'm sure there's some edge cases in things like spikesorting I'm not thinking of, but this might handle a lot without too much change to the code |
We previously added then removed logging of file size and creation time to the Scriptfrom pathlib import Path
import pandas as pd
from datajoint.utils import to_camel_case
from hurry.filesize import size # REQUIRES: pip install hurry.filesize
from spyglass.common.common_nwbfile import AnalysisNwbfileLog
DATA_PATH = Path("data.pkl")
DATA_BCK = Path("data_bck.pkl")
class LA:
def __init__(self, fetch=False):
self.data = (
AnalysisNwbfileLog().fetch(format="frame")
if not DATA_BCK.exists() or fetch
else pd.read_pickle(DATA_BCK)
)
self.data.to_pickle(DATA_BCK)
self._grouped = None
def load_from_backup(self):
self.data = pd.read_pickle(DATA_BCK)
self.reformat()
self._grouped = None
def drop_if_cols(self, data, cols):
cols = [col for col in cols if col in data.columns]
if not cols: # no columns to drop
return data
data = data.drop(columns=cols, axis=1)
return data
def reformat(self):
def to_tbl_name(full_name):
if "." not in full_name:
return full_name, full_name
schema, table = full_name.replace("`", "").split(".")
return schema, to_camel_case(table)
self.data["full_table_name"] = self.data["table"]
self.data["schema"] = self.data["full_table_name"].apply(
lambda x: to_tbl_name(x)[0] if x is not None else None
)
self.data["table"] = self.data["full_table_name"].apply(
lambda x: to_tbl_name(x)[1] if x is not None else None
)
self.data = self.drop_if_cols(
self.data, ["analysis_file_name", "full_table_name"]
)
numeric_cols = ["time_delta", "file_size", "accessed"]
self.data[numeric_cols] = self.data[numeric_cols].apply(pd.to_numeric)
self.data.to_pickle(DATA_PATH)
@property
def grouped(self):
if self._grouped is not None:
return self._grouped
self.reformat()
grouped = self.drop_if_cols(
self.data, ["dj_user", "timestamp"]
).groupby(["schema", "table"])
mean_df = grouped.mean()
mean_df = mean_df[mean_df["time_delta"].notnull()]
sorted_df = mean_df.sort_values("file_size", ascending=False)
def sec_to_min(sec):
min = round(sec / 60, 2)
return f"{min} min"
def adj_accessed(accessed):
"""Adjust accessed count to be more readable, fix indexing."""
return round(accessed + 1, 2)
sorted_df["time_delta"] = sorted_df["time_delta"].apply(sec_to_min)
sorted_df["file_size"] = sorted_df["file_size"].apply(size)
sorted_df["accessed"] = sorted_df["accessed"].apply(adj_accessed)
self._grouped = sorted_df
return self._grouped
if __name__ == "__main__":
la = LA(fetch=True)
print(la.grouped) With the following results. These are means, including time to create, file size and number of times accessed, including file creation, ordered by file size.
All conclusions will assume we have a representative sample, which may not be the case. I'll also assume we want to recompute files that are seldom re-accessed. The following tables produced files that were seldom re-accessed:
@edeno - Are both these operations deterministic? |
I would also put ' |
UpdatesI've have a working version of a hasher in #1093. Ideally, we could regenerate some test files and start compiling a list of the files that match. Unfortunately, none of my randomly selected files from Small differences from my test files include...
By censoring these values before hashing, I cut down on mismatches (see File Compare ToolSee files in """
Usage:
> old = "/stelmo/nwb/analysis/example/example_RAND.nwb"
> new = "/stelmo/cbroz/temp_rcp/example/example_RAND.nwb"
> comp = NwbfileComparator(old,new)
> comp.name_mismatch # see names missing in one or the other
> comp.obj_mismatch # see differing objects
> comp.obj('optional_obj_name') # see diffs in scalar data
"""
import atexit
import re
import warnings
from difflib import SequenceMatcher
from hashlib import md5
from pathlib import Path
from pprint import pprint
from typing import Any, Union
import datajoint as dj
import h5py
import numpy as np
from datajoint.logging import logger as dj_logger
warnings.filterwarnings("ignore", module="hdmf")
warnings.filterwarnings("ignore", module="pynwb")
schema = dj.schema("cbroz_temp")
dj_logger.setLevel("INFO")
DEFAULT_BATCH_SIZE = 4095
class NwbfileComparator:
def __init__(
self,
old: Union[str, Path],
new: Union[str, Path],
batch_size: int = DEFAULT_BATCH_SIZE,
verbose: bool = True,
):
"""Compares NWB files by pairwise hashing objects.
Parameters
----------
path : Union[str, Path]
Path to the NWB file.
batch_size : int, optional
Limit of data to hash for large datasets, by default 4095.
verbose : bool, optional
Display progress bar, by default True.
"""
if not Path(old).exists():
raise FileNotFoundError(f"File not found: {old}")
if not Path(new).exists():
raise FileNotFoundError(f"File not found: {new}")
self.old = h5py.File(old, "r")
self.new = h5py.File(new, "r")
atexit.register(self.cleanup)
self.batch_size = batch_size
self.verbose = verbose
self.name_mismatch = []
self.hash_mismatch = []
self.all_old, self.all_new = [], []
self.obj_mismatch = dict()
self.status = "which"
_ = self.compare_files()
self._obj_mismatch_iter = iter(self.obj_mismatch.items())
self.comps = zip(self.all_old, self.all_new)
if not self.obj_mismatch: # Only close if no mismatches
self.cleanup()
atexit.unregister(self.cleanup)
def remove_version(self, content):
version_pattern = (
r"\d+\.\d+\.\d+" # Major.Minor.Patch
+ r"(?:-alpha|-beta|a\d+)?" # Optional alpha or beta, -alpha
+ r"(?:\.dev\d{2})?" # Optional dev build, .dev01
+ r"(?:\+[a-z0-9]{9})?" # Optional commit hash, +abcdefghi
+ r"(?:\.d\d{8})?" # Optional date, dYYYYMMDD
)
no_ver = re.sub(version_pattern, "VERSION", content)
docstring_pattern = r'"doc":"(.*?)"'
ret = re.sub(docstring_pattern, '"doc":"DOCSTRING"', no_ver)
return ret
@property
def mismatches_diff(self):
ret = []
for k, (old, new) in self.obj_mismatch.items():
ret.append(f"Object: {k}")
if getattr(old, "shape", None) == ():
old_str = self.remove_version(str(old[()]))
new_str = self.remove_version(str(new[()]))
ret.append(self.diff_strings(old_str, new_str, context=15))
elif isinstance(old, h5py.Dataset):
ret.append(str(old[:5]))
ret.append(str(new[:5]))
ret.append(" ")
return "\n".join([r for r in ret if r is not None])
def cleanup(self):
self.old.close()
self.new.close()
def compare_files(self):
old_items = self.collect_names(self.old)
new_items = self.collect_names(self.new)
all_names = set(old_items.keys()) | set(new_items.keys())
for name in all_names:
if name not in old_items:
self.name_mismatch.append({"name": name, "missing_from": "old"})
continue
if name not in new_items:
self.name_mismatch.append({"name": name, "missing_from": "new"})
continue
self.status = "old"
old_hash = self.compute_hash(name, old_items[name])
self.status = "new"
new_hash = self.compute_hash(name, new_items[name])
if old_hash != new_hash:
self.hash_mismatch.append({"name": name})
self.obj_mismatch[name] = (old_items[name], new_items[name])
def collect_names(self, file):
"""Collects all object names in the file."""
def collect_items(name, obj):
name = self.remove_version(name)
if name in items_to_process:
raise ValueError(f"Duplicate key: {name}")
items_to_process.update({name: obj})
items_to_process = dict()
file.visititems(collect_items)
return items_to_process
def serialize_attr_value(self, value: Any):
"""Serializes an attribute value into bytes for hashing.
Setting all numpy array types to string avoids false positives.
Parameters
----------
value : Any
Attribute value.
Returns
-------
bytes
Serialized bytes of the attribute value.
"""
if isinstance(value, np.ndarray):
return value.astype(str).tobytes() # must be 'astype(str)'
elif isinstance(value, (str, int, float)):
return self.remove_version(str(value)).encode()
return self.remove_version(repr(value)).encode()
def hash_dataset(self, dataset: h5py.Dataset):
hashed = md5(self.hash_shape_dtype(dataset))
if dataset.shape == ():
hashed.update(self.serialize_attr_value(dataset[()]))
return hashed.hexdigest().encode()
# WARNING: only head of data
size = min(dataset.shape[0], self.batch_size * 5)
# size = dataset.shape[0]
start = 0
padding = len(str(size))
while start < size:
pad_start = f"{round(start,padding-2):0{padding}}"
print(f"\rData: {dataset.name}: {pad_start}/{size}", end="")
end = min(start + self.batch_size, size)
hashed.update(self.serialize_attr_value(dataset[start:end]))
start = end
print()
return hashed.hexdigest().encode()
def hash_shape_dtype(self, obj: [h5py.Dataset, np.ndarray]) -> str:
if not hasattr(obj, "shape") or not hasattr(obj, "dtype"):
return "".encode()
return str(obj.shape).encode() + str(obj.dtype).encode()
def compute_hash(self, name, obj) -> str:
hashed = md5(name.encode())
for attr_key in sorted(obj.attrs):
attr_value = obj.attrs[attr_key]
hashed = self.uhash(hashed, self.hash_shape_dtype(attr_value))
hashed = self.uhash(hashed, attr_key.encode())
hashed = self.uhash(hashed, self.serialize_attr_value(attr_value))
if isinstance(obj, h5py.Dataset):
hashed = self.uhash(hashed, self.hash_dataset(obj))
elif isinstance(obj, h5py.SoftLink):
hashed = self.uhash(hashed, obj.path.encode())
elif isinstance(obj, h5py.Group):
for k, v in obj.items():
hashed = self.uhash(hashed, self.remove_version(k).encode())
hashed = self.uhash(hashed, self.serialize_attr_value(v))
else:
raise TypeError(
f"Unknown object type: {type(obj)}\n"
+ "Please report this an issue on GitHub."
)
return hashed.hexdigest()
def uhash(self, hash, value):
hash.update(value)
if self.status == "old":
self.all_old.append(value)
elif self.status == "new":
self.all_new.append(value)
return hash
def comp_obj(self, obj=None):
"""Show string diffs for given object.
If obj=None, iterate over differing objects.
"""
if obj is not None and obj in self.obj_mismatch:
key = obj
old, new = self.obj_mismatch[key]
else:
try:
key, (old, new) = next(self._obj_mismatch_iter)
except StopIteration:
return None
print(f"Object: {key}")
if getattr(old, "shape", None) == ():
old = self.remove_version(str(old[()]))
new = self.remove_version(str(new[()]))
pprint(self.diff_strings(old, new, context=15))
return old, new
def diff_strings(self, a: str, b: str, context=30) -> str:
"""Highlight differences between two strings with surrounding context."""
a = str(a)
b = str(b)
matcher = SequenceMatcher(None, a, b)
diffs = []
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag != "equal":
diffs.append(
f"...{a[max(0, i1-context):i2+context]}"
+ f" -> {b[max(0, j1-context):j2+context]}..."
)
return "\n".join(diffs) QuestionsFrom a detail perspective, should a two files have the same hash if they differ in these ways?
Each case where we make changes to the input during hashing is additional processing time hashing the recompute product. This is especially true for rounding big datasets Bigger picture, what is our process for a mismatch on recompute? Hypothetically, a paper has been submitted using a downstream analysis and a reviewer asks for summary stats that would require a recomputed file, whose new hash does not match
Certainly, we can start default to one approach and take another on a case-by-case basis. Next stepsNext, I'll make an effort to reverse-engineer the required environment from existing files to test our capacity to take the 'resolve' route for existing files |
ReplicationEnvironmentI've had trouble replicating files without a complete record of the original The files provide (a) a Unfortunately, most of the existing files use File counts by version, and dependency pins
CensoringIn an effort to minimize mismatches, I've added the option to censor version MismatchesMissing ObjectsA few old files are missing
Hash mismatches~80% of generated files have some hashing mismatch of at least one object.
Some of these are actual datasets, while others are groups or scalar datasets, def unpack_scalar(obj):
return eval(
eval(
str(obj[()])
.replace("null", '"null"')
.replace("false", '"false"')[1:]
)
) IcephysOld old['groups'][2]['datasets'][0].get('value') == None
new['groups'][2]['datasets'][0].get('value') == 0.0 Ophys
resourcesEven controlling for hdmf version, ProcessedElectricalSeriesMany QuestionsShould the hasher take the time to...
Should we adjust the existing files before hashing to account for the updated
A growing list of exceptions also increases the risk of a false negative, if future file specs introduce a meaningful difference in some region we've decided to ignore. |
One of our main goals is to be able to share complete pipelines with results. Sharing of the various intermediate computed data outputs is one way to do this, but for data that can be computed relatively quickly, we could also enable remote users to recompute on the fly.
One possible solution would be to add a recompute method to each dj.Computed table that would regenerate the NWB file (using the same name) if it was not present locally. This would be quite a lot of work, and would also require a lot of thought as to what to do when upstream computed results are not available, but if we could get this to work we could share a much smaller subet of results when we publish papers, which would likely help.
The text was updated successfully, but these errors were encountered: