-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/file reporter #78
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks promising, but we'll have to tweak it slightly to make it as effective as possible.
If anything is unclear, either ping me here or I can push some aspects as I have them in my head to your branch if you prefer.
src/nnbench/reporter/file.py
Outdated
from nnbench.types import BenchmarkRecord | ||
|
||
|
||
class Parser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarification: We're not actually parsing things when we are reading files, that is the job of the respective modules (json|yaml|toml
).
src/nnbench/reporter/file.py
Outdated
# Register custom parsers here | ||
parsers = {"json": JsonParser, "yaml": YamlParser} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of making it a module-wide default, but we should take care that
a) the variable is private (i.e. has a leading underscore) to prevent accidental export, and
b) we should make the value structure of the map as easy as possible.
To address b), I would start by making it a tuple[ser, de]
where ser
is a Callable[[IO, dict[str, Any]], None]
, i.e., a function taking a file descriptor in write mode and a record and writing it to a file, and de
being a Callable[[IO], dict[str, Any]]
, a function taking a file descriptor in read mode and returning the loaded record.
You can then register the SerDe factories based on whether the necessary packages are installed (json is available out of the box, yaml and toml are not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I use a class-based approach to register the SerDe factories (like I used in Parser class) ?
One other option is to use a simple register method which takes 3 arguments (i.e., Ser and De functions and a file_type) as arguments.
Also, I think the class-based approach is much more concise to define the ser and de methods on the user side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with either, though the optional import part (i.e., erroring if toml/yaml
are not installed) will be a bit easier in the class case.
For now, I think the quickest way is a functional approach, though. Maybe like this:
_file_loaders: dict[str, tuple[Any, Any]] = {}
def yaml_load(fp: IO, options=None):
try:
import yaml
except ImportError:
raise ModuleNotFoundError("`pyyaml` is not installed")
# takes no options, but the slot is useful for passing options to file loaders.
obj = yaml.safe_load(fp)
return BenchmarkRecord(context=obj["context"], benchmarks=obj["benchmarks"])
def yaml_save(record: BenchmarkRecord, fp: IO, options=None) -> None:
try:
import yaml
except ImportError:
raise ModuleNotFoundError("`pyyaml` is not installed")
yaml.safe_dump(record, fp, **(options or {})
_file_loaders["yaml"] = (yaml_save, yaml_load)
With an option of defining e.g. a register_file_io(ser, de)
later to do the dict insertion if we want.
src/nnbench/reporter/file.py
Outdated
if not self.dir: | ||
raise BaseException("Directory is not initialized") | ||
file_path = os.path.join(self.dir, file_name) | ||
file_type = file_name.split(".")[1] | ||
try: | ||
with open(file_path) as file: | ||
data = file.read() | ||
parsed_data = parse_records(data, file_type) | ||
return parsed_data | ||
except FileNotFoundError: | ||
raise ValueError(f"Could not read the file: {file_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs the restructured file loading dict I talked about earlier, but in essence, all that should happen here is the file being loaded with open
(like you did), and then calling the deserializer on the opened file.
In particular, no error handling for open()
should be necessary, since those are informative enough for the user on their own.
src/nnbench/reporter/file.py
Outdated
if not self.dir: | ||
raise BaseException("Directory is not initialized") | ||
|
||
file_path = os.path.join(self.dir, file_name) | ||
if not os.path.exists(file_path): # Create the file | ||
with open(file_path, "w") as file: | ||
file.write("") | ||
try: | ||
parsed_records = self.read(file_name) | ||
file_type = file_name.split(".")[1] | ||
new_records = append_record_to_records(parsed_records, record, file_type) | ||
with open(file_path, "w") as file: | ||
file.write(new_records) | ||
except FileNotFoundError: | ||
raise ValueError(f"Could not read the file: {file_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, just load the serializer from the dict and call it on the opened file.
src/nnbench/reporter/file.py
Outdated
def finalize(self) -> None: | ||
del self.dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not do what you think it does - it just stages the self.dir
variable for garbage collection.
To safely remove the directory (which you might not want to do anyway, we could add a flag in the constructor for that?), you should call shutil.rmtree(self.dir, ignore_errors=True)
. (Though you might want to check existence first and set ignore_errors to False
.
…ded changes to read and write methods of base `BenchmarkReporter`.
I've changed the value value of the |
Thanks! To get you up to speed to the current situation, we (I) have found that a file reporter is necessary as a building block for the duckDB reporter implementation that I started in #75. For this reason, I implemented a file reporter myself in that PR, which we will go ahead with ASAP. That is not meant as a knock on you - you did a great job here, and a lot of the logic you brought in here also made it into #75, we just had to speed things up a bit to consolidate the interface. If you are interested, I'd like to welcome you to help improve on it (especially the file driver registration/deregistration hooks you gave here come to mind)! I'm going to open some tickets related to improvement of that file IO solution, which I can assign to you if you want, just let me know by commenting on the respective issue. |
Closes #74
Here, I have written a generic FileReporter according to your recommendations mentioned in PR #74.
To specialize the method to both reading and writing to file in any particular style, we can utilize the Parser class to define read and write methods.
I have defined JSONParser and YAMLFileParser as sample in
file.py
for handling read and write operations in json and yaml files.Since the FileReporter read and write functions utilized the Parser class, it automatically adjusts the read and write methods based on the file extension we are writing to.
Registering a Parser
Once the parser is registered with a particular extension, we can utilize that parser implicitly based on the extension of
file_name
we want to write to.Do let me know your views and recommendations on these changes.