-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ✨ add check_raw_data
function
#824
base: main
Are you sure you want to change the base?
Conversation
Raises: | ||
MismatchedRawDataError: If the data do not match the properties. | ||
""" | ||
resource = Resource.from_descriptor(resource_properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When frictionless
validates a resource it checks both the resource properties and the table schema, including any constraints on the fields. It locates the data file to compare with the schema by following the path
property of the resource. So here I'm constructing a resource that has the resource properties loaded from datapackage.json
, except that the path is pointed to the raw data file. This doesn't affect the original resource properties.
resource.format = get_extension(data_path) | ||
resource.scheme = "file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are needed for being able to open the file.
resource.basepath = str(path_sprout_root()) | ||
resource.path = str(data_path.relative_to(path_sprout_root())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doing this because path
cannot be an absolute path. Since it has to be made into a relative path, it made sense to make it relative to sprout root. To ensure that frictionless
can find the data file, I'm setting sprout root to be the basepath
of the resource. This has the added advantage of working with the tests (for which I am infinitely grateful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A wondrously complex function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pulled this out from verify_properties_are_well_formed.py
@fixture | ||
def sprout_root(tmp_path, monkeypatch) -> Path: | ||
ROOT = tmp_path / "root" | ||
monkeypatch.setenv("SPROUT_ROOT", str(ROOT)) | ||
return ROOT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting this so everything can happen inside temporary dirs
"""Should accept a data file that conforms to schema constraints.""" | ||
resource_properties["schema"]["fields"] += [ | ||
{"name": "id", "type": "integer"}, | ||
{"name": "name", "type": "string", "constraints": {"maxLength": 10}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've discovered that frictionless
expects constraint names in camel case (hence maxLength
), so we'll have to have a think about what to do with our class attribute names.
resource.basepath = str(path_sprout_root()) | ||
resource.path = str(data_path.relative_to(path_sprout_root())) | ||
|
||
report = resource.validate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only going to make a small comment here, but looking through the frictionless source code (which, I think personally, isn't well designed), it isn't clear how they do the validation (I had to dig for finally finding how resource.validate()
works).
My instinct is that we may have to either 1) write our own checks because it gives me anxiety looking through and trying to understand their code, or 2) look for other verify/validate packages available. My guess, given this is a very common need, there will be better designed tools out there for these purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tagging the relevant issue: #826
@martonvago @lwjohnst86 should I hold off on reviewing this until #826 is closed? |
@signekb Yeah, I think so! Maybe I'll move it back to in progress |
Description
This PR adds a
check_raw_data
function. This is a combination ofverify_raw_data
andvalidate_raw_data
becausefrictionless
has one function for these. We can still split the function into two based on the error codes, but I wanted to put it up for discussion before making it more complicated.Feel free to suggest better names for
check_raw_data
and the error it throws (if we're keeping them).Note that I added
resource_properties: dict
as an argument instead of reading it from file insidecheck_raw_data
. I'll put up another PR for loading the resource properties fromdatapackage.json
, because this seemed like a separate task and raised its own questions.Closes #800, #801
This PR needs an in-depth review.
Checklist