Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ✨ add check_raw_data function #824

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

martonvago
Copy link
Contributor

@martonvago martonvago commented Oct 29, 2024

Description

This PR adds a check_raw_data function. This is a combination of verify_raw_data and validate_raw_data because frictionless has one function for these. We can still split the function into two based on the error codes, but I wanted to put it up for discussion before making it more complicated.

Feel free to suggest better names for check_raw_data and the error it throws (if we're keeping them).

Note that I added resource_properties: dict as an argument instead of reading it from file inside check_raw_data. I'll put up another PR for loading the resource properties from datapackage.json, because this seemed like a separate task and raised its own questions.

Closes #800, #801

This PR needs an in-depth review.

Checklist

  • Added or updated tests
  • Tests passed locally
  • Linted and formatted code
  • Build passed locally
  • Updated documentation

@martonvago martonvago self-assigned this Oct 29, 2024
Raises:
MismatchedRawDataError: If the data do not match the properties.
"""
resource = Resource.from_descriptor(resource_properties)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When frictionless validates a resource it checks both the resource properties and the table schema, including any constraints on the fields. It locates the data file to compare with the schema by following the path property of the resource. So here I'm constructing a resource that has the resource properties loaded from datapackage.json, except that the path is pointed to the raw data file. This doesn't affect the original resource properties.

Comment on lines +25 to +26
resource.format = get_extension(data_path)
resource.scheme = "file"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed for being able to open the file.

Comment on lines +27 to +28
resource.basepath = str(path_sprout_root())
resource.path = str(data_path.relative_to(path_sprout_root()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing this because path cannot be an absolute path. Since it has to be made into a relative path, it made sense to make it relative to sprout root. To ensure that frictionless can find the data file, I'm setting sprout root to be the basepath of the resource. This has the added advantage of working with the tests (for which I am infinitely grateful).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wondrously complex function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pulled this out from verify_properties_are_well_formed.py

Comment on lines +11 to +15
@fixture
def sprout_root(tmp_path, monkeypatch) -> Path:
ROOT = tmp_path / "root"
monkeypatch.setenv("SPROUT_ROOT", str(ROOT))
return ROOT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this so everything can happen inside temporary dirs

@martonvago martonvago marked this pull request as ready for review October 29, 2024 11:07
@martonvago martonvago requested a review from a team as a code owner October 29, 2024 11:07
"""Should accept a data file that conforms to schema constraints."""
resource_properties["schema"]["fields"] += [
{"name": "id", "type": "integer"},
{"name": "name", "type": "string", "constraints": {"maxLength": 10}},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've discovered that frictionless expects constraint names in camel case (hence maxLength), so we'll have to have a think about what to do with our class attribute names.

resource.basepath = str(path_sprout_root())
resource.path = str(data_path.relative_to(path_sprout_root()))

report = resource.validate()
Copy link
Member

@lwjohnst86 lwjohnst86 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only going to make a small comment here, but looking through the frictionless source code (which, I think personally, isn't well designed), it isn't clear how they do the validation (I had to dig for finally finding how resource.validate() works).

My instinct is that we may have to either 1) write our own checks because it gives me anxiety looking through and trying to understand their code, or 2) look for other verify/validate packages available. My guess, given this is a very common need, there will be better designed tools out there for these purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tagging the relevant issue: #826

@signekb
Copy link
Member

signekb commented Oct 31, 2024

@martonvago @lwjohnst86 should I hold off on reviewing this until #826 is closed?

@martonvago
Copy link
Contributor Author

martonvago commented Oct 31, 2024

@signekb Yeah, I think so! Maybe I'll move it back to in progress

@signekb signekb marked this pull request as draft November 12, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Create and test verify_raw_data function (part of write_resource_data_to_raw)
3 participants