Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Hypermodern Cokkiecutter #570

Open
wants to merge 86 commits into
base: master
Choose a base branch
from

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented Nov 29, 2024

This PR is just to indicate that I'm woring on using a cookiecutter template for this modle.

It comes with a lot of features. Linting, coverage and documentation builders.

Refactor the `get_sample_files` function to use `os.walk` instead of
`os.listdir`. This allows the function to find files in subdirectories
within the "compare" directory, ensuring that all relevant test files
are included.
Refactor the `test_custom_invoices` test function to use `os.walk` for iterating through files in the "custom" directory.

This change ensures that the test function can correctly locate and process all test files, including those in subdirectories.
Use `pytest.raises` to assert that an `AssertionError` is raised when creating an `InvoiceTemplate` with an invalid language code.

This ensures that the test correctly checks for the exception even when the code is run with optimizations (`python -O`).
Refactor the `ocrmypdf` availability check to use `ocrmypdf_available()` instead of `have_ocrmypdf()`.

This change ensures consistency with the new availability check function and improves code readability.
Update the expected warning message in `test_ordered_load_broken_json` to match the actual output.

This change ensures that the test correctly verifies the warning message generated when a broken JSON file is loaded.
This commit refactors the command-line interface (CLI) to use the `click` library instead of `argparse`.

The `click` library provides a more concise and readable way to define command-line options and arguments. It also offers features like automatic help generation and type validation, improving the user experience.

This change removes the dependency on `argparse` and modernizes the CLI implementation.
This commit removes the dependency on the `importlib.resources` module and instead uses a relative path to access the templates directory.

The `importlib.resources` module was introduced in Python 3.7, so removing this dependency makes the code compatible with older versions of Python.

Additionally, this commit includes the following changes:

- Add type hints to function parameters and return values.
- Update docstrings to conform to Google style guidelines.
- Refactor code for clarity and consistency.
This commit refactors the `InvoiceTemplate` class and adds several optimizations to improve performance and maintainability.

The following changes were made:

- **Refactor `__init__`:** Use `super()` without arguments for calling the superclass initializer.
- **Refactor `matches_input`:** Improve the docstring and logic for checking keyword matches.
- **Optimize `parse_number`:** Add an early exit condition for simple numbers and handle locale-specific thousands separators.
- **Refactor `coerce_type`:** Improve the docstring and raise `AssertionError` directly for unknown types.
- **Refactor `extract`:**  Improve the docstring and add a "Raises" section.
- **Add type hints:** Add type hints to function parameters and return values.
- **Update docstrings:** Update docstrings to conform to Google style guidelines.
- **General cleanup:** Remove unnecessary comments and improve code readability.
This commit updates the `pdfminer_wrapper` module and adds type hints to the `to_text` function.

The following changes were made:

- Removed unnecessary encoding-related code that is no longer needed in Python 3.
- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines.
- Added a module-level docstring.

These changes improve the code's readability, maintainability, and compatibility with modern Python versions.
This commit updates the `pdfplumber` input module and adds type hints to the `to_text` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines.
- Added a module-level docstring.

These changes improve the code's readability, maintainability, and make it easier to understand its usage.
This commit updates the `pdftotext` input module and adds type hints to the `to_text` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines, including a "Raises" section.
- Added a module-level docstring.
- Raise `FileNotFoundError` if the PDF file is not found.

These changes improve the code's readability, maintainability, and error handling.
This commit updates the `tesseract` input module and adds type hints to the `to_text` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines, including a "Raises" section.
- Raise `FileNotFoundError` if the image file is not found.
- Check for the `tesseract` executable using `shutil.which`.

These changes improve the code's readability, maintainability, and error handling.
This commit updates the `to_csv` output module and adds type hints to the `write_to_file` function.

The following changes were made:

- Added type hints to the function parameters and return value.
- Updated the docstring to conform to Google style guidelines.
- Added a module-level docstring.

These changes improve the code's readability, maintainability, and make it easier to understand its usage.
This commit refactors the Google Vision input module (`gvision.py`) to improve its structure, error handling, and compatibility with other input modules.

The following changes were made:

- Moved the import of `google.cloud.vision` inside the `to_text` function to prevent import errors when other input modules are used.
- Added a check (`have_google_cloud`) to verify if the `google.cloud.vision` module is available before attempting to use it.
- Improved the error message to guide users on installing the necessary dependency if it's missing.
- Updated the docstring to reflect the dependency on `google-cloud-vision`.
- Removed the unused `language` parameter for consistency with other input modules.
- Added type hints for improved readability and maintainability.

These changes make the Google Vision input module more robust and user-friendly while ensuring compatibility with the rest of the invoice2data project.
This commit updates the `to_xml` output module and adds type hints to its functions.

The following changes were made:

- Added type hints to all function parameters and return values.
- Updated docstrings to conform to Google style guidelines.
- Refactored code for clarity and consistency.
- Removed unnecessary logging and simplified the defusedxml availability check.
- Updated the module-level docstring.

These changes improve the code's readability, maintainability, and type safety.
This commit fixes the gvision unit tests that were failing due to incorrect mocking and assertions.

The following changes were made:

- Corrected the mocking of `get_blob` in both test cases to accurately simulate the behavior of Google Cloud Storage.
- Added `side_effect` to `mock_bucket.get_blob` in `test_to_text` to return different values on consecutive calls.
- Simplified the mocking of `get_blob` in `test_to_text_existing_result` to return the result blob directly.
- Ensured that `to_text` is called with the correct `path` argument in both test cases.
- Used `assert_any_call` in `test_to_text_existing_result` to check if `get_blob` is called with the expected argument without enforcing it as the only call.

These corrections ensure that the unit tests accurately test the functionality of the gvision input module and pass reliably.
This commit moves the imports of `google.cloud.storage` and `google.cloud.vision` inside the `to_text` function in the `gvision` input module.

This change ensures that these modules are only imported when the Google Cloud Vision API is actually used, preventing unnecessary imports and potential import errors when other input methods are used.

This approach aligns with the structure of other input modules, such as `ocrmypdf`, where the module-specific libraries are only imported when the function is called.
This commit refactors the ocrmypdf input module to improve its structure, error handling, and documentation.

The following changes were made:

- Refactored the `have_ocrmypdf` function to `ocrmypdf_available` and improved its implementation.
- Added type hints to function parameters and return values.
- Updated docstrings to conform to Google style guidelines.
- Moved the import of `ocrmypdf` inside the `pre_process_pdf` function to prevent unnecessary imports.
- Improved logging and error handling.
- Added a module-level docstring.

These changes enhance the readability, maintainability, and robustness of the ocrmypdf input module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant