[DO NOT MERGE] Hypermodern Cokkiecutter #570

bosd · 2024-11-29T20:49:42Z

This PR is just to indicate that I'm woring on using a cookiecutter template for this modle.

It comes with a lot of features. Linting, coverage and documentation builders.

Refactor the `get_sample_files` function to use `os.walk` instead of `os.listdir`. This allows the function to find files in subdirectories within the "compare" directory, ensuring that all relevant test files are included.

Refactor the `test_custom_invoices` test function to use `os.walk` for iterating through files in the "custom" directory. This change ensures that the test function can correctly locate and process all test files, including those in subdirectories.

Use `pytest.raises` to assert that an `AssertionError` is raised when creating an `InvoiceTemplate` with an invalid language code. This ensures that the test correctly checks for the exception even when the code is run with optimizations (`python -O`).

Refactor the `ocrmypdf` availability check to use `ocrmypdf_available()` instead of `have_ocrmypdf()`. This change ensures consistency with the new availability check function and improves code readability.

Update the expected warning message in `test_ordered_load_broken_json` to match the actual output. This change ensures that the test correctly verifies the warning message generated when a broken JSON file is loaded.

This commit refactors the command-line interface (CLI) to use the `click` library instead of `argparse`. The `click` library provides a more concise and readable way to define command-line options and arguments. It also offers features like automatic help generation and type validation, improving the user experience. This change removes the dependency on `argparse` and modernizes the CLI implementation.

This commit removes the dependency on the `importlib.resources` module and instead uses a relative path to access the templates directory. The `importlib.resources` module was introduced in Python 3.7, so removing this dependency makes the code compatible with older versions of Python. Additionally, this commit includes the following changes: - Add type hints to function parameters and return values. - Update docstrings to conform to Google style guidelines. - Refactor code for clarity and consistency.

This commit refactors the `InvoiceTemplate` class and adds several optimizations to improve performance and maintainability. The following changes were made: - **Refactor `__init__`:** Use `super()` without arguments for calling the superclass initializer. - **Refactor `matches_input`:** Improve the docstring and logic for checking keyword matches. - **Optimize `parse_number`:** Add an early exit condition for simple numbers and handle locale-specific thousands separators. - **Refactor `coerce_type`:** Improve the docstring and raise `AssertionError` directly for unknown types. - **Refactor `extract`:** Improve the docstring and add a "Raises" section. - **Add type hints:** Add type hints to function parameters and return values. - **Update docstrings:** Update docstrings to conform to Google style guidelines. - **General cleanup:** Remove unnecessary comments and improve code readability.

This commit updates the `pdfminer_wrapper` module and adds type hints to the `to_text` function. The following changes were made: - Removed unnecessary encoding-related code that is no longer needed in Python 3. - Added type hints to the function parameters and return value. - Updated the docstring to conform to Google style guidelines. - Added a module-level docstring. These changes improve the code's readability, maintainability, and compatibility with modern Python versions.

This commit updates the `pdfplumber` input module and adds type hints to the `to_text` function. The following changes were made: - Added type hints to the function parameters and return value. - Updated the docstring to conform to Google style guidelines. - Added a module-level docstring. These changes improve the code's readability, maintainability, and make it easier to understand its usage.

This commit updates the `pdftotext` input module and adds type hints to the `to_text` function. The following changes were made: - Added type hints to the function parameters and return value. - Updated the docstring to conform to Google style guidelines, including a "Raises" section. - Added a module-level docstring. - Raise `FileNotFoundError` if the PDF file is not found. These changes improve the code's readability, maintainability, and error handling.

This commit updates the `tesseract` input module and adds type hints to the `to_text` function. The following changes were made: - Added type hints to the function parameters and return value. - Updated the docstring to conform to Google style guidelines, including a "Raises" section. - Raise `FileNotFoundError` if the image file is not found. - Check for the `tesseract` executable using `shutil.which`. These changes improve the code's readability, maintainability, and error handling.

This commit updates the `to_csv` output module and adds type hints to the `write_to_file` function. The following changes were made: - Added type hints to the function parameters and return value. - Updated the docstring to conform to Google style guidelines. - Added a module-level docstring. These changes improve the code's readability, maintainability, and make it easier to understand its usage.

This commit refactors the Google Vision input module (`gvision.py`) to improve its structure, error handling, and compatibility with other input modules. The following changes were made: - Moved the import of `google.cloud.vision` inside the `to_text` function to prevent import errors when other input modules are used. - Added a check (`have_google_cloud`) to verify if the `google.cloud.vision` module is available before attempting to use it. - Improved the error message to guide users on installing the necessary dependency if it's missing. - Updated the docstring to reflect the dependency on `google-cloud-vision`. - Removed the unused `language` parameter for consistency with other input modules. - Added type hints for improved readability and maintainability. These changes make the Google Vision input module more robust and user-friendly while ensuring compatibility with the rest of the invoice2data project.

This commit updates the `to_xml` output module and adds type hints to its functions. The following changes were made: - Added type hints to all function parameters and return values. - Updated docstrings to conform to Google style guidelines. - Refactored code for clarity and consistency. - Removed unnecessary logging and simplified the defusedxml availability check. - Updated the module-level docstring. These changes improve the code's readability, maintainability, and type safety.

This commit fixes the gvision unit tests that were failing due to incorrect mocking and assertions. The following changes were made: - Corrected the mocking of `get_blob` in both test cases to accurately simulate the behavior of Google Cloud Storage. - Added `side_effect` to `mock_bucket.get_blob` in `test_to_text` to return different values on consecutive calls. - Simplified the mocking of `get_blob` in `test_to_text_existing_result` to return the result blob directly. - Ensured that `to_text` is called with the correct `path` argument in both test cases. - Used `assert_any_call` in `test_to_text_existing_result` to check if `get_blob` is called with the expected argument without enforcing it as the only call. These corrections ensure that the unit tests accurately test the functionality of the gvision input module and pass reliably.

This commit moves the imports of `google.cloud.storage` and `google.cloud.vision` inside the `to_text` function in the `gvision` input module. This change ensures that these modules are only imported when the Google Cloud Vision API is actually used, preventing unnecessary imports and potential import errors when other input methods are used. This approach aligns with the structure of other input modules, such as `ocrmypdf`, where the module-specific libraries are only imported when the function is called.

…of range

This commit refactors the ocrmypdf input module to improve its structure, error handling, and documentation. The following changes were made: - Refactored the `have_ocrmypdf` function to `ocrmypdf_available` and improved its implementation. - Added type hints to function parameters and return values. - Updated docstrings to conform to Google style guidelines. - Moved the import of `ocrmypdf` inside the `pre_process_pdf` function to prevent unnecessary imports. - Improved logging and error handling. - Added a module-level docstring. These changes enhance the readability, maintainability, and robustness of the ocrmypdf input module.

bosd added 30 commits November 16, 2024 20:14

Use modern cookiecutter template

da30718

Fixup

e09c196

Fixup

5790dad

Fixup

98cdae7

Fixup

5a746ee

Fixup

3d11ad2

Fixup

fbc51cf

Fixup

f0f09ca

Fixup

58372ff

Fixup

8af5128

Fixup

bacea4b

Update fixup toml file

3ff4c2e

Update lockfile

c86b375

Fixup pyproject.toml

035b709

Update Lockfile

a1c20e0

pre-commit, Ruff formatting

0655883

update md files

67208e8

Fix B006

0e9d023

Fix D205

5a6a527

Fix N806 change variable tmp_folder to lowercase

7091c93

Fix E503 line too long

fc2d3d6

Fix docstring

84f4aff

Fix B007

d7709ca

Fix Docstring

10d0703

Fix Docstring

86d58b4

Fix N806 change variable tmp_folder to lowercase

47901a5

Fix b007

b776deb

Rename var InvoiceTempl to invoicetempl

d16c12a

Rename var OPTIONS_TEST to options_test

e85c2ae

Rename argument Loader to loader

c2ea7e7

bosd added 30 commits November 17, 2024 08:35

Fixup! output_to_xml logging

abe9d5e

Fixup Loader

3c15728

Update pre-commit config

889a999

Update noxfile

8caa4bf

Update defusedxml dependency in toml file, and doclint test

f32ff4d

Test: Refactor get_sample_files to use os.walk

f6c599a

Refactor the `get_sample_files` function to use `os.walk` instead of `os.listdir`. This allows the function to find files in subdirectories within the "compare" directory, ensuring that all relevant test files are included.

Test: Refactor test_cli

9988677

Test: Refactor ocrmypdf availability check

4d6fe93

Refactor the `ocrmypdf` availability check to use `ocrmypdf_available()` instead of `have_ocrmypdf()`. This change ensures consistency with the new availability check function and improves code readability.

Test: Fix warning message in test_ordered_load_broken_json

d5fe9b9

Update the expected warning message in `test_ordered_load_broken_json` to match the actual output. This change ensures that the test correctly verifies the warning message generated when a broken JSON file is loaded.

Documentation: Add Docstrings

8170bdb

Documentation: Add docstring

9c70c7b

Test: Improve ocrmypdf fallback test. Pretty error on list index out …

65f6e67

…of range

Fixup __main__.py ocrmypdf fallback

2e7a698

Nofxile uncomment ocrmypdf

6944bd3

Update Lockfile

21f6a8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Hypermodern Cokkiecutter #570

[DO NOT MERGE] Hypermodern Cokkiecutter #570

bosd commented Nov 29, 2024

[DO NOT MERGE] Hypermodern Cokkiecutter #570

Are you sure you want to change the base?

[DO NOT MERGE] Hypermodern Cokkiecutter #570

Conversation

bosd commented Nov 29, 2024