Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid evaluation doctype deduction #3071

Closed

Conversation

micmarty-deepsense
Copy link
Contributor

@micmarty-deepsense micmarty-deepsense commented May 21, 2024

There was a bug in evaluation.py that caused extensions of certain files to be detected improperly.
Evaluation files are expected to have two extensions, e.g. foobar.pdf.json because they were partitioned first. The code was prone to a case when more than 3 dots are present in file name.

  • adjust doctype extraction for:
    • TextExtractionMetricsCalculator
    • TableStructureMetricsCalculator
    • ElementTypeMetricsCalculator
  • unit test

@micmarty-deepsense micmarty-deepsense self-assigned this May 21, 2024
@micmarty-deepsense micmarty-deepsense changed the title Fix invalid evaluation doctype deduction [WIP] Fix invalid evaluation doctype deduction May 21, 2024
@micmarty-deepsense micmarty-deepsense force-pushed the fix/invalid-evaluation-doctype-deduction branch from 5ef51a4 to 6384058 Compare May 22, 2024 08:06
document_paths: Optional[list[str | Path]] = None,
ground_truth_paths: Optional[list[str | Path]] = None,
relative_document_paths: Optional[list[str | Path]] = None,
relative_ground_truth_paths: Optional[list[str | Path]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to reduce confusion by being more verbose what those arguments actually are: relative to self.documents_dir or self.ground_truths_dir.

Note that we may want to refactor the logic/design of this class at some point and make it use full paths instead (only if we don't like the design of providing the root dir and discovering files). For me, it all depends, it can stay as it is.

connector = doc.parts[0] if len(doc.parts) > 1 else None
def _process_document(self, path: Path) -> list:
filename = path.stem
doctype = get_document_type(path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevant change: setting the doctype in ElementTypeMetricsCalculator.

@micmarty-deepsense micmarty-deepsense marked this pull request as ready for review May 23, 2024 08:36
@micmarty-deepsense micmarty-deepsense changed the title [WIP] Fix invalid evaluation doctype deduction Fix invalid evaluation doctype deduction May 24, 2024
CHANGELOG.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants