-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix invalid evaluation doctype deduction #3071
Fix invalid evaluation doctype deduction #3071
Conversation
5ef51a4
to
6384058
Compare
2eb4ed1
to
cd806c5
Compare
document_paths: Optional[list[str | Path]] = None, | ||
ground_truth_paths: Optional[list[str | Path]] = None, | ||
relative_document_paths: Optional[list[str | Path]] = None, | ||
relative_ground_truth_paths: Optional[list[str | Path]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to reduce confusion by being more verbose what those arguments actually are: relative to self.documents_dir
or self.ground_truths_dir
.
Note that we may want to refactor the logic/design of this class at some point and make it use full paths instead (only if we don't like the design of providing the root dir and discovering files). For me, it all depends, it can stay as it is.
unstructured/metrics/evaluate.py
Outdated
connector = doc.parts[0] if len(doc.parts) > 1 else None | ||
def _process_document(self, path: Path) -> list: | ||
filename = path.stem | ||
doctype = get_document_type(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relevant change: setting the doctype in ElementTypeMetricsCalculator
.
81a84e0
to
bce5119
Compare
…thub.com/Unstructured-IO/unstructured into fix/invalid-evaluation-doctype-deduction
There was a bug in evaluation.py that caused extensions of certain files to be detected improperly.
Evaluation files are expected to have two extensions, e.g.
foobar.pdf.json
because they were partitioned first. The code was prone to a case when more than 3 dots are present in file name.