Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update parser with bytesio interface #32

Merged
merged 3 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docling/backend/docling_parse_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,10 +150,11 @@ def __init__(self, path_or_stream: Union[BytesIO, Path]):
super().__init__(path_or_stream)
self._pdoc = pdfium.PdfDocument(path_or_stream)
# Parsing cells with docling_parser call
if isinstance(path_or_stream, BytesIO):
raise NotImplemented("This backend does not support byte streams yet.")
parser = pdf_parser()
self._parser_doc = parser.find_cells(str(path_or_stream))
if isinstance(path_or_stream, BytesIO):
self._parser_doc = parser.find_cells_from_bytesio(path_or_stream)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a check that the file exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not here, this is too internal.
the check if the file exists should be in the DocumentConversionInput. It is already in place, https://github.com/DS4SD/docling/blob/main/docling/datamodel/document.py#L99

self._parser_doc = parser.find_cells(str(path_or_stream))

def page_count(self) -> int:
return len(self._parser_doc["pages"])
Expand Down
103 changes: 78 additions & 25 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ pydantic-settings = "^2.3.0"
huggingface_hub = ">=0.23,<1"
requests = "^2.32.3"
easyocr = { version = "^1.7", optional = true }
docling-parse = "^0.0.1"
docling-parse = "^0.2.0"
certifi = ">=2024.7.4"

[tool.poetry.group.dev.dependencies]
Expand Down
Loading