-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support unstructured encoder #1672
Support unstructured encoder #1672
Conversation
ed97db2
to
b7516d4
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1672 +/- ##
==========================================
- Coverage 80.33% 79.93% -0.41%
==========================================
Files 95 120 +25
Lines 6602 8452 +1850
==========================================
+ Hits 5304 6756 +1452
- Misses 1298 1696 +398 ☔ View full report in Codecov by Sentry. |
b7516d4
to
19ab738
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR ! :)
unstructure_kwargs: t.Dict[str, t.Any] = dc.field(default_factory=dict) | ||
|
||
def __post_init__(self): | ||
self.encoder = create_encoder(self.unstructure_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these do? Convert a .pdf
file to bytes
and back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I want to keep all source data/element
(page_num, text_type(title, text, table...)).
If we don’t want the bytes, we can convert them into a list of dict
, also can directly merge them into a text.
Do you have any suggestion? @blythed @kartik4949
- Add a tag to choose:
element
`json\
text` - Add a merge func?
- only keep the source
element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes keeping all of that information is good.
|
||
def create_decoder(unstructure_kwargs): | ||
def decoder(b: bytes): | ||
if isinstance(b, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b: bytes?
how can it be str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use encoder(uri=file_path), the uri(a str
) will be passed in here
What do you think if I ban the use of uri=xxx
format?
Because our downloader processes uri, it is currently not very compatible with unstructured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
84f5a68
to
42cc873
Compare
8c6b293
to
3fe34a5
Compare
Description
Related Issues
Checklist
make unit-testing
andmake integration-testing
successfully?Additional Notes or Comments