Support unstructured encoder #1672

jieguangzhou · 2024-01-15T09:42:10Z

Description

Related Issues

Checklist

Is this code covered by new or existing unit tests or integration tests?
Did you run make unit-testing and make integration-testing successfully?
Do new classes, functions, methods and parameters all have docstrings?
Were existing docstrings updated, if necessary?
Was external documentation updated, if necessary?

Additional Notes or Comments

codecov-commenter · 2024-01-15T09:51:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (34830a7) 80.33% compared to head (3fe34a5) 79.93%.
Report is 1390 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1672      +/-   ##
==========================================
- Coverage   80.33%   79.93%   -0.41%     
==========================================
  Files          95      120      +25     
  Lines        6602     8452    +1850     
==========================================
+ Hits         5304     6756    +1452     
- Misses       1298     1696     +398

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

superduperdb/ext/unstructured/encoder.py

kartik4949

Great PR ! :)

superduperdb/ext/unstructured/encoder.py

blythed · 2024-01-15T18:06:39Z

superduperdb/ext/unstructured/encoder.py

+    unstructure_kwargs: t.Dict[str, t.Any] = dc.field(default_factory=dict)
+
+    def __post_init__(self):
+        self.encoder = create_encoder(self.unstructure_kwargs)


What do these do? Convert a .pdf file to bytes and back?

Here I want to keep all source data/element(page_num, text_type(title, text, table...)).
If we don’t want the bytes, we can convert them into a list of dict, also can directly merge them into a text.

Do you have any suggestion? @blythed @kartik4949

Add a tag to choose: element`json\text`

Add a merge func?

only keep the source element

Yes keeping all of that information is good.

kartik4949 · 2024-01-15T18:09:23Z

superduperdb/ext/unstructured/encoder.py

+
+def create_decoder(unstructure_kwargs):
+    def decoder(b: bytes):
+        if isinstance(b, str):


b: bytes?
how can it be str

If we use encoder(uri=file_path), the uri(a str) will be passed in here

What do you think if I ban the use of uri=xxx format?

Because our downloader processes uri, it is currently not very compatible with unstructured

jieguangzhou force-pushed the feat/pdf-encoder branch 2 times, most recently from ed97db2 to b7516d4 Compare January 15, 2024 09:43

jieguangzhou force-pushed the feat/pdf-encoder branch from b7516d4 to 19ab738 Compare January 15, 2024 15:43

jieguangzhou requested review from blythed and kartik4949 January 15, 2024 15:44

kartik4949 reviewed Jan 15, 2024

View reviewed changes

superduperdb/ext/unstructured/encoder.py Show resolved Hide resolved

kartik4949 suggested changes Jan 15, 2024

View reviewed changes

superduperdb/ext/unstructured/encoder.py Show resolved Hide resolved

superduperdb/ext/unstructured/encoder.py Outdated Show resolved Hide resolved

blythed reviewed Jan 15, 2024

View reviewed changes

kartik4949 reviewed Jan 15, 2024

View reviewed changes

jieguangzhou force-pushed the feat/pdf-encoder branch 2 times, most recently from 84f5a68 to 42cc873 Compare January 16, 2024 07:39

jieguangzhou requested a review from kartik4949 January 16, 2024 12:06

jieguangzhou force-pushed the feat/pdf-encoder branch from 8c6b293 to 3fe34a5 Compare January 17, 2024 07:11

Support unstructured encoder

3fe34a5

kartik4949 approved these changes Jan 17, 2024

View reviewed changes

jieguangzhou merged commit ded416e into superduper-io:main Jan 17, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unstructured encoder #1672

Support unstructured encoder #1672

jieguangzhou commented Jan 15, 2024

codecov-commenter commented Jan 15, 2024 •

edited

Loading

kartik4949 left a comment

blythed Jan 15, 2024

jieguangzhou Jan 16, 2024 •

edited

Loading

blythed Jan 16, 2024

kartik4949 Jan 15, 2024

jieguangzhou Jan 16, 2024

kartik4949 Jan 16, 2024

jieguangzhou Jan 17, 2024

Support unstructured encoder #1672

Support unstructured encoder #1672

Conversation

jieguangzhou commented Jan 15, 2024

Description

Related Issues

Checklist

Additional Notes or Comments

codecov-commenter commented Jan 15, 2024 • edited Loading

Codecov Report

kartik4949 left a comment

Choose a reason for hiding this comment

blythed Jan 15, 2024

Choose a reason for hiding this comment

jieguangzhou Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

blythed Jan 16, 2024

Choose a reason for hiding this comment

kartik4949 Jan 15, 2024

Choose a reason for hiding this comment

jieguangzhou Jan 16, 2024

Choose a reason for hiding this comment

kartik4949 Jan 16, 2024

Choose a reason for hiding this comment

jieguangzhou Jan 17, 2024

Choose a reason for hiding this comment

codecov-commenter commented Jan 15, 2024 •

edited

Loading

jieguangzhou Jan 16, 2024 •

edited

Loading