Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unstructured encoder #1672

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

jieguangzhou
Copy link
Collaborator

Description

Related Issues

Checklist

  • Is this code covered by new or existing unit tests or integration tests?
  • Did you run make unit-testing and make integration-testing successfully?
  • Do new classes, functions, methods and parameters all have docstrings?
  • Were existing docstrings updated, if necessary?
  • Was external documentation updated, if necessary?

Additional Notes or Comments

@jieguangzhou jieguangzhou force-pushed the feat/pdf-encoder branch 2 times, most recently from ed97db2 to b7516d4 Compare January 15, 2024 09:43
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (34830a7) 80.33% compared to head (3fe34a5) 79.93%.
Report is 1390 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1672      +/-   ##
==========================================
- Coverage   80.33%   79.93%   -0.41%     
==========================================
  Files          95      120      +25     
  Lines        6602     8452    +1850     
==========================================
+ Hits         5304     6756    +1452     
- Misses       1298     1696     +398     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@kartik4949 kartik4949 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR ! :)

superduperdb/ext/unstructured/encoder.py Show resolved Hide resolved
superduperdb/ext/unstructured/encoder.py Outdated Show resolved Hide resolved
unstructure_kwargs: t.Dict[str, t.Any] = dc.field(default_factory=dict)

def __post_init__(self):
self.encoder = create_encoder(self.unstructure_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these do? Convert a .pdf file to bytes and back?

Copy link
Collaborator Author

@jieguangzhou jieguangzhou Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I want to keep all source data/element(page_num, text_type(title, text, table...)).
If we don’t want the bytes, we can convert them into a list of dict, also can directly merge them into a text.

Do you have any suggestion? @blythed @kartik4949

  1. Add a tag to choose: element`json\text`
  2. Add a merge func?
  3. only keep the source element

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes keeping all of that information is good.


def create_decoder(unstructure_kwargs):
def decoder(b: bytes):
if isinstance(b, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b: bytes?
how can it be str

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use encoder(uri=file_path), the uri(a str) will be passed in here

What do you think if I ban the use of uri=xxx format?

Because our downloader processes uri, it is currently not very compatible with unstructured

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jieguangzhou jieguangzhou force-pushed the feat/pdf-encoder branch 2 times, most recently from 84f5a68 to 42cc873 Compare January 16, 2024 07:39
@jieguangzhou jieguangzhou merged commit ded416e into superduper-io:main Jan 17, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants