Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue with eparse mode #6

Open
luca-simonetti opened this issue Apr 18, 2024 · 2 comments
Open

Potential issue with eparse mode #6

luca-simonetti opened this issue Apr 18, 2024 · 2 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@luca-simonetti
Copy link

luca-simonetti commented Apr 18, 2024

I don't fully understand if this is how modes other than eparse and structured (namely digest and table-digest) could ever work.
Was it supposed to be an elif to the eparse condition?

if eparse_mode == "eparse":
text = str(table.iloc[:eparse_max_rows, :eparse_max_cols])
if "digest" in eparse_mode:
digest = get_table_digest(
df_serialize_table(table),
table_name=table_name,
)
if eparse_mode == "table-digest":
text = (
f"{table_name} is a spreadsheet table. This is "
f"the head of the table:\n{table.head(eparse_max_rows)}\n"
f"Summary: {digest}."
)
else:
text = digest
elif eparse_mode == "unstructured":
text = lxml.html.document_fromstring(html_text).text_content()
table = Table(text=text, metadata=metadata)
elements.append(table)
return elements

@ChrisPappalardo
Copy link
Owner

Hi, thanks for opening this issue and your interest in eparse. The contrib function you’re referencing needs to be reworked as it was written for an old version of unstructured. You can either write your own control loop to identify and process xlsx files or change lines 105 and 113 to use in instead of ==. I’ll update this contrib module in a future release and fix these things in the update to latest unstructured.

@luca-simonetti
Copy link
Author

thanks for your reply. I tried with a change on your code putting the ifs in the right order (or at least what I think it was the right one) and I think I actually need the vanilla eparse mode so I think I'll wait for the next release when there'll be one.

@ChrisPappalardo ChrisPappalardo added bug Something isn't working good first issue Good for newcomers labels May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants