fix: disable table_as_cells output by default #3093
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the output of table elements: now by default the table elements'
metadata.table_as_cells
isNone
. The data will only be populated when the envEXTRACT_TABLE_AS_CELLS
is set totrue
.The original design of the
table_as_cells
is for evaluate table extraction performance. The format itself is not as readable as thetable_as_html
metadata for human or RAG consumption. Therefore by default this data is not needed.Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the
partition
function call. Adding a new parameter to thepartition
interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed.test
running the following code snippet on main vs. this PR
on main branch
table_cells
contains cell structured data but on this branch it is a list ofNone
However if we first set in terminal:
export EXTRACT_TABLE_AS_CELLS=true
then run the same code again with this PR the
table_cells
would contain actual data, the same as on main branch.