Skip to content

Commit

Permalink
Updated tests
Browse files Browse the repository at this point in the history
Signed-off-by: Maxim Lysak <[email protected]>
  • Loading branch information
Maxim Lysak committed Oct 3, 2024
1 parent 5e5fd93 commit 44dcf83
Show file tree
Hide file tree
Showing 22 changed files with 48 additions and 33 deletions.
8 changes: 4 additions & 4 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ torchvision = [
python = "^3.10"
pydantic = "^2.0.0"
docling-core = "^1.6.2"
docling-ibm-models = {git = "https://github.com/DS4SD/docling-ibm-models.git", rev = "118d38a296ce8bab2150f0b23ce5087867f4e379" } # "dev/ahn_ts_migrate"}
docling-ibm-models = {git = "https://github.com/DS4SD/docling-ibm-models.git", rev = "e92c3cef733d138da4d9e57f55750143b68c0f02" } # "dev/ahn_ts_migrate"}
deepsearch-glm = "^0.21.1"
filetype = "^1.2.0"
pypdfium2 = "^4.30.0"
Expand Down
1 change: 1 addition & 0 deletions tests/data/2203.01017v2.doctags.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
<document>
<subtitle-level-1><location><page_1><loc_16><loc_85><loc_82><loc_87></location>TableFormer: Table Structure Understanding with Transformers.</subtitle-level-1>
<paragraph><location><page_1><loc_23><loc_78><loc_74><loc_82></location>Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research</paragraph>
<paragraph><location><page_1><loc_34><loc_77><loc_62><loc_78></location>{ ahn,nli,mly,taa } @zurich.ibm.com</paragraph>
<subtitle-level-1><location><page_1><loc_24><loc_71><loc_31><loc_73></location>Abstract</subtitle-level-1>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/2203.01017v2.json

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions tests/data/2203.01017v2.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## TableFormer: Table Structure Understanding with Transformers.

Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, Peter Staar IBM Research

{ ahn,nli,mly,taa } @zurich.ibm.com
Expand Down
2 changes: 1 addition & 1 deletion tests/data/2203.01017v2.pages.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions tests/data/2206.01062.doctags.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
<document>
<subtitle-level-1><location><page_1><loc_17><loc_85><loc_83><loc_89></location>DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis</subtitle-level-1>
<paragraph><location><page_1><loc_15><loc_77><loc_32><loc_83></location>Birgit Pfitzmann IBM Research Rueschlikon, Switzerland [email protected]</paragraph>
<paragraph><location><page_1><loc_42><loc_77><loc_58><loc_83></location>Christoph Auer IBM Research Rueschlikon, Switzerland [email protected]</paragraph>
<paragraph><location><page_1><loc_68><loc_77><loc_85><loc_83></location>Michele Dolfi IBM Research Rueschlikon, Switzerland [email protected]</paragraph>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/2206.01062.json

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions tests/data/2206.01062.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Birgit Pfitzmann IBM Research Rueschlikon, Switzerland [email protected]

Christoph Auer IBM Research Rueschlikon, Switzerland [email protected]
Expand Down
2 changes: 1 addition & 1 deletion tests/data/2206.01062.pages.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions tests/data/2305.03393v1.doctags.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
<document>
<subtitle-level-1><location><page_1><loc_22><loc_81><loc_79><loc_85></location>Optimized Table Tokenization for Table Structure Recognition</subtitle-level-1>
<paragraph><location><page_1><loc_23><loc_74><loc_78><loc_79></location>Maksym Lysak [0000 - 0002 - 3723 - $^{6960]}$, Ahmed Nassar[0000 - 0002 - 9468 - $^{0822]}$, Nikolaos Livathinos [0000 - 0001 - 8513 - $^{3491]}$, Christoph Auer[0000 - 0001 - 5761 - $^{0422]}$, and Peter Staar [0000 - 0002 - 8088 - 0823]</paragraph>
<paragraph><location><page_1><loc_36><loc_70><loc_64><loc_73></location>IBM Research {mly,ahn,nli,cau,taa}@zurich.ibm.com</paragraph>
<paragraph><location><page_1><loc_27><loc_41><loc_74><loc_66></location>Abstract. Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs. Popular table structure data-sets will be published in OTSL format to the community.</paragraph>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/2305.03393v1.json

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions tests/data/2305.03393v1.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## Optimized Table Tokenization for Table Structure Recognition

Maksym Lysak [0000 - 0002 - 3723 - $^{6960]}$, Ahmed Nassar[0000 - 0002 - 9468 - $^{0822]}$, Nikolaos Livathinos [0000 - 0001 - 8513 - $^{3491]}$, Christoph Auer[0000 - 0001 - 5761 - $^{0422]}$, and Peter Staar [0000 - 0002 - 8088 - 0823]

IBM Research {mly,ahn,nli,cau,taa}@zurich.ibm.com
Expand Down
2 changes: 1 addition & 1 deletion tests/data/2305.03393v1.pages.json

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions tests/data/redp5110.doctags.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
<document>
<paragraph><location><page_1><loc_6><loc_59><loc_35><loc_63></location>Implement roles and separation of duties</paragraph>
<paragraph><location><page_1><loc_6><loc_52><loc_33><loc_56></location>Leverage row permissions on the database</paragraph>
<paragraph><location><page_1><loc_6><loc_45><loc_32><loc_49></location>Protect columns by defining column masks</paragraph>
<paragraph><location><page_1><loc_6><loc_3><loc_27><loc_5></location>ibm.com /redbooks</paragraph>
<paragraph><location><page_1><loc_47><loc_94><loc_68><loc_96></location>Front cover</paragraph>
<figure>
<location><page_1><loc_84><loc_93><loc_96><loc_97></location>
</figure>
<subtitle-level-1><location><page_1><loc_6><loc_79><loc_96><loc_89></location>Row and Column Access Control Support in IBM DB2 for i</subtitle-level-1>
<paragraph><location><page_1><loc_6><loc_59><loc_35><loc_63></location>Implement roles and separation of duties</paragraph>
<paragraph><location><page_1><loc_6><loc_52><loc_33><loc_56></location>Leverage row permissions on the database</paragraph>
<paragraph><location><page_1><loc_6><loc_45><loc_32><loc_49></location>Protect columns by defining column masks</paragraph>
<paragraph><location><page_1><loc_6><loc_3><loc_27><loc_5></location>ibm.com /redbooks</paragraph>
<paragraph><location><page_1><loc_81><loc_12><loc_95><loc_27></location>Jim Bainbridge Hernando Bedoya Rob Bestgen Mike Cain Dan Cruikshank Jim Denton Doug Mack Tom McKinley Kent Milligan</paragraph>
<figure>
<location><page_1><loc_51><loc_2><loc_95><loc_10></location>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/redp5110.json

Large diffs are not rendered by default.

12 changes: 7 additions & 5 deletions tests/data/redp5110.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
Front cover


<!-- image -->

## Row and Column Access Control Support in IBM DB2 for i

Implement roles and separation of duties

Leverage row permissions on the database
Expand All @@ -6,11 +13,6 @@ Protect columns by defining column masks

ibm.com /redbooks

Front cover


<!-- image -->

Jim Bainbridge Hernando Bedoya Rob Bestgen Mike Cain Dan Cruikshank Jim Denton Doug Mack Tom McKinley Kent Milligan


Expand Down
2 changes: 1 addition & 1 deletion tests/data/redp5110.pages.json

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions tests/data/redp5695.doctags.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
<document>
<paragraph><location><page_1><loc_47><loc_96><loc_68><loc_99></location>Front cover</paragraph>
<figure>
<location><page_1><loc_67><loc_90><loc_93><loc_96></location>
</figure>
<subtitle-level-1><location><page_1><loc_7><loc_75><loc_88><loc_86></location>IBM Cloud Pak for Data on IBM Z</subtitle-level-1>
<paragraph><location><page_1><loc_7><loc_60><loc_20><loc_62></location>Jasmeet Bhatia</paragraph>
<paragraph><location><page_1><loc_7><loc_57><loc_20><loc_59></location>Ravi Gummadi</paragraph>
<paragraph><location><page_1><loc_7><loc_51><loc_21><loc_52></location>Srirama Sharma</paragraph>
Expand All @@ -14,10 +19,6 @@
<figure>
<location><page_1><loc_7><loc_3><loc_21><loc_8></location>
</figure>
<paragraph><location><page_1><loc_47><loc_96><loc_68><loc_99></location>Front cover</paragraph>
<figure>
<location><page_1><loc_67><loc_90><loc_93><loc_96></location>
</figure>
<figure>
<location><page_1><loc_24><loc_13><loc_99><loc_62></location>
</figure>
Expand Down
2 changes: 1 addition & 1 deletion tests/data/redp5695.json

Large diffs are not rendered by default.

12 changes: 7 additions & 5 deletions tests/data/redp5695.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
Front cover


<!-- image -->

## IBM Cloud Pak for Data on IBM Z

Jasmeet Bhatia

Ravi Gummadi
Expand All @@ -14,11 +21,6 @@ Srirama Sharma
<!-- image -->


<!-- image -->

Front cover


<!-- image -->


Expand Down
2 changes: 1 addition & 1 deletion tests/data/redp5695.pages.json

Large diffs are not rendered by default.

0 comments on commit 44dcf83

Please sign in to comment.