Skip to content

Commit

Permalink
ENH: Add PDF with Arab text (#13)
Browse files Browse the repository at this point in the history
Add habibi.pdf from py-pdf/pypdf#1111
Add sample pdf with alternate CMap structure
  • Loading branch information
dkg authored Jul 17, 2022
1 parent 3176390 commit 200644f
Show file tree
Hide file tree
Showing 5 changed files with 64 additions and 0 deletions.
37 changes: 37 additions & 0 deletions 015-arabic/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Arabic script for testing text extraction

`habibi.pdf` was generated using weasyprint 54.1-3 on debian unstable in July 2022, using the following command:

```bash
weasyprint habibi.html habibi.pdf
```

See also https://github.com/py-pdf/PyPDF2/issues/1111

# CMap Structure

`habibi-oneline-cmap.pdf` is the same file, but the `beginbfchar` stanza of the `ToUnicode` CMap is written with ASCII space delimiters between `<srcString> <dstString>` pairings, rather than newlines. That is, where `habibi.pdf` contains:

```
6 beginbfchar
<0003> <>
<03f2> <>
<0392> <>
<03f4> <>
<02f4> <>
<03a3> <062d064e0628064a0628064a0020>
endbfchar
```

`habibi-oneline-cmap.pdf` contains:

```
6 beginbfchar
<0003> <> <03f2> <> <0392> <> <03f4> <> <02f4> <> <03a3> <062d064e0628064a0628064a0020>
endbfchar
```

Otherwise the two files are exactly identical.

I believe text extraction should behave the same way on both files.
From what i understand of the PDF specification, they are syntactically equivalent.
Binary file added 015-arabic/habibi-oneline-cmap.pdf
Binary file not shown.
9 changes: 9 additions & 0 deletions 015-arabic/habibi.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<title>habibi</title>
</head>
<body>
<div>حَبيبي habibi</div>
</body>
</html>
Binary file added 015-arabic/habibi.pdf
Binary file not shown.
18 changes: 18 additions & 0 deletions files.json
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,24 @@
"pages": 4,
"images": 0,
"forms": 0
},
{
"path": "015-arabic/habibi.pdf",
"producer": "weasyprint-54.1",
"creation_date": "2022-07-17T13:00:00",
"encrypted": false,
"pages": 1,
"images": 0,
"forms": 0
},
{
"path": "015-arabic/habibi-oneline-cmap.pdf",
"producer": "weasyprint-54.1+manual",
"creation_date": "2022-07-17T13:00:00",
"encrypted": false,
"pages": 1,
"images": 0,
"forms": 0
}
]
}

0 comments on commit 200644f

Please sign in to comment.