From a65b410ea4a24fde30cb194c1a808cd117ed16bf Mon Sep 17 00:00:00 2001 From: Martin Thoma Date: Sun, 24 Apr 2022 16:29:40 +0200 Subject: [PATCH] DOC: More details on text parsing issues (#815) --- docs/user/extract-text.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/user/extract-text.md b/docs/user/extract-text.md index bf6c5fa73f..3a6cdda378 100644 --- a/docs/user/extract-text.md +++ b/docs/user/extract-text.md @@ -21,7 +21,14 @@ clear answer what the expected result should look like: 3. **Outlines**: Should outlines be extracted at all? 4. **Formatting**: If text is **bold** or *italic*, should it be included in the output? -5. **Captions**: Should image and table captions be included? +5. **Tables**: Should the text extraction skip tables? Should it extract just the + text? Should the borders be shown in some Markdown-like way or should the + structure be present e.g. as an HTML table? How would you deal with merged + cells? +6. **Captions**: Should image and table captions be included? +7. **Ligatures**: The Unicode symbol [U+FB00](https://www.compart.com/de/unicode/U+FB00) + is a single symbol ff for two lowercase letters 'f'. Should that be parsed as + the Unicode symbol 'ff' or as two ASCII symbols 'ff'? Then there are issues where most people would agree on the correct output, but the way PDF stores information just makes it hard to achieve that: