chore: Change PDF text extraction logics #10420

filzrev · 2024-11-23T08:42:02Z

This PR intended to reduce diffs when running snapshot tests.

What's changed in this PR

1. Use ContentOrderTextExtractor.GetText instead of Text property

Text property don't returns new line chars.
And it cause snapshot diffs between the lines.

By using ContentOrderTextExtractor.GetText API.
It can gets human readable text content.

2. Normalize ligature chars

When getting text from PDF file.
Some chars are concatenated as ligature chars.
So manually replacing these chars.

It's normally replaced by string.Normalize method. But when using Globalization Invariant Mode. it's not works as expected.

…xt-extraction-logics

filzrev and others added 3 commits November 23, 2024 17:35

chore: change PDF text extraction logics

c2df7e0

test(snapshot): update snapshots c2df7e0

8798a75

Merge branch 'main' into chore-change-pdf-text-extraction-logics

3bd9837

yufeih approved these changes Nov 26, 2024

View reviewed changes

filzrev and others added 2 commits November 27, 2024 10:45

Merge remote-tracking branch 'upstream/main' into chore-change-pdf-te…

a124734

…xt-extraction-logics

test(snapshot): update snapshots a124734

d2ccb04

yufeih merged commit 8704b50 into dotnet:main Nov 27, 2024
1 check passed