You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run PDF tests I get output that looks like this
Textractor returns the contents of pdf documents
Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text'
expected: "text",
got: "text\t\r \302\240 \t\r \302\240" (using ==)
My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?
In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.
The text was updated successfully, but these errors were encountered:
I'm going to have to think about this some more, it looks like all of the extractors are calling String#strip to remove trailing whitespace. It looks like those characters just represent whitespace of some type and if that is the case I'm fine with coming up with a replacement for String#strip that grabs these characters too.
When I run PDF tests I get output that looks like this
Textractor returns the contents of pdf documents
Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text'
expected: "text",
got: "text\t\r \302\240 \t\r \302\240" (using ==)
My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?
In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.
The text was updated successfully, but these errors were encountered: