Strip formatting #1

bamnet · 2011-07-26T02:50:29Z

When I run PDF tests I get output that looks like this

Textractor returns the contents of pdf documents
Failure/Error: Textractor.text_from_path(fixture_path("document.pdf")).should == 'text'
expected: "text",
got: "text\t\r \302\240 \t\r \302\240" (using ==)

My pdftotext version must handle formatting characters differently from yours. Do you think this is something textractor should handle?

In my use case I never care about the document formatting, I only want strings separated by spaces, with a limited subset of punctuation (aka periods and commas) for use indexing documents. I don't mind handling this functionality in each application, but I'd be glad to write it into texttactor if you think there's value in that.

mguterl · 2011-07-29T15:05:27Z

I'm going to have to think about this some more, it looks like all of the extractors are calling String#strip to remove trailing whitespace. It looks like those characters just represent whitespace of some type and if that is the case I'm fine with coming up with a replacement for String#strip that grabs these characters too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip formatting #1

Strip formatting #1

bamnet commented Jul 26, 2011

mguterl commented Jul 29, 2011

Strip formatting #1

Strip formatting #1

Comments

bamnet commented Jul 26, 2011

mguterl commented Jul 29, 2011