Input data preprocessing to remove noise #55

lfoppiano · 2021-04-23T02:11:17Z

I just found the following problem, although since the data is extracted from a PDF I'm not sure it's the right place where to fix the issue.

The following DOI: 10.1063/1.1905789͔ comes out with a nasty 9͔ ...

Although I think this is not glutton lookup's responsibility, I think having a small pre-processing that removes crap could be nice anyway .

Update: I've checked and since we lookup by DOI directly from LMDB it's a rather strict matching (we lowercase already)

The text was updated successfully, but these errors were encountered:

lfoppiano added the enhancement label Apr 23, 2021

lfoppiano mentioned this issue Apr 23, 2021

improving methods that clean doi for consolidation kermitt2/grobid#750

Merged

kermitt2 mentioned this issue Sep 5, 2021

Sanity check for field request #60

Open

Provide feedback