Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read Data from img #91

Closed
itlaosiji opened this issue Nov 6, 2017 · 2 comments
Closed

Can't read Data from img #91

itlaosiji opened this issue Nov 6, 2017 · 2 comments
Labels
invalid Not a real issue on the library

Comments

@itlaosiji
Copy link

itlaosiji commented Nov 6, 2017

don't read this website verifycode:
http://www.miitbeian.gov.cn/getVerifyCode?73
i try you offer img is OK ,i need you help.please .

you can save this website verify code XXX.JPG

@itlaosiji itlaosiji changed the title Can't read Data from png Can't read Data from img Nov 6, 2017
@thiagoalessio
Copy link
Owner

hehehe trying to break captchas my friend ;D
Using the issues of this repo is a bit out of scope, but here it goes:

Given the original picture:

original

Cleaning

You'll need to use imagemagick (or something similar) to clean up the picture noise before sending to tesseract.
That can be achieved by playing around with some filters (a sequence of modulate, contrast-stretch and gaussian-blur) in order to minimize (or even get rid of) the thin strokes that compromise text recognition, for example:

$ convert -colorspace gray -modulate 120 -contrast-stretch 10%x80% -modulate 140 -gaussian-blur 1 -contrast-stretch 5%x50% +repage -negate -gaussian-blur 4 -negate -modulate 130 original.jpeg clean.jpeg

would give you the following image:

clean

Recognizing

Now pass the clean image to tesseract:

echo (new TesseractOCR('clean.jpeg'))->run();
// outputs 655V,3A

There is an undesired comma (,) on the output, because the cleaning wasn't 100% perfect.
But since you know that this particular captcha is only formed of numbers and uppercase letters, you can give this hint to tesseract, making the recognition more effective:

echo (new TesseractOCR('clean.jpeg'))->whitelist(range(0, 9), range('A', 'Z'))->run();
// outputs 655V3A

And there you have it ... But I have to tell you, it will not work everytime. So make sure you collect a large number of captchas from this source, build the best cleaning sequence of filters you can, and prepare your code to keep trying new captchas until it succeeds.

@itlaosiji
Copy link
Author

thank you verymuch

@thiagoalessio thiagoalessio added the invalid Not a real issue on the library label Feb 18, 2020
Repository owner locked as off-topic and limited conversation to collaborators Feb 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
invalid Not a real issue on the library
Projects
None yet
Development

No branches or pull requests

2 participants