switching to submodules #15

alexcohn · 2019-07-03T09:57:22Z

In my fork, I am using submodules for tesseract 4.0, for leptronica, and for libpng.

I have also cleaned up the build scripts, and fixed the long-awaiting tests for leptronica integration.

nativeWriteBitmap() changes the source pix data. The fix is in test code, because changing the C++ beavior may break some code that relies on it, or call pixEndianByteSwap() to return to original byte order after write. rmtheis/tess-two#228 (comment)

cherry-pick readfile.cpp that always produces 32-bit pix from renard314 source: https://github.com/renard314/tess-two/blob/bb8b785d79810cbc97dbe5147a75c90714b2abca/tess-two/jni/com_googlecode_leptonica_android/readfile.cpp

see discussion at rmtheis/tess-two#264

default quality (0 is translated to 75) produces too much noise, use 85 is much better

…n tighten the test macth threshold

Robyer · 2019-07-03T11:06:47Z

@alexcohn Hi, thanks for the PR and your work on debugging and fixing some of the old bugs! :)

I don't want to merge this as one whole PR, it would be better to separate it into smaller more specific changes. Also I see some issues with some of the changes, so we can discuss them first.

Robyer · 2019-07-03T11:13:13Z

I think that by using submodule for tesseract you missed some changes that were made to the sources here. Namely:

alexcohn · 2019-07-03T11:28:18Z

If I understand correctly, the 21bb9dc has already been fixed upstream and we only need to point to 4.1.0 or even to some fork of tesseract.

It's super easy to switch a submodule to a private fork and back; it's much harder to maintain copies of few big repos, if you ever expect to move your reference vis a vis the peer repo.

I believe that this how Tesseract4Android became a clone, not a fork of tess-two. And now I see that tesseract itself is struggling to be up-to-date with the new Android build system. In my opinion, submodules maintenance is easier in the long run.

Robyer

We should only add the relevant changes into code, not replace the whole file from different project.

When replacing the file, there are these issues:

changes includes block
reverts various code improvements, like return values, NULL->nullptr, etc. This should be applied to new added code too.
breaks file formatting

Also I'm not sure if renand's version is best solution, as he (if I understand correctly the code) brings performance drop by iterating over the pix twice by calling the pixEndianByteSwap afterwards. I think it might be rewritten in a way that the endianness is changed in place during the single for cycle.

Robyer · 2019-07-03T11:35:32Z

@alexcohn Yes, I'm not against submodules per se, I'm rather mentioning the problem current PR brings. I think submodule's code can be changed in place by having some local branch. I'm not sure if the changes must be pushed to separate repo clone or if they stay in this repository (that would be better). I haven't used submodules in the past, that's why I created this repo by copying sources, but I agree it would be good to use them (if we solve the custom code changes).

And I would wait before upgrading to 4.1.0 (or 5.0.0, I think they raised the version already) and keep using stable versions for this library. That's why previously I just cherry-picked the single commit.

Robyer

Why do you want to add tesseract_exe if this is Android only library?

Robyer

I read your comments in Leptonica issues and I understood them that problem with PNG (with <4 channels) and JPEG (which always have <4 channels), is the same - that we should ignore the 4th byte. And I expect they are represented in the Pix format same way (same byte order, etc.), aren't they? So why there is different processing for JPEG and for PNG? Why don't we ignore the 4th byte same way?

Also I just realize that your code works correctly only with spp == 3, not with spp < 4, because you always ignore only 4th byte, right? So to handle spp == 2 correctly, the code must be modified.

So perhaps for the comparison we should first extract the all channels into separate variables and then compare the variables itself (based on how many channels there are, and ignore the others). That way you would also remove the format condition (as that wouldn't matter anymore).

Or am I missing something?

alexcohn · 2019-07-03T11:53:10Z

Why do you want to add tesseract_exe if this is Android only library?

Actually, this was my motivation to begin this adventure. It all began when I came across this post and then I understood that the cleanest way to build the executable (for Android) today is to build upon your project. As you can see, the change is trivial, and it is compatile with Android Sudio, and works with the latest NDK smoothly.

On the way, I found few other issues (namely, the Java tests), that I decided to fix as well.

I believe that having an easy way to produce an up-to-date command-line executable makes tests much easier to maintain.

alexcohn · 2019-07-03T11:56:40Z

Jpeg needs special treatment because the inherent noise of lossy compression required relaxed criteria for image comparison. If you ignore the least significant bit, and use 85 quality, then the two images actually match perfectly.

Robyer

I think we should expose the quality argument to Java code, to make it explicit for tests (or other usages) what quality will be used.

alexcohn · 2019-07-03T11:59:10Z

Working with submodules and cherry-picking some changes from upstream requires a writable branch in the repo. On GitHub, it's easily achieved by forking the primary repo.

alexcohn · 2019-07-03T12:03:25Z

I almost agree with exposing the quality argument on the Java side. The only reservatin is that actually this is a marginal use case. What is the scenario in tesseract that you would save a Pix as a Jpeg file?

Robyer

Idea of app module was to have some very simple example project so people would know how to use the library (in issues can be seen that few users struggled with that), but I haven't got time for that.

Robyer

Ignoring whole .idea/ folder isn't good idea as we lose some project specific settings that should be shared, like code formatting or run configurations, etc. You see that there are already specified some .idea/xxx files to be ignored, so I'd rather add what files should be explicitly ignored.

Robyer · 2019-07-03T12:58:38Z

Actually, this was my motivation to begin this adventure. (...)

On the way, I found few other issues (namely, the Java tests), that I decided to fix as well.

I believe that having an easy way to produce an up-to-date command-line executable makes tests much easier to maintain.

Sorry, I still don't fully understand. You wanted to compile own tesseract.exe for windows and it was easier to do from this project than from the original tesseract repository? Or did you just needed it for testing the fixes in this library?

I mean do you think someone else will benefit from this ability of compiling exe? Normally users would download Windows binary from https://github.com/UB-Mannheim/tesseract/wiki , and if it's developer who want to edit tesseract source, then they definitely building of the project itself. So I still don't see real need for having this build target here (otherwise we would have to add also linux or mac building too).

alexcohn · 2019-07-03T12:58:40Z

It woud be great to add a simple demo app, but until it's there, an empty app only causes confusion. For me, it did ;)

Robyer · 2019-07-03T13:01:27Z

Working with submodules and cherry-picking some changes from upstream requires a writable branch in the repo. On GitHub, it's easily achieved by forking the primary repo.

Ok, so I should create fork of Tesseract and create some branch with my changes, then we can use that branch as source for the submodule in this library? I'll do that later then.

alexcohn · 2019-07-03T13:03:24Z

tessetact_exe target produces an Android binary named tesseract in Tesseract4Android/tesseract4android/.externalNativeBuild/cmake/debug/arm64-v8a/tesseract directory (and all the other flavours). I could not use tesseract target now because that would conflict with the shared library target, which builds libtesseract.so

alexcohn · 2019-07-03T13:06:02Z

Ok, so I should create fork of Tesseract and create some branch with my changes, then we can use that branch as source for the submodule in this library? I'll do that later then.

Exactly. That's what I did for leptonica. But in the end, I don't need my fork now, because the necessary changes to Java tests have worked around the alpha channel issue. Though, the change there is still worth merging, but it's not a showstopper anymore.

alexcohn · 2019-07-03T13:12:28Z

Whether .idea should be in VCS, is an unsettled holiwar. In this particular case, I see that all configurations are actually fully recovered by Android Studio, but there is also no strong reasons to remove them from Git.

Robyer · 2019-07-03T13:13:19Z

tessetact_exe target produces an Android binary named tesseract in Tesseract4Android/tesseract4android/.externalNativeBuild/cmake/debug/arm64-v8a/tesseract directory (and all the other flavours). I could not use tesseract target now because that would conflict with the shared library target, which builds libtesseract.so

Aha, so only the name with _exe is misleading as it doesn't produce any tesseract.exe file, but normal binary for Android? I see that you use src/src/api/tesseractmain.cpp there, so you are compiling tesseract as console application that could be used from Android or are you just compiling static library of tesseract to be used in some other library/application?

In case of static library, do you need to include that main file to compile it? Also if it's just static binary (not console app), I would use tesseract_static name instead of tesseract_exe.

If it's console app that you use via some Android Terminal, then I finally understand :-)

alexcohn · 2019-07-03T13:14:47Z

I haven't studied the details of 2-spp Pix, so I cannot tell what would the right fix there,

Robyer · 2019-07-03T13:17:20Z

Jpeg needs special treatment because the inherent noise of lossy compression required relaxed criteria for image comparison. If you ignore the least significant bit, and use 85 quality, then the two images actually match perfectly.

Aren't you just hiding compression in this particular case, because this is black/white image? If we would have some normal photograph, would your special treatment work correctly too?

Robyer · 2019-07-03T13:25:52Z

I almost agree with exposing the quality argument on the Java side. The only reservatin is that actually this is a marginal use case. What is the scenario in tesseract that you would save a Pix as a Jpeg file?

That method is for saving any Pix into file, right? That could have multiple use-cases for users. For example they preprocess the image they have in Pix, and then they want to save the processed image for some later use. Basically any usage of the file when they don't want to convert it back to Bitmap first, but rather directly save to file.

alexcohn · 2019-07-03T15:18:10Z

Jpeg needs special treatment because the inherent noise of lossy compression required relaxed criteria for image comparison. If you ignore the least significant bit, and use 85 quality, then the two images actually match perfectly.

Aren't you just hiding compression in this particular case, because this is black/white image? If we would have some normal photograph, would your special treatment work correctly too?

The scope of this comparison method is for testing. The change may still show difference if some other image is chosen, but no, an arbitrary photograph will not behave worse than black/white synthetic image.

We don't have a greneral-purpose comparison method, but now, with ssp and inputFormat exposed, users of Java API can decide what accuracy they should expect,

Robyer · 2019-07-03T15:44:10Z

We don't have a greneral-purpose comparison method, but now, with ssp and inputFormat exposed, users of Java API can decide what accuracy they should expect,

Yes, but my point is: Do we really need the special handling of JPEG? Can't we handle comparison of PNGs (with ssp<4) and JPEGs same way with same block of code?

alexcohn · 2019-07-12T06:43:57Z

I almost agree with exposing the quality argument on the Java side. The only reservatin is that actually this is a marginal use case. What is the scenario in tesseract that you would save a Pix as a Jpeg file?

That method is for saving any Pix into file, right? That could have multiple use-cases for users. For example they preprocess the image they have in Pix, and then they want to save the processed image for some later use. Basically any usage of the file when they don't want to convert it back to Bitmap first, but rather directly save to file.

IMHO, using Jpeg as intermediate format should be discouraged. Jpeg is very good for final result (i.e. an image that is displayed to a human), but not if that file must be loaded into Pix again: at any quality, new artifacts will be introduced, and such artifacts are known to interfere with the computer vision algorithms.

alexcohn · 2019-07-12T09:02:52Z

I cannot understand which 1 change request is pending. I'd like to move forwar with this pull request if possible.

alexcohn · 2019-07-16T09:05:01Z

If it's console app that you use via some Android Terminal, then I finally understand :-)

You nailed it!

Robyer · 2019-07-17T08:34:55Z

I was trying to better understand and improve some of the changes as I don't like some parts of them, as I commented.

I got stuck a bit on the submodules and I was thinking whether is it really improvement / easier to use them or not, because updating sources is done rarely (I'd like to use only stable versions of given libraries) and having to "maintain" own fork just for any change that I need to do to the code (currently 0-2 changes, but it might be different in future). Also not being able to use only part of the submodule (again, only with custom fork where that changes would be done and maintained). And most I dislike the need to maintain the custom fork "forever", otherwise history commits of this this repository would break (e.g. when my custom changes are included in official Tesseract and I'd like to remove custom fork - I can't because previous version of this library won't compile anymore).

So for me it's really simpler to once in a while checkout the libraries sources manually and update them in this repo. And with submodules I see only more problems and no things that make my life easier. Not mentioning that it brings more problems to users of this repo too, as they need to know how to init/update submodules before they can compile this library.

So for now I won't use them.

I was looking at the fix for readFile/writeFile and related changes and I don't think this is best solution. I'm doing more investigation into the issues in order to try to fix the real issue in code and not just the tests itself.

I still don't understand why there are tests using Bitmap.Config.RGB_565 when everywhere else in the code we expect working with Bitmap.Config.ARGB_8888 only. I think this causes some of the problems, but I'm not sure if special support for RGB_565 should be added or we should support only ARGB_8888.

I'll commit some changes when I have more time.

alexcohn · 2019-07-17T10:03:17Z

It's completely up to you whether to use submodules in your repository or not. I like it better this way, because I often prefer to be at the head of the master branch, or choose other branches when relevant. But to me, the advantage of keeping the commit history for each source is the most important.

alexcohn · 2019-07-17T10:06:24Z

For the artificial B/W bitmap, RGB565 does not introduce extra trouble. But in more realistic scenarios, this could be an extra source of rounding problems.

alexcohn · 2019-07-17T10:37:14Z

Feel free to close this Pull Request and continue with you path. I have found a way to integrate tesseract 4 into tess-two, and the leptonica tests have been fixed there.

Robyer · 2019-07-17T19:32:50Z

I finished my changes and fixes. I included only your modified imageformat commit, but for other fixes I chosen different approach. I didn't included the executable commit, because when I tried that, I couldn't build the tesseract library itself afterwards... But I don't think any user of this library needs that app anyway. Also I updated Tesseract to 4.1.0.

alexcohn added 13 commits June 25, 2019 22:09

using submodule for tesseract lib

9f18e05

using submodule for leptonica

8225d98

using submodule for libpng

53eb780

remove redundant test case

7452b95

fix the writeBitmap tests

02e0d37

nativeWriteBitmap() changes the source pix data. The fix is in test code, because changing the C++ beavior may break some code that relies on it, or call pixEndianByteSwap() to return to original byte order after write. rmtheis/tess-two#228 (comment)

fixing rmtheis/tess-two#87

12edb77

cherry-pick readfile.cpp that always produces 32-bit pix from renard314 source: https://github.com/renard314/tess-two/blob/bb8b785d79810cbc97dbe5147a75c90714b2abca/tess-two/jni/com_googlecode_leptonica_android/readfile.cpp

pass imageFormat and spp to Java, to fix read/write tests

dac172a

use imageFormat and spp to perform more accurate image compare methods

48d4352

see discussion at rmtheis/tess-two#264

fixing the tests that write Jpeg.

cb181a0

default quality (0 is translated to 75) produces too much noise, use 85 is much better

when Jpeg compression inherent data loss is taken into account, we ca…

66255b0

…n tighten the test macth threshold

remove unused app and junit tests

e49d2db

upgrade android build tools and clean up build.gradle, ignore .idea

355e309

update ref to leptonica source

2bc29d1

Robyer requested changes Jul 3, 2019

View reviewed changes

also build the command-line executable

da9e4ee

Robyer reviewed Jul 3, 2019

View reviewed changes

Robyer requested changes Jul 3, 2019

View reviewed changes

Robyer approved these changes Jul 3, 2019

View reviewed changes

Robyer requested changes Jul 3, 2019

View reviewed changes

Robyer reviewed Jul 3, 2019

View reviewed changes

Robyer requested changes Jul 3, 2019

View reviewed changes

Robyer closed this Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switching to submodules #15

switching to submodules #15

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer left a comment

Robyer commented Jul 3, 2019 •

edited

Loading

Robyer left a comment

Robyer left a comment

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer left a comment

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer left a comment

Robyer left a comment

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019 •

edited

Loading

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 12, 2019

alexcohn commented Jul 12, 2019

alexcohn commented Jul 16, 2019

Robyer commented Jul 17, 2019 •

edited

Loading

alexcohn commented Jul 17, 2019

alexcohn commented Jul 17, 2019

alexcohn commented Jul 17, 2019

Robyer commented Jul 17, 2019

switching to submodules #15

switching to submodules #15

Conversation

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer left a comment

Choose a reason for hiding this comment

Robyer commented Jul 3, 2019 • edited Loading

Robyer left a comment

Choose a reason for hiding this comment

Robyer left a comment

Choose a reason for hiding this comment

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer left a comment

Choose a reason for hiding this comment

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer left a comment

Choose a reason for hiding this comment

Robyer left a comment

Choose a reason for hiding this comment

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019 • edited Loading

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 3, 2019

Robyer commented Jul 3, 2019

alexcohn commented Jul 12, 2019

alexcohn commented Jul 12, 2019

alexcohn commented Jul 16, 2019

Robyer commented Jul 17, 2019 • edited Loading

alexcohn commented Jul 17, 2019

alexcohn commented Jul 17, 2019

alexcohn commented Jul 17, 2019

Robyer commented Jul 17, 2019

Robyer commented Jul 3, 2019 •

edited

Loading

Robyer commented Jul 3, 2019 •

edited

Loading

Robyer commented Jul 17, 2019 •

edited

Loading