Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switching to submodules #15

Closed
wants to merge 14 commits into from
Closed

Conversation

alexcohn
Copy link

@alexcohn alexcohn commented Jul 3, 2019

In my fork, I am using submodules for tesseract 4.0, for leptronica, and for libpng.

I have also cleaned up the build scripts, and fixed the long-awaiting tests for leptronica integration.

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

@alexcohn Hi, thanks for the PR and your work on debugging and fixing some of the old bugs! :)

I don't want to merge this as one whole PR, it would be better to separate it into smaller more specific changes. Also I see some issues with some of the changes, so we can discuss them first.

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

  1. I think that by using submodule for tesseract you missed some changes that were made to the sources here. Namely:

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

If I understand correctly, the 21bb9dc has already been fixed upstream and we only need to point to 4.1.0 or even to some fork of tesseract.

It's super easy to switch a submodule to a private fork and back; it's much harder to maintain copies of few big repos, if you ever expect to move your reference vis a vis the peer repo.

I believe that this how Tesseract4Android became a clone, not a fork of tess-two. And now I see that tesseract itself is struggling to be up-to-date with the new Android build system. In my opinion, submodules maintenance is easier in the long run.

Copy link
Member

@Robyer Robyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only add the relevant changes into code, not replace the whole file from different project.

When replacing the file, there are these issues:

  • changes includes block
  • reverts various code improvements, like return values, NULL->nullptr, etc. This should be applied to new added code too.
  • breaks file formatting

Also I'm not sure if renand's version is best solution, as he (if I understand correctly the code) brings performance drop by iterating over the pix twice by calling the pixEndianByteSwap afterwards. I think it might be rewritten in a way that the endianness is changed in place during the single for cycle.

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

@alexcohn Yes, I'm not against submodules per se, I'm rather mentioning the problem current PR brings. I think submodule's code can be changed in place by having some local branch. I'm not sure if the changes must be pushed to separate repo clone or if they stay in this repository (that would be better). I haven't used submodules in the past, that's why I created this repo by copying sources, but I agree it would be good to use them (if we solve the custom code changes).

And I would wait before upgrading to 4.1.0 (or 5.0.0, I think they raised the version already) and keep using stable versions for this library. That's why previously I just cherry-picked the single commit.

Copy link
Member

@Robyer Robyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to add tesseract_exe if this is Android only library?

Copy link
Member

@Robyer Robyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read your comments in Leptonica issues and I understood them that problem with PNG (with <4 channels) and JPEG (which always have <4 channels), is the same - that we should ignore the 4th byte. And I expect they are represented in the Pix format same way (same byte order, etc.), aren't they? So why there is different processing for JPEG and for PNG? Why don't we ignore the 4th byte same way?

Also I just realize that your code works correctly only with spp == 3, not with spp < 4, because you always ignore only 4th byte, right? So to handle spp == 2 correctly, the code must be modified.

So perhaps for the comparison we should first extract the all channels into separate variables and then compare the variables itself (based on how many channels there are, and ignore the others). That way you would also remove the format condition (as that wouldn't matter anymore).

Or am I missing something?

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

Why do you want to add tesseract_exe if this is Android only library?

Actually, this was my motivation to begin this adventure. It all began when I came across this post and then I understood that the cleanest way to build the executable (for Android) today is to build upon your project. As you can see, the change is trivial, and it is compatile with Android Sudio, and works with the latest NDK smoothly.

On the way, I found few other issues (namely, the Java tests), that I decided to fix as well.

I believe that having an easy way to produce an up-to-date command-line executable makes tests much easier to maintain.

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

Jpeg needs special treatment because the inherent noise of lossy compression required relaxed criteria for image comparison. If you ignore the least significant bit, and use 85 quality, then the two images actually match perfectly.

Copy link
Member

@Robyer Robyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should expose the quality argument to Java code, to make it explicit for tests (or other usages) what quality will be used.

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

Working with submodules and cherry-picking some changes from upstream requires a writable branch in the repo. On GitHub, it's easily achieved by forking the primary repo.

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

I almost agree with exposing the quality argument on the Java side. The only reservatin is that actually this is a marginal use case. What is the scenario in tesseract that you would save a Pix as a Jpeg file?

Copy link
Member

@Robyer Robyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea of app module was to have some very simple example project so people would know how to use the library (in issues can be seen that few users struggled with that), but I haven't got time for that.

Copy link
Member

@Robyer Robyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring whole .idea/ folder isn't good idea as we lose some project specific settings that should be shared, like code formatting or run configurations, etc. You see that there are already specified some .idea/xxx files to be ignored, so I'd rather add what files should be explicitly ignored.

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

Actually, this was my motivation to begin this adventure. (...)

On the way, I found few other issues (namely, the Java tests), that I decided to fix as well.

I believe that having an easy way to produce an up-to-date command-line executable makes tests much easier to maintain.

Sorry, I still don't fully understand. You wanted to compile own tesseract.exe for windows and it was easier to do from this project than from the original tesseract repository? Or did you just needed it for testing the fixes in this library?

I mean do you think someone else will benefit from this ability of compiling exe? Normally users would download Windows binary from https://github.com/UB-Mannheim/tesseract/wiki , and if it's developer who want to edit tesseract source, then they definitely building of the project itself. So I still don't see real need for having this build target here (otherwise we would have to add also linux or mac building too).

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

It woud be great to add a simple demo app, but until it's there, an empty app only causes confusion. For me, it did ;)

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

Working with submodules and cherry-picking some changes from upstream requires a writable branch in the repo. On GitHub, it's easily achieved by forking the primary repo.

Ok, so I should create fork of Tesseract and create some branch with my changes, then we can use that branch as source for the submodule in this library? I'll do that later then.

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

tessetact_exe target produces an Android binary named tesseract in Tesseract4Android/tesseract4android/.externalNativeBuild/cmake/debug/arm64-v8a/tesseract directory (and all the other flavours). I could not use tesseract target now because that would conflict with the shared library target, which builds libtesseract.so

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

Ok, so I should create fork of Tesseract and create some branch with my changes, then we can use that branch as source for the submodule in this library? I'll do that later then.

Exactly. That's what I did for leptonica. But in the end, I don't need my fork now, because the necessary changes to Java tests have worked around the alpha channel issue. Though, the change there is still worth merging, but it's not a showstopper anymore.

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

Whether .idea should be in VCS, is an unsettled holiwar. In this particular case, I see that all configurations are actually fully recovered by Android Studio, but there is also no strong reasons to remove them from Git.

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

tessetact_exe target produces an Android binary named tesseract in Tesseract4Android/tesseract4android/.externalNativeBuild/cmake/debug/arm64-v8a/tesseract directory (and all the other flavours). I could not use tesseract target now because that would conflict with the shared library target, which builds libtesseract.so

Aha, so only the name with _exe is misleading as it doesn't produce any tesseract.exe file, but normal binary for Android? I see that you use src/src/api/tesseractmain.cpp there, so you are compiling tesseract as console application that could be used from Android or are you just compiling static library of tesseract to be used in some other library/application?

In case of static library, do you need to include that main file to compile it? Also if it's just static binary (not console app), I would use tesseract_static name instead of tesseract_exe.

If it's console app that you use via some Android Terminal, then I finally understand :-)

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

I haven't studied the details of 2-spp Pix, so I cannot tell what would the right fix there,

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

Jpeg needs special treatment because the inherent noise of lossy compression required relaxed criteria for image comparison. If you ignore the least significant bit, and use 85 quality, then the two images actually match perfectly.

Aren't you just hiding compression in this particular case, because this is black/white image? If we would have some normal photograph, would your special treatment work correctly too?

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

I almost agree with exposing the quality argument on the Java side. The only reservatin is that actually this is a marginal use case. What is the scenario in tesseract that you would save a Pix as a Jpeg file?

That method is for saving any Pix into file, right? That could have multiple use-cases for users. For example they preprocess the image they have in Pix, and then they want to save the processed image for some later use. Basically any usage of the file when they don't want to convert it back to Bitmap first, but rather directly save to file.

@alexcohn
Copy link
Author

alexcohn commented Jul 3, 2019

Jpeg needs special treatment because the inherent noise of lossy compression required relaxed criteria for image comparison. If you ignore the least significant bit, and use 85 quality, then the two images actually match perfectly.

Aren't you just hiding compression in this particular case, because this is black/white image? If we would have some normal photograph, would your special treatment work correctly too?

The scope of this comparison method is for testing. The change may still show difference if some other image is chosen, but no, an arbitrary photograph will not behave worse than black/white synthetic image.

We don't have a greneral-purpose comparison method, but now, with ssp and inputFormat exposed, users of Java API can decide what accuracy they should expect,

@Robyer
Copy link
Member

Robyer commented Jul 3, 2019

We don't have a greneral-purpose comparison method, but now, with ssp and inputFormat exposed, users of Java API can decide what accuracy they should expect,

Yes, but my point is: Do we really need the special handling of JPEG? Can't we handle comparison of PNGs (with ssp<4) and JPEGs same way with same block of code?

@alexcohn
Copy link
Author

I almost agree with exposing the quality argument on the Java side. The only reservatin is that actually this is a marginal use case. What is the scenario in tesseract that you would save a Pix as a Jpeg file?

That method is for saving any Pix into file, right? That could have multiple use-cases for users. For example they preprocess the image they have in Pix, and then they want to save the processed image for some later use. Basically any usage of the file when they don't want to convert it back to Bitmap first, but rather directly save to file.

IMHO, using Jpeg as intermediate format should be discouraged. Jpeg is very good for final result (i.e. an image that is displayed to a human), but not if that file must be loaded into Pix again: at any quality, new artifacts will be introduced, and such artifacts are known to interfere with the computer vision algorithms.

@alexcohn
Copy link
Author

I cannot understand which 1 change request is pending. I'd like to move forwar with this pull request if possible.

@alexcohn
Copy link
Author

If it's console app that you use via some Android Terminal, then I finally understand :-)

You nailed it!

@Robyer
Copy link
Member

Robyer commented Jul 17, 2019

I was trying to better understand and improve some of the changes as I don't like some parts of them, as I commented.

  1. I got stuck a bit on the submodules and I was thinking whether is it really improvement / easier to use them or not, because updating sources is done rarely (I'd like to use only stable versions of given libraries) and having to "maintain" own fork just for any change that I need to do to the code (currently 0-2 changes, but it might be different in future). Also not being able to use only part of the submodule (again, only with custom fork where that changes would be done and maintained). And most I dislike the need to maintain the custom fork "forever", otherwise history commits of this this repository would break (e.g. when my custom changes are included in official Tesseract and I'd like to remove custom fork - I can't because previous version of this library won't compile anymore).

    So for me it's really simpler to once in a while checkout the libraries sources manually and update them in this repo. And with submodules I see only more problems and no things that make my life easier. Not mentioning that it brings more problems to users of this repo too, as they need to know how to init/update submodules before they can compile this library.

    So for now I won't use them.

  1. I was looking at the fix for readFile/writeFile and related changes and I don't think this is best solution. I'm doing more investigation into the issues in order to try to fix the real issue in code and not just the tests itself.

    I still don't understand why there are tests using Bitmap.Config.RGB_565 when everywhere else in the code we expect working with Bitmap.Config.ARGB_8888 only. I think this causes some of the problems, but I'm not sure if special support for RGB_565 should be added or we should support only ARGB_8888.

I'll commit some changes when I have more time.

@alexcohn
Copy link
Author

It's completely up to you whether to use submodules in your repository or not. I like it better this way, because I often prefer to be at the head of the master branch, or choose other branches when relevant. But to me, the advantage of keeping the commit history for each source is the most important.

@alexcohn
Copy link
Author

For the artificial B/W bitmap, RGB565 does not introduce extra trouble. But in more realistic scenarios, this could be an extra source of rounding problems.

@alexcohn
Copy link
Author

Feel free to close this Pull Request and continue with you path. I have found a way to integrate tesseract 4 into tess-two, and the leptonica tests have been fixed there.

@Robyer
Copy link
Member

Robyer commented Jul 17, 2019

I finished my changes and fixes. I included only your modified imageformat commit, but for other fixes I chosen different approach. I didn't included the executable commit, because when I tried that, I couldn't build the tesseract library itself afterwards... But I don't think any user of this library needs that app anyway. Also I updated Tesseract to 4.1.0.

@Robyer Robyer closed this Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants