Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing languages #140

Closed
marian-code opened this issue Jul 30, 2022 · 22 comments · Fixed by #148
Closed

Missing languages #140

marian-code opened this issue Jul 30, 2022 · 22 comments · Fixed by #148
Assignees
Labels
enhancement New feature or request

Comments

@marian-code
Copy link

Hey,

This is closely related to #119 concernig OCR language selection. The documentation is a bit misleading stating that you have to just install corresponding language package.

Also if you want to use specific language settings please install the corresponding tesseract packages

But this holds true only for the hardcoded languages in the lists:

https://github.com/R0Wi/workflow_ocr/blob/bba5551a22b4a74a0ae23812449a9eea73b7fed0/lib/OcrProcessors/OcrMyPdfBasedProcessor.php#L39-L49

https://github.com/R0Wi/workflow_ocr/blob/bba5551a22b4a74a0ae23812449a9eea73b7fed0/src/components/WorkflowOcr.vue#L47-L57

Could you please consider creating these lists dynamicaly based on installed languages or at least include all tesseract available languages by default? Because this brings down this otherwise perfect tool :)

@R0Wi
Copy link
Contributor

R0Wi commented Jul 30, 2022

Hi, you're right, we discussed that feature at #135 and #119 already. This is definitely a feature we want to add but i can't give you a time horizon for this.

Something we have to check first is how we could get all installed languages for OcrMyPdf in a system-independent manner. @bahnwaerter some ideas for a command?

@R0Wi R0Wi added the enhancement New feature or request label Jul 30, 2022
@marian-code
Copy link
Author

Thanks, for the fast reply. Is there any other way I could change the installed app files and add required languages manually in the meantime? I tried to to alter the javascript files but the changes do not show in Nextcloud.

Something we have to check first is how we could get all installed languages for OcrMyPdf in a system-independent manner. @bahnwaerter some ideas for a command?

Tesseract does that nicely: tesseract --list-langs

@R0Wi
Copy link
Contributor

R0Wi commented Jul 30, 2022

I tried to to alter the javascript files but the changes do not show in Nextcloud.

Only changing the .vue-files is not enough. You'll have to recompile via make for example. Therefore clone the whole repo. If you tell me the missing languages i could add them if you want.

Tesseract does that nicely: tesseract --list-langs

Nice hint, will try this, thanks 👍

@marian-code
Copy link
Author

I you could add Slovak language in the meantime, it would be very nice :)

@R0Wi
Copy link
Contributor

R0Wi commented Aug 1, 2022

@marian-code i added slk as option. Please use https://github.com/R0Wi/workflow_ocr/suites/7609856015/artifacts/315720821 in the meantime and let me know if this works for the moment.

@marian-code
Copy link
Author

Thanks, everything works fine. I only had to bump version because nextcould kept updating the app from the store.

@bahnwaerter
Copy link
Collaborator

Something we have to check first is how we could get all installed languages for OcrMyPdf in a system-independent manner. @bahnwaerter some ideas for a command?

I have nothing to add to @marian-code's suggestion using tesseract --list-langs as a system-independent command. This would also work great if we want to provide a generic docker container for various hardware architectures in the future.

@melnyksergii
Copy link

@R0Wi Hello, could you add the Ukrainian language? Thank you.

@R0Wi R0Wi self-assigned this Sep 3, 2022
@R0Wi
Copy link
Contributor

R0Wi commented Sep 3, 2022

@marian-code @melnyksergii @bahnwaerter i just implemented a first beta version where the app only lists the tesseract languages which are installed on the backend. Would love to get some feedback.

Artifact can be found at https://github.com/R0Wi/workflow_ocr/suites/8119756582/artifacts/351406245. Please download and unpack (attention: packed twice...) the files into your NC apps folder. After that, please set the version inside appinfo/info.xml to 1.24.5 so that migrations are applied correctly.

Will try to write some test these days, then this could be released in the near future.

@marian-code
Copy link
Author

Well, I have tried it today and it gives me List is empty in the language selection dropdown menu.

Tesseract swos this output:

tesseract --list-langs
List of available languages (4):
ces
eng
osd
slk

@R0Wi
Copy link
Contributor

R0Wi commented Sep 10, 2022

Thank's for testing. Could you check the server logs? Maybe also decreasing the loglevel could help.

Will try to check it inside a fresh NC installation these days...

@marian-code
Copy link
Author

marian-code commented Sep 12, 2022

Well, there is nothing in the logs. Even with debug level setting. I see in code that log should be emited on error. Maybe the command runs OK but something else errors? There is no log for successful run of the command. I am not familiar with PHP, can I experiment a bit by adding some lines of code? As far as I know it is interpreted so changes shoul work right away?

public function getInstalledLanguages() : array {
  $commandStr = 'tesseract --list-langs';
  $this->command->setCommand($commandStr);
  
  $success = $this->command->execute();
  $errorOutput = $this->command->getError();
  $stdErr = $this->command->getStdErr();
  $exitCode = $this->command->getExitCode();
  
  if (!$success) {
	  throw new Exception('The command ' . $commandStr .' exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
  }
  
  if ($stdErr !== '' || $errorOutput !== '') {
	  $this->logger->warning('Tesseract list languages succeeded with warning(s): {stdErr}, {errorOutput}', [
		  'stdErr' => $stdErr,
		  'errorOutput' => $errorOutput
	  ]);
  }
  
  $installedLangsStr = $this->command->getOutput();
  
  if (!$installedLangsStr) {
	  throw new Exception('The command ' . $commandStr .' did not produce any output');
  }
  
  $lines = explode("\n", $installedLangsStr);
  return Chain::create($lines)
	  ->slice(1) // Skip tesseract header line
	  ->filter(function ($line) {
		  return $line !== 'osd'; // Also skip "osd" (OSD is not a language)
	  })
	  ->array;
  }

@R0Wi
Copy link
Contributor

R0Wi commented Sep 12, 2022

@marian-code you're right, you can edit the file directly in your NC instance and changes will be reflected immediately. I'd suggest to edit the mentioned code like the following:

public function getInstalledLanguages() : array {
  $this->logger->debug('getInstalledLanguages');
  $commandStr = 'tesseract --list-langs';
  $this->command->setCommand($commandStr);

  $this->logger->debug('executing command: ' . $commandStr);  

  $success = $this->command->execute();
  $errorOutput = $this->command->getError();
  $stdErr = $this->command->getStdErr();
  $exitCode = $this->command->getExitCode();
  
  if (!$success) {
	  throw new Exception('The command ' . $commandStr .' exited abnormally with exit-code ' . $exitCode . '. Message: ' . $errorOutput . ' ' . $stdErr);
  }
  
  if ($stdErr !== '' || $errorOutput !== '') {
	  $this->logger->warning('Tesseract list languages succeeded with warning(s): {stdErr}, {errorOutput}', [
		  'stdErr' => $stdErr,
		  'errorOutput' => $errorOutput
	  ]);
  }
  
  $installedLangsStr = $this->command->getOutput();

  $this->logger->debug('got installed language string: ' . $installedLangsStr);    

  if (!$installedLangsStr) {
	  throw new Exception('The command ' . $commandStr .' did not produce any output');
  }
  
  $lines = explode("\n", $installedLangsStr);
  return Chain::create($lines)
	  ->slice(1) // Skip tesseract header line
	  ->filter(function ($line) {
		  return $line !== 'osd'; // Also skip "osd" (OSD is not a language)
	  })
	  ->array;
  }

When setting NC loglevel to debug, that should give you some additional output. Would be interested to see the logs after you changed the code. Thank's !

@marian-code
Copy link
Author

Here are the two relevant lines from log. It seems to be working fine. So the problem must be elsewhere.

{"user":"admin",
"app":"workflow_ocr",
"method":"GET",
"url":"/index.phpflow_ocr/ocrBackendInfo/installedLangs",
"message":"executing command: tesseract --list-langs",
"version":"24.0.5.1",
"data":{"app":"workflow_ocr"}}

{"user":"admin",
"app":"workflow_ocr",
"method":"GET",
"url":"/index.phpflow_ocr/ocrBackendInfo/installedLangs",
"message":"got installed language string: List of available languages (4):\nces\neng\nosd\nslk",
"version":"24.0.5.1",
"data":{"app":"workflow_ocr"}}

@R0Wi
Copy link
Contributor

R0Wi commented Sep 12, 2022

You're right that's all looking good so far. Could you try to open the dev-tools of your browser and see if there are any errors reported?

@marian-code
Copy link
Author

I think I've got it. Dev-tools console shows this error:

TypeError: t.includes is not a function
    at workflow_ocr-main.js?v=5b6fe176-0:2:666355
    at Array.filter (<anonymous>)
    at o.beforeMount (workflow_ocr-main.js?v=5b6fe176-0:2:666342)
Ht @ vue.runtime.esm.js:1897

Which points to here:

beforeMount: async function() {
      const t = await async function() {
          const t = (0,
          n.generateUrl)("/apps/workflow_ocr/ocrBackendInfo/installedLangs");
          return (await o.default.get(t)).data
      }();
      this.availableLanguages = e.filter((e=>t.includes(e.langCode)))
  },

apparently t is not a string or array so js errors here.

@R0Wi
Copy link
Contributor

R0Wi commented Sep 13, 2022

Thank's for digging into that @marian-code! So it really seems to be a frontend error then. I'll try to reproduce it (event though I'm wondering why it worked on my dev build).

Could you do me a favor and also send me some screenshot of you Network-tab inside your Dev-Tools? There should be a request going to http(s)://.../apps/workflow_ocr/ocrBackendInfo/installedLangs. I'd really like to see the Response which is send by the server. Maybe it's empty or some other things went wrong.

@marian-code
Copy link
Author

marian-code commented Sep 13, 2022

Thanks for all the fast replies. Here is what I found. I am only vaguely familiar with web developement so I hope I understood right and this is what you wanted. If there is anything more thats needed I am happy to help.

Response

The response I am seeing is:

{"0":"ces","1":"eng","3":"slk"}

This is in line with what I would expect, since I have installed these 3 languages.

Headers

# General
Request URL: https://.../apps/workflow_ocr/ocrBackendInfo/installedLangs
Request Method: GET
Status Code: 200 
Remote Address: 188.167.103.32:443
Referrer Policy: no-referrer

# Response headers
cache-control: no-cache, no-store, must-revalidate
content-encoding: gzip
content-length: 47
content-security-policy: default-src 'none';base-uri 'none';manifest-src 'self';frame-ancestors 'none'
content-type: application/json; charset=utf-8
date: Tue, 13 Sep 2022 10:37:08 GMT
expires: Thu, 19 Nov 1981 08:52:00 GMT
feature-policy: autoplay 'none';camera 'none';fullscreen 'none';geolocation 'none';microphone 'none';payment 'none'
pragma: no-cache
referrer-policy: no-referrer
server: Apache/2.4.41 (Ubuntu)
strict-transport-security: max-age=15552000; includeSubDomains
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: ekYHdb1fH7S2GT79Gwnc
x-robots-tag: none
x-xss-protection: 1; mode=block

# Request headers
:authority: ...
:method: GET
:path: /apps/workflow_ocr/ocrBackendInfo/installedLangs
:scheme: https
accept: application/json, text/plain, */*
accept-encoding: gzip, deflate, br
accept-language: sk-SK,sk;q=0.9,en-US;q=0.8,en;q=0.7,cs;q=0.6
cookie: oc_sessionPassphrase=...; __Host-nc_sameSiteCookielax=true; __Host-nc_sameSiteCookiestrict=true; i18next=sk-SK; nc_username=...; oc8t9v13ol02=r79s30ltae91ma8lujras8m4hj; nc_token=...; nc_session_id=...
dnt: 1
requesttoken: Nd6ACO5YS0xhR77stUaznEZLvC0KVqsnOtX92Xi+3jw=:QInWft8JPiooctCtxHHg7w4H1n5tG/hLceSIkzzdk10=
sec-ch-ua: "Chromium";v="104", " Not A;Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-origin
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.115 Safari/537.36

@R0Wi
Copy link
Contributor

R0Wi commented Sep 13, 2022

Response

The response I am seeing is:

{"0":"ces","1":"eng","3":"slk"}

That's exactly the line I needed. Looks like the server serialization is a bit buggy. According to https://stackoverflow.com/a/11722121 PHP will turn any associative array into an JSON object,

If the array keys in your PHP array are not consecutive numbers

I think in my test cases this did not happen because my languages array was continuous in my installation.

Could you please try to patch the following line:

https://github.com/R0Wi/workflow_ocr/blob/f1b0edce0143b8bb5db3a7ba17ca78a2a52df8a9/lib/Controller/OcrBackendInfoController.php#L50

to

return array_values($this->ocrBackendInfoService->getInstalledLanguages());

?

Thank's again for your participation!

@marian-code
Copy link
Author

Well, that is certailny interesting serlialization :D
But after the patch all is working as it should. Thanks for debuging this so quickly.

@R0Wi
Copy link
Contributor

R0Wi commented Sep 13, 2022

Well, that is certailny interesting serlialization :D

It is 😄 Since I'm also not a "real" PHP-developer, I think I have missed this detail. But nevertheless I'm glad to hear that things are working correctly now. I will integrate the changes into the code and also add additional tests. This issue will then be closed if the code was merged.

Thank's again (hopefully the last time 😸 )

@marian-code
Copy link
Author

Yeah, no problem. Glad to help.

R0Wi added a commit that referenced this issue Sep 19, 2022
Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Sep 19, 2022
Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>
@R0Wi R0Wi mentioned this issue Sep 19, 2022
@R0Wi R0Wi linked a pull request Sep 19, 2022 that will close this issue
R0Wi added a commit that referenced this issue Sep 19, 2022
Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Sep 19, 2022
R0Wi added a commit that referenced this issue Sep 24, 2022
Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Sep 24, 2022
R0Wi added a commit that referenced this issue Sep 24, 2022
@R0Wi R0Wi closed this as completed in #148 Sep 24, 2022
R0Wi added a commit that referenced this issue Sep 24, 2022
* Implement #140

Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>

* Fix OcrBackendInfoServiceTest for #140

Signed-off-by: Robin Windey <[email protected]>

* Introduce specific CommandException

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Sep 24, 2022
* Implement #140

Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>

* Fix OcrBackendInfoServiceTest for #140

Signed-off-by: Robin Windey <[email protected]>

* Introduce specific CommandException

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
R0Wi added a commit that referenced this issue Sep 24, 2022
* Feature/impl#144 (#145)

* Register EventService class

* Fire TextRecognizedEvent

* Add TextRecognizedEvent class

* Create sidecar and add recognized text to result

* Added PdfOcrProcessor constructor argument

* Added recognizedText variable to class

* Added EventService

* Refactored TextRecognizeEvent

* Added EventService

* Fixed tests

* composer run cs:fix

* Basic code cleanup

Signed-off-by: Robin Windey <[email protected]>

* Adjustments for #144

* Add additional tests
* Refactor code to use more "high-level" SidecarFileAccessor

Signed-off-by: Robin Windey <[email protected]>

* Add docs for #144

* Add section for events to README.md
* Remove TOC workflow

Signed-off-by: Robin Windey <[email protected]>

* Fix php7.4 syntax

Signed-off-by: Robin Windey <[email protected]>

* Add check if event is emitted

Signed-off-by: Robin Windey <[email protected]>

* Change TextRecognizedEvent interface to be more generic

Linked to #144

* Adjust docs to match new interface

Signed-off-by: Robin Windey <[email protected]>

* Fix codecov

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
Co-authored-by: Guido Schmitz <[email protected]>
Co-authored-by: Robin Windey <[email protected]>

* Implement #140 (#148)

* Implement #140

Get installed tesseract languages from backend

Signed-off-by: Robin Windey <[email protected]>

* Fix OcrBackendInfoServiceTest for #140

Signed-off-by: Robin Windey <[email protected]>

* Introduce specific CommandException

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>

Signed-off-by: Robin Windey <[email protected]>
Co-authored-by: g-schmitz <[email protected]>
Co-authored-by: Guido Schmitz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants