worker.recognize fails when run before previous job finishes #875

DR7777 · 2024-01-20T20:12:17Z

Hello!

I have just set up the OCR function and it seems to work if I use this (old deprecated):
const result = await Tesseract.recognize(url, "eng");

But does not work if I use this (new):
const result = await worker.recognize(url);

When I try to run it on a document the "new one" just randomly stops at certain pages / images and doesn't finish without throwing a bug.

See full code below.
As I understood from the migration guide this is the only line I had to change?!

import Tesseract from "tesseract.js";

type OcrOnImagesOptions = {
  onProgress?: (progress: { current: number; total: number }) => void;
  onStart?: (progress: { current: 0; total: number }) => void;
};

const OcrOnImages = async (
  urls: string[],
  options?: OcrOnImagesOptions 
): Promise<
  Record<
    string,
    { text: string; boxes: Array<{ text: string; box: Tesseract.Box }> }
  >
> => {
  options.onStart && options.onStart({ current: 0, total: urls.length });
  const progress = { total: urls.length, current: 0 };

  const worker = await Tesseract.createWorker("eng");

  const promises = urls.map(async (url) => {
    try {
      // THIS DOES NOT WORK
      const result = await worker.recognize(url);
      console.log("result", result);

      // THIS WORKS PERFECTLY
      // const result = await Tesseract.recognize(url, "eng");
      // console.log("result1", result);

      progress.current += 1;
      options.onProgress && options.onProgress(progress);

      return {
        text: result.data.text,
        boxes: result.data.words.map((word) => ({
          text: word.text,
          box: word.bbox,
        })),
      };
    } catch (error) {
      console.error("Error processing URL:", url, error);
      return null;
    }
  });

  const results = await Promise.all(promises);

  return results.reduce((acc, data, index) => {
    return { ...acc, [index + 1]: data };
  }, {});
};

export default OcrOnImages;

Thanks so much for maintaining this library!

The text was updated successfully, but these errors were encountered:

Balearica · 2024-01-22T04:54:11Z

Thanks for reporting. I was able to replicate. A minimal reproducible example is below.

const worker = await Tesseract.createWorker("eng");

for (let i=0; i<5; i++) {
  worker.recognize("https://raw.githubusercontent.com/naptha/tesseract.js/master/benchmarks/data/meditations.jpg").then(ret => {
    console.log(ret.data.text);
  })
}

This appears to occur when calling worker.recognize multiple times without waiting for the previous recognition jobs to finish. In my example code above, changing worker.recognize to await worker.recognize waits until worker.recognize call completes before running the next one, which resolves the issue. Therefore, the quick fix to your problem would be to always wait until worker.recognize is done running before running worker.recognize again.

Regarding why this would not happen using Tesseract.recognize--that function creates a new worker every time it is called, so the same worker never gets two jobs. However, this will cause other issues. In addition to making a new worker for every job being inefficient, running Tesseract.recognize in parallel can lead to an uncontrollably large number of workers being spawned which can cause crashes due to resource constraints.

If you want to run jobs in parallel, the best way to handle this is using schedulers. Schedulers allow for using a defined number of workers in parallel to process jobs. Schedulers are explained here, and an example using schedulers is here.

DR7777 · 2024-01-22T11:15:46Z

Thanks for getting back so quick!

Actually I was awaiting the result. I just tried again like this and unfortunately I just keep waiting and waiting for the result of the second image...

const worker = await Tesseract.createWorker("eng");

  const promises = urls.map(async (url) => {
    try {
      const result = await worker.recognize(url);
      console.log("result", result);

      progress.current += 1;
      options.onProgress && options.onProgress(progress);

      return {
        text: result.data.text,
        boxes: result.data.words.map((word) => ({
          text: word.text,
          box: word.bbox,
        })),
      };
    } catch (error) {
      console.error("Error processing URL:", url, error);
      return null; 
    }
  });

  const results = await Promise.all(promises);

I also tried this method and it works fine and is quite fast aswell. But I assume starting a new worker and terminating it is quite inefficient? Even if I terminate it every time?

On my machine it works quite fast though:

  const promises = urls.map(async (url) => {
    try {
      const worker = await Tesseract.createWorker("eng");
      const result = await worker.recognize(url);
      console.log("result", result);

      progress.current += 1;
      options.onProgress && options.onProgress(progress);

      await worker.terminate();

      return {
        text: result.data.text,
        boxes: result.data.words.map((word) => ({
          text: word.text,
          box: word.bbox,
        })),
      };
    } catch (error) {
      console.error("Error processing URL:", url, error);
      return null; 
    }
  });

  const results = await Promise.all(promises);

Balearica · 2024-01-22T18:43:24Z

Actually I was awaiting the result.

When you use map to call an async function, map does not wait for one iteration to finish running before starting the next. Therefore, your code is still running worker.recognize without waiting for the previous job to complete. This concept be observed in the following example.

const wait1Second = () => new Promise((resolve) => {
  setTimeout(() => {
    console.log('1 second has passed');
    resolve(true);
  }, 1000);
});

// console.timeEnd is run after ~5 seconds
console.time();
for (let i = 0; i < 5; i++) {
  await wait1Second();
}
console.timeEnd();

// console.timeEnd runs almost immediately
console.time();
[1, 2, 3, 4, 5].map(async () => {
  await wait1Second();
});
console.timeEnd();

I also tried this method and it works fine and is quite fast aswell.

The reason why this is fast is because your application is running recognition on different workers in parallel--if you have 5 URLs then it is making 5 workers that run at the same time. As noted in my comment above, this is a problematic way to implement parallel processing because (among other things) the number of Tesseract.js workers it creates is undefined--you can end up creating 12 workers at the same time and crashing your application.

If you want to run multiple workers at the same time--which is the most efficient way to do things--you should implement a scheduler using the resources I linked in my previous comment.

Balearica · 2024-02-13T05:09:52Z

The root cause of this bug is that Tesseract.js workers only store one promise for each action at a time. This can be seen in the code linked below. If there is an unresolved promise with action recognize, and a new promise is created with action recognize, the new promise will overwrite the old promise.

https://github.com/naptha/tesseract.js/blob/master/src/createWorker.js#L51-L71

This causes the first recognize job that finishes to resolve the last promise created with action recognize. The remaining jobs still run but do not return anything to the user, as the promise they are trying to resolve has already been resolved.

This should be easily fixable by making the identifier for all promises unique, which can be achieved by appending jobId. I will do this in the next version of Tesseract.js. However, even after this is fixed, I still do not recommend sending multiple recognize jobs to the same worker at the same time. As described above, using a scheduler is the most performant way to handle executing multiple jobs in parallel. Additionally, given that jobs have side effects and there is no queue system built into Tesseract.js workers, it may still be possible for sending a bunch of jobs to a single worker at the same time to cause issues.

Balearica · 2024-02-13T05:38:23Z

The fix described above has been implemented in the master branch and will be included in the next patch release (v5.0.5).

DR7777 · 2024-03-20T17:50:47Z

Thanks a lot for implementing it that quick!
Appreciate it!

I am using schedulers now.

Balearica changed the title ~~Recognise stops when migrated to v5~~ worker.recognize fails when run before previous job finishes Jan 22, 2024

Balearica added a commit that referenced this issue Feb 13, 2024

Made promises have unique identifier per #875

15c62cb

Balearica mentioned this issue Feb 13, 2024

Made promises have unique identifier per #875 #888

Merged

Balearica closed this as completed in #888 Feb 13, 2024

Balearica added a commit that referenced this issue Feb 13, 2024

Made promises have unique identifier per #875 (#888)

c0479a1

Balearica mentioned this issue Apr 27, 2024

Execution worker.recognize repeatedly causes "Out of Memory" error in JSFiddle #920

Closed

PetersonGuo mentioned this issue Jun 12, 2024

[Snyk] Upgrade tesseract.js from 5.0.3 to 5.1.0 PetersonGuo/AssistedReader#17

Merged

Balearica mentioned this issue Jun 20, 2024

Combine worker and scheduler interfaces #933

Open

ramzimalhas mentioned this issue Aug 21, 2024

[Snyk] Upgrade tesseract.js from 5.0.4 to 5.1.0 ramzimalhas/CyberChef#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker.recognize fails when run before previous job finishes #875

worker.recognize fails when run before previous job finishes #875

DR7777 commented Jan 20, 2024 •

edited

Loading

Balearica commented Jan 22, 2024 •

edited

Loading

DR7777 commented Jan 22, 2024

Balearica commented Jan 22, 2024

Balearica commented Feb 13, 2024

Balearica commented Feb 13, 2024

DR7777 commented Mar 20, 2024

worker.recognize fails when run before previous job finishes #875

worker.recognize fails when run before previous job finishes #875

Comments

DR7777 commented Jan 20, 2024 • edited Loading

Balearica commented Jan 22, 2024 • edited Loading

DR7777 commented Jan 22, 2024

Balearica commented Jan 22, 2024

Balearica commented Feb 13, 2024

Balearica commented Feb 13, 2024

DR7777 commented Mar 20, 2024

DR7777 commented Jan 20, 2024 •

edited

Loading

Balearica commented Jan 22, 2024 •

edited

Loading