Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert shell scripts to single Python script #48

Merged
merged 20 commits into from
Oct 11, 2024

Conversation

pedropombeiro
Copy link
Collaborator

@pedropombeiro pedropombeiro commented Sep 24, 2024

This PR converts the scripts to a single Python 3 script (scanner.py), making the code more readable and more maintainable. The log output was also improved in order to increase readability. Improvements are welcome, as I'm not versed in Python.

NOTES:

  • The code should be in a good enough condition to use daily, but there might be bugs, especially in the scripts that I don't use (trigger*.py).

  • There have been recent changes that aren't yet incorporated in the Python script.

  • The remove_blank.sh script hasn't been converted fully yet, as that involves parsing the output of some command line utilities. UPDATE: fixed

  • Later on, we might want to remove the remaining shell scripts by changing the contents of /opt/brother/scanner/brscan-skey/brscan-skey.config to refer directly to the Python scripts:

    IMAGE="python3  /opt/brother/scanner/brscan-skey/script/scantoimage.py"
    OCR="python3  /opt/brother/scanner/brscan-skey/script/scantoocr.py"
    EMAIL="python3  /opt/brother/scanner/brscan-skey/script/scantoemail.py"
    FILE="python3  /opt/brother/scanner/brscan-skey/script/scantofile.py"

Related to #42

@pedropombeiro pedropombeiro force-pushed the pedropombeiro/convert-to-python3 branch 5 times, most recently from 59ece31 to 3b031d4 Compare September 25, 2024 11:34
@pedropombeiro pedropombeiro changed the title Draft: Convert shell scripts to single Python script Convert shell scripts to single Python script Sep 25, 2024
@pedropombeiro pedropombeiro force-pushed the pedropombeiro/convert-to-python3 branch 2 times, most recently from 99c9a0a to 9f3d47e Compare September 25, 2024 13:54
@pedropombeiro pedropombeiro self-assigned this Sep 25, 2024
@pedropombeiro pedropombeiro force-pushed the pedropombeiro/convert-to-python3 branch 4 times, most recently from 87cb872 to b84fabc Compare September 25, 2024 16:56
@PhilippMundhenk
Copy link
Owner

WOW! Just wow! This is awesome, thank you!! I will need some time to take a look at this, though. I'll have to push this to the weekend (hopefully). Very sorry about that.

@pedropombeiro
Copy link
Collaborator Author

WOW! Just wow! This is awesome, thank you!! I will need some time to take a look at this, though. I'll have to push this to the weekend (hopefully). Very sorry about that.

No worries @PhilippMundhenk, I already have it running on my NAS as my daily driver by overriding the scripts, so I'm in no rush 🙂

@pedropombeiro pedropombeiro force-pushed the pedropombeiro/convert-to-python3 branch 14 times, most recently from bff36c1 to 9ee9978 Compare October 1, 2024 20:34
@pedropombeiro
Copy link
Collaborator Author

The latest version should have fixed the scan order:

image

I'll now debug the scanner disconnection (no idea what could be causing this). Do you know if this is only happening in this branch - not in master?

@pedropombeiro
Copy link
Collaborator Author

Turns out that the duplicate logs were caused by a lingering batch I had in /tmp. Maybe we should clean incomplete batches on container startup (at least if the back pages are there but no .scan_pid file).

@pedropombeiro
Copy link
Collaborator Author

I tried a few more times and this time I didn't see any disconnections 🤷

Can you please try the latest image? Normally, OCR should be the only problem left.

@pedropombeiro pedropombeiro force-pushed the pedropombeiro/convert-to-python3 branch from 9b36b70 to 665fa8c Compare October 7, 2024 20:46
@pedropombeiro
Copy link
Collaborator Author

OCR should also be fixed.

@PhilippMundhenk
Copy link
Owner

PhilippMundhenk commented Oct 8, 2024

Thank you so much for the super fast respose! I don't manage to be this fast.

  • I can confirm that ordering is correct now!
  • For OCR, current behavior (thus allowing backward compatibility) is to trigger the OCR call after the conversion of either front or front and rear pages on scantofile or scantoemail (i.e., any scan), if OCR variables are set. scantoimage and scantoocr are currently undefined. I only took a quick look, but don't quite understand how OCR is enabled on the OCR key, but not the other keys.
  • I also tested with OCR key, but OCR also does not seem to work, maybe due to this error:
    - Scanning rear to latest batch 2024-10-08-19-04-12
      rear side: Found front-side batch: 2024-10-08-19-04-12
      rear side: ERROR: scan_pid file {path} not found.
      Analyzing 4 pages in /tmp/2024-10-08-19-04-12/2024-10-08-19-04-12.pdf with threshold 0.3%
    
  • I also do not observe any more hanging.
  • Regarding leftover batches, I think it would make sense not to clean on container start, but rather on scan start, if we can be sure there is not conversion process running. We should enable the case though were two front pages are scanned quickly in a row. E.g., by checking for age of .scan_pid
  • I (accidentally) noticed that aborting a started scan leads to some issues. Logs (scanning stopped at scanner before first page being pulled in, second scan started before second line):
    Scanning page 1
      front side: Waiting for 2 minutes before starting file conversion for 2024-10-08-18-57-25
      front side: converting to PDF for 2024-10-08-18-57-25...
      DEBUG: Executing command: ['gm', 'convert', '/tmp/2024-10-08-18-57-25/2024-10-08-18-57-25-front-page0001.pnm', '/tmp/2024-10-08-18-57-25/2024-10-08-18-57-25.pdf'], kwargs={'check': True}
      DEBUG: Moving /tmp/2024-10-08-18-57-25/2024-10-08-18-57-25.pdf to /scans/2024-10-08-18-57-25.pdf
      INFO: SSH environment variables not set, skipping inotify trigger.
      INFO: TELEGRAM_TOKEN or TELEGRAM_CHATID environment variables not set, skipping Telegram trigger.
    scanimage: sane_read: Error during device I/O
    Scanned page 1. (scanner status = 9)
    Batch terminated, 1 page scanned
    object address  : 0x7aa11b385780
    object refcount : 3
    object type     : 0x6037ff263140
    object type name: CalledProcessError
    object repr     : CalledProcessError(9, ['scanimage', '-l', '0', '-t', '0', '-x', '215', '-y', '297', '--format=pnm', '--resolution=300', '--  batch=/tmp/2024-10-08-18-58-20/2024-10-08-18-58-20-front-page%04d.pnm'])
    lost sys.stderr
    - Scanning front to batch /tmp/2024-10-08-19-00-41/2024-10-08-19-00-41-front-page%04d.pnm
      DEBUG: Executing command: ['scanimage', '-l', '0', '-t', '0', '-x', '215', '-y', '297', '--format=pnm', '--mode=True Gray', '--resolution=300', '--batch=/tmp/2024-10-08-19-00-41/2024-10-08-19-00-41-front-page%04d.pnm'], kwargs={'check': True}
    scanimage: rounded value of br-x from 215 to 211.881
    

I have not tested FTPS, inotify, Telegram

@pedropombeiro
Copy link
Collaborator Author

I also tested with OCR key, but OCR also does not seem to work, maybe due to this error:

@PhilippMundhenk that's strange, because OCR key is mapped to scan_front(log, device, ["--mode=True Gray"]), and the logs you mention show a rear scan 🤔

I also do not observe any more hanging.

🎉

I (accidentally) noticed that aborting a started scan leads to some issues. Logs (scanning stopped at scanner before first page being pulled in, second scan started before second line):

I've tested the scenario locally and pushed a fix for it, which also logs what happened.

@PhilippMundhenk
Copy link
Owner

PhilippMundhenk commented Oct 9, 2024

OCR: Oh ok, misunderstanding. I thought I can add rear pages and it still runs through OCR. I starte with scantoocr and then ran scantoemail for rear pages. I think we should just add OCR to front and front and rear scans on the scantofile and scantoemail buttons, if OCR variables are set. This would be backward compatible behavior.
I tested only front pages with scantoocr (my button calls it "Text") now and I don't see any difference in behavior. No OCR is being triggered.

- Scanning front to batch /tmp/2024-10-09-18-22-57/2024-10-09-18-22-57-front-page%04d.pnm
  DEBUG: Executing command: ['scanimage', '-l', '0', '-t', '0', '-x', '215', '-y', '297', '--format=pnm', '--mode=True Gray', '--resolution=300', '--batch=/tmp/2024-10-09-18-22-57/2024-10-09-18-22-57-front-page%04d.pnm'], kwargs={'check': True}
scanimage: rounded value of br-x from 215 to 211.881
scanimage: rounded value of br-y from 297 to 296.973
Scanning infinity pages, incrementing by 1, numbering from 1
Scanning page 1
Scanned page 1. (scanner status = 5)
Scanning page 2
Scanned page 2. (scanner status = 5)
Scanning page 3
scanimage: sane_start: Document feeder out of documents
Batch terminated, 2 pages scanned
  front side: INFO: Waiting to start conversion process for 2024-10-09-18-22-57 in process with PID 98
  front side: Waiting for 2 minutes before starting file conversion for 2024-10-09-18-22-57
  front side: converting to PDF for 2024-10-09-18-22-57...
  DEBUG: Executing command: ['gm', 'convert', '/tmp/2024-10-09-18-22-57/2024-10-09-18-22-57-front-page0001.pnm', '/tmp/2024-10-09-18-22-57/2024-10-09-18-22-57-front-page0002.pnm', '/tmp/2024-10-09-18-22-57/2024-10-09-18-22-57.pdf'], kwargs={'check': True}
  DEBUG: Moving /tmp/2024-10-09-18-22-57/2024-10-09-18-22-57.pdf to /scans/2024-10-09-18-22-57.pdf
  INFO: SSH environment variables not set, skipping inotify trigger.
  INFO: TELEGRAM_TOKEN or TELEGRAM_CHATID environment variables not set, skipping Telegram trigger.

Aborts: I noticed now that it seems to be the scan command that is hanging. There is probably little we can do about that, at least not now. Same behavior in master today. Never had issues with this though, so maybe that is not even a practical situation.

@pedropombeiro
Copy link
Collaborator Author

I think we should just add OCR to front and front and rear scans on the scantofile and scantoemail buttons, if OCR variables are set. This would be backward compatible behavior.

I'm not sure I follow. Right now we're calling OCR for any document that has finished scanning - regardless of which buttons were used. This seems logical, no?

I tested only front pages with scantoocr (my button calls it "Text") now and I don't see any difference in behavior. No OCR is being triggered

@PhilippMundhenk I don't see the following lines on your log. Were they present?

    print(f"  {side} side: Conversion and post-processing for finished.")
    print("-----------------------------------")

@PhilippMundhenk
Copy link
Owner

Yes, that would be the ideal behavior, indeed.

Nope, never saw those lines...

@pedropombeiro
Copy link
Collaborator Author

Nope, never saw those lines...

@PhilippMundhenk I wonder if the OCR variables are present at that point. I tested in the REPL inside the container and the check seems to work as expected. Would you be able to add some prints to your script to see what is happening?

@PhilippMundhenk
Copy link
Owner

Ok, so the issue is not within OCR, it is within the telegram notification. We never reach OCR:

    print("notifying...")
    notify(log, output_pdf_file, f"{job_name}.pdf ({side}) scanned")
    print("notified")

    print("cleaning...")
    clean_job_files(log, side, job_name)
    print("cleaned")

    print("OCRing...")
    # Check for OCR environment variables
    ocr_server = os.getenv("OCR_SERVER")
    ocr_port = os.getenv("OCR_PORT")
    ocr_path = os.getenv("OCR_PATH")

    print("OCR_SERVER: " + ocr_server)
    print("OCR_PORT: " + ocr_port)
    print("OCR_PATH: " + ocr_path)

log:

- Scanning rear to latest batch 2024-10-09-19-51-12
  rear side: Found front-side batch: 2024-10-09-19-51-12
  rear side: Read pid from /tmp/2024-10-09-19-51-12/.scan_pid, killing front processing job 69
  DEBUG: Executing command: ['scanimage', '-l', '0', '-t', '0', '-x', '215', '-y', '297', '--format=pnm', '--resolution=300', '--batch=/tmp/2024-10-09-19-51-12/2024-10-09-19-51-12-back-page%04d.pnm'], kwargs={'check': True}
scanimage: rounded value of br-x from 215 to 211.881
scanimage: rounded value of br-y from 297 to 296.973
Scanning infinity pages, incrementing by 1, numbering from 1
Scanning page 1
Scanned page 1. (scanner status = 5)
Scanning page 2
scanimage: sane_start: Document feeder out of documents
Batch terminated, 1 page scanned
  rear side: INFO: number of pages scanned: 1
  rear side: DEBUG: renamed 2024-10-09-19-51-12-front-page0001.pnm to index001-1-2024-10-09-19-51-12-front-page0001.pnm
  rear side: DEBUG: renamed 2024-10-09-19-51-12-back-page0001.pnm to index001-2-2024-10-09-19-51-12-back-page0001.pnm
  rear side: INFO: number of pages scanned: 1
  rear side: DEBUG: renamed 2024-10-09-19-51-12-front-page0001.pnm to index001-1-2024-10-09-19-51-12-front-page0001.pnm
  rear side: DEBUG: renamed 2024-10-09-19-51-12-back-page0001.pnm to index001-2-2024-10-09-19-51-12-back-page0001.pnm
  rear side: converting to PDF for 2024-10-09-19-51-12...
  DEBUG: Executing command: ['gm', 'convert', '/tmp/2024-10-09-19-51-12/index001-1-2024-10-09-19-51-12-front-page0001.pnm', '/tmp/2024-10-09-19-51-12/index001-2-2024-10-09-19-51-12-back-page0001.pnm', '/tmp/2024-10-09-19-51-12/2024-10-09-19-51-12.pdf'], kwargs={'check': True}
  Analyzing 2 pages in /tmp/2024-10-09-19-51-12/2024-10-09-19-51-12.pdf with threshold 0.3%
    Page 1: delete (ink coverage: 0.01%)
    Page 2: delete (ink coverage: 0.01%)
  DEBUG: Executing command: ['/usr/bin/pdftk', '/tmp/2024-10-09-19-51-12/2024-10-09-19-51-12.pdf', 'cat', 'output', '/tmp/2024-10-09-19-51-12/2024-10-09-19-51-12_noblank.pdf'], kwargs={'check': True}
  Removed 2 blank pages and saved as /tmp/2024-10-09-19-51-12/2024-10-09-19-51-12.pdf
  DEBUG: Moving /tmp/2024-10-09-19-51-12/2024-10-09-19-51-12.pdf to /scans/2024-10-09-19-51-12.pdf
�      
notifying...
  INFO: SSH environment variables not set, skipping inotify trigger.
  INFO: TELEGRAM_TOKEN or TELEGRAM_CHATID environment variables not set, skipping Telegram trigger.

@PhilippMundhenk
Copy link
Owner

Well, yeah, makes sense. We actually exit(1) there, rather than returning :)

made sure scan doesn't exit if telegram not found
@pedropombeiro
Copy link
Collaborator Author

Well, yeah, makes sense. We actually exit(1) there, rather than returning :)

😄 yeah, that would do it!

@PhilippMundhenk
Copy link
Owner

OCR: Something still off, but I'm looking into it.

Empty pages: Are you sure that removal of empty pages works? I just scanned a bunch of empties, it also logs that the pages have been removed, but I receive a PDF with two empty pages.

@pedropombeiro
Copy link
Collaborator Author

Empty pages: Are you sure that removal of empty pages works? I just scanned a bunch of empties, it also logs that the pages have been removed, but I receive a PDF with two empty pages.

Is it a document consisting of all empties? If so, I've noticed that the file comes out as original, but I don't think that's necessarily bad, since that's not a usual scenario, and at least it allows you to see what is wrong with the document, instead of receiving a PDF with zero pages. I've tried scanning pages with some empty backs, and it worked as expected.

@PhilippMundhenk
Copy link
Owner

Ah ok, yes, indeed, special case. I was too lazy to put paper in and just scanned a bunch of nothing. We can leave it like that.

OCR: Works!!

FTP: It tries to upload, although no variable set. I changed the checking condition.

Copy link
Owner

@PhilippMundhenk PhilippMundhenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just now I said in #53 that 1) there is not an issue, but here it is. If we remove the shell files altogether, we will also need to make sure the web interface is directly calling the python scripts...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need any of the php cleanups, once #32 is in...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This is really neat! Much easier to handle and for users to adapt

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we could make it even simpler, by getting logging and some formalities out of the way, but this is really nitpicking. Let's keep some todos for the future ;)

@PhilippMundhenk
Copy link
Owner

PhilippMundhenk commented Oct 9, 2024

Oh, more of a "note-to-self": One thing I noticed in OCR is that the files are now massive. 33MB for a 14MB input, used to be that they come out much smaller. Not sure why that is...

I will take a look at that another day. But this can be merged into dev, so that we can get started on the web UI integration...

@PhilippMundhenk PhilippMundhenk merged commit 0cb8ebe into development Oct 11, 2024
1 check passed
@pedropombeiro pedropombeiro deleted the pedropombeiro/convert-to-python3 branch October 11, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants