Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training an existing Tesseract model ara #406

Open
Abdlrhman00 opened this issue Nov 19, 2024 · 6 comments
Open

Training an existing Tesseract model ara #406

Abdlrhman00 opened this issue Nov 19, 2024 · 6 comments

Comments

@Abdlrhman00
Copy link

Abdlrhman00 commented Nov 19, 2024

I am working on training the Tesseract OCR model for the Arabic language (ara) using a custom dataset focused on political content. The workflow involves generating .gt.txt and .tif files for the dataset and running the make training command. However, I am encountering issues during the training process that result in unexpected errors or incomplete logs.

log file:

You are using make version: 4.3
combine_tessdata -u /content/tesstrain/data/tessdata/politcsar.traineddata data/politcsar/politcsar-1
Extracting tessdata components from /content/tesstrain/data/tessdata/politcsar.traineddata
Wrote data/politcsar/politcsar-1.lstm
Wrote data/politcsar/politcsar-1.lstm-unicharset
Wrote data/politcsar/politcsar-1.lstm-recoder
Wrote data/politcsar/politcsar-1.version
Version:5.5.0
17:lstm:size=11594707, offset=192
21:lstm-unicharset:size=6055, offset=11594899
22:lstm-recoder:size=796, offset=11600954
23:version:size=5, offset=11601750
unicharset_extractor --output_unicharset "data/politcsar-1/my.unicharset" --norm_mode 3 "data/politcsar-1/all-gt"
Extracting unicharset from plain text file data/politcsar-1/all-gt
Wrote unicharset file data/politcsar-1/my.unicharset
merge_unicharsets data/politcsar/politcsar-1.lstm-unicharset data/politcsar-1/my.unicharset "data/politcsar-1/unicharset"
Loaded unicharset of size 88 from file data/politcsar/politcsar-1.lstm-unicharset
Loaded unicharset of size 49 from file data/politcsar-1/my.unicharset
Wrote unicharset file data/politcsar-1/unicharset.
python3 shuffle.py 0 "data/politcsar-1/all-lstmf"
python3 generate_eval_train.py data/politcsar-1/all-lstmf 0.90
combine_lang_model \
  --input_unicharset data/politcsar-1/unicharset \
  --script_dir data/langdata \
  --numbers data/politcsar-1/politcsar-1.numbers \
  --puncs data/politcsar-1/politcsar-1.punc \
  --words data/politcsar-1/politcsar-1.wordlist \
  --output_dir data \
  --pass_through_recoder --lang_is_rtl \
  --lang politcsar-1
Failed to read data from: data/politcsar-1/politcsar-1.wordlist
Failed to read data from: data/politcsar-1/politcsar-1.punc
Failed to read data from: data/politcsar-1/politcsar-1.numbers
Loaded unicharset of size 88 from file data/politcsar-1/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/langdata/Inherited.unicharset
Warning: properties incomplete for index 16 = َ
Warning: properties incomplete for index 20 = ُ
Warning: properties incomplete for index 44 = ٍ
Warning: properties incomplete for index 48 = ّ
Warning: properties incomplete for index 65 = ِ
Warning: properties incomplete for index 66 = ْ
Warning: properties incomplete for index 69 = ً
Warning: properties incomplete for index 71 = ٌ
Config file is optional, continuing...
Failed to read data from: data/langdata/politcsar-1/politcsar-1.config
Created data/politcsar-1/politcsar-1.traineddata
lstmtraining \
  --debug_interval 0 \
  --traineddata data/politcsar-1/politcsar-1.traineddata \
  --old_traineddata /content/tesstrain/data/tessdata/politcsar.traineddata \
  --continue_from data/politcsar/politcsar-1.lstm \
  --learning_rate 0.0001 \
  --model_output data/politcsar-1/checkpoints/politcsar-1 \
  --train_listfile data/politcsar-1/list.train \
  --eval_listfile data/politcsar-1/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01 \
2>&1 | tee -a data/politcsar-1/training.log
Loaded file data/politcsar/politcsar-1.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 88 to 88!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys64:64, 20736
  Lfx96:96, 61824
  RxLrx96:96, 74112
  Lfx512:512, 1247232
  Fc88:88, 45144
Total weights = 1449208
Previous null char=2 mapped to 2
Continuing from data/politcsar/politcsar-1.lstm
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, mean rms=4.622%, delta=17.570%, BCER train=86.816%, BWER train=99.375%, skip ratio=0.000%, New best BCER = 86.816 wrote checkpoint.
2 Percent improvement time=200, best error was 100 @ 0
At iteration 200/200/200, mean rms=4.560%, delta=16.773%, BCER train=86.189%, BWER train=99.000%, skip ratio=0.000%, New best BCER = 86.189 wrote checkpoint.
2 Percent improvement time=300, best error was 100 @ 0
At iteration 300/300/300, mean rms=4.591%, delta=17.240%, BCER train=85.928%, BWER train=98.917%, skip ratio=0.000%, New best BCER = 85.928 wrote checkpoint.
2 Percent improvement time=400, best error was 100 @ 0
At iteration 400/400/400, mean rms=4.595%, delta=17.338%, BCER train=85.912%, BWER train=98.781%, skip ratio=0.000%, New best BCER = 85.912 wrote checkpoint.
2 Percent improvement time=500, best error was 100 @ 0
At iteration 500/500/500, mean rms=4.596%, delta=17.363%, BCER train=85.763%, BWER train=98.725%, skip ratio=0.000%, New best BCER = 85.763 wrote checkpoint.
Encoding of string failed! Failure bytes: 63 70 74 6c 65 73 6c 61 6b 55 67 68 33 79 79 41 79 79 67 75 65 53 61 6c 61 64 6e 79 70 61 79 70 6c 65
Can't encode transcription: 'cptleslakUgh3yyAyygueSaladnypayple' in language ''
2 Percent improvement time=600, best error was 100 @ 0
At iteration 600/600/601, mean rms=4.544%, delta=16.674%, BCER train=85.098%, BWER train=98.771%, skip ratio=0.167%, New best BCER = 85.098 wrote checkpoint.
2 Percent improvement time=600, best error was 86.816 @ 100
At iteration 700/700/701, mean rms=4.489%, delta=15.945%, BCER train=84.498%, BWER train=98.625%, skip ratio=0.143%, New best BCER = 84.498 wrote checkpoint.
2 Percent improvement time=400, best error was 85.912 @ 400
At iteration 800/800/801, mean rms=4.420%, delta=14.973%, BCER train=83.832%, BWER train=98.578%, skip ratio=0.125%, New best BCER = 83.832 wrote checkpoint.
2 Percent improvement time=400, best error was 85.763 @ 500
At iteration 900/900/901, mean rms=4.356%, delta=14.056%, BCER train=83.139%, BWER train=98.556%, skip ratio=0.111%, New best BCER = 83.139 wrote checkpoint.
2 Percent improvement time=300, best error was 84.498 @ 700
At iteration 1000/1000/1001, mean rms=4.316%, delta=13.352%, BCER train=82.203%, BWER train=98.550%, skip ratio=0.100%, New best BCER = 82.203 wrote checkpoint.
2 Percent improvement time=200, best error was 83.139 @ 900
At iteration 1100/1100/1101, mean rms=4.261%, delta=12.403%, BCER train=80.822%, BWER train=98.500%, skip ratio=0.100%, New best BCER = 80.822 wrote checkpoint.
2 Percent improvement time=200, best error was 82.203 @ 1000
At iteration 1200/1200/1201, mean rms=4.225%, delta=11.756%, BCER train=79.625%, BWER train=98.538%, skip ratio=0.100%, New best BCER = 79.625 wrote checkpoint.
2 Percent improvement time=200, best error was 80.822 @ 1100
At iteration 1300/1300/1301, mean rms=4.174%, delta=10.925%, BCER train=78.460%, BWER train=98.538%, skip ratio=0.100%, New best BCER = 78.460 wrote checkpoint.
2 Percent improvement time=200, best error was 79.625 @ 1200
At iteration 1400/1400/1401, mean rms=4.128%, delta=10.138%, BCER train=77.000%, BWER train=98.600%, skip ratio=0.100%, New best BCER = 77.000 wrote checkpoint.
2 Percent improvement time=200, best error was 78.46 @ 1300
At iteration 1500/1500/1501, mean rms=4.084%, delta=9.379%, BCER train=75.531%, BWER train=98.663%, skip ratio=0.100%, New best BCER = 75.531 wrote checkpoint.
2 Percent improvement time=200, best error was 77 @ 1400
At iteration 1600/1600/1601, mean rms=4.076%, delta=9.136%, BCER train=74.397%, BWER train=98.675%, skip ratio=0.000%, New best BCER = 74.397 wrote checkpoint.
2 Percent improvement time=200, best error was 75.531 @ 1500
At iteration 1700/1700/1701, mean rms=4.078%, delta=9.052%, BCER train=73.193%, BWER train=98.800%, skip ratio=0.000%, New best BCER = 73.193 wrote checkpoint.
2 Percent improvement time=200, best error was 74.397 @ 1600
At iteration 1800/1800/1801, mean rms=4.108%, delta=9.334%, BCER train=72.116%, BWER train=98.825%, skip ratio=0.000%, New best BCER = 72.116 wrote best model:data/politcsar-1/checkpoints/politcsar-1_72.116_1800_1800.checkpoint wrote checkpoint.
2 Percent improvement time=300, best error was 74.397 @ 1600
At iteration 1900/1900/1901, mean rms=4.138%, delta=9.698%, BCER train=71.278%, BWER train=98.863%, skip ratio=0.000%, New best BCER = 71.278 wrote checkpoint.
2 Percent improvement time=300, best error was 73.193 @ 1700
At iteration 2000/2000/2001, mean rms=4.169%, delta=10.148%, BCER train=70.623%, BWER train=98.905%, skip ratio=0.000%, New best BCER = 70.623 wrote checkpoint.
2 Percent improvement time=300, best error was 72.116 @ 1800
At iteration 2100/2100/2101, mean rms=4.193%, delta=10.556%, BCER train=69.676%, BWER train=98.943%, skip ratio=0.000%, New best BCER = 69.676 wrote best model:data/politcsar-1/checkpoints/politcsar-1_69.676_2100_2100.checkpoint wrote checkpoint.
2 Percent improvement time=200, best error was 70.623 @ 2000
At iteration 2200/2200/2201, mean rms=4.207%, delta=10.778%, BCER train=68.613%, BWER train=98.955%, skip ratio=0.000%, New best BCER = 68.613 wrote checkpoint.
2 Percent improvement time=300, best error was 70.623 @ 2000
At iteration 2300/2300/2301, mean rms=4.217%, delta=10.919%, BCER train=67.767%, BWER train=98.930%, skip ratio=0.000%, New best BCER = 67.767 wrote checkpoint.
2 Percent improvement time=300, best error was 69.676 @ 2100
At iteration 2400/2400/2401, mean rms=4.226%, delta=11.085%, BCER train=66.860%, BWER train=98.855%, skip ratio=0.000%, New best BCER = 66.860 wrote best model:data/politcsar-1/checkpoints/politcsar-1_66.860_2400_2400.checkpoint wrote checkpoint.
2 Percent improvement time=300, best error was 68.613 @ 2200
At iteration 2500/2500/2501, mean rms=4.237%, delta=11.271%, BCER train=65.895%, BWER train=98.655%, skip ratio=0.000%, New best BCER = 65.895 wrote checkpoint.
2 Percent improvement time=300, best error was 67.767 @ 2300
At iteration 2600/2600/2601, mean rms=4.238%, delta=11.391%, BCER train=64.969%, BWER train=98.593%, skip ratio=0.000%, New best BCER = 64.969 wrote checkpoint.
2 Percent improvement time=300, best error was 66.86 @ 2400
At iteration 2700/2700/2701, mean rms=4.246%, delta=11.543%, BCER train=64.146%, BWER train=98.443%, skip ratio=0.000%, New best BCER = 64.146 wrote best model:data/politcsar-1/checkpoints/politcsar-1_64.146_2700_2700.checkpoint wrote checkpoint.
2 Percent improvement time=300, best error was 65.895 @ 2500
At iteration 2800/2800/2801, mean rms=4.253%, delta=11.705%, BCER train=63.243%, BWER train=98.480%, skip ratio=0.000%, New best BCER = 63.243 wrote checkpoint.
2 Percent improvement time=300, best error was 64.969 @ 2600
At iteration 2900/2900/2901, mean rms=4.258%, delta=11.826%, BCER train=62.443%, BWER train=98.330%, skip ratio=0.000%, New best BCER = 62.443 wrote checkpoint.
2 Percent improvement time=400, best error was 64.969 @ 2600
At iteration 3000/3000/3001, mean rms=4.239%, delta=11.659%, BCER train=62.216%, BWER train=98.238%, skip ratio=0.000%, New best BCER = 62.216 wrote checkpoint.
2 Percent improvement time=400, best error was 64.146 @ 2700
At iteration 3100/3100/3101, mean rms=4.229%, delta=11.657%, BCER train=61.718%, BWER train=98.038%, skip ratio=0.000%, New best BCER = 61.718 wrote best model:data/politcsar-1/checkpoints/politcsar-1_61.718_3100_3100.checkpoint wrote checkpoint.
2 Percent improvement time=500, best error was 64.146 @ 2700
At iteration 3200/3200/3201, mean rms=4.222%, delta=11.635%, BCER train=61.589%, BWER train=97.975%, skip ratio=0.000%, New best BCER = 61.589 wrote checkpoint.
2 Percent improvement time=500, best error was 63.243 @ 2800
At iteration 3300/3300/3301, mean rms=4.221%, delta=11.745%, BCER train=60.819%, BWER train=97.875%, skip ratio=0.000%, New best BCER = 60.819 wrote checkpoint.
2 Percent improvement time=500, best error was 62.443 @ 2900
At iteration 3400/3400/3401, mean rms=4.219%, delta=11.792%, BCER train=60.257%, BWER train=97.825%, skip ratio=0.000%, New best BCER = 60.257 wrote checkpoint.
2 Percent improvement time=500, best error was 62.216 @ 3000
At iteration 3500/3500/3501, mean rms=4.217%, delta=11.818%, BCER train=59.833%, BWER train=97.838%, skip ratio=0.000%, New best BCER = 59.833 wrote checkpoint.
2 Percent improvement time=500, best error was 61.718 @ 3100
At iteration 3600/3600/3601, mean rms=4.201%, delta=11.625%, BCER train=59.687%, BWER train=97.763%, skip ratio=0.000%, New best BCER = 59.687 wrote best model:data/politcsar-1/checkpoints/politcsar-1_59.687_3600_3600.checkpoint wrote checkpoint.
2 Percent improvement time=500, best error was 61.589 @ 3200
At iteration 3700/3700/3701, mean rms=4.187%, delta=11.508%, BCER train=59.545%, BWER train=97.775%, skip ratio=0.000%, New best BCER = 59.545 wrote checkpoint.
2 Percent improvement time=600, best error was 61.589 @ 3200
At iteration 3800/3800/3801, mean rms=4.180%, delta=11.517%, BCER train=59.353%, BWER train=97.575%, skip ratio=0.000%, New best BCER = 59.353 wrote checkpoint.
2 Percent improvement time=700, best error was 61.589 @ 3200
At iteration 3900/3900/3901, mean rms=4.173%, delta=11.480%, BCER train=59.040%, BWER train=97.525%, skip ratio=0.000%, New best BCER = 59.040 wrote checkpoint.
2 Percent improvement time=700, best error was 60.819 @ 3300
At iteration 4000/4000/4001, mean rms=4.186%, delta=11.688%, BCER train=58.362%, BWER train=97.425%, skip ratio=0.000%, New best BCER = 58.362 wrote checkpoint.
2 Percent improvement time=700, best error was 60.257 @ 3400
At iteration 4100/4100/4101, mean rms=4.177%, delta=11.630%, BCER train=58.222%, BWER train=97.425%, skip ratio=0.000%, New best BCER = 58.222 wrote checkpoint.
2 Percent improvement time=800, best error was 60.257 @ 3400
At iteration 4200/4200/4201, mean rms=4.171%, delta=11.652%, BCER train=57.968%, BWER train=97.225%, skip ratio=0.000%, New best BCER = 57.968 wrote checkpoint.
At iteration 4300/4300/4301, mean rms=4.155%, delta=11.520%, BCER train=57.977%, BWER train=97.213%, skip ratio=0.000%, New worst BCER = 57.977 wrote checkpoint.
At iteration 4400/4400/4401, mean rms=4.142%, delta=11.414%, BCER train=58.271%, BWER train=97.188%, skip ratio=0.000%, New worst BCER = 58.271 wrote checkpoint.
At iteration 4500/4500/4501, mean rms=4.129%, delta=11.318%, BCER train=58.310%, BWER train=97.163%, skip ratio=0.000%, New worst BCER = 58.310 wrote checkpoint.
At iteration 4600/4600/4601, mean rms=4.132%, delta=11.438%, BCER train=58.047%, BWER train=97.025%, skip ratio=0.000%, New worst BCER = 58.047 wrote checkpoint.
2 Percent improvement time=1100, best error was 59.687 @ 3600
At iteration 4700/4700/4701, mean rms=4.135%, delta=11.579%, BCER train=57.567%, BWER train=96.863%, skip ratio=0.000%, New best BCER = 57.567 wrote best model:data/politcsar-1/checkpoints/politcsar-1_57.567_4700_4700.checkpoint wrote checkpoint.
2 Percent improvement time=1100, best error was 59.545 @ 3700
At iteration 4800/4800/4801, mean rms=4.117%, delta=11.375%, BCER train=57.456%, BWER train=96.850%, skip ratio=0.000%, New best BCER = 57.456 wrote checkpoint.
2 Percent improvement time=1100, best error was 59.353 @ 3800
At iteration 4900/4900/4901, mean rms=4.117%, delta=11.415%, BCER train=57.317%, BWER train=96.825%, skip ratio=0.000%, New best BCER = 57.317 wrote checkpoint.
At iteration 5000/5000/5001, mean rms=4.096%, delta=11.192%, BCER train=57.385%, BWER train=96.875%, skip ratio=0.000%, New worst BCER = 57.385 wrote checkpoint.
2 Percent improvement time=1300, best error was 59.353 @ 3800
At iteration 5100/5100/5101, mean rms=4.093%, delta=11.198%, BCER train=57.282%, BWER train=96.825%, skip ratio=0.000%, New best BCER = 57.282 wrote checkpoint.
2 Percent improvement time=1300, best error was 59.04 @ 3900
At iteration 5200/5200/5201, mean rms=4.087%, delta=11.186%, BCER train=56.956%, BWER train=96.863%, skip ratio=0.000%, New best BCER = 56.956 wrote checkpoint.
At iteration 5300/5300/5301, mean rms=4.087%, delta=11.209%, BCER train=56.998%, BWER train=96.875%, skip ratio=0.000%, New worst BCER = 56.998 wrote checkpoint.
2 Percent improvement time=1500, best error was 59.04 @ 3900
At iteration 5400/5400/5401, mean rms=4.084%, delta=11.203%, BCER train=56.718%, BWER train=96.650%, skip ratio=0.000%, New best BCER = 56.718 wrote checkpoint.
At iteration 5500/5500/5501, mean rms=4.079%, delta=11.204%, BCER train=56.883%, BWER train=96.608%, skip ratio=0.000%, New worst BCER = 56.883 wrote checkpoint.
At iteration 5600/5600/5601, mean rms=4.077%, delta=11.202%, BCER train=57.073%, BWER train=96.758%, skip ratio=0.000%, New worst BCER = 57.073 wrote checkpoint.
At iteration 5700/5700/5701, mean rms=4.075%, delta=11.159%, BCER train=57.147%, BWER train=96.846%, skip ratio=0.000%, New worst BCER = 57.147 wrote checkpoint.
At iteration 5800/5800/5801, mean rms=4.078%, delta=11.231%, BCER train=56.944%, BWER train=96.883%, skip ratio=0.000%, New worst BCER = 56.944 wrote checkpoint.
At iteration 5900/5900/5901, mean rms=4.067%, delta=11.132%, BCER train=56.950%, BWER train=96.858%, skip ratio=0.000%, New worst BCER = 56.950 wrote checkpoint.
2 Percent improvement time=2100, best error was 59.04 @ 3900
At iteration 6000/6000/6001, mean rms=4.071%, delta=11.248%, BCER train=56.439%, BWER train=96.719%, skip ratio=0.000%, New best BCER = 56.439 wrote checkpoint.
2 Percent improvement time=2200, best error was 59.04 @ 3900
At iteration 6100/6100/6101, mean rms=4.062%, delta=11.188%, BCER train=56.386%, BWER train=96.702%, skip ratio=0.000%, New best BCER = 56.386 wrote checkpoint.
2 Percent improvement time=2200, best error was 58.362 @ 4000
At iteration 6200/6200/6201, mean rms=4.052%, delta=11.070%, BCER train=56.329%, BWER train=96.652%, skip ratio=0.000%, New best BCER = 56.329 wrote checkpoint.
2 Percent improvement time=2200, best error was 58.222 @ 4100
At iteration 6300/6300/6301, mean rms=4.045%, delta=11.027%, BCER train=56.219%, BWER train=96.640%, skip ratio=0.000%, New best BCER = 56.219 wrote checkpoint.
2 Percent improvement time=2300, best error was 58.222 @ 4100
At iteration 6400/6400/6401, mean rms=4.042%, delta=11.059%, BCER train=56.152%, BWER train=96.752%, skip ratio=0.000%, New best BCER = 56.152 wrote checkpoint.
2 Percent improvement time=2400, best error was 58.222 @ 4100
At iteration 6500/6500/6501, mean rms=4.041%, delta=10.997%, BCER train=56.102%, BWER train=96.669%, skip ratio=0.000%, New best BCER = 56.102 wrote checkpoint.
2 Percent improvement time=2500, best error was 58.222 @ 4100
At iteration 6600/6600/6601, mean rms=4.040%, delta=11.056%, BCER train=56.031%, BWER train=96.457%, skip ratio=0.000%, New best BCER = 56.031 wrote checkpoint.
At iteration 6700/6700/6701, mean rms=4.026%, delta=10.939%, BCER train=56.150%, BWER train=96.319%, skip ratio=0.000%, New worst BCER = 56.150 wrote checkpoint.
At iteration 6800/6800/6801, mean rms=4.025%, delta=10.936%, BCER train=56.436%, BWER train=96.132%, skip ratio=0.000%, New worst BCER = 56.436 wrote checkpoint.
At iteration 6900/6900/6901, mean rms=4.028%, delta=11.053%, BCER train=56.436%, BWER train=96.219%, skip ratio=0.000%, New worst BCER = 56.436 wrote checkpoint.
At iteration 7000/7000/7001, mean rms=4.032%, delta=11.141%, BCER train=56.624%, BWER train=96.246%, skip ratio=0.000%, New worst BCER = 56.624 wrote checkpoint.
At iteration 7100/7100/7101, mean rms=4.034%, delta=11.152%, BCER train=56.971%, BWER train=96.263%, skip ratio=0.000%, New worst BCER = 56.971 wrote checkpoint.
At iteration 7200/7200/7201, mean rms=4.037%, delta=11.213%, BCER train=57.245%, BWER train=96.213%, skip ratio=0.000%, New worst BCER = 57.245 wrote checkpoint.
At iteration 7300/7300/7301, mean rms=4.046%, delta=11.330%, BCER train=57.098%, BWER train=96.025%, skip ratio=0.000%, New worst BCER = 57.098 wrote checkpoint.
At iteration 7400/7400/7401, mean rms=4.046%, delta=11.389%, BCER train=56.897%, BWER train=96.025%, skip ratio=0.000%, New worst BCER = 56.897 wrote checkpoint.
At iteration 7500/7500/7501, mean rms=4.050%, delta=11.573%, BCER train=56.273%, BWER train=96.000%, skip ratio=0.000%, New worst BCER = 56.273 wrote checkpoint.
2 Percent improvement time=3400, best error was 57.968 @ 4200
At iteration 7600/7600/7601, mean rms=4.046%, delta=11.590%, BCER train=55.753%, BWER train=95.913%, skip ratio=0.000%, New best BCER = 55.753 wrote best model:data/politcsar-1/checkpoints/politcsar-1_55.753_7600_7600.checkpoint wrote checkpoint.
2 Percent improvement time=3500, best error was 57.968 @ 4200
At iteration 7700/7700/7701, mean rms=4.049%, delta=11.634%, BCER train=55.590%, BWER train=95.875%, skip ratio=0.000%, New best BCER = 55.590 wrote checkpoint.
At iteration 7800/7800/7801, mean rms=4.039%, delta=11.527%, BCER train=55.735%, BWER train=95.938%, skip ratio=0.000%, New worst BCER = 55.735 wrote checkpoint.
2 Percent improvement time=3100, best error was 57.456 @ 4800
At iteration 7900/7900/7901, mean rms=4.040%, delta=11.601%, BCER train=55.366%, BWER train=95.838%, skip ratio=0.000%, New best BCER = 55.366 wrote checkpoint.
2 Percent improvement time=3100, best error was 57.317 @ 4900
At iteration 8000/8000/8001, mean rms=4.034%, delta=11.577%, BCER train=55.312%, BWER train=95.850%, skip ratio=0.000%, New best BCER = 55.312 wrote checkpoint.
2 Percent improvement time=2900, best error was 56.956 @ 5200
At iteration 8100/8100/8101, mean rms=4.035%, delta=11.673%, BCER train=54.902%, BWER train=95.888%, skip ratio=0.000%, New best BCER = 54.902 wrote checkpoint.
2 Percent improvement time=3000, best error was 56.956 @ 5200
At iteration 8200/8200/8201, mean rms=4.030%, delta=11.635%, BCER train=54.878%, BWER train=95.900%, skip ratio=0.000%, New best BCER = 54.878 wrote checkpoint.
2 Percent improvement time=3100, best error was 56.956 @ 5200
At iteration 8300/8300/8301, mean rms=4.022%, delta=11.623%, BCER train=54.860%, BWER train=96.063%, skip ratio=0.000%, New best BCER = 54.860 wrote checkpoint.
At iteration 8400/8400/8401, mean rms=4.016%, delta=11.568%, BCER train=54.955%, BWER train=96.050%, skip ratio=0.000%, New worst BCER = 54.955 wrote checkpoint.
At iteration 8500/8500/8501, mean rms=4.012%, delta=11.523%, BCER train=55.176%, BWER train=96.063%, skip ratio=0.000%, New worst BCER = 55.176 wrote checkpoint.
At iteration 8600/8600/8601, mean rms=4.011%, delta=11.515%, BCER train=55.371%, BWER train=96.300%, skip ratio=0.000%, New worst BCER = 55.371 wrote checkpoint.
At iteration 8700/8700/8701, mean rms=4.010%, delta=11.554%, BCER train=55.567%, BWER train=96.500%, skip ratio=0.000%, New worst BCER = 55.567 wrote checkpoint.
At iteration 8800/8800/8801, mean rms=4.017%, delta=11.676%, BCER train=55.241%, BWER train=96.525%, skip ratio=0.000%, New worst BCER = 55.241 wrote checkpoint.
At iteration 8900/8900/8901, mean rms=4.010%, delta=11.535%, BCER train=55.442%, BWER train=96.513%, skip ratio=0.000%, New worst BCER = 55.442 wrote checkpoint.
At iteration 9000/9000/9001, mean rms=4.009%, delta=11.507%, BCER train=55.161%, BWER train=96.425%, skip ratio=0.000%, New worst BCER = 55.161 wrote checkpoint.
At iteration 9100/9100/9101, mean rms=4.001%, delta=11.401%, BCER train=54.947%, BWER train=96.275%, skip ratio=0.000%, New worst BCER = 54.947 wrote checkpoint.
2 Percent improvement time=3800, best error was 56.718 @ 5400
At iteration 9200/9200/9201, mean rms=4.002%, delta=11.462%, BCER train=54.652%, BWER train=96.250%, skip ratio=0.000%, New best BCER = 54.652 wrote checkpoint.
2 Percent improvement time=3900, best error was 56.718 @ 5400
At iteration 9300/9300/9301, mean rms=3.999%, delta=11.436%, BCER train=54.544%, BWER train=96.263%, skip ratio=0.000%, New best BCER = 54.544 wrote checkpoint.
At iteration 9400/9400/9401, mean rms=4.004%, delta=11.454%, BCER train=54.635%, BWER train=96.200%, skip ratio=0.000%, New worst BCER = 54.635 wrote checkpoint.
2 Percent improvement time=4100, best error was 56.718 @ 5400
At iteration 9500/9500/9501, mean rms=3.995%, delta=11.376%, BCER train=54.500%, BWER train=96.238%, skip ratio=0.000%, New best BCER = 54.500 wrote checkpoint.
Encoding of string failed! Failure bytes: 63 70 74 6c 65 73 6c 61 6b 55 67 68 33 79 79 41 79 79 67 75 65 53 61 6c 61 64 6e 79 70 61 79 70 6c 65
Can't encode transcription: 'cptleslakUgh3yyAyygueSaladnypayple' in language ''
2 Percent improvement time=3400, best error was 56.329 @ 6200
At iteration 9600/9600/9602, mean rms=3.989%, delta=11.350%, BCER train=54.226%, BWER train=95.925%, skip ratio=0.100%, New best BCER = 54.226 wrote checkpoint.
2 Percent improvement time=2100, best error was 55.753 @ 7600
At iteration 9700/9700/9702, mean rms=3.984%, delta=11.341%, BCER train=53.628%, BWER train=95.700%, skip ratio=0.100%, New best BCER = 53.628 wrote best model:data/politcsar-1/checkpoints/politcsar-1_53.628_9700_9700.checkpoint wrote checkpoint.
2 Percent improvement time=1900, best error was 55.366 @ 7900
At iteration 9800/9800/9802, mean rms=3.982%, delta=11.387%, BCER train=53.364%, BWER train=95.638%, skip ratio=0.100%, New best BCER = 53.364 wrote checkpoint.
2 Percent improvement time=1900, best error was 55.312 @ 8000
At iteration 9900/9900/9902, mean rms=3.975%, delta=11.401%, BCER train=53.065%, BWER train=95.550%, skip ratio=0.100%, New best BCER = 53.065 wrote checkpoint.
2 Percent improvement time=2000, best error was 55.312 @ 8000
At iteration 10000/10000/10002, mean rms=3.971%, delta=11.454%, BCER train=52.917%, BWER train=95.463%, skip ratio=0.100%, New best BCER = 52.917 wrote checkpoint.
Finished! Selected model with minimal training error rate (BCER) = 52.917

this result is after more than 20000 itirations

I would be more than happy if someone help me with this

@Abdlrhman00
Copy link
Author

also here is github repo having all used files to generate the dataset:
https://github.com/Abdlrhman00/OCR_train_ara

@stweil
Copy link
Collaborator

stweil commented Nov 20, 2024

I see no critical errors in your training, but it did not run long enough. Increase --max_iterations or use --epochs with a higher value.

And fix or remove the GT line with 'cptleslakUgh3yyAyygueSaladnypayple'.

@Abdlrhman00
Copy link
Author

Thank you for your feedback and suggestions. I appreciate your guidance!

I’ve noticed that while there is some progress in BCER and BWER, the BWER in particular is improving very slowly. Considering the results after 20,000 iterations, it seems like achieving good accuracy will be quite challenging at this rate.

Do you think adjusting the dataset (e.g., increasing its size or refining the ground truth) or tweaking training parameters like learning_rate or --max_iterations could help accelerate improvement?

Looking forward to your thoughts! Thanks again for your time and help.

@Abdlrhman00
Copy link
Author

Abdlrhman00 commented Nov 20, 2024

Can you check this:

File /content/ocr_training_data/133.lstmf line 0 :
Mean rms=2.847%, delta=6.185%, train=29.879%(83.13%), skip ratio=0%
Iteration 46338: GROUND  TRUTH : هرشابم نم هعنمو هدلاو ضرتعاف ضرالا ثرحل مدق
Iteration 46338: ALIGNED TRUTH : هرشب منموا ضرتا ضرالا ثرحل مدقق
Iteration 46338: BEST OCR TEXT : هرشابم نم هنمولاو ضرتاف ضرالا يلل هال
File /content/ocr_training_data/45.lstmf line 0 :
Mean rms=2.847%, delta=6.188%, train=29.886%(83.093%), skip ratio=0%
Iteration 46339: GROUND  TRUTH : فوقوملا نا هيلوالا تايرحتلا تتبثاو هتزوحب تناك يتلا
Iteration 46339: ALIGNED TRUTH : فوقمناالايرتتتبثهزوحب تناك يتلا
Iteration 46339: BEST OCR TEXT : فوقمانا هيال تايتل تثا هتزوحب ت يلا
File /content/ocr_training_data/353.lstmf line 0 :
Mean rms=2.847%, delta=6.185%, train=29.891%(83.093%), skip ratio=0%
Iteration 46340: GROUND  TRUTH : هتاذ ردصملا راشاو هصتخملا هماعلا هباينلا فارشا تحت
Iteration 46340: ALIGNED TRUTH : هتذ دصماوهصخمهلا نلا فارشا تحت
Iteration 46340: BEST OCR TEXT : هت ردصملاراشوهصخملهاعلا هينلا هاشلا نمح
File /content/ocr_training_data/563.lstmf line 0 :
Mean rms=2.847%, delta=6.19%, train=29.903%(83.118%), skip ratio=0%
Iteration 46341: GROUND  TRUTH : حمست ال هيسفنلاو هيعامتجالا اهفورظ نا هفيضم هيلام
Iteration 46341: ALIGNED TRUTH : حمس هف هيمالا فورظ نا هفيضم هيلام
Iteration 46341: BEST OCR TEXT : حمستا هيسفلاو يعمجاا اهفورظ نا نم ها
File /content/ocr_training_data/882.lstmf line 0 :
Mean rms=2.847%, delta=6.19%, train=29.908%(83.105%), skip ratio=0%
Iteration 46342: GROUND  TRUTH : يمسملا بالا صخش يف هدحاو هرسا نم دارفا
Iteration 46342: ALIGNED TRUTH : يمسم بااصخش يهحاو هرسا نم داارفا
Iteration 46342: BEST OCR TEXT : يمسم بالاصخش يفهدحاو رام هاا
File /content/ocr_training_data/917.lstmf line 0 :
Mean rms=2.848%, delta=6.197%, train=29.904%(83.118%), skip ratio=0%
Iteration 46343: GROUND  TRUTH : هثلاثلا يف باش هذخاؤمب ءاثالثلا سما لوا هديدجلاب
Iteration 46343: ALIGNED TRUTH : هثا يبا هذخؤماثا سما لوا هديدجلاب
Iteration 46343: BEST OCR TEXT : هثثلا يف با هذخاؤمهاثاثا سما يالا هاملا
File /content/ocr_training_data/806.lstmf line 0 :
Mean rms=2.848%, delta=6.197%, train=29.909%(83.118%), skip ratio=0%
Iteration 46344: GROUND  TRUTH : تاعيقوت مضت هيعامج يرخاو هيدرفلا تاياكشلا نم تارشعلا
Iteration 46344: ALIGNED TRUTH : تايق ضت عم يرخودرااكشلا نم تارشعلا
Iteration 46344: BEST OCR TEXT : تاعيت ضت عمج يرخودرفااكشلان ن هاقا
File /content/ocr_training_data/364.lstmf line 0 :
Mean rms=2.848%, delta=6.198%, train=29.911%(83.118%), skip ratio=0%
Iteration 46345: GROUND  TRUTH : قح يف هرداصلاو هءاربلاب هيفانئتسالا ماكحالا هماعلا هباينلا
Iteration 46345: ALIGNED TRUTH : قح يفردص هءاربلانسامحالا هماعلا هباينلا
Iteration 46345: BEST OCR TEXT : قح فرداصلاو هءرلايفانئتسالا محالا هملا الا
File /content/ocr_training_data/300.lstmf line 0 :
Mean rms=2.848%, delta=6.195%, train=29.906%(83.118%), skip ratio=0%
Iteration 46346: GROUND  TRUTH : فاضاو اهيلع سنجلا هسراممو اهسبالم نم اهديرجت يلا
Iteration 46346: ALIGNED TRUTH : فاضا العسنجلهراماسبلمنم اهديرجت ييلا
Iteration 46346: BEST OCR TEXT : فاضاو الع سنجلا هسراماهبلم نما دم يالا
File /content/ocr_training_data/680.lstmf line 0 :
Mean rms=2.848%, delta=6.195%, train=29.905%(83.118%), skip ratio=0%
Iteration 46347: GROUND  TRUTH : اضرم يناعيو هنس رمعلا نم غلبي عباس روز
Iteration 46347: ALIGNED TRUTH : اضرميا سرعلنم غلبي عباسس ررز
Iteration 46347: BEST OCR TEXT : اضرم ينايو هنسرعلا نم لبي هام يفم
File /content/ocr_training_data/890.lstmf line 0 :
Mean rms=2.848%, delta=6.198%, train=29.924%(83.118%), skip ratio=0%
Iteration 46348: GROUND  TRUTH : قيقحتلاب هفلكملا ثحبلا هقرف ليحت نا رظتنملا نمو
Iteration 46348: ALIGNED TRUTH : قيقاب لاثحبقفيت نا رظتنملا نمو
Iteration 46348: BEST OCR TEXT : قيقلب لاثبلهقف ليت نا ررحملا نمم
File /content/ocr_training_data/688.lstmf line 0 :
Mean rms=2.848%, delta=6.196%, train=29.927%(83.118%), skip ratio=0%
Iteration 46349: GROUND  TRUTH : وضع ،« ينادجولا يفطصم تادافا يلا عامتسالاب تبلاط
Iteration 46349: ALIGNED TRUTH : وضع ،نجو يفطصم تاديلامتسالاب تببلاط
Iteration 46349: BEST OCR TEXT : وضع ، ندجوا فطصم تاد ي امتسالاب يملا
File /content/ocr_training_data/800.lstmf line 0 :
Mean rms=2.847%, delta=6.191%, train=29.933%(83.13%), skip ratio=0%
Iteration 46350: GROUND  TRUTH : هيناثلا هقطنملا يلا امهملسيو امهفقوي نا لبق اهيلع
Iteration 46350: ALIGNED TRUTH : هيا هطنما امملس اهقوي نا لبق اهيلع
Iteration 46350: BEST OCR TEXT : هينالا هطنم لا امهملسومهقوي نا نيحف يل
File /content/ocr_training_data/203.lstmf line 0 :
Mean rms=2.847%, delta=6.192%, train=29.935%(83.143%), skip ratio=0%
Iteration 46351: GROUND  TRUTH : ناريوديا هعامج سيئر تارب نا قبس هتاذ ميلقالاب
Iteration 46351: ALIGNED TRUTH : ناريويهعمج سيئتباقبس هتاذ ملاقالاب
Iteration 46351: BEST OCR TEXT : ناريويهعامج سيئ ترب نا قبس مي هاملا
File /content/ocr_training_data/579.lstmf line 0 :
Mean rms=2.847%, delta=6.194%, train=29.919%(83.118%), skip ratio=0%
Iteration 46352: GROUND  TRUTH : افرحنم يضاملا نينثالا حابص ريداكا نما هيالو حلاصم
Iteration 46352: ALIGNED TRUTH : افحن يضالاثالابصرياك نما هيالو ححلاصصم
Iteration 46352: BEST OCR TEXT : افحن يضالا يثالا حابص ريداك نما هببلا هدو
File /content/ocr_training_data/339.lstmf line 0 :
Mean rms=2.848%, delta=6.198%, train=29.926%(83.105%), skip ratio=0%
Iteration 46353: GROUND  TRUTH : صاخشا هثالث فاقيا نع ترفسا قشلا نيع هينمالا
Iteration 46353: ALIGNED TRUTH : صاخشا لثفياع تفس قشلا نييعع هينمالا
Iteration 46353: BEST OCR TEXT : صاخشا هلثفقي نع تفس قشلا نامم ينلا
File /content/ocr_training_data/461.lstmf line 0 :
Mean rms=2.847%, delta=6.195%, train=29.922%(83.068%), skip ratio=0%
Iteration 46354: GROUND  TRUTH : رداغيو مهنم صلختي نا دعب هتحارل ليبسلا وه
Iteration 46354: ALIGNED TRUTH : ردغيمهنم صخ ند حارل ليببسسلا ووهه
Iteration 46354: BEST OCR TEXT : رداغيومهنم صختي نا د حال لرلا نام
File /content/ocr_training_data/440.lstmf line 0 :
Mean rms=2.847%, delta=6.196%, train=29.931%(83.08%), skip ratio=0%
Iteration 46355: GROUND  TRUTH : يتلا هالابماللا ماما جاجحلا نا اهتاذ رداصملا تدازو
Iteration 46355: ALIGNED TRUTH : يت هالابماللا مااجحا اذ رداصملا تدازو
Iteration 46355: BEST OCR TEXT : يتلا هلابماللا ماما جاجحا ا اهتذ راصلا يم
File /content/ocr_training_data/388.lstmf line 0 :
Mean rms=2.848%, delta=6.196%, train=29.919%(83.068%), skip ratio=0%
Iteration 46356: GROUND  TRUTH : اهلوانت كوكيص هبجو هيف ببستملا نوكي نا حجري
Iteration 46356: ALIGNED TRUTH : اهانتكوكيصهجو فببسمانوكي نا حجري
Iteration 46356: BEST OCR TEXT : اهانت كوكيصهبجو يفببسملا وكي ا ننب
File /content/ocr_training_data/166.lstmf line 0 :
Mean rms=2.847%, delta=6.192%, train=29.899%(83.068%), skip ratio=0%
Iteration 46357: GROUND  TRUTH : نيينيرملا هعطاقمب يضاملا سيمخلا رهظ هطيبال يحب نطقي
Iteration 46357: ALIGNED TRUTH : نييلاطقبيام سخل رهظ هطيبال يحب نطقي
Iteration 46357: BEST OCR TEXT : نييلا هطاقمب ياملا سخا رهظ هطبال يبب هيات
File /content/ocr_training_data/55.lstmf line 0 :
Mean rms=2.847%, delta=6.186%, train=29.909%(83.068%), skip ratio=0%
Iteration 46358: GROUND  TRUTH : هجاو يذلا هكرشلل يرادالا سلجملا وضعو يعامجلا سلجملا
Iteration 46358: ALIGNED TRUTH : هجاوي هرلل رالسجملوضعو يعامجلا سلجملا
Iteration 46358: BEST OCR TEXT : هجاويل هكرلليردالا سلجماوضعو يعاجلا هاملا
File /content/ocr_training_data/801.lstmf line 0 :
Mean rms=2.847%, delta=6.184%, train=29.912%(83.093%), skip ratio=0%
Iteration 46359: GROUND  TRUTH : هيعامتجالاو هينهملا هايحلا يف ءالزنلا جامدا يلع دعاست
Iteration 46359: ALIGNED TRUTH : هيعتلاونهايح  ءلاجامدا يلع دععاست
Iteration 46359: BEST OCR TEXT : هيعمتالاوينه يحلا ف ءانلا جمدا ياا نمم
File /content/ocr_training_data/532.lstmf line 0 :
Mean rms=2.848%, delta=6.19%, train=29.917%(83.105%), skip ratio=0%
Iteration 46360: GROUND  TRUTH : هصاخلا تايلالا نم هعومجم ءافتخاب ئجوف هنا هرعشا
Iteration 46360: ALIGNED TRUTH : هصاخلتالا وممءتخب ئجوف هنا هرعشا
Iteration 46360: BEST OCR TEXT : هصاخلتيلا م هوممءافتخب نجوف هاا هيتا
File /content/ocr_training_data/325.lstmf line 0 :
Mean rms=2.848%, delta=6.191%, train=29.923%(83.105%), skip ratio=0%
Iteration 46361: GROUND  TRUTH : ردحتيو لمعلا نع لطاع وهو هيف هبتشملا نا
Iteration 46361: ALIGNED TRUTH : ردحتو معا ع طع وهو هيف هبتشملا ننا
Iteration 46361: BEST OCR TEXT : ردحتو لمعل نع طاع هو هيف هيالا نفا
File /content/ocr_training_data/681.lstmf line 0 :
Mean rms=2.847%, delta=6.19%, train=29.912%(83.08%), skip ratio=0%
Iteration 46362: GROUND  TRUTH : نهر هتعضو يتلا هيئاضقلا هطرشلا حلاصمل هميلست متو
Iteration 46362: ALIGNED TRUTH : نهر تعض لااضق هطشا حاصمل هميلسست ممتو
Iteration 46362: BEST OCR TEXT : نهر تعض تلا هيئاق هطرشا حلاصمل هيام يا
File /content/ocr_training_data/73.lstmf line 0 :
Mean rms=2.847%, delta=6.188%, train=29.9%(83.08%), skip ratio=0%
Iteration 46363: GROUND  TRUTH : يدمحملا يحلا عبسلا نيع نماب هينمالا حلاصملا تنكمت
Iteration 46363: ALIGNED TRUTH : يدمحليلاعبلاني ناينمالا حلاصملا تنكممت
Iteration 46363: BEST OCR TEXT : يدمحملايل عبسلا نيع نماهينالا عاصلا هحم
File /content/ocr_training_data/481.lstmf line 0 :
Mean rms=2.847%, delta=6.184%, train=29.911%(83.093%), skip ratio=0%
Iteration 46364: GROUND  TRUTH : فرتعا رومزاب يكلملا كردلا رصانع فرط نم هلاقتعا
Iteration 46364: ALIGNED TRUTH : فرع ومب كا رلارصنع فرط نم هللاقتعا
Iteration 46364: BEST OCR TEXT : فرتع وماب يك ردلارصانع فرط نم هالا
File /content/ocr_training_data/852.lstmf line 0 :
Mean rms=2.846%, delta=6.177%, train=29.903%(83.068%), skip ratio=0%
Iteration 46365: GROUND  TRUTH : اسايكا تزجح هشيتفت دعبو هبحاص لقتعيل هناكم يلع
Iteration 46365: ALIGNED TRUTH : اساكا تجح هشتدع هحاص قتعيل هناكم يلع
Iteration 46365: BEST OCR TEXT : اساكا تزجح هشتدعو هحاص قعيل هام نلت
File /content/ocr_training_data/120.lstmf line 0 :
Mean rms=2.845%, delta=6.174%, train=29.896%(83.068%), skip ratio=0%
Iteration 46366: GROUND  TRUTH : هياهن اهعضو مت هلاكولا عم هدقاعتملا جاجحلا هثعب
Iteration 46366: ALIGNED TRUTH : هين ضوملاكلمداعتملا جاجحلا هثثعب
Iteration 46366: BEST OCR TEXT : هياهن اعضو م لاكلعمهدقاعتملا ححلا همب
File /content/ocr_training_data/365.lstmf line 0 :
Mean rms=2.845%, delta=6.17%, train=29.898%(83.068%), skip ratio=0%
Iteration 46367: GROUND  TRUTH : هب هبتشملا نا ينطولا نمالل هماعلا هيريدملل غالب
Iteration 46367: ALIGNED TRUTH : هببشا  ط للمعلا هيريدملل غالب
Iteration 46367: BEST OCR TEXT : هببتشلا نا طول ناللمعلا هاملا ها
File /content/ocr_training_data/556.lstmf line 0 :
Mean rms=2.845%, delta=6.171%, train=29.898%(83.055%), skip ratio=0%
Iteration 46368: GROUND  TRUTH : رئاجسلل عئاب يلع هئادتعا بقع روجهم قدنف ءاضفب
Iteration 46368: ALIGNED TRUTH : رئاسلل عيهدابق وجهم قدنف ءااضفب
Iteration 46368: BEST OCR TEXT : رئاجسلل عاب يع هئدتا بق روجهم ديف هنع
File /content/ocr_training_data/70.lstmf line 0 :
Mean rms=2.845%, delta=6.17%, train=29.905%(83.043%), skip ratio=0%
Iteration 46369: GROUND  TRUTH : بابسا نع فشكلاب ليفكلا هدحو هرفينخ نمال هعباتلا
Iteration 46369: ALIGNED TRUTH : بابسنع فشكبف هحوهفنخ نمال هعباتلا
Iteration 46369: BEST OCR TEXT : بابسانع فشكابلفا هحوهرفنخ هال هيلا
File /content/ocr_training_data/637.lstmf line 0 :
Mean rms=2.845%, delta=6.17%, train=29.907%(83.055%), skip ratio=0%
Iteration 46370: GROUND  TRUTH : امب ءافتكالاو هيئاضقلا هطباضلا رضحم داعبتسا يلا هيمارلا
Iteration 46370: ALIGNED TRUTH : امءافاوئاضقهبلارضح داتسا يلا هيمارلا
Iteration 46370: BEST OCR TEXT : امبءاااوئاضقلهبلارضح داتسا يلا هيالا
File /content/ocr_training_data/811.lstmf line 0 :
Mean rms=2.844%, delta=6.161%, train=29.914%(83.055%), skip ratio=0%
Iteration 46371: GROUND  TRUTH : مهتليل ءاضقو هكلاهتمو هئيدر تالفاح نتم يلع اولقني
Iteration 46371: ALIGNED TRUTH : مهت ءضوههموئ الاح نتم يلع ااولقنقني
Iteration 46371: BEST OCR TEXT : مهتيل ءضقوهاهموهئي تالفاح نتم يع ينق
File /content/ocr_training_data/384.lstmf line 0 :
Mean rms=2.844%, delta=6.16%, train=29.894%(83.043%), skip ratio=0%
Iteration 46372: GROUND  TRUTH : هطرشلل همهملا دنستل تاردخملا نم همهم تايمك لمحت
Iteration 46372: ALIGNED TRUTH : هطرلل لاس تارخلن همهم تايمك لمححت
Iteration 46372: BEST OCR TEXT : هطرلل مملانستل تارخلا نم همهم تم ها
File /content/ocr_training_data/103.lstmf line 0 :
Mean rms=2.843%, delta=6.159%, train=29.882%(83.018%), skip ratio=0%
Iteration 46373: GROUND  TRUTH : نيذلا نيباصملا ناف رداصم بسحو دمحا نبال يدلبلا
Iteration 46373: ALIGNED TRUTH : نيذنيبصملن ردصم بسحو دمحاا نبال يدلببلا
Iteration 46373: BEST OCR TEXT : نيذلنياصمل ن ردصم بسحو مححال نال ييلا
File /content/ocr_training_data/168.lstmf line 0 :
Mean rms=2.843%, delta=6.156%, train=29.879%(83.005%), skip ratio=0%
Iteration 46374: GROUND  TRUTH : مهم ددع يلع ضبقلا ءاقلاب لجع ام هيراجتلا
Iteration 46374: ALIGNED TRUTH : مهم دد يلع ضباقل لجع اام هيراجتلا
Iteration 46374: BEST OCR TEXT : مهمددع يلع ضبقلا ااب لجع يا هالا
File /content/ocr_training_data/144.lstmf line 0 :
Mean rms=2.843%, delta=6.157%, train=29.875%(83.005%), skip ratio=0%
Iteration 46375: GROUND  TRUTH : طخلا نمؤت يتلا هلفاحلا باكر دحا قح يف
Iteration 46375: ALIGNED TRUTH : طخلا نمؤتا هحلا باك دحا قح يفف
Iteration 46375: BEST OCR TEXT : طخلا نمؤتل هفاحلا بك دحا همحم يفم
File /content/ocr_training_data/355.lstmf line 0 :
Mean rms=2.843%, delta=6.159%, train=29.881%(82.993%), skip ratio=0%
Iteration 46376: GROUND  TRUTH : مهرد لباقم بيرهتلا براوق وحن اريشلا هلقنب فرتعا
Iteration 46376: ALIGNED TRUTH : مهردلق برهارق ح ارشلا هلقنب فررتعا
Iteration 46376: BEST OCR TEXT : مهرد لق بيرهلا رق ح ارشلا نم هنفا
File /content/ocr_training_data/251.lstmf line 0 :
Mean rms=2.843%, delta=6.161%, train=29.878%(82.993%), skip ratio=0%
Iteration 46377: GROUND  TRUTH : دعب مهتملا ناكو هحلاص يف تسيلو هدض هجح
Iteration 46377: ALIGNED TRUTH : دع مملاك حاص ي تسيلو هدض هجحح
Iteration 46377: BEST OCR TEXT : دع مملااكو هحاص يف تسلو هام هحم
File /content/ocr_training_data/851.lstmf line 0 :
Mean rms=2.843%, delta=6.162%, train=29.889%(83.005%), skip ratio=0%
Iteration 46378: GROUND  TRUTH : مهتملا هعامجلا هسائرل رفوالا حشرملاو ناميلس نبا ميلقال
Iteration 46378: ALIGNED TRUTH : مهملعج ائ رو شرملا ناملس نببا ميلقاال
Iteration 46378: BEST OCR TEXT : مهتمل عمجا هان ولا حشرلا لي نا يالا
File /content/ocr_training_data/408.lstmf line 0 :
Mean rms=2.843%, delta=6.167%, train=29.893%(83.018%), skip ratio=0%
Iteration 46379: GROUND  TRUTH : اهلالغتساو اهب دارفنالا هل ينستيل عقاولا رمالا ماما
Iteration 46379: ALIGNED TRUTH : اهغتااهفنا تل عاولا رمالا ماما
Iteration 46379: BEST OCR TEXT : اهاغتاو اه ارفنااهل تيل عقاولا هالا يا
File /content/ocr_training_data/857.lstmf line 0 :
Mean rms=2.843%, delta=6.164%, train=29.892%(83.018%), skip ratio=0%
Iteration 46380: GROUND  TRUTH : هيمك زوحتي لامشلا نم امداق اصخش نا ديفت
Iteration 46380: ALIGNED TRUTH : هيمك وحلشاممدا صخش نا دييففت
Iteration 46380: BEST OCR TEXT : هيمك وحتيلامشلا م مدا صخش نا هلن
File /content/ocr_training_data/485.lstmf line 0 :
Mean rms=2.843%, delta=6.169%, train=29.905%(83.03%), skip ratio=0%
Iteration 46381: GROUND  TRUTH : هيعمجو هضايرلاو هبيبشلا هرازوو هلادع ييمالعا هيعمج نيب
Iteration 46381: ALIGNED TRUTH : هيعجوهضيو شلهروودعييمالعا هيعج نييب
Iteration 46381: BEST OCR TEXT : هيعمجوهضايراو هيشارازوو هلع يبالعا همم نحم
File /content/ocr_training_data/509.lstmf line 0 :
Mean rms=2.843%, delta=6.167%, train=29.899%(83.055%), skip ratio=0%
Iteration 46382: GROUND  TRUTH : اناك يتلا كابترالا هلاح يلا هبتنا اهب يطرش
Iteration 46382: ALIGNED TRUTH : ااي كرلا يا بتنا اهبب يطرش
Iteration 46382: BEST OCR TEXT : اايل راهلاح يلا بتنا هاب يفف
File /content/ocr_training_data/202.lstmf line 0 :
Mean rms=2.843%, delta=6.163%, train=29.903%(83.055%), skip ratio=0%
Iteration 46383: GROUND  TRUTH : هداعسلا رويدب ههوبشم هقش همهادم نم يضاملا ءاثالثلا
Iteration 46383: ALIGNED TRUTH : هدس يدههشهشهام نم يضاملا ءاثالثلا
Iteration 46383: BEST OCR TEXT : هدعسلا رودب ههبشهقش ههدم نم هيالا هينلا
File /content/ocr_training_data/959.lstmf line 0 :
Mean rms=2.843%, delta=6.163%, train=29.907%(83.043%), skip ratio=0%
Iteration 46384: GROUND  TRUTH : لك يلا هبوسنملا تاماهتالا يف عيمجلا قاطنتسا دصق
Iteration 46384: ALIGNED TRUTH : لك يبونلااتافعيمجلا قاطنتسا دصق
Iteration 46384: BEST OCR TEXT : لك يل بونلا اماهتالا فعيمجلا عالا همم
File /content/ocr_training_data/892.lstmf line 0 :
Mean rms=2.843%, delta=6.168%, train=29.91%(83.055%), skip ratio=0%
Iteration 46385: GROUND  TRUTH : هنع ثحبت اهلعجيل اهب لوخدلل دادعتسالاو اهسبالم عزنب
Iteration 46385: ALIGNED TRUTH : هنعثحت لعل لودللدادسالاو اهسبالم عززنب
Iteration 46385: BEST OCR TEXT : هنع ثحت اهلجيل اهلودللدادتسالاو الا هاب
File /content/ocr_training_data/831.lstmf line 0 :
Mean rms=2.844%, delta=6.17%, train=29.901%(83.068%), skip ratio=0%
Iteration 46386: GROUND  TRUTH : هيئاضقلا هطرشلا تلقتعاو هطرشلل امهميلست لبق امهورصاحو لزنملا
Iteration 46386: ALIGNED TRUTH : هياضلاطشاتقع هطرلل املس لامهوصاحو لزنملا
Iteration 46386: BEST OCR TEXT : هيئاضقلاهطشاتقتعا هطلل اهملست لب اهورصاحو هاتلا
File /content/ocr_training_data/729.lstmf line 0 :
Mean rms=2.843%, delta=6.165%, train=29.906%(83.093%), skip ratio=0%
Iteration 46387: GROUND  TRUTH : هياكشلا تفاضاو هيدمحملاب هلاكو عم مسومل جحلاب هصاخ
Iteration 46387: ALIGNED TRUTH : هيكلتاضا دمللا ممسومل جحلاب هصاخخ
Iteration 46387: BEST OCR TEXT : هياكلا تفاضاو يدحملاب اكو م مسومل ححلا همم
File /content/ocr_training_data/372.lstmf line 0 :
Mean rms=2.844%, delta=6.166%, train=29.904%(83.105%), skip ratio=0%
Iteration 46388: GROUND  TRUTH : قيقحتلل نمالا زكرم يلا هدايتقاو هلاقتعا مت ثيح
Iteration 46388: ALIGNED TRUTH : قيقلل لا كرميدااو هلاتعا ممتت ثيحح
Iteration 46388: BEST OCR TEXT : قيقلل ملا كرميداتاو هاقتا هيم يحم
File /content/ocr_training_data/672.lstmf line 0 :
Mean rms=2.844%, delta=6.169%, train=29.9%(83.13%), skip ratio=0%
Iteration 46389: GROUND  TRUTH : ادح وروصي يللا ددهيت ساف يف هودش رافش
Iteration 46389: ALIGNED TRUTH : ادحوصيللا دد سف يف ههودش ررافشش
Iteration 46389: BEST OCR TEXT : ادحووصي يللا دده سف يف ييات هيا
File /content/ocr_training_data/715.lstmf line 0 :
Mean rms=2.844%, delta=6.174%, train=29.897%(83.118%), skip ratio=0%
Iteration 46390: GROUND  TRUTH : ادوقع اومربا مهنا اهنومضم يف اودكا جاجحلا عم
Iteration 46390: ALIGNED TRUTH : ادقوم مهنمضميف ادكا جاجحلا ععم
Iteration 46390: BEST OCR TEXT : ادوق اوبا مه هنمضم يف اودكا هاملا نم
File /content/ocr_training_data/371.lstmf line 0 :
Mean rms=2.845%, delta=6.176%, train=29.892%(83.093%), skip ratio=0%
Iteration 46391: GROUND  TRUTH : اسبلتم ساف نما هيالوب هيئاضقلا هطرشلا رصانع لبق
Iteration 46391: ALIGNED TRUTH : اسبمسفمياهيئقا طرشلا رصانع لبق
Iteration 46391: BEST OCR TEXT : اسبم سفنما هيابهيئاضقلا طرشلا ياع يل
File /content/ocr_training_data/603.lstmf line 0 :
Mean rms=2.845%, delta=6.18%, train=29.894%(83.118%), skip ratio=0%
Iteration 46392: GROUND  TRUTH : ريفوتب تلخا نا دعب همكحملا يلا تاياكشب اومدقت
Iteration 46392: ALIGNED TRUTH : ريفوبتخا  د مكحل لا تاياكشب اوومدقت
Iteration 46392: BEST OCR TEXT : ريفوبتلخا ن دعب مكحملا يلااا هنيام ها
File /content/ocr_training_data/358.lstmf line 0 :
Mean rms=2.845%, delta=6.182%, train=29.88%(83.105%), skip ratio=0%
Iteration 46393: GROUND  TRUTH : اءزج نال عازنلا عوضوم ضرالا يف لاغتشالا نم
Iteration 46393: ALIGNED TRUTH : اءجنازعوضو ضرالايف لاغتشالا نم
Iteration 46393: BEST OCR TEXT : اءزج نال عازنعضو ضراايف لاشلا نم
File /content/ocr_training_data/39.lstmf line 0 :
Mean rms=2.844%, delta=6.177%, train=29.875%(83.068%), skip ratio=0%
Iteration 46394: GROUND  TRUTH : يف هعرصم ينيتس لجر يقلرونب يديسب هثداح يف
Iteration 46394: ALIGNED TRUTH : يف هرصمينسلجيلربيديسب هثداح يفف
Iteration 46394: BEST OCR TEXT : يف هعرصم ينس لجيقلنب يديسب هلاف يفف
File /content/ocr_training_data/642.lstmf line 0 :
Mean rms=2.844%, delta=6.177%, train=29.865%(83.03%), skip ratio=0%
Iteration 46395: GROUND  TRUTH : تلصاوتو تاردخملا نم نط فصنو نانطا هثالث رمالا
Iteration 46395: ALIGNED TRUTH : تلصاووتارخم منط فصنو ننطا هثالث ررمالا
Iteration 46395: BEST OCR TEXT : تلصاوتو تاردخمل نم نط فصنو ننطا هاث يالا
File /content/ocr_training_data/261.lstmf line 0 :
Mean rms=2.844%, delta=6.179%, train=29.847%(82.98%), skip ratio=0%
Iteration 46396: GROUND  TRUTH : وباصلا ب لمعلا فقوب يضاقلا يضاملا رياربف رهش
Iteration 46396: ALIGNED TRUTH : وباصا بلع فقوبيضق يضاملا رياربف رهرهش
Iteration 46396: BEST OCR TEXT : وباصا ب لمعل فقوب يضاقا يضاملا يف هير
File /content/ocr_training_data/764.lstmf line 0 :
Mean rms=2.843%, delta=6.179%, train=29.823%(82.955%), skip ratio=0%
Iteration 46397: GROUND  TRUTH : يف هعوقو هلاح يف ثدحلا لفطلل يلضفلا هحلصملا
Iteration 46397: ALIGNED TRUTH : يف عق هحي ثحافطلل يلضضفلا هحلصمملا
Iteration 46397: BEST OCR TEXT : يف عوقو هحي ثحلا لفطلل يااقلا هيلا
File /content/ocr_training_data/544.lstmf line 0 :
Mean rms=2.843%, delta=6.178%, train=29.839%(82.955%), skip ratio=0%
Iteration 46398: GROUND  TRUTH : هدعاسمب هيحضلا لزنم ماحتقا دعب تئجوف ذا اهنع
Iteration 46398: ALIGNED TRUTH : هدعسهحضالناح ب تئجوف ذا اهننع
Iteration 46398: BEST OCR TEXT : هدعس هيحضلالنم اقا عب تجوف يف ينم
File /content/ocr_training_data/624.lstmf line 0 :
Mean rms=2.844%, delta=6.184%, train=29.849%(82.98%), skip ratio=0%
Iteration 46399: GROUND  TRUTH : اهعاضخال يناسغلا يفشتسمب تاومالا عدوتسم يلا هتثج لقنتو
Iteration 46399: ALIGNED TRUTH : اهعضخايسلافشتسمتوا عتسم يلا هتثج لقنتو
Iteration 46399: BEST OCR TEXT : اهعضخايسلايفشتسم توملا عتسم يلا ها يت
File /content/ocr_training_data/249.lstmf line 0 :
Mean rms=2.844%, delta=6.185%, train=29.853%(82.968%), skip ratio=0%
At iteration 46364/46400/46400, mean rms=2.844%, delta=6.185%, BCER train=29.853%, BWER train=82.968%, skip ratio=0.000%, New worst BCER = 29.853 wrote checkpoint.
Iteration 46400: GROUND  TRUTH : هدحاو هراجيس ذخا لواحو هنم برتقاف رئاجسلا عيبل
Iteration 46400: ALIGNED TRUTH : هدحواي الاوهن برتقاف رئاجسلا عيبلل
Iteration 46400: BEST OCR TEXT : هدحاورايس ذالواحو هنم برتق ررالا هاملا
File /content/ocr_training_data/78.lstmf line 0 :
Mean rms=2.844%, delta=6.188%, train=29.855%(82.955%), skip ratio=0%
Iteration 46401: GROUND  TRUTH : فاضاو رصعلا هالص يتح هيف يقب يذلا كلاهلا
Iteration 46401: ALIGNED TRUTH : فاضاورع هص يت يفقب يذلا كلاهلا
Iteration 46401: BEST OCR TEXT : فاضاورعا هص يتح يف قب يانلا يلا
File /content/ocr_training_data/942.lstmf line 0 :
Mean rms=2.844%, delta=6.186%, train=29.844%(82.955%), skip ratio=0%
Iteration 46402: GROUND  TRUTH : اهلاسرا دصق اهنم تانيع ذخا عم ،« كوكيص
Iteration 46402: ALIGNED TRUTH : اهساصقه ان خا عم ،«« كووكيص
Iteration 46402: BEST OCR TEXT : اهسرا دصقاه ان خا عم هعب تت
File /content/ocr_training_data/181.lstmf line 0 :
Mean rms=2.844%, delta=6.184%, train=29.856%(82.943%), skip ratio=0%
Iteration 46403: GROUND  TRUTH : ببست وباصلا ب لمعلا فقوب يضاقلاو يضاملا رياربف
Iteration 46403: ALIGNED TRUTH : ببس ال  مافقب يضاقلاو يضاملا ريارببف
Iteration 46403: BEST OCR TEXT : ببس اصل ب لملا فقوب يضاقاو ياملا هاو

this is after 46000 itirations the the values are almost the same for the 5000 iterations

@Abdlrhman00
Copy link
Author

Abdlrhman00 commented Nov 22, 2024

Would this might cause a problem

You are using make version: 4.3
combine_lang_model \
  --input_unicharset data/politcs-ar/unicharset \
  --script_dir data/langdata \
  --numbers data/politcs-ar/politcs-ar.numbers \
  --puncs data/politcs-ar/politcs-ar.punc \
  --words data/politcs-ar/politcs-ar.wordlist \
  --output_dir data \
  --pass_through_recoder --lang_is_rtl \
  --lang politcs-ar
Failed to read data from: data/politcs-ar/politcs-ar.wordlist
Failed to read data from: data/politcs-ar/politcs-ar.punc
Failed to read data from: data/politcs-ar/politcs-ar.numbers
Loaded unicharset of size 87 from file data/politcs-ar/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/langdata/Inherited.unicharset
Warning: properties incomplete for index 16 = َ
Warning: properties incomplete for index 20 = ُ
Warning: properties incomplete for index 44 = ٍ
Warning: properties incomplete for index 48 = ّ
Warning: properties incomplete for index 65 = ِ
Warning: properties incomplete for index 66 = ْ
Warning: properties incomplete for index 69 = ً
Warning: properties incomplete for index 71 = ٌ
Config file is optional, continuing...
Failed to read data from: data/langdata/politcs-ar/politcs-ar.config
Created data/politcs-ar/politcs-ar.traineddata
lstmtraining \
  --debug_interval -1 \
  --traineddata data/politcs-ar/politcs-ar.traineddata \
  --old_traineddata /content/tesstrain/tessdata/ara.traineddata \
  --continue_from data/ara/politcs-ar.lstm \
  --learning_rate 0.0001 \
  --model_output data/politcs-ar/checkpoints/politcs-ar \
  --train_listfile data/politcs-ar/list.train \
  --eval_listfile data/politcs-ar/list.eval \
  --max_iterations 40000 \
  --target_error_rate 0.01 \
2>&1 | tee -a data/politcs-ar/training.log

@M3ssman
Copy link
Contributor

M3ssman commented Nov 26, 2024

How do you do read in the additional information regarding word list (and alike) which cause the failures prompted in your log?

Please note, that for training of official Arabic model they used for sure several dozens of fonts and at least 100.000 lines of text. You only use three(?) fonts and only about 16K lines of text to generate your input.

Why do you limit your input to 8 words for each line?

Further, despite the learning progress tesstrain reports and depending on your final target scenario, it might be worthwhile to evaluate the resulting model afterwards with data completely unseen before. At least this is how we do it, in the context of mass digitization of Arabic/Hebrew/Farsi prints.

To provide some more background, I'm from an german institution called FID MENA, always looking for collaborations on this topic. In the past we've tried to fine tune the official Tesseract std-model for Arabic, but not with synthetic dataset alike you do, but with snippets generated from retro-digitized materials, originating from real prints of MENAlib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants