十日训练营中读取det_data_lesson_demo数据集存在问题 #5029

WZMIAOMIAO · 2021-12-23T06:17:59Z

系统环境/System Environment：aistudio(v100 16g)
Paddle verison：paddlepaddle-gpu==2.2.1.post101
PaddleOCR： git clone https://gitee.com/paddlepaddle/PaddleOCR
运行指令/Command Code：!python tools/train.py -c configs/det/det_mv3_db.yml
完整报错/Complete Error Message：

截取的部分错误信息：

[2021/12/23 12:41:26] root INFO: train dataloader has 94 iters
[2021/12/23 12:41:26] root INFO: valid dataloader has 250 iters
[2021/12/23 12:41:26] root INFO: During the training process, after the 0th iteration, an evaluation is run every 500 iterations
[2021/12/23 12:41:26] root INFO: Initialize indexs of datasets:['/home/aistudio/work/data/det_data_lesson_demo/train.txt']
[2021/12/23 12:41:54] root INFO: epoch: [1/100], iter: 10, lr: 0.000027, loss: 9.582685, loss_shrink_maps: 4.681584, loss_threshold_maps: 3.961636, loss_binary_maps: 0.939466, reader_cost: 1.81348 s, batch_cost: 2.77511 s, samples: 88, ips: 3.17105
[2021/12/23 12:41:58] root ERROR: When parsing line mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg	[{"transcription": "\u6d53\u7f29\u9664\u81ed\u6db2", "points": [[473.55, 99.64], [456.18, 41.82], [778.73, 39.82], [777.73, 105.82]]}, {"transcription": "1000ml", "points": [[476.27, 158.73], [477.27, 129.09], [618.55, 124.09], [618.55, 158.73]]}, {"transcription": "\u62b510\u74f6", "points": [[647.55, 121.64], [652.55, 165.64], [771.09, 166.64], [773.09, 121.64]]}, {"transcription": "\u9001", "points": [[691.82, 347.45], [690.82, 437.36], [768.55, 426.36], [777.0, 345.36]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[94.0, 289.0], [94.0, 305.73], [164.73, 305.73], [164.73, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[242.55, 290.0], [242.55, 303.27], [317.45, 303.27], [316.45, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[650.55, 476.36], [651.55, 485.82], [694.91, 486.82], [695.91, 477.36]]}, {"transcription": "Disiuf", "points": [[48.36, 325.55], [46.55, 359.09], [154.55, 363.45], [156.55, 330.55]]}, {"transcription": "spray", "points": [[61.45, 362.73], [61.45, 378.91], [123.27, 377.0], [121.27, 361.91]]}, {"transcription": "spray", "points": [[211.73, 360.27], [214.73, 377.73], [272.64, 377.73], [269.64, 360.27]]}, {"transcription": "Disiufectaut", "points": [[198.73, 324.55], [199.73, 361.82], [387.0, 357.82], [390.0, 328.55]]}, {"transcription": "\u5ba0\u7269\u9664\u81ed\u6db2", "points": [[271.64, 379.64], [272.64, 400.0], [369.82, 399.0], [371.82, 381.64]]}, {"transcription": "\u5ba0\u7269", "points": [[125.73, 379.64], [122.73, 399.27], [153.82, 401.27], [154.82, 381.64]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[63.82, 380.82], [67.82, 397.55], [116.73, 399.55], [114.73, 382.82]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[216.91, 382.55], [214.91, 401.27], [267.27, 399.27], [269.27, 383.55]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[63.18, 422.09], [61.18, 429.36], [104.82, 429.36], [105.82, 422.09]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[211.73, 421.09], [211.73, 429.82], [256.64, 430.82], [254.64, 421.09]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[63.18, 442.91], [61.18, 454.55], [144.36, 453.55], [143.36, 442.91]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[216.18, 444.64], [213.18, 457.27], [294.91, 454.27], [296.91, 445.64]]}, {"transcription": "\u51c0\u542b\u91cf", "points": [[703.8, 615.47], [705.8, 621.8], [721.4, 620.8], [723.4, 615.47]]}, {"transcription": "500ML", "points": [[704.2, 622.4], [702.2, 626.67], [720.2, 627.67], [720.2, 622.4]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[661.07, 546.13], [661.07, 553.53], [689.93, 553.53], [691.93, 547.13]]}, {"transcription": "\u5ba0\u7269\u795b\u5473\u55b7\u96fe", "points": [[640.8, 537.13], [642.8, 526.87], [713.2, 526.87], [710.2, 537.13]]}, {"transcription": "Healthy", "points": [[62.27, 408.93], [64.27, 419.6], [101.07, 417.6], [103.07, 408.93]]}, {"transcription": "Antiscptic", "points": [[104.27, 408.53], [104.27, 420.67], [152.33, 420.67], [154.33, 408.53]]}, {"transcription": "###", "points": [[213.13, 406.93], [215.13, 417.0], [249.33, 417.0], [249.33, 406.93]]}, {"transcription": "Antiscptic&DeodorantForPet", "points": [[253.13, 417.0], [253.6, 408.47], [408.2, 407.93], [401.27, 418.07]]}, {"transcription": "Healthy", "points": [[224.2, 406.4], [224.2, 407.93], [221.67, 406.93], [223.67, 407.4]]}, {"transcription": "DEODORANT", "points": [[627.0, 505.07], [627.0, 496.47], [724.53, 495.47], [724.53, 505.07]]}, {"transcription": "SPRAY", "points": [[651.47, 519.87], [651.47, 509.2], [698.0, 508.2], [702.0, 518.87]]}, {"transcription": "\u4e70\u4e00\u9001\u4e00", "points": [[27.07, 790.8], [12.93, 645.2], [484.67, 631.93], [450.8, 786.27]]}, {"transcription": "###", "points": [[123.93, 700.2], [121.93, 700.73], [124.47, 700.73], [124.47, 700.2]]}, {"transcription": "\u5206\u89e3\u81ed\u5473", "points": [[515.0, 793.07], [514.0, 710.53], [786.0, 700.53], [793.0, 784.07]]}, {"transcription": "\u9001\u9664\u81ed\u55b7\u96fe500ml", "points": [[522.2, 698.0], [522.2, 666.47], [786.87, 662.47], [788.87, 697.0]]}]
, error happened with msg: Traceback (most recent call last):
  File "/home/aistudio/work/PaddleOCR/ppocr/data/simple_dataset.py", line 119, in __getitem__
    outs = transform(data, self.ops)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/__init__.py", line 43, in transform
    data = op(data)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/make_border_map.py", line 60, in __call__
    self.draw_border_map(text_polys[i], canvas, mask=mask)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/make_border_map.py", line 81, in draw_border_map
    padded_polygon = np.array(padding.Execute(distance)[0])
IndexError: list index out of range

在训练以及验证过程中还存在很多图像破损的错误：

Corrupt JPEG data: bad Huffman code
Corrupt JPEG data: 18 extraneous bytes before marker 0xc4
Corrupt JPEG data: premature end of data segment

虽然能够正常训练，但感觉对萌新不太友好(不知道是不是老师故意埋的坑hahaha)。
下面是det_mv3_db.yml文件的信息：

Global:
  use_gpu: True
  epoch_num: 100
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/db_mv3/
  save_epoch_step: 1200
  # evaluation is run every 2000 iterations
  eval_batch_step: [0, 500]
  cal_metric_during_train: False
  pretrained_model: ./pretrain_models/MobileNetV3_large_x0_5_pretrained
  checkpoints:
  save_inference_dir:
  use_visualdl: False
  infer_img: doc/imgs_en/img_10.jpg
  save_res_path: ./output/det_db/predicts_db.txt

Architecture:
  model_type: det
  algorithm: DB
  Transform:
  Backbone:
    name: MobileNetV3
    scale: 0.5
    model_name: large
  Neck:
    name: DBFPN
    out_channels: 256
  Head:
    name: DBHead
    k: 50

Loss:
  name: DBLoss
  balance_loss: true
  main_loss_type: DiceLoss
  alpha: 5
  beta: 10
  ohem_ratio: 3

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 2
  regularizer:
    name: 'L2'
    factor: 0.000005

PostProcess:
  name: DBPostProcess
  thresh: 0.3
  box_thresh: 0.6
  max_candidates: 1000
  unclip_ratio: 1.5

Metric:
  name: DetMetric
  main_indicator: hmean

Train:
  dataset:
    name: SimpleDataSet
    data_dir: /home/aistudio/work/data/det_data_lesson_demo/
    label_file_list:
      - /home/aistudio/work/data/det_data_lesson_demo/train.txt
    ratio_list: [1.0]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - IaaAugment:
          augmenter_args:
            - { 'type': Fliplr, 'args': { 'p': 0.5 } }
            - { 'type': Affine, 'args': { 'rotate': [-10, 10] } }
            - { 'type': Resize, 'args': { 'size': [0.5, 3] } }
      - EastRandomCropData:
          size: [640, 640]
          max_tries: 50
          keep_ratio: true
      - MakeBorderMap:
          shrink_ratio: 0.4
          thresh_min: 0.3
          thresh_max: 0.7
      - MakeShrinkMap:
          shrink_ratio: 0.4
          min_text_size: 8
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] # the order of the dataloader list
  loader:
    shuffle: True
    drop_last: False
    batch_size_per_card: 8
    num_workers: 4
    use_shared_memory: False

Eval:
  dataset:
    name: SimpleDataSet
    data_dir: /home/aistudio/work/data/det_data_lesson_demo/
    label_file_list:
      - /home/aistudio/work/data/det_data_lesson_demo/eval.txt
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - DetResizeForTest:
          image_shape: [736, 1280]
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1 # must be 1
    num_workers: 2
    use_shared_memory: False

The text was updated successfully, but these errors were encountered:

LDOUBLEV · 2021-12-23T07:29:38Z

训练集中个别图像存在问题，不影响整体训练，也不影响最终训练精度

WZMIAOMIAO · 2021-12-23T07:38:11Z

好的，谢谢回复。

chenshenghao · 2021-12-23T08:38:29Z

请问有遇到eval速度特别慢的问题么，会在19%卡很久很久
eval model:: 0%| | 0/250 [00:00<?, ?it/s]Corrupt JPEG data: bad Huffman code
eval model:: 18%|█████▎ | 46/250 [00:17<01:02, 3.25it/s]Traceback (most recent call last):
File "tools/train.py", line 148, in

eval model:: 19%|█████▍ | 47/250 [00:20<01:28, 2.29it/s]

WZMIAOMIAO · 2021-12-23T08:45:17Z

@chenshenghao

我的大概1分13秒，没有你说的特别慢的问题。

eval model::  95%|██████████████████████████▋ | 238/250 [01:02<00:10,  1.12it/s]Corrupt JPEG data: premature end of data segment
eval model:: 100%|████████████████████████████| 250/250 [01:13<00:00,  1.09it/s]
[2021/12/23 13:12:55] root INFO: cur metric, precision: 0.5475378787878787, recall: 0.48409243134628266, hmean: 0.5138642019196588, fps: 17.20648334652921
[2021/12/23 13:12:55] root INFO: save best model is to ./output/db_mv3/best_accuracy

livingbody · 2021-12-23T10:33:13Z

但是我还是很想把错误剔除出去，很想很想。

WZMIAOMIAO · 2021-12-23T12:10:58Z

我把训练集和验证集中有问题的图片全部找出来了，如果不想看到报错信息就把这些图片对应的标注信息给删掉。

# 训练集中读取后为空的图片
train_none_imgs = ["mtwi/train/TB1Zj7Un4rI8KJjy0FpXXb5hVXa_!!1-item_pic.gif.jpg",
                   "mtwi/train/TB2MVFMjjqhSKJjSspnXXc79XXa_!!2822611227.gif.jpg",
                   "mtwi/train/TB20eQfjqagSKJjy0FgXXcRqFXa_!!480667565.jpg.jpg",
                   "mtwi/train/TB26jrZgPnD8KJjSspbXXbbEXXa_!!3173720736.jpg.jpg",
                   "mtwi/train/TB2AzHIhlTH8KJjy0FiXXcRsXXa_!!2691187853.gif.jpg",
                   "mtwi/train/TB24dJ4jCYH8KJjSspdXXcRgVXa_!!2426498448.jpg.jpg",
                   "mtwi/train/TB2Ob1Ve_J_SKJjSZPiXXb3LpXa_!!789520595.gif.jpg",
                   "mtwi/train/TB2gwyxj46I8KJjy0FgXXXXzVXa_!!3401535694.gif.jpg",
                   "mtwi/train/TB17ggQcJHO8KJjSZFtXXchfXXa_!!1-item_pic.gif.jpg"]
# 训练集中破损的图片
train_corrupt_imgs = ["xfun/train/zh_train_43.jpg",  # Corrupt JPEG data: 18 extraneous bytes before marker 0xc4
                      "xfun/train/zh_train_144.jpg"]  # Corrupt JPEG data: bad Huffman code

# 验证集中读取后为空的图片
eval_non_imgs = ["mtwi/eval/TB1KB4MLXXXXXblXpXXunYpLFXX.jpg",
                 "mtwi/eval/TB25PgKirsTMeJjSszgXXacpFXa-1106900306.jpg.jpg"]
# 验证集中破损的图片
eval_corrupt_imgs = ["xfun/val/zh_val_42.jpg"]  # Corrupt JPEG data: premature end of data segment

另外，还有两个标注可能有问题的标签，也可以删掉：

mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg
mtwi/train/TB1oe8CLXXXXXc4XFXXunYpLFXX.jpg

否则训练时会报错：

root ERROR: When parsing line mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg	[{"transcription": "\u6d53\u7f29\u9664\u81ed\u6db2", "points": [[473.55, 99.64], [456.18, 41.82], [778.73, 39.82], [777.73, 105.82]]}, {"transcription": "1000ml", "points": [[476.27, 158.73], [477.27, 129.09], [618.55, 124.09], [618.55, 158.73]]}, {"transcription": "\u62b510\u74f6", "points": [[647.55, 121.64], [652.55, 165.64], [771.09, 166.64], [773.09, 121.64]]}, {"transcription": "\u9001", "points": [[691.82, 347.45], [690.82, 437.36], [768.55, 426.36], [777.0, 345.36]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[94.0, 289.0], [94.0, 305.73], [164.73, 305.73], [164.73, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[242.55, 290.0], [242.55, 303.27], [317.45, 303.27], [316.45, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[650.55, 476.36], [651.55, 485.82], [694.91, 486.82], [695.91, 477.36]]}, {"transcription": "Disiuf", "points": [[48.36, 325.55], [46.55, 359.09], [154.55, 363.45], [156.55, 330.55]]}, {"transcription": "spray", "points": [[61.45, 362.73], [61.45, 378.91], [123.27, 377.0], [121.27, 361.91]]}, {"transcription": "spray", "points": [[211.73, 360.27], [214.73, 377.73], [272.64, 377.73], [269.64, 360.27]]}, {"transcription": "Disiufectaut", "points": [[198.73, 324.55], [199.73, 361.82], [387.0, 357.82], [390.0, 328.55]]}, {"transcription": "\u5ba0\u7269\u9664\u81ed\u6db2", "points": [[271.64, 379.64], [272.64, 400.0], [369.82, 399.0], [371.82, 381.64]]}, {"transcription": "\u5ba0\u7269", "points": [[125.73, 379.64], [122.73, 399.27], [153.82, 401.27], [154.82, 381.64]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[63.82, 380.82], [67.82, 397.55], [116.73, 399.55], [114.73, 382.82]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[216.91, 382.55], [214.91, 401.27], [267.27, 399.27], [269.27, 383.55]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[63.18, 422.09], [61.18, 429.36], [104.82, 429.36], [105.82, 422.09]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[211.73, 421.09], [211.73, 429.82], [256.64, 430.82], [254.64, 421.09]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[63.18, 442.91], [61.18, 454.55], [144.36, 453.55], [143.36, 442.91]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[216.18, 444.64], [213.18, 457.27], [294.91, 454.27], [296.91, 445.64]]}, {"transcription": "\u51c0\u542b\u91cf", "points": [[703.8, 615.47], [705.8, 621.8], [721.4, 620.8], [723.4, 615.47]]}, {"transcription": "500ML", "points": [[704.2, 622.4], [702.2, 626.67], [720.2, 627.67], [720.2, 622.4]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[661.07, 546.13], [661.07, 553.53], [689.93, 553.53], [691.93, 547.13]]}, {"transcription": "\u5ba0\u7269\u795b\u5473\u55b7\u96fe", "points": [[640.8, 537.13], [642.8, 526.87], [713.2, 526.87], [710.2, 537.13]]}, {"transcription": "Healthy", "points": [[62.27, 408.93], [64.27, 419.6], [101.07, 417.6], [103.07, 408.93]]}, {"transcription": "Antiscptic", "points": [[104.27, 408.53], [104.27, 420.67], [152.33, 420.67], [154.33, 408.53]]}, {"transcription": "###", "points": [[213.13, 406.93], [215.13, 417.0], [249.33, 417.0], [249.33, 406.93]]}, {"transcription": "Antiscptic&DeodorantForPet", "points": [[253.13, 417.0], [253.6, 408.47], [408.2, 407.93], [401.27, 418.07]]}, {"transcription": "Healthy", "points": [[224.2, 406.4], [224.2, 407.93], [221.67, 406.93], [223.67, 407.4]]}, {"transcription": "DEODORANT", "points": [[627.0, 505.07], [627.0, 496.47], [724.53, 495.47], [724.53, 505.07]]}, {"transcription": "SPRAY", "points": [[651.47, 519.87], [651.47, 509.2], [698.0, 508.2], [702.0, 518.87]]}, {"transcription": "\u4e70\u4e00\u9001\u4e00", "points": [[27.07, 790.8], [12.93, 645.2], [484.67, 631.93], [450.8, 786.27]]}, {"transcription": "###", "points": [[123.93, 700.2], [121.93, 700.73], [124.47, 700.73], [124.47, 700.2]]}, {"transcription": "\u5206\u89e3\u81ed\u5473", "points": [[515.0, 793.07], [514.0, 710.53], [786.0, 700.53], [793.0, 784.07]]}, {"transcription": "\u9001\u9664\u81ed\u55b7\u96fe500ml", "points": [[522.2, 698.0], [522.2, 666.47], [786.87, 662.47], [788.87, 697.0]]}]
, error happened with msg: Traceback (most recent call last):
  File "/home/aistudio/work/PaddleOCR/ppocr/data/simple_dataset.py", line 119, in __getitem__
    outs = transform(data, self.ops)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/__init__.py", line 43, in transform
    data = op(data)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/make_border_map.py", line 60, in __call__
    self.draw_border_map(text_polys[i], canvas, mask=mask)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/make_border_map.py", line 81, in draw_border_map
    padded_polygon = np.array(padding.Execute(distance)[0])
IndexError: list index out of range

learning-and-learning1651880 · 2021-12-27T14:07:23Z

大佬你好，想请教一下是用什么办法把这些有问题的文件批量检测出来啊？

WZMIAOMIAO · 2021-12-28T07:32:22Z

@learning-and-learning1651880
关于寻找读取图片为空的方法：

import os
import numpy as np
import cv2


def official_read_img(img_path):
    with open(img_path, 'rb') as f:
        img = f.read()
    img = np.frombuffer(img, dtype='uint8')
    img = cv2.imdecode(img, 1)
    if img is None:
        print(f"find None image: {img_path}")


def main():
    img_dir = "./det_data_lesson_demo"
    txt_path = "./det_data_lesson_demo/train.txt"
    with open(txt_path, "r", encoding="utf-8") as f:
        for i in f.readlines():
            if len(i.strip()) > 0:
                img_name = i.split("\t")[0]
                official_read_img(os.path.join(img_dir, img_name))


if __name__ == '__main__':
    main()

关于破损图片的问题，因为opencv并没有抛出Exception，所以无法直接捕获，我当时就是用二分法自己一张张去排除的。

EighteenSprings · 2021-12-29T14:02:52Z

谢谢楼上提示，我逐个删除了，确实没有任何报错信息了，分享下自己粗糙的删除代码

import os

root_dir  = "./train_data/det_data_lesson_demo"
train_file = "train.txt"
eval_file = "eval.txt"

# 训练集中读取后为空的图片
train_none_imgs = ["mtwi/train/TB1Zj7Un4rI8KJjy0FpXXb5hVXa_!!1-item_pic.gif.jpg",
                   "mtwi/train/TB2MVFMjjqhSKJjSspnXXc79XXa_!!2822611227.gif.jpg",
                   "mtwi/train/TB20eQfjqagSKJjy0FgXXcRqFXa_!!480667565.jpg.jpg",
                   "mtwi/train/TB26jrZgPnD8KJjSspbXXbbEXXa_!!3173720736.jpg.jpg",
                   "mtwi/train/TB2AzHIhlTH8KJjy0FiXXcRsXXa_!!2691187853.gif.jpg",
                   "mtwi/train/TB24dJ4jCYH8KJjSspdXXcRgVXa_!!2426498448.jpg.jpg",
                   "mtwi/train/TB2Ob1Ve_J_SKJjSZPiXXb3LpXa_!!789520595.gif.jpg",
                   "mtwi/train/TB2gwyxj46I8KJjy0FgXXXXzVXa_!!3401535694.gif.jpg",
                   "mtwi/train/TB17ggQcJHO8KJjSZFtXXchfXXa_!!1-item_pic.gif.jpg"]
# 训练集中破损的图片
train_corrupt_imgs = ["xfun/train/zh_train_43.jpg",  # Corrupt JPEG data: 18 extraneous bytes before marker 0xc4
                      "xfun/train/zh_train_144.jpg"]  # Corrupt JPEG data: bad Huffman code

# 验证集中读取后为空的图片
eval_non_imgs = ["mtwi/eval/TB1KB4MLXXXXXblXpXXunYpLFXX.jpg",
                 "mtwi/eval/TB25PgKirsTMeJjSszgXXacpFXa-1106900306.jpg.jpg"]
# 验证集中破损的图片
eval_corrupt_imgs = ["xfun/val/zh_val_42.jpg"]  # Corrupt JPEG data: premature end of data segment

label_mismatch = ["mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg",
                 "mtwi/train/TB1oe8CLXXXXXc4XFXXunYpLFXX.jpg"]

def dataset_filter(root_dir, file_list):
    not_delete = []
    for filename in file_list:
        file_pth = os.path.join(root_dir, filename)
        if os.path.exists(file_pth):
            if os.remove(file_pth):
                print(f"{file_pth} has benn removed")
            else:
                print(f"{file_pth} cannot be removed")
                not_delete.append(filename)
        else:
            print(f"{file_pth} do not exist")
    print("Done correctly!")

def remove_from_file(filename, filter_list):
    with open(filename, 'r') as f:
        lines = f.readlines()
    new_lines = []
    for line in lines:
        flag = True
        for img in filter_list:
            if img in line:
                flag = False
                print(f"Delete {img} from {filename}")
                break
        if flag:
            new_lines.append(line)
    with open(filename, 'w') as f:
        f.writelines(new_lines)

def main():
    # remove img
    dataset_filter(root_dir, train_none_imgs)
    dataset_filter(root_dir, train_corrupt_imgs)
    dataset_filter(root_dir, eval_non_imgs)
    dataset_filter(root_dir, eval_corrupt_imgs)
    dataset_filter(root_dir, label_mismatch)

    # remove from label
    remove_from_file(os.path.join(root_dir, "train.txt"), train_none_imgs+train_corrupt_imgs+label_mismatch)
    remove_from_file(os.path.join(root_dir, "eval.txt"), eval_non_imgs+eval_corrupt_imgs)


if __name__ == "__main__":
    main()

paddle-bot-old bot assigned MissPenguin Dec 23, 2021

WZMIAOMIAO mentioned this issue Dec 24, 2021

update read image function #5053

Closed

WZMIAOMIAO closed this as completed Dec 25, 2021

WZMIAOMIAO mentioned this issue Dec 25, 2021

PaddleOCR社区常规赛 #4982

Closed

WZMIAOMIAO mentioned this issue Dec 28, 2021

Parsing line Error: list index out of range #5101

Closed

an1018 mentioned this issue Sep 28, 2022

Corrupt Jpeg data #7763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

十日训练营中读取det_data_lesson_demo数据集存在问题 #5029

十日训练营中读取det_data_lesson_demo数据集存在问题 #5029

WZMIAOMIAO commented Dec 23, 2021 •

edited

Loading

LDOUBLEV commented Dec 23, 2021

WZMIAOMIAO commented Dec 23, 2021

chenshenghao commented Dec 23, 2021

WZMIAOMIAO commented Dec 23, 2021 •

edited

Loading

livingbody commented Dec 23, 2021

WZMIAOMIAO commented Dec 23, 2021 •

edited

Loading

learning-and-learning1651880 commented Dec 27, 2021 •

edited

Loading

WZMIAOMIAO commented Dec 28, 2021

EighteenSprings commented Dec 29, 2021 •

edited

Loading

十日训练营中读取det_data_lesson_demo数据集存在问题 #5029

十日训练营中读取det_data_lesson_demo数据集存在问题 #5029

Comments

WZMIAOMIAO commented Dec 23, 2021 • edited Loading

LDOUBLEV commented Dec 23, 2021

WZMIAOMIAO commented Dec 23, 2021

chenshenghao commented Dec 23, 2021

WZMIAOMIAO commented Dec 23, 2021 • edited Loading

livingbody commented Dec 23, 2021

WZMIAOMIAO commented Dec 23, 2021 • edited Loading

learning-and-learning1651880 commented Dec 27, 2021 • edited Loading

WZMIAOMIAO commented Dec 28, 2021

EighteenSprings commented Dec 29, 2021 • edited Loading

WZMIAOMIAO commented Dec 23, 2021 •

edited

Loading

WZMIAOMIAO commented Dec 23, 2021 •

edited

Loading

WZMIAOMIAO commented Dec 23, 2021 •

edited

Loading

learning-and-learning1651880 commented Dec 27, 2021 •

edited

Loading

EighteenSprings commented Dec 29, 2021 •

edited

Loading