-
Notifications
You must be signed in to change notification settings - Fork 7.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
十日训练营中读取det_data_lesson_demo数据集存在问题 #5029
Comments
训练集中个别图像存在问题,不影响整体训练,也不影响最终训练精度 |
好的,谢谢回复。 |
请问有遇到eval速度特别慢的问题么,会在19%卡很久很久 eval model:: 19%|█████▍ | 47/250 [00:20<01:28, 2.29it/s] |
我的大概1分13秒,没有你说的特别慢的问题。
|
但是我还是很想把错误剔除出去,很想很想。 |
我把训练集和验证集中有问题的图片全部找出来了,如果不想看到报错信息就把这些图片对应的标注信息给删掉。 # 训练集中读取后为空的图片
train_none_imgs = ["mtwi/train/TB1Zj7Un4rI8KJjy0FpXXb5hVXa_!!1-item_pic.gif.jpg",
"mtwi/train/TB2MVFMjjqhSKJjSspnXXc79XXa_!!2822611227.gif.jpg",
"mtwi/train/TB20eQfjqagSKJjy0FgXXcRqFXa_!!480667565.jpg.jpg",
"mtwi/train/TB26jrZgPnD8KJjSspbXXbbEXXa_!!3173720736.jpg.jpg",
"mtwi/train/TB2AzHIhlTH8KJjy0FiXXcRsXXa_!!2691187853.gif.jpg",
"mtwi/train/TB24dJ4jCYH8KJjSspdXXcRgVXa_!!2426498448.jpg.jpg",
"mtwi/train/TB2Ob1Ve_J_SKJjSZPiXXb3LpXa_!!789520595.gif.jpg",
"mtwi/train/TB2gwyxj46I8KJjy0FgXXXXzVXa_!!3401535694.gif.jpg",
"mtwi/train/TB17ggQcJHO8KJjSZFtXXchfXXa_!!1-item_pic.gif.jpg"]
# 训练集中破损的图片
train_corrupt_imgs = ["xfun/train/zh_train_43.jpg", # Corrupt JPEG data: 18 extraneous bytes before marker 0xc4
"xfun/train/zh_train_144.jpg"] # Corrupt JPEG data: bad Huffman code
# 验证集中读取后为空的图片
eval_non_imgs = ["mtwi/eval/TB1KB4MLXXXXXblXpXXunYpLFXX.jpg",
"mtwi/eval/TB25PgKirsTMeJjSszgXXacpFXa-1106900306.jpg.jpg"]
# 验证集中破损的图片
eval_corrupt_imgs = ["xfun/val/zh_val_42.jpg"] # Corrupt JPEG data: premature end of data segment 另外,还有两个标注可能有问题的标签,也可以删掉:
否则训练时会报错:
|
大佬你好,想请教一下是用什么办法把这些有问题的文件批量检测出来啊? |
@learning-and-learning1651880 import os
import numpy as np
import cv2
def official_read_img(img_path):
with open(img_path, 'rb') as f:
img = f.read()
img = np.frombuffer(img, dtype='uint8')
img = cv2.imdecode(img, 1)
if img is None:
print(f"find None image: {img_path}")
def main():
img_dir = "./det_data_lesson_demo"
txt_path = "./det_data_lesson_demo/train.txt"
with open(txt_path, "r", encoding="utf-8") as f:
for i in f.readlines():
if len(i.strip()) > 0:
img_name = i.split("\t")[0]
official_read_img(os.path.join(img_dir, img_name))
if __name__ == '__main__':
main() 关于破损图片的问题,因为opencv并没有抛出Exception,所以无法直接捕获,我当时就是用二分法自己一张张去排除的。 |
谢谢楼上提示,我逐个删除了,确实没有任何报错信息了,分享下自己粗糙的删除代码 import os
root_dir = "./train_data/det_data_lesson_demo"
train_file = "train.txt"
eval_file = "eval.txt"
# 训练集中读取后为空的图片
train_none_imgs = ["mtwi/train/TB1Zj7Un4rI8KJjy0FpXXb5hVXa_!!1-item_pic.gif.jpg",
"mtwi/train/TB2MVFMjjqhSKJjSspnXXc79XXa_!!2822611227.gif.jpg",
"mtwi/train/TB20eQfjqagSKJjy0FgXXcRqFXa_!!480667565.jpg.jpg",
"mtwi/train/TB26jrZgPnD8KJjSspbXXbbEXXa_!!3173720736.jpg.jpg",
"mtwi/train/TB2AzHIhlTH8KJjy0FiXXcRsXXa_!!2691187853.gif.jpg",
"mtwi/train/TB24dJ4jCYH8KJjSspdXXcRgVXa_!!2426498448.jpg.jpg",
"mtwi/train/TB2Ob1Ve_J_SKJjSZPiXXb3LpXa_!!789520595.gif.jpg",
"mtwi/train/TB2gwyxj46I8KJjy0FgXXXXzVXa_!!3401535694.gif.jpg",
"mtwi/train/TB17ggQcJHO8KJjSZFtXXchfXXa_!!1-item_pic.gif.jpg"]
# 训练集中破损的图片
train_corrupt_imgs = ["xfun/train/zh_train_43.jpg", # Corrupt JPEG data: 18 extraneous bytes before marker 0xc4
"xfun/train/zh_train_144.jpg"] # Corrupt JPEG data: bad Huffman code
# 验证集中读取后为空的图片
eval_non_imgs = ["mtwi/eval/TB1KB4MLXXXXXblXpXXunYpLFXX.jpg",
"mtwi/eval/TB25PgKirsTMeJjSszgXXacpFXa-1106900306.jpg.jpg"]
# 验证集中破损的图片
eval_corrupt_imgs = ["xfun/val/zh_val_42.jpg"] # Corrupt JPEG data: premature end of data segment
label_mismatch = ["mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg",
"mtwi/train/TB1oe8CLXXXXXc4XFXXunYpLFXX.jpg"]
def dataset_filter(root_dir, file_list):
not_delete = []
for filename in file_list:
file_pth = os.path.join(root_dir, filename)
if os.path.exists(file_pth):
if os.remove(file_pth):
print(f"{file_pth} has benn removed")
else:
print(f"{file_pth} cannot be removed")
not_delete.append(filename)
else:
print(f"{file_pth} do not exist")
print("Done correctly!")
def remove_from_file(filename, filter_list):
with open(filename, 'r') as f:
lines = f.readlines()
new_lines = []
for line in lines:
flag = True
for img in filter_list:
if img in line:
flag = False
print(f"Delete {img} from {filename}")
break
if flag:
new_lines.append(line)
with open(filename, 'w') as f:
f.writelines(new_lines)
def main():
# remove img
dataset_filter(root_dir, train_none_imgs)
dataset_filter(root_dir, train_corrupt_imgs)
dataset_filter(root_dir, eval_non_imgs)
dataset_filter(root_dir, eval_corrupt_imgs)
dataset_filter(root_dir, label_mismatch)
# remove from label
remove_from_file(os.path.join(root_dir, "train.txt"), train_none_imgs+train_corrupt_imgs+label_mismatch)
remove_from_file(os.path.join(root_dir, "eval.txt"), eval_non_imgs+eval_corrupt_imgs)
if __name__ == "__main__":
main() |
aistudio(v100 16g)
paddlepaddle-gpu==2.2.1.post101
git clone https://gitee.com/paddlepaddle/PaddleOCR
!python tools/train.py -c configs/det/det_mv3_db.yml
截取的部分错误信息:
在训练以及验证过程中还存在很多图像破损的错误:
虽然能够正常训练,但感觉对萌新不太友好(不知道是不是老师故意埋的坑hahaha)。
下面是
det_mv3_db.yml
文件的信息:The text was updated successfully, but these errors were encountered: