Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions of the errors during Fine Tune / 求助Fine Tune过程中碰到的问题 #2660

Open
yhuangece7 opened this issue Nov 26, 2024 · 0 comments

Comments

@yhuangece7
Copy link

yhuangece7 commented Nov 26, 2024

大家好,

我目前在 Windows 系统中使用 20220506_u2pp_conformer_exp_wenetspeech 模型的 .pt 文件进行微调。我的数据集由 WAV 文件和对应的 SRT 文件组成。我将数据处理为以下四个文件:spk2utttextutt2spkwav.scp,并希望以这些文件作为输入数据。

在微调的过程中:

  1. 我的微调程序调用了 wenet.utils.train_utils.py 文件中的 init_dataset_and_dataloader 函数,用于初始化数据。
  2. init_dataset_and_dataloader 函数内部调用了 init_dataset.py 文件中的 init_dataset 函数,其调用链如下:
    train_dataset = init_dataset(
        configs.get('dataset', 'asr'),
        args.data_type,
        args.train_data_dir,
        tokenizer,
        configs['dataset_conf'],
        partition=True,
        split='train'
    )
  3. init_dataset.py 文件中,init_dataset 函数调用了 init_asr_dataset,通过以下代码调用 dataset.py 文件中的 Dataset 类:
    dataset = Dataset(data_type, data_list_file, tokenizer, conf, partition)
    在处理以下语句时出现错误:
    dataset = dataset.map(processor.parse_json)
    报错信息如下:
    error: "Missing required keys in elem: ('<path>/<to>\\wav.scp', StreamWrapper<<_io.TextIOWrapper name='<path>/<to>\\wav.scp' mode='rt' encoding='utf-8'>>)"
    
    因为这个错误,导致第 (2) 步返回的 train_dataset 中的 Train dataset size 为 0。

我的问题是:

  1. 从上述步骤中,我的 Fine-tuning 数据设置、函数使用或者格式上是否有明显的错误?
  2. 是否有可以参考的 Fine-tuning 示例?不需要提供原始数据,但希望了解配置和实现的具体做法。

谢谢大家的帮助!

============================================
Hello everyone,

I’m currently working on fine-tuning the 20220506_u2pp_conformer_exp_wenetspeech model in Windows using the .pt file. My dataset consists of WAV files and their corresponding SRT files. I’ve processed my data into four files: spk2utt, text, utt2spk, and wav.scp, intending to use these as input for training.

In the fine-tuning process:

  1. My program calls the init_dataset_and_dataloader function in wenet.utils.train_utils.py to initialize the dataset and dataloader.
  2. Inside init_dataset_and_dataloader, it calls the init_dataset function from wenet/utils/init_dataset.py as follows:
    train_dataset = init_dataset(
        configs.get('dataset', 'asr'),
        args.data_type,
        args.train_data_dir,
        tokenizer,
        configs['dataset_conf'],
        partition=True,
        split='train'
    )
  3. In init_dataset.py, the init_asr_dataset function is invoked, which subsequently calls the Dataset class in dataset.py as follows:
    dataset = Dataset(data_type, data_list_file, tokenizer, conf, partition)
    While executing the following line:
    dataset = dataset.map(processor.parse_json)
    I encountered the following error:
    error: "Missing required keys in elem: ('<path>/<to>\\wav.scp', StreamWrapper<<_io.TextIOWrapper name='<path>/<to>\\wav.scp' mode='rt' encoding='utf-8'>>)"
    
    This results in the train_dataset having a Train dataset size of 0 in step (2).

Here are my questions:

  1. Based on the above steps, are there any obvious mistakes in my Fine-tuning data setup, function usage, or even data format?
  2. Are there any available Fine-tuning examples I can refer to? I don’t need the original data, but would appreciate guidance on the configuration and workflow.

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant