Skip to content

Commit

Permalink
added sampling for pretraining data
Browse files Browse the repository at this point in the history
  • Loading branch information
VahidooX committed Nov 19, 2023
1 parent ae8741f commit aa1a85d
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions scripts/speechlm_sft/sampling_pretraining_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@
dataset_handler = open(dataset_path, 'r', encoding='utf-8')
dataset_handlers[dataset] = dataset_handler

print("Blend to be used:", DATA_BLEND)

with open(OUTPUT_FILE, 'w', encoding='utf-8') as outf:
datasets = list(DATA_BLEND.keys())
weights = list(DATA_BLEND.values())
Expand Down

0 comments on commit aa1a85d

Please sign in to comment.