insights on noise in got dataset and fine-tuning issues #234

ep0p · 2024-12-04T10:15:10Z

After re-reading the GOT paper, I’d like more insight into how noise or document quality was handled during training. For example, was there any focus on the percentage of pdf documents from Common Crawl that were distorted or noisy?

In my experiments, adding noise to 40% of the documents during fine-tuning still results in hallucinations during inference. Should I increase that margin, or would starting training from scratch be a better approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insights on noise in got dataset and fine-tuning issues #234

insights on noise in got dataset and fine-tuning issues #234

ep0p commented Dec 4, 2024

insights on noise in got dataset and fine-tuning issues #234

insights on noise in got dataset and fine-tuning issues #234

Comments

ep0p commented Dec 4, 2024