Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regression tests for training #730

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Oct 7, 2024

Issue: If we make a backward incompatible change or a regression, we don't have a mechanism to catch it. Also, if we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.

Fix: Add training regression tests that run 2 steps of training and compare the model activations against an already-prepared set of model activations from beaker. The saved model activations can be updated by changing the flags passed to the tests.

The added tests also run on CPU, but this required making minor changes to OLMo training code. Autocast works differently on CPU, so the model activations are different for CPU compared to GPU.

The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...

@2015aroras 2015aroras changed the title Shanea/add training regression tests Add regression tests for training Oct 7, 2024
@2015aroras 2015aroras marked this pull request as ready for review October 7, 2024 22:09
@2015aroras 2015aroras requested review from dirkgr and epwalsh October 7, 2024 22:09
@epwalsh
Copy link
Member

epwalsh commented Oct 8, 2024

I'm in favor of adding trainer tests but I have a couple concerns about this approach, the foremost being that I think comparing activations is way too brittle and strict. It would be enough to just check that loss is decreasing over a few steps.

we start running jobs on a new platform we don't have an easy way to tell if the training will be identical to existing platforms.

I wouldn't assume training would or even should be identical across platforms.

The saved model-activations are about 26Mb total, which is a reasonable increase to repo size...

I think this is a lot actually especially considering new versions of these may have to committed over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants