Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callback PR Rev 3 #615

Merged
merged 45 commits into from
May 28, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
b16d356
Rebase off of master; add new working prototype of loss callback
blisc May 6, 2020
8024454
first working hack of computing uncomputed tensors
blisc May 7, 2020
879fcfc
style
blisc May 7, 2020
ddbf472
add a reference to Actions into TrainingState, remove deprecated func…
blisc May 12, 2020
912d83d
add decorators; add all events
blisc May 12, 2020
2e4eb18
style
blisc May 12, 2020
35d6b7d
more style
blisc May 12, 2020
4f6e1f7
initial refactor
blisc May 16, 2020
3c7b89e
adding checkpoint callback
blisc May 16, 2020
cf41850
enable fetching via NmTensor and string; add WandBCallback, Tensorboa…
blisc May 18, 2020
b1df99d
style
blisc May 18, 2020
d62f021
merge with master
blisc May 18, 2020
fa6553f
DDP bug fix
blisc May 18, 2020
ba84c80
clean up of checkpoint
blisc May 20, 2020
fc3ce62
update an4
blisc May 20, 2020
3a53e5e
merge with master
blisc May 20, 2020
e5b8258
style
blisc May 20, 2020
5361003
undo comenting
blisc May 20, 2020
9fc00d7
unpate
blisc May 20, 2020
d806e7e
wip
blisc May 21, 2020
f1c8aa8
more logging
blisc May 21, 2020
7608d45
remove debugging statements
blisc May 21, 2020
9f13e9d
style and merge
blisc May 21, 2020
5fe64fb
update new warning format with rank
blisc May 21, 2020
01dd179
add explicit rank marker
blisc May 21, 2020
f85f968
Merge branch 'warningformat_bug_fix' into U_callbacks_4
blisc May 21, 2020
46800cf
Merge remote-tracking branch 'nvidia/master' into U_callbacks_4
blisc May 21, 2020
c6ece47
docstrings and more
blisc May 21, 2020
3c3bee9
style
blisc May 21, 2020
dba4536
callback docstrings
blisc May 21, 2020
d95b2d4
style
blisc May 21, 2020
7bb53cd
add deprecation warnings
blisc May 22, 2020
d615efa
changelog
blisc May 22, 2020
21f4cf1
rename oldwandbcallback
blisc May 22, 2020
1c99f54
test
blisc May 22, 2020
b976ec0
style
blisc May 22, 2020
6ec04aa
first commit of changes
blisc May 27, 2020
7009bee
some fixes
blisc May 27, 2020
9f4566b
style
blisc May 27, 2020
307f550
move nmtensor_registry
blisc May 28, 2020
31fc556
update tests
blisc May 28, 2020
b9e4441
clean code for comments
blisc May 28, 2020
c036084
add back str_to_opt_level
blisc May 28, 2020
1e429af
split callbacks into two files; update error messages
blisc May 28, 2020
fdae1f3
add deprecated callbacks files
blisc May 28, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ To release a new version, please update the changelog as followed:

### Changed
- Syncs across workers at each step to check for NaN or inf loss. Terminates all workers if stop\_on\_nan\_loss is set (as before), lets Apex deal with it if apex.amp optimization level is O1 or higher, and skips the step across workers otherwise. ([PR #637](https://github.com/NVIDIA/NeMo/pull/637)) - @redoctopus
- Updated the callback system. Old callbacks will be deprecated in version 0.12. ([PR #615](https://github.com/NVIDIA/NeMo/pull/615)) - @blisc

### Dependencies Update

Expand Down Expand Up @@ -123,7 +124,7 @@ files, along with unit tests, examples and tutorials
([PR #375](https://github.com/NVIDIA/NeMo/pull/375)) - @titu1994

### Changed
- Refactoring of `nemo_nlp` collections:
- Refactoring of `nemo_nlp` collections:
([PR #368](https://github.com/NVIDIA/NeMo/pull/368)) - @VahidooX, @yzhang123, @ekmb
- renaming and restructuring of files, folder, and functions in `nemo_nlp`
- losses cleaned up. LossAggregatorNM moved to nemo/backends/pytorch/common/losses
Expand All @@ -138,7 +139,7 @@ files, along with unit tests, examples and tutorials
([PR #284](https://github.com/NVIDIA/NeMo/pull/284)) - @stasbel
- NeMo is not longer using pep8 code style rules. Code style rules are now enforced with `isort` and `black` incorporated into CI checks.
([PR #286](https://github.com/NVIDIA/NeMo/pull/286)) - @stasbel
- Major cleanup of Neural Module constructors (init), aiming at increasing the framework robustness: cleanup of NeuralModule initialization logic, refactor of trainer/actions (getting rid of local_params), fixes of several examples and unit tests, extraction and storing of intial parameters (init_params).
- Major cleanup of Neural Module constructors (init), aiming at increasing the framework robustness: cleanup of NeuralModule initialization logic, refactor of trainer/actions (getting rid of local_params), fixes of several examples and unit tests, extraction and storing of intial parameters (init_params).
([PR #309](https://github.com/NVIDIA/NeMo/pull/309)) - @tkornuta-nvidia
- Updated nemo's use of the logging library. from nemo import logging is now the reccomended way of using the nemo logger. neural_factory.logger and all other instances of logger are now deprecated and planned for removal in the next version. Please see PR 267 for complete change information.
([PR #267](https://github.com/NVIDIA/NeMo/pull/267), [PR #283](https://github.com/NVIDIA/NeMo/pull/283), [PR #305](https://github.com/NVIDIA/NeMo/pull/305), [PR #311](https://github.com/NVIDIA/NeMo/pull/311)) - @blisc
Expand All @@ -147,7 +148,7 @@ files, along with unit tests, examples and tutorials

- Added TRADE (dialogue state tracking model) on MultiWOZ dataset
([PR #322](https://github.com/NVIDIA/NeMo/pull/322)) - @chiphuyen, @VahidooX
- Question answering:
- Question answering:
([PR #390](https://github.com/NVIDIA/NeMo/pull/390)) - @yzhang123
- Changed question answering task to use Roberta and Albert as alternative backends to Bert
- Added inference mode that does not require ground truth labels
Expand All @@ -158,7 +159,7 @@ files, along with unit tests, examples and tutorials
### Deprecated

### Fixed
- Critical fix of the training action on CPU
- Critical fix of the training action on CPU
([PR #308](https://github.com/NVIDIA/NeMo/pull/309)) - @tkornuta-nvidia
- Fixed issue in Tacotron 2 prenet
([PR #444](https://github.com/NVIDIA/NeMo/pull/444)) - @blisc
Expand Down
117 changes: 59 additions & 58 deletions examples/asr/jasper_an4.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,64 +17,68 @@
process_evaluation_epoch,
word_error_rate,
)
from nemo.core import NeuralGraph
from nemo.utils import logging
from nemo.utils.lr_policies import CosineAnnealing


def create_dags(model_config_file, vocab, args, nf):

# Create a data_layer for training.
data_layer = nemo_asr.AudioToTextDataLayer.import_from_config(
model_config_file,
"AudioToTextDataLayer_train",
overwrite_params={"manifest_filepath": args.train_dataset, "batch_size": args.batch_size},
)
with NeuralGraph() as g0:
tkornuta-nvidia marked this conversation as resolved.
Show resolved Hide resolved
# Create a data_layer for training.
data_layer = nemo_asr.AudioToTextDataLayer.import_from_config(
model_config_file,
"AudioToTextDataLayer_train",
overwrite_params={"manifest_filepath": args.train_dataset, "batch_size": args.batch_size},
)

num_samples = len(data_layer)
steps_per_epoch = math.ceil(num_samples / (data_layer.batch_size * args.iter_per_step * nf.world_size))
total_steps = steps_per_epoch * args.num_epochs
logging.info("Train samples=", num_samples, "num_steps=", total_steps)
num_samples = len(data_layer)
steps_per_epoch = math.ceil(num_samples / (data_layer.batch_size * args.iter_per_step * nf.world_size))
total_steps = steps_per_epoch * args.num_epochs
logging.info("Train samples=", num_samples, "num_steps=", total_steps)

# Create a data_layer for evaluation.
data_layer_eval = nemo_asr.AudioToTextDataLayer.import_from_config(
model_config_file, "AudioToTextDataLayer_eval", overwrite_params={"manifest_filepath": args.eval_datasets},
)
# Create a data_layer for evaluation.
data_layer_eval = nemo_asr.AudioToTextDataLayer.import_from_config(
model_config_file, "AudioToTextDataLayer_eval", overwrite_params={"manifest_filepath": args.eval_datasets},
)

num_samples = len(data_layer_eval)
logging.info(f"Eval samples={num_samples}")
num_samples = len(data_layer_eval)
logging.info(f"Eval samples={num_samples}")

# Instantiate data processor.
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor.import_from_config(
model_config_file, "AudioToMelSpectrogramPreprocessor"
)
# Instantiate data processor.
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor.import_from_config(
model_config_file, "AudioToMelSpectrogramPreprocessor"
)

# Instantiate JASPER encoder-decoder modules.
jasper_encoder = nemo_asr.JasperEncoder.import_from_config(model_config_file, "JasperEncoder")
jasper_decoder = nemo_asr.JasperDecoderForCTC.import_from_config(
model_config_file, "JasperDecoderForCTC", overwrite_params={"num_classes": len(vocab)}
)
# Instantiate JASPER encoder-decoder modules.
jasper_encoder = nemo_asr.JasperEncoder.import_from_config(model_config_file, "JasperEncoder")
jasper_decoder = nemo_asr.JasperDecoderForCTC.import_from_config(
model_config_file, "JasperDecoderForCTC", overwrite_params={"num_classes": len(vocab)}
)

# Instantiate losses.
ctc_loss = nemo_asr.CTCLossNM(num_classes=len(vocab))
greedy_decoder = nemo_asr.GreedyCTCDecoder()

# Create a training graph.
audio, audio_len, transcript, transcript_len = data_layer()
processed, processed_len = data_preprocessor(input_signal=audio, length=audio_len)
encoded, encoded_len = jasper_encoder(audio_signal=processed, length=processed_len)
log_probs = jasper_decoder(encoder_output=encoded)
predictions = greedy_decoder(log_probs=log_probs)
loss = ctc_loss(log_probs=log_probs, targets=transcript, input_length=encoded_len, target_length=transcript_len,)

# Create an evaluation graph.
audio_e, audio_len_e, transcript_e, transcript_len_e = data_layer_eval()
processed_e, processed_len_e = data_preprocessor(input_signal=audio_e, length=audio_len_e)
encoded_e, encoded_len_e = jasper_encoder(audio_signal=processed_e, length=processed_len_e)
log_probs_e = jasper_decoder(encoder_output=encoded_e)
predictions_e = greedy_decoder(log_probs=log_probs_e)
loss_e = ctc_loss(
log_probs=log_probs_e, targets=transcript_e, input_length=encoded_len_e, target_length=transcript_len_e,
)
# Instantiate losses.
ctc_loss = nemo_asr.CTCLossNM(num_classes=len(vocab))
greedy_decoder = nemo_asr.GreedyCTCDecoder()

# Create a training graph.
audio, audio_len, transcript, transcript_len = data_layer()
processed, processed_len = data_preprocessor(input_signal=audio, length=audio_len)
encoded, encoded_len = jasper_encoder(audio_signal=processed, length=processed_len)
log_probs = jasper_decoder(encoder_output=encoded)
predictions = greedy_decoder(log_probs=log_probs)
loss = ctc_loss(
log_probs=log_probs, targets=transcript, input_length=encoded_len, target_length=transcript_len,
)

# Create an evaluation graph.
tkornuta-nvidia marked this conversation as resolved.
Show resolved Hide resolved
audio_e, audio_len_e, transcript_e, transcript_len_e = data_layer_eval()
processed_e, processed_len_e = data_preprocessor(input_signal=audio_e, length=audio_len_e)
encoded_e, encoded_len_e = jasper_encoder(audio_signal=processed_e, length=processed_len_e)
log_probs_e = jasper_decoder(encoder_output=encoded_e)
predictions_e = greedy_decoder(log_probs=log_probs_e)
loss_e = ctc_loss(
log_probs=log_probs_e, targets=transcript_e, input_length=encoded_len_e, target_length=transcript_len_e,
)
logging.info("Num of params in encoder: {0}".format(jasper_encoder.num_weights))

# Callbacks to print info to console and Tensorboard.
Expand All @@ -99,14 +103,7 @@ def create_dags(model_config_file, vocab, args, nf):
callbacks = [train_callback, checkpointer_callback, eval_callback]

# Return entities required by the actual training.
return (
loss,
eval_tensors,
callbacks,
total_steps,
log_probs_e,
encoded_len_e,
)
return (loss, eval_tensors, callbacks, total_steps, log_probs_e, encoded_len_e, g0)


def main():
Expand Down Expand Up @@ -166,7 +163,7 @@ def main():
# Get vocabulary.
vocab = jasper_params['labels']

(loss, eval_tensors, callbacks, total_steps, log_probs_e, encoded_len_e,) = create_dags(
(loss, eval_tensors, callbacks, total_steps, log_probs_e, encoded_len_e, g0) = create_dags(
args.model_config, vocab, args, nf
)

Expand Down Expand Up @@ -232,13 +229,17 @@ def main():
folder=checkpoint_dir, step_freq=args.checkpoint_save_freq, force_load=True,
)

# Distributed Data Parallel changes the underlying class so we need
# to reinstantiate Encoder and Decoder
args.num_epochs += 10
previous_step_count = total_steps
loss, eval_tensors, callbacks, total_steps, _, _ = create_dags(args.model_config, vocab, args, nf)

# Distributed Data Parallel and amp changes the underlying class so we need to reinstantiate modules
# Clear the module registery
nemo.utils.app_state.AppState().modules.clear()
# Delete old graph and make a new one
del g0
nf.reset_trainer()
loss, eval_tensors, callbacks, total_steps, _, _, new_g = create_dags(args.model_config, vocab, args, nf)

nf.train(
tensors_to_optimize=[loss],
callbacks=callbacks,
Expand Down
Loading