This repository contains training code for my 3rd place at Kaggle Competition - IceCube - Neutrinos in Deep Ice.
I may add some improvements and new ideas in the future, from other participant solutions that I found insightful.
The goal of the Icecube Challenge is to predict a neutrino particle’s direction, with sensor data from the Icecube neutrino detector located in Antarctica.
Problem statement: Given a sequence of detections containing information about the coordinates of detection, time of detection, and intensity of the light, predict two angles (Azimuth and Zenith) of the incoming neutrino that caused the event.
My final solution used for the challenge used the following components
- Self Attention Encoder
- Two angle classification heads
- 3D vector regression head - trained with Von Mises-Fisher Loss
- Gradient Boosting Based ensembler to combine the predictions
Download the data from the competition page
You can place the data anywhere you like, by default the training code looks for the the environment variable ICECUBE_DATA_DIR
or else checks in ./data
Run the data preparation scripts in the data_preparation
directory
angle_bins.py
geometry_with_transparency.py
split_meta.py
Hyperparameters in attention_model_training/config.py
are for the final model which was trained for 5 days on a RTX 4080.
Original Model Parameters:
- Embedding size - 512
- Heads - 8
- Layers - 18 (Longer network proved better for Icecube)
- Average Pooling
- Neck Size - 3072
- 128 bins for angle classification
Other details:
- Loss - Locally Smoothed Cross Entropy
- LR Schedule - OneCycle (2 phase cosine)
- Max LR - 5e-5
- FP16 training (BF16 was unstable)
- Switch to FP32 towards end of training
- Effective batch size - 384 (32 x 12)
You can reduce the model size to get a model that reaches score ~1.00 in 3 hours on RTX 4080
Small model details
- Embedding size - 64
- Heads 2
- Layers - 9
- Neck Size - 1024
- Max LR - 2e-4
- Batch size 384 (Accumulate 1)
Model training does length based grouping, this gives a training speed gain.
For very long sequences, fine tune the trained model on lengths upto 3072 (see FINETUNE_LONG_SEQ in the config file)
For efficiency I dump 100 batches of encoder predictions before training the stack model. The stack model is a simple MLP trained to do 3D vector regression with Von Mises-Fisher Loss.
Run inference with inference/inference.py
Set the cached prediction paths in stack_model_training/config.py
Train stack model with stack_model_training/train.py
-
Using Flash Attention - This was still early stages when I started writing the code. For some reason default pytorch flash attention was not working even though the original flash attention repo worked. If you're having trouble installing flash attention, you can revert to default pytorch attention by making necessary changes in
attention_model_training/attn_model.py
-
Hyperparameter sensitivity:
- LR Schedule - Fairly sensitive to changining the LR
- Classifier bins - Going above 128 bins causes some instability
- Dropout - Always results in worse performance
- Weight decay - Can change without too much effect
- Network size
- Embedding/Head - Below 32 didn't work well
- Layers - 9 layers gave big improvement over 6
-
Mixed precision instability - I saw some instability in training after validation metric reached 0.992, this has not been further investigated due to lack of time. I just switch to FP32 (code supports automatic switching after a set epoch number)
- The attention model code was heavily adopted from Andrej Karpathy's NanoGPT
- Lots of insights about data processing and loss functions was taken from GraphNet
- Kaggle discussion posts and baselines were extremely helpful.