Skip to content

Submission for the Google - Fast or Slow? Predict AI Model Runtime competition

Notifications You must be signed in to change notification settings

jafluri/kaggle_tpu_graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TPU Graph

Code style: black

Overview

This repo contains the code used for the 3rd place solution of the Google - Fast or Slow? Predict AI Model Runtime competition on Kaggle.

The final submission was a combination of different networks trained at different stages during the development of this repo. For completeness, all submissions have separate branches. However, the state of the repo during the development was sometimes quite messy. The codes of the submissions follow all the same pattern, but use slightly different network architectures, conventions etc. Experiments show that the main branch achieves the same or better results for the layout and is therefore recommended. Note that the Kendall's Tau score is a very noisy evaluation metric. Training the same network in the same way can lead to a difference of 0.1 in the score. One has to be careful when comparing different networks and always train them multiple times.

The main branch can only be used for the layout collection. The tile collection has its own branch and does not need any special preparation.

Installation

This package was developed and tested with Python 3.11. You might want to install pytorch and torch_scatter manually with pre-build wheels using the same CUDA version. Otherwise, the installation is in theory as easy as

pip install -e .

Dependencies

This repo depends on pytorch which should be installed with GPU support.

Usage

The package uses the data of the competition and has two scripts in the scripts folder. The first one add_features.py extracts additional features from the data and adds some derived features. In detail, it does:

  • Extract additional features from the protocol buffer files
  • Add positional encodings using RWPE with the asymmetric and symmetric adjacency matrices
  • Logs features with large dynamic range
  • Creates new features from the 30 dimensional features in the form of x % 128 and (x // 128) / 10, where x is the original feature. This is done because of the register size of the TPU.
  • Add a virtual output node to the graph that connects all nodes with outputs.

It requires pointers to directories containing the protocol buffer files and the npz files. You can have a look at the full signature with python scripts/add_features.py --help. Note that the RWPE can take a while and use a lot of memory for the larger graph.

The second script train.py trains a network on the data. You can have a look at the signature with python scripts/train.py --help. The script requires a path to a directory containing the npz files generated with add_features.py. After every epoch, validation and test set are evaluated and saved along with the model. If there are multiple GPUs available, one can specify the number of GPUs to use. Note that the backend and master port are hard-coded in the script and might need to be changed.

Development

Hooks

Pre-commit hooks come in any color you'd like

pre-commit install -c .hooks/.pre-commit-config.yaml

TODO

  • Clean up, better docs

About

Submission for the Google - Fast or Slow? Predict AI Model Runtime competition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages