Skip to content

Commit

Permalink
L1=2048 L2=31
Browse files Browse the repository at this point in the history
  • Loading branch information
linrock committed Sep 8, 2023
1 parent b25d68f commit 6ad3264
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/nnue/nnue_architecture.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ constexpr IndexType LayerStacks = 8;

struct Network
{
static constexpr int FC_0_OUTPUTS = 15;
static constexpr int FC_0_OUTPUTS = 31;
static constexpr int FC_1_OUTPUTS = 32;

Layers::AffineTransformSparseInput<TransformedFeatureDimensions, FC_0_OUTPUTS + 1> fc_0;
Expand Down

6 comments on commit 6ad3264

@mstembera
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linrock I think making these layers match closely in size makes a lot of sense. More here official-stockfish#3927 (comment)
Maybe if FC_0_OUTPUTS is now 31 the TransformedFeatureDimensions should be made smaller than 2048 to keep overall speed high?
An alternate balanced config (w/o a bottleneck) should be FC_0_OUTPUTS = 15 and FC_1_OUTPUTS = 16 which may gain speed w/o much eval degradation. In any case thanks for all the tests you are doing!

@linrock
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if FC_0_OUTPUTS is now 31 the TransformedFeatureDimensions should be made smaller than 2048 to keep overall speed high?

it's possible this would work. i'll certainly consider reducing L1 with L2 = 31. it takes quite some time to test out new architectures at master level, so i'll need to come up with a better way to compare architectures without the full training process. maybe just comparing nets after a one or two-stage training.

An alternate balanced config (w/o a bottleneck) should be FC_0_OUTPUTS = 15 and FC_1_OUTPUTS = 16 which may gain speed w/o much eval degradation. In any case thanks for all the tests you are doing!

this may work as well, will need more testing later. thanks for the suggestions. and of course, it's been fun!

@mstembera
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linrock Can I ask what the primary time limiting factor in training nets is? Is it the speed of the GPU or CPU or something else? I am considering getting some new hardware. Thanks.

@linrock
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the speed of the GPU is the primary time limiting factor for training nets, assuming there are enough CPU cores for data loading to not be the bottleneck, and that the disk is fast enough for IO to not be a bottleneck. The faster the GPU, the more CPU cores are needed to avoid data loading bottlenecks. NVMe disks are likely good enough to avoid IO bottlenecks with single-GPU training on the fastest GPUs. Need at least several hundred GB disk space for current master-level datasets.

having more CPU cores for local eval (default: 25k nodes per move) is also helpful to ensure the training is on track.

lmk what hardware you're considering and i can estimate the training speed, and where hardware resource bottlenecks may be.

@mstembera
Copy link

@mstembera mstembera commented on 6ad3264 Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I have a 18C/36T i9-7980XE CPU and a Samsung 960 PRO 3500MB/s nvme drive.
I would probably just add a 4090 GPU. I could even add a A6000 GPU if the extra 48GB of GPU memory was a benefit over 24GB. Do you estimate the speed by the number of Tensor cores or the number of CUDA cores?

@linrock
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 4090 GPU would be best, and probably ~60% faster than A6000 for training nnue. GPU memory usage is relatively low for training L1-2048, maybe ~3GB. I don't see VRAM being an issue for training nnue as these are relatively small networks in terms of AI training workloads, so increasing 24GB -> 48GB GPU memory will have no benefit here. I believe # of CUDA cores has higher correlation with nnue training speed than # tensor cores.

18C/36T and Samsung 960 nvme sound more than enough to avoid CPU/disk speed bottlenecks while also having enough cores for reasonable local elo results. On a 4090, I'd estimate 700+ epoch trainings for L1-2048 to take 2-3 days each at around 40 it/s. So a master-level L1-2048 net could be trained from scratch in a week or two, with a similar training method as the current master.

Please sign in to comment.