forked from official-stockfish/Stockfish
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
1 addition
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6ad3264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linrock I think making these layers match closely in size makes a lot of sense. More here official-stockfish#3927 (comment)
Maybe if FC_0_OUTPUTS is now 31 the TransformedFeatureDimensions should be made smaller than 2048 to keep overall speed high?
An alternate balanced config (w/o a bottleneck) should be FC_0_OUTPUTS = 15 and FC_1_OUTPUTS = 16 which may gain speed w/o much eval degradation. In any case thanks for all the tests you are doing!
6ad3264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible this would work. i'll certainly consider reducing L1 with L2 = 31. it takes quite some time to test out new architectures at master level, so i'll need to come up with a better way to compare architectures without the full training process. maybe just comparing nets after a one or two-stage training.
this may work as well, will need more testing later. thanks for the suggestions. and of course, it's been fun!
6ad3264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@linrock Can I ask what the primary time limiting factor in training nets is? Is it the speed of the GPU or CPU or something else? I am considering getting some new hardware. Thanks.
6ad3264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the speed of the GPU is the primary time limiting factor for training nets, assuming there are enough CPU cores for data loading to not be the bottleneck, and that the disk is fast enough for IO to not be a bottleneck. The faster the GPU, the more CPU cores are needed to avoid data loading bottlenecks. NVMe disks are likely good enough to avoid IO bottlenecks with single-GPU training on the fastest GPUs. Need at least several hundred GB disk space for current master-level datasets.
having more CPU cores for local eval (default: 25k nodes per move) is also helpful to ensure the training is on track.
lmk what hardware you're considering and i can estimate the training speed, and where hardware resource bottlenecks may be.
6ad3264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I have a 18C/36T i9-7980XE CPU and a Samsung 960 PRO 3500MB/s nvme drive.
I would probably just add a 4090 GPU. I could even add a A6000 GPU if the extra 48GB of GPU memory was a benefit over 24GB. Do you estimate the speed by the number of Tensor cores or the number of CUDA cores?
6ad3264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 4090 GPU would be best, and probably ~60% faster than A6000 for training nnue. GPU memory usage is relatively low for training L1-2048, maybe ~3GB. I don't see VRAM being an issue for training nnue as these are relatively small networks in terms of AI training workloads, so increasing 24GB -> 48GB GPU memory will have no benefit here. I believe # of CUDA cores has higher correlation with nnue training speed than # tensor cores.
18C/36T and Samsung 960 nvme sound more than enough to avoid CPU/disk speed bottlenecks while also having enough cores for reasonable local elo results. On a 4090, I'd estimate 700+ epoch trainings for L1-2048 to take 2-3 days each at around 40 it/s. So a master-level L1-2048 net could be trained from scratch in a week or two, with a similar training method as the current master.