You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training a 6GB dataset with LightGBM using n_jobs=70 does not result in a proportional reduction in training time. Despite utilizing a machine with 72 cores and setting a high n_jobs value, the training time remains unexpectedly high.
Environment
OS: Linux 6.1.0-27-cloud-amd64 Debian
CPU:
Architecture: x86_64
CPU(s): 72
- Model Name: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
- Cores: 72 (1 thread per core)
- Flags: AVX, AVX2, AVX512, FMA, etc.
Memory: 288 MB L2 Cache, 16 MB L3 Cache
NUMA Node(s): 1
Memory:
total used free shared buff/cache available
Mem: 491Gi 81Gi 399Gi 1.1Mi 15Gi 410Gi
Swap: 79Gi 84Mi 79Gi
Storage:
Filesystem Size Used Avail Use% Mounted on
udev 246G 0 246G 0% /dev
tmpfs 50G 1.8M 50G 1% /run
/dev/sda1 197G 104G 86G 55% /
tmpfs 246G 0 246G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda15 124M 12M 113M 10% /boot/efi
tmpfs 50G 0 50G 0% /run/user/10476
tmpfs 50G 0 50G 0% /run/user/90289
tmpfs 50G 0 50G 0% /run/user/1003
VM Type: Custom VM on a cloud environment.
LightGBM Setup
Version: 3.2.1=py38h709712a_0
Parameters: n_estimators=325, num_leaves=512, colsample_bytree=0.2, min_data_in_leaf=80, max_depth=22, learning_rate=0.09, objective="binary", n_jobs=70, boost_from_average=True, max_bin=200, bagging_fraction=0.999, lambda_l1=0.29, lambda_l2=0.165
Dataset:
Size: ~6GB
Characteristics: Binary classification problem, categorical and numerical features, preprocessed and balanced.
Performance Issues
Current Performance:
Training time with n_jobs=32: ~25 minutes
Training time with n_jobs=70: ~23 minutes
Expected Performance:
Substantial reduction in training time when utilizing 70 cores, ideally below 10 minutes.
Bottleneck Symptoms:
Minimal reduction in training time with increased cores (n_jobs).
CPU utilization remains low, with individual threads not fully utilized.
System Metrics During Training
CPU Utilization:
Average utilization: ~40%
Peak utilization: ~55%
Core-specific activity: Most cores show low activity levels (<30%)
Memory Usage:
Utilized during training: ~81Gi
Free memory: ~399Gi
Swap usage: ~84Mi
Disk I/O:
Read: ~50MB/s
Write: ~30MB/s
I/O wait time: ~2%
Request for Support
Explanation of why n_jobs scaling is not improving training time.
Suggestions for configurations to fully utilize 70 cores for LightGBM training.
Recommendations for debugging and monitoring specific to LightGBM threading or system-level bottlenecks.
The text was updated successfully, but these errors were encountered:
jameslamb
changed the title
Bug Report: No Improvement in Training Time with more Cores on LightGBM
No Improvement in Training Time with more Cores on LightGBM
Nov 25, 2024
You haven't provided enough information yet for us to help you with this report.
can you provide a minimal, reproducible example (docs on that) showing the exact code you're running and how you installed LightGBM?
You haven't even told us whether you're using the Python package, R package, CLI, etc.
You haven't told us anything about the shape and content of the dataset, other than it's total size in memory. CPU utilization is heavily dependent on the shape of the input data (e.g. number of rows and columnes) and the distribution of the features (e.g. cardinality of categorical values)
are there any other processes running on the system?
If you're trying to devote all cores to LightGBM training, they'll be competing with any other work happening on the system.
have I understood correctly that you're using LightGBM 3.2.1?
if so, please try updating to the latest version (v4.5.0) and tell us if that changes the results. There have been hundreds of bug fixes and improvements in the 4+ years of development between those 2 versions
Description
Training a 6GB dataset with LightGBM using n_jobs=70 does not result in a proportional reduction in training time. Despite utilizing a machine with 72 cores and setting a high n_jobs value, the training time remains unexpectedly high.
Environment
LightGBM Setup
Request for Support
Explanation of why
n_jobs
scaling is not improving training time.Suggestions for configurations to fully utilize 70 cores for LightGBM training.
Recommendations for debugging and monitoring specific to LightGBM threading or system-level bottlenecks.
The text was updated successfully, but these errors were encountered: