Skip to content

Issues: intelligent-machine-learning/dlrover

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Assignee
Filter by who’s assigned
Sort

Issues list

Support HTTP for master-worker communication. enhancement New feature or request todo issue or pr with 'todo' will ignore expiration
#1366 opened Nov 29, 2024 by BalaBalaYi Backlog
Will flashcheckpoint support fully parallel save in megatron core 0.7+ ? question Further information is requested
#1363 opened Nov 28, 2024 by leondada
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本 question Further information is requested
#1338 opened Nov 15, 2024 by lulu-0126
Add balance loss in atorch moe example Hacktoberfest todo issue or pr with 'todo' will ignore expiration
#1300 opened Oct 18, 2024 by skydoorkai
How does dlrover make sure all the nodes in one job are in one switch question Further information is requested
#1298 opened Oct 17, 2024 by gangxie112
Enhance/Replace k8s python client. Hacktoberfest wip issue or pr with 'wip' will ignore expiration
#1291 opened Oct 12, 2024 by BalaBalaYi v0.4.0
add xpu monitor for dlrover Hacktoberfest todo issue or pr with 'todo' will ignore expiration
#1290 opened Oct 12, 2024 by majieyue
Can you create a dlrover arm64 image for Ascend NPU? question Further information is requested
#1248 opened Aug 22, 2024 by xmarker
Question: How DLRover integrate with Llama Factory? question Further information is requested
#1244 opened Aug 21, 2024 by hetingyou
xpu timer python package todo issue or pr with 'todo' will ignore expiration
#1159 opened Jun 17, 2024 by zxyyzx
Use Gang Scheduling in ElasticJob of DLRover. todo issue or pr with 'todo' will ignore expiration
#1075 opened Apr 14, 2024 by workingloong Backlog
The job stops restarting workers and exits if the traceback is a code bug. enhancement New feature or request question Further information is requested todo issue or pr with 'todo' will ignore expiration
#1068 opened Apr 8, 2024 by workingloong Backlog
ProTip! Follow long discussions with comments:>50.