-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The GPU Capacity of hosts and kubectl describe node
does not match
#344
Labels
Comments
I also create an issue on Nvidia develop forum: https://devtalk.nvidia.com/default/topic/1023105/general/nvidia-smi-and-dev-nvidia-does-not-match/ |
这是其中一个卡在运行时丢了,dmesg 应该能看到一些信息
…On Thu, Aug 24, 2017 at 10:43 AM, Yancey ***@***.***> wrote:
I also create an issue on Nvidia develop forum:
https://devtalk.nvidia.com/default/topic/1023105/general/
nvidia-smi-and-dev-nvidia-does-not-match/
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#344 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEcOOA0gyWdFtrcJsSFuIofm6KvqjdQks5sbONpgaJpZM4PAMcn>
.
--
---------------------------------------------
Qingsong Liu
[email protected]
Univ. of Sci.& Tech. of China
----------------------------------------------
|
现在 kubelet 检测 GPU 数目比较裸,直接匹配 /dev/nvidia[0-9]+ 的数目,根据运行状态检测 GPU 数目,要等到 v1.8
或者 v1.9
…On Thu, Aug 24, 2017 at 11:03 AM, Qingsong Liu ***@***.***> wrote:
这是其中一个卡在运行时丢了,dmesg 应该能看到一些信息
On Thu, Aug 24, 2017 at 10:43 AM, Yancey ***@***.***> wrote:
> I also create an issue on Nvidia develop forum:
> https://devtalk.nvidia.com/default/topic/1023105/general/nvi
> dia-smi-and-dev-nvidia-does-not-match/
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#344 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEcOOA0gyWdFtrcJsSFuIofm6KvqjdQks5sbONpgaJpZM4PAMcn>
> .
>
--
---------------------------------------------
Qingsong Liu
***@***.***
Univ. of Sci.& Tech. of China
----------------------------------------------
--
---------------------------------------------
Qingsong Liu
[email protected]
Univ. of Sci.& Tech. of China
----------------------------------------------
|
多谢 @pineking dmesg 里确实有初始化失败的日志: [1903878.128627] NVRM: RmInitAdapter failed! (0x26:0xffff:1096)
[1903878.128678] NVRM: rm_init_adapter failed for device bearing minor number 6 你们有遇到过类似问题么? |
之前碰到过,重启解决,最近没出现丢卡情况 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
kubectl describe node
The text was updated successfully, but these errors were encountered: