-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mellanox] Disable SSD NCQ on Mellanox platforms #17567
Conversation
@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo |
@saiarcot895 Could you please help review? Thanks! |
How about the SN2700 A1? |
/azpw ms_conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1.
please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205
@liat-grozovik are we good to go? |
472b63f
to
49e301e
Compare
@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo |
@liat-grozovik can we move forward with this PR and handle 5600 and 2700-A1 with another PR? |
/azpw ms_conflict |
@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo |
/azpw ms_conflict |
@yxieca, no need to handle 5600 and 2700-A1 with another PR, this PR was already updated, it has all the changes. |
@volodymyrsamotiy what is keeping this PR in draft mode? @liat-grozovik any other blocking issues? |
@liat-grozovik Can we unblock this PR now? |
Removed the label for 202205 branch as another PR has been opened to 202205 #17662 |
@liat-grozovik i think this has been addressed so please check again |
@volodymyrsamotiy PR conflicts with 202311 branch |
@volodymyrsamotiy can you help create 202311 PR? |
- Why I did it Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ. There seems to be a problem between some kernel versions and some SATA controllers. Syslog error message examples: Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED". Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED". Some vendors already disabled NCQ on their platforms in SONiC due to similar issue: [Arista] Disable ATA NCQ for a few products sonic-net#13739 [Arista] Disable ATA NCQ for a few products [Arista] Disable SSD NCQ on DCS-7050CX3-32S sonic-net#13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ: https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous - How I did it Add a kernel parameter to tell libata to disable NCQ - How to verify it Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
Cherry-pick PR to 202305: #17960 |
ADO: 25853968 |
- Why I did it Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ. There seems to be a problem between some kernel versions and some SATA controllers. Syslog error message examples: Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED". Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED". Some vendors already disabled NCQ on their platforms in SONiC due to similar issue: [Arista] Disable ATA NCQ for a few products #13739 [Arista] Disable ATA NCQ for a few products [Arista] Disable SSD NCQ on DCS-7050CX3-32S #13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ: https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous - How I did it Add a kernel parameter to tell libata to disable NCQ - How to verify it Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.
Syslog error message examples:
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
Work item tracking
How I did it
Add a kernel parameter to tell libata to disable NCQ
How to verify it
Use FIO tool -
fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
Test results with NCQ enabled:
Test results with NCQ disabled:
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)