Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpStress, HTTP/3: content mismatch in POST Duplex Slow #76183

Closed
antonfirsov opened this issue Sep 26, 2022 · 12 comments
Closed

HttpStress, HTTP/3: content mismatch in POST Duplex Slow #76183

antonfirsov opened this issue Sep 26, 2022 · 12 comments
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.) tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@antonfirsov
Copy link
Member

The following issue has appeared two times in the CI stress runs against main in the POST Duplex Slow operation, first noticed on 09/13. At this point I cannot exclude the possibility that the issue also affects the release/7.0 code.

https://dev.azure.com/dnceng-public/public/_build/results?buildId=14692&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=634

 System.Exception: Expected response content "X4Fzl4GA06N7jZWOoUNPatAQ8KM4YRNVQAQDvZ5f1VN5asSKJr5djlietpqRhBbYjsgZ8JSEOiN5V5pGr2lhMkleRlJ6jexptpLiBtB0OqE1Od8ozaZfQtCz627NzMVkkXBgj75yrkJWVkGocV694IP0cMc8lo01KUGqSS4sUbnfZfTLmJkUGMfS8A3MUwC3kIB1UQMAQEeTUjQxtGMS3VlxoohHOhNXOkvHUSrMxZ41iBxOk6TDakzEWDCgi4liq65ko9TQPzUURjFdqIWIE9welgkGFBdsnLBb6DZad2obFVmnyZpPYTUg", got "X4Fzl4GA06N7jZWOoUNPatAQ8KM4YRNVQAQDvZ5f1VN5asSKJr5djlietpqRhBbYjsgZ8JSEOiN5V5pGr2lhMkleRlJ6jexptpLiBtB0OqE1Od8ozaZfQtCz627NzMVkkXBgj75yrJWVkGocV694IP0cMc8lo01KUGqSS4sUbnfZfTLmJkUGMfS8A3MUwC3kIB1UQMAQEeTUjQxtGMS3VlxoohHOhNXOkvHUSrMxZ41iBxOk6TDakzEWDCgi4liq65ko9TQPzUURjFdqIWIE9welgkGFBdsnLBb6DZad2obFVmnyZpPYTUg".
client_1  |  Diverging at index 137. server checksum does not match client value.
client_1  |    at HttpStress.ClientOperations.ValidateContent(String expectedContent, String actualContent, String details) in /app/ClientOperations.cs:line 494
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_8>d.MoveNext() in /app/ClientOperations.cs:line 384
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 224
client_1  | 
client_1  | 	 8: POST Duplex Slow          Fail: 1	Timestamps: 14:43:07.6682381, Duration: 00:00:00.0612812, Cancelled: False
client_1  | 	    TOTAL                     Fail: 1

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 26, 2022
@ghost
Copy link

ghost commented Sep 26, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

The following issue has appeared two times in the CI stress runs against main in the POST Duplex Slow operation, first noticed on 09/13. At this point I cannot exclude the possibility that the issue also affects the release/7.0 code.

https://dev.azure.com/dnceng-public/public/_build/results?buildId=14692&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=1451f5f3-0108-5a08-5b92-e984b2a85bbd&l=634

 System.Exception: Expected response content "X4Fzl4GA06N7jZWOoUNPatAQ8KM4YRNVQAQDvZ5f1VN5asSKJr5djlietpqRhBbYjsgZ8JSEOiN5V5pGr2lhMkleRlJ6jexptpLiBtB0OqE1Od8ozaZfQtCz627NzMVkkXBgj75yrkJWVkGocV694IP0cMc8lo01KUGqSS4sUbnfZfTLmJkUGMfS8A3MUwC3kIB1UQMAQEeTUjQxtGMS3VlxoohHOhNXOkvHUSrMxZ41iBxOk6TDakzEWDCgi4liq65ko9TQPzUURjFdqIWIE9welgkGFBdsnLBb6DZad2obFVmnyZpPYTUg", got "X4Fzl4GA06N7jZWOoUNPatAQ8KM4YRNVQAQDvZ5f1VN5asSKJr5djlietpqRhBbYjsgZ8JSEOiN5V5pGr2lhMkleRlJ6jexptpLiBtB0OqE1Od8ozaZfQtCz627NzMVkkXBgj75yrJWVkGocV694IP0cMc8lo01KUGqSS4sUbnfZfTLmJkUGMfS8A3MUwC3kIB1UQMAQEeTUjQxtGMS3VlxoohHOhNXOkvHUSrMxZ41iBxOk6TDakzEWDCgi4liq65ko9TQPzUURjFdqIWIE9welgkGFBdsnLBb6DZad2obFVmnyZpPYTUg".
client_1  |  Diverging at index 137. server checksum does not match client value.
client_1  |    at HttpStress.ClientOperations.ValidateContent(String expectedContent, String actualContent, String details) in /app/ClientOperations.cs:line 494
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_8>d.MoveNext() in /app/ClientOperations.cs:line 384
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 224
client_1  | 
client_1  | 	 8: POST Duplex Slow          Fail: 1	Timestamps: 14:43:07.6682381, Duration: 00:00:00.0612812, Cancelled: False
client_1  | 	    TOTAL                     Fail: 1

Author: antonfirsov
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@karelz karelz added this to the 8.0.0 milestone Sep 27, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 27, 2022
@karelz karelz added bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.) untriaged New issue has not been triaged by the area owner and removed untriaged New issue has not been triaged by the area owner labels Sep 27, 2022
@karelz
Copy link
Member

karelz commented Sep 27, 2022

Triage: To our best knowledge, it should be addressed in main and latest 7.0 by copying buffer in HTTP/3. We should not see it.
It is low frequency, so let's keep it in 8.0 - once we understand the impact there, we can consider servicing if it is impactful.

@ManickaP
Copy link
Member

The content is diverging at 137 position of 312, so in the middle. And it's diverging in one character. Also the test is sending the data one byte at a time, so one of the bytes got lost somewhere. This might be a bug in our code as well as in msquic. We should try to reproduce locally and get logs.

@ManickaP
Copy link
Member

Another hit on Windows in https://dev.azure.com/dnceng-public/public/_build/results?buildId=254920&view=logs&j=0bc77094-9fcd-5c38-f6e4-27d2ae131589&t=4fd93899-d50b-5dc1-f9eb-66ab1b883c22:

httpstress-client-1  | Failure Type 1/1:
httpstress-client-1  | System.Exception: Expected response content "VIGhAOYBM7psBozVtNFKJzX9QQvf5xVCSeeF6tfVxSd6k3eySIX6nWJWl7GkSs5OccPXAjPzijMqMQONlxDnQPPvK8Uzdv35MEU7d7oitXO49T9KityVfIyCDUC1FQMmfHT8wHke8Uts1Llrz9HkwYQ3FHu9unTTBbOsz0NQ9KMAP5BsiL159bq7qPsfNajK0QgQwtrymqbcBFe962RYlOJZlG1a7i3Ya19yPlvoU3II0EwaljZK9AZc66ffrHq4m2196fkK6c4i51ZWKRpO8iPPSXsI6FxCCEWyCmU5yEEW6rW5X8LQSSv4dbhZnSEqerVrVbKYwkQuUsJb1mkyuXkkJRKb5vHVzNsELeawnnTi11ZlH6X1iGw77lY3GiS1Tuu6z4GMCnA732iIsYJzNFnjRl7Z6Wg4KSF66TlCqBou6o343FBiPTdD7AeExgFf75mbnbBOm69BQsdch2WrLyfqp1zwpWF9w8n12SHR4mvIKRmEX1XuAOK9PvmlcFZGrzFH5baFl5mY3MWNFkOnzKzXodb6gVopDfRKBU9vRH94bl62POWrrmGW0hTR0UZFGxLOvBBkbSMhBuw3XYgHzxyROuKU1hvwitoeQi7DDlFOQY", got "VIGhAOYBM7psBozVtNFKJzX9QQvf5xVCSeeF6tfVxSd6k3eySIX6nWJWl7GkSs5OccPXAjPzijMqMQONlxDnQPPvK8Uzdv35MEU7d7oitXO49T9KityVfIyCDUC1FQMmfHT8wHke8Uts1Llrz9HkwYQ3FHu9unTTBbOsz0NQ9KMAP5BsiL159bq7qPs;NajK0QgQwtrymqbcBFe962RYlOJZlG1a7i3Ya19yPlvoU3II0EwaljZK9AZc66ffrHq4m2196fkK6c4i51ZWKRpO8iPPSXsI6FxCCEWyCmU5yEEW6rW5X8LQSSv4dbhZnSEqerVrVbKYwkQuUsJb1mkyuXkkJRKb5vHVzNsELeawnnTi11ZlH6X1iGw77lY3GiS1Tuu6z4GMCnA732iIsYJzNFnjRl7Z6Wg4KSF66TlCqBou6o343FBiPTdD7AeExgFf75mbnbBOm69BQsdch2WrLyfqp1zwpWF9w8n12SHR4mvIKRmEX1XuAOK9PvmlcFZGrzFH5baFl5mY3MWNFkOnzKzXodb6gVopDfRKBU9vRH94bl62POWrrmGW0hTR0UZFGxLOvBBkbSMhBuw3XYgHzxyROuKU1hvwitoeQi7DDlFOQY".
httpstress-client-1  |  Diverging at index 187. server checksum does not match client value.
httpstress-client-1  |    at HttpStress.ClientOperations.ValidateContent(String expectedContent, String actualContent, String details) in C:\app\ClientOperations.cs:line 494
httpstress-client-1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_8>d.MoveNext() in C:\app\ClientOperations.cs:line 377
httpstress-client-1  | --- End of stack trace from previous location ---
httpstress-client-1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in C:\app\StressClient.cs:line 204
httpstress-client-1  | 
httpstress-client-1  | 	 8: POST Duplex Slow          Fail: 1	Timestamps: 19:39:37.3417219, Duration: 00:00:00.0690585, Cancelled: False
httpstress-client-1  | 	    TOTAL                     Fail: 1

In this case, we're not missing any bytes, but ; is received instead of f.

@antonfirsov
Copy link
Member Author

I see a very similar failure in 5: POST, I assume the root cause is the same.

System.Exception: Expected response content "rt4jEAlVco7skxdOOP3EE1isBsYbSxAbrUl5wKvBP99qSXdN1JwMmlTpACXMc92feV1GCXS3Uro4srPdFAZ80op50wuknZYF6UCwVMBTrsruwOwJmt6SZueqgwFGciGvIZyOfT7sz5G7wu9czbxtS6fBpMGZ6ur9cwbMQYeJlESTeXtmmdG1kMSCtmYICXl4E6U0ImBRxd7EeBXISTnTdpPFTPTezL6DSOFX2tT0Sp36FEYhFGqjkIQOhFWsJXScXulFwzGYzn0omI0ErlOjUNTAxqqmNbhuNQZKSIkFzIjxHccOEH0X8i89RbI68TBO56RMjflZ08jCeVsKESY5b9meIPtrze7W6LrBBFfYVQWzzvVvjxsgv5nwJm4KmuaM1D5Jdh9ynzZAdmfMty81lRa0XvrOBLmbqAylqCH1ottvWcIX7hQomZaac4lipkmRCyKSotWhg6DrYeUvcJHTuyoHZO9lRZB37RjYIpjDBi7gLi1xap5WTIXE93tQklG6u9k6vwdma3IBq8vqMBBbTfZouVPEwITPS0lCyv1eCIvQ0LN0823YZecEZwCQXOqHasipNisR9XOBk9nY7O6tFuqoOWc1ib6oSdq7cXdPAcO2FRUQ8dI8Hw2f2uRq6qiYQai7JgwalhicWnYw2Cs4IyWWaB4h5nJJuqcfjM6RhiNZDQ8FMVnuyeFR3CW8c", got "9t4jEAlVco7skxdOOP3EE1isBsYbSxAbrUl5wKvBP99qSXdN1JwMmlTpACXMc92feV1GCXS3Uro4srPdFAZ80op50wuknZYF6UCwVMBTrsruwOwJmt6SZueqgwFGciGvIZyOfT7sz5G7wu9czbxtS6fBpMGZ6ur9cwbMQYeJlESTeXtmmdG1kMSCtmYICXl4E6U0ImBRxd7EeBXISTnTdpPFTPTezL6DSOFX2tT0Sp36FEYhFGqjkIQOhFWsJXScXulFwzGYzn0omI0ErlOjUNTAxqqmNbhuNQZKSIkFzIjxHccOEH0X8i89RbI68TBO56RMjflZ08jCeVsKESY5b9meIPtrze7W6LrBBFfYVQWzzvVvjxsgv5nwJm4KmuaM1D5Jdh9ynzZAdmfMty81lRa0XvrOBLmbqAylqCH1ottvWcIX7hQomZaac4lipkmRCyKSotWhg6DrYeUvcJHTuyoHZO9lRZB37RjYIpjDBi7gLi1xap5WTIXE93tQklG6u9k6vwdma3IBq8vqMBBbTfZouVPEwITPS0lCyv1eCIvQ0LN0823YZecEZwCQXOqHasipNisR9XOBk9nY7O6tFuqoOWc1ib6oSdq7cXdPAcO2FRUQ8dI8Hw2f2uRq6qiYQai7JgwalhicWnYw2Cs4IyWWaB4h5nJJuqcfjM6RhiNZDQ8FMVnuyeFR3CW8c".
httpstress-client-1  |  Diverging at index 0. server checksum mismatch

@ManickaP ManickaP removed their assignment May 25, 2023
@karelz
Copy link
Member

karelz commented Jul 18, 2023

Likely related to #87559

Not a regression against 7.0 - puntable if needed.

@karelz
Copy link
Member

karelz commented Aug 14, 2023

Not much has changed, moving to 9.0 due to limited remaining time. We should continue working on it to have a fix handy for backport if needed.

@karelz karelz modified the milestones: 8.0.0, 9.0.0 Aug 14, 2023
@karelz karelz assigned ManickaP and unassigned CarnaViire Aug 15, 2023
@ManickaP
Copy link
Member

Tentatively closing as this might be fixed by #90253. I ran H/3 stress for 3+ hours on my slow VM, which was reproducing this problem from time to time.
If this shows up again, we can re-open.
This is the same root cause as #87559.

@ManickaP
Copy link
Member

See #42211 (comment)

@ManickaP ManickaP reopened this Oct 20, 2023
@ManickaP ManickaP removed their assignment Oct 20, 2023
@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 7, 2023

A new failure type appeared on main that is most likely caused by the same root cause. (I'm not 100% sure, but I don't find it worthy to open a new issue at this point.)

System.Exception: Expected response content "--test_boundary
Content-Type: text/plain; charset=utf-8

hGntRf0SNTEJYUK0em1CH08Jmf2Y89eIVoTGOchHthgOlzS4VtulpYRh7U6mcRJDpvnk04b15jIZJjXnCy2EEsSwMm8uVueSOTZpBh3RIsf5GsNOwIaeO4ywSqOEwhVA4qAqAicUubwMDNgTOkxDK03JUcMKJoUaMU0PhWDEg5jm8cj59uqCjIq4nrMXLHB67LziMHlrWjWEGGBstGBOyIr3ewcqAgsPudfmd0js9zfzmbnWfUd46m8oUv3Pfvgqr6KdbOFsXIFydsH396Ag2nJ7VqIhxKDDIp8gH9uGFkXnX9Bbe0GYk0n3mC8l6XvC51gnkalzCtRHGWZMeg3w4eTI14t419AhpKiWSNr3EGMqYVmGcl6PTy8QYOlfhhx2jjSvn3tGo0inIcjWHcRG2kErBIx4bCMXIJ7o3rTZrpaTGL8veFN4Fh0yVbjV6pfcdAUyVvSQ9sOD9laqA2QLoIHbJ8JS7xS2Ayp5q9PjnxuNJmok0sL4PJoLttfNc0ICPL3BvU00chYqNtHWJYREcOdr9TCxT26meixyjcoVsKShkxyzb833Y4Z1hB2SI2WZalCfvJmgjvLHwv6CVCLKy3Ea8wGYaMDqns7ylI4IMNJciI4l4px6hBaWXbWy2E6NiVT3o8RC83ybjbrqsUTkP68bIbQ9m5F7HwtDo1xwOkxeDzmwkg6XlLxcMrlChHP3nqKnhe1rWXLe9UOJoUO5yhM
--test_boundary--
", got "--test_boundary
Content-Type: text/plain; charset=utf-8

�GntRf0SNTEJYUK0em1CH08Jmf2Y89eIVoTGOchHthgOlzS4VtulpYRh7U6mcRJDpvnk04b15jIZJjXnCy2EEsSwMm8uVueSOTZpBh3RIsf5GsNOwIaeO4ywSqOEwhVA4qAqAicUubwMDNgTOkxDK03JUcMKJoUaMU0PhWDEg5jm8cj59uqCjIq4nrMXLHB67LziMHlrWjWEGGBstGBOyIr3ewcqAgsPudfmd0js9zfzmbnWfUd46m8oUv3Pfvgqr6KdbOFsXIFydsH396Ag2nJ7VqIhxKDDIp8gH9uGFkXnX9Bbe0GYk0n3mC8l6XvC51gnkalzCtRHGWZMeg3w4eTI14t419AhpKiWSNr3EGMqYVmGcl6PTy8QYOlfhhx2jjSvn3tGo0inIcjWHcRG2kErBIx4bCMXIJ7o3rTZrpaTGL8veFN4Fh0yVbjV6pfcdAUyVvSQ9sOD9laqA2QLoIHbJ8JS7xS2Ayp5q9PjnxuNJmok0sL4PJoLttfNc0ICPL3BvU00chYqNtHWJYREcOdr9TCxT26meixyjcoVsKShkxyzb833Y4Z1hB2SI2WZalCfvJmgjvLHwv6CVCLKy3Ea8wGYaMDqns7ylI4IMNJciI4l4px6hBaWXbWy2E6NiVT3o8RC83ybjbrqsUTkP68bIbQ9m5F7HwtDo1xwOkxeDzmwkg6XlLxcMrlChHP3nqKnhe1rWXLe9UOJoUO5yhM
--test_boundary--
".
 Diverging at index 60. server checksum mismatch
   at HttpStress.ClientOperations.ValidateContent(String expectedContent, String actualContent, String details) in /app/ClientOperations.cs:line 494
   at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_6>d.MoveNext() in /app/ClientOperations.cs:line 342
--- End of stack trace from previous location ---
   at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 204

     6: POST Multipart Data       Fail: 1   Timestamps: 13:50:33.9887448, Duration: 00:00:00.0070846, Cancelled: False
        TOTAL                     Fail: 1

https://dev.azure.com/dnceng-public/public/_build/results?buildId=480929&view=logs&j=22fea640-1099-5f32-ec5d-316ad83f359a&t=a923b061-8445-5c16-cd6e-848990587cb0
https://dev.azure.com/dnceng-public/public/_build/results?buildId=487816&view=logs&j=22fea640-1099-5f32-ec5d-316ad83f359a&t=a923b061-8445-5c16-cd6e-848990587cb0

@antonfirsov
Copy link
Member Author

@rzikm @CarnaViire @ManickaP there is a drastical increase in hits of this issue in recent runs, most likely as a result of switching to DEBUG msquic in #100023. The good news is that it now should be easier to reproduce, since every Linux CI run hits it 200-300 times. I would love to see this prioritized.

@ManickaP
Copy link
Member

ManickaP commented Apr 2, 2024

This is pending on microsoft/msquic#4215. Let's wait a bit if it gets in soon, before we decide to switch to one of the released branches of msquic.

@karelz karelz added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Jun 26, 2024
@rzikm rzikm closed this as completed Jul 10, 2024
@rzikm rzikm reopened this Jul 11, 2024
@rzikm rzikm closed this as completed Oct 14, 2024
@rzikm rzikm removed their assignment Oct 14, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Nov 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Http bug tenet-reliability Reliability/stability related issue (stress, load problems, etc.) tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

5 participants