-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding unicode string fails while running unit tests. #23466
Comments
Sorry for the delay- I will look at this soon. |
Thanks for the detailed issue and great that you found a solution! We are switching to not reading from a socket soon as part of another low-level architecture change so this won't be an issue longterm but Ill add in your solution for the time being until that changes. |
Here is another version which gives up after trying to read 3 additional bytes.
thank you! |
We are about to release our development environment that exposes this issue to a broader set of developers. Is there anything you can tell us around timing of the patch being included in a future update? thanks again! |
Hi! Here is the issue about it which we will keep updating #23279. I will pull in @karthiknadig since he is making this change and probably has a better sense of timeline. Thanks! |
Hello eleanorjboyd! #23279 appears to be the longer-term solution. I was hoping that in the mean time the patch I posted on June 13 could be included as an interim fix. Is that a possibility? ty! |
Hi! Closing this as #23279 is closed |
Type: Bug
Behaviour
Steps to reproduce:
you should see this error.
I believe the underlying issue is unicode characters in test function names, i.e. the circles in:
When running a large set of unit tests each test name is provided to the pytest invocation on the client
machine as command line args. If there are enough tests to cross a 1024 byte boundary (or maybe its 3000),
then there is a chance that PipeManager in socket_manager.py will do a read that does not terminate on a unicode code point boundary.
For example, say the character ᐤ is encoded as two bytes, A and B, then it's possible by reading 1024 bytes that
the final character of the bunch is only A, and that B will be available in a subsequent read. Decoding a string
under this condition will result in the above error:
I was able to confirm the above theory by patching by socket_manager as follows:
If I were to PR this I would like some feedback on this approach. It does rely on exception handling which isn't ideal. And reading one byte at a time seems inefficient. And could a genuine utf encoding error put this into an infinite loop? Are any of those issues severe enough to warrant improvements? (I think the last one is for sure.)
Diagnostic data
Output for
Python
in theOutput
panel (View
→Output
, change the drop-down the upper-right of theOutput
panel toPython
)Extension version: 2024.6.0
VS Code version: Code 1.89.1 (dc96b837cf6bb4af9cd736aa3af08cf8279f7685, 2024-05-07T05:13:33.891Z)
OS version: Windows_NT x64 10.0.22621
Modes:
Remote OS version: Linux x64 5.4.0-182-generic
python.languageServer
setting: DefaultUser Settings
Installed Extensions
System Info
canvas_oop_rasterization: enabled_on
direct_rendering_display_compositor: disabled_off_ok
gpu_compositing: enabled
multiple_raster_threads: enabled_on
opengl: enabled_on
rasterization: enabled
raw_draw: disabled_off_ok
skia_graphite: disabled_off
video_decode: enabled
video_encode: enabled
vulkan: disabled_off
webgl: enabled
webgl2: enabled
webgpu: enabled
A/B Experiments
The text was updated successfully, but these errors were encountered: