Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: remove flakiness on macOS test #56971

Closed
wants to merge 2 commits into from

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Feb 9, 2025

Locally I could reproduce this on macOS M4 machine using tools/test.py --repeat 10000 test/parallel/test-net-write-fully-async-hex-string.js after this change I couldn't reproduce it.

Ref: https://github.com/nodejs/node/actions/runs/13220154720/job/36904132463?pr=56970

@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Feb 9, 2025
Copy link

codecov bot commented Feb 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.12%. Comparing base (b181535) to head (a712b4f).
Report is 27 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #56971      +/-   ##
==========================================
- Coverage   89.13%   89.12%   -0.01%     
==========================================
  Files         665      665              
  Lines      193165   193179      +14     
  Branches    37191    37201      +10     
==========================================
- Hits       172181   172177       -4     
- Misses      13729    13739      +10     
- Partials     7255     7263       +8     

see 41 files with indirect coverage changes

@lpinca
Copy link
Member

lpinca commented Feb 9, 2025

I don't know if this invalidates the original intention of the test but it seems to also fix the issue on my machine. Can you do the same also for test-net-write-fully-async-buffer.js (see #52959 (comment))?

This might also help with #54918 because before this change when the test timed out the issue was the same:

$ lldb -p 53757
(lldb) process attach --pid 53757
Process 53757 stopped
* thread #1, name = 'MainThread', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff80d888aaa libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7ff80d888aaa <+10>: jae    0x7ff80d888ab4 ; <+20>
    0x7ff80d888aac <+12>: movq   %rax, %rdi
    0x7ff80d888aaf <+15>: jmp    0x7ff80d886662 ; cerror_nocancel
    0x7ff80d888ab4 <+20>: retq   
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64-apple-macosx-.
(lldb) bt
* thread #1, name = 'MainThread', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff80d888aaa libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff80d8c77a8 libsystem_pthread.dylib`_pthread_cond_wait + 1193
    frame #2: 0x00000001067b37a3 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001058ccfdb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007fba92404cb8, mutex=0x00007fba92404c48) at node_mutex.h:175:5 [opt]
    frame #4: 0x00000001058ccfc1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007fba92404cb8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x00000001058ccfc1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007fba92404c48) at node_platform.cc:641:20 [opt]
    frame #6: 0x00000001058ccfab node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007fba92404c48) at node_platform.cc:215:25 [opt]
    frame #7: 0x00000001058ccfab node`node::NodePlatform::DrainTasks(this=0x0000600000c98000, isolate=<unavailable>) at node_platform.cc:468:33 [opt]
    frame #8: 0x0000000105732a60 node`node::SpinEventLoopInternal(env=0x00007fba95833600) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x0000000105891e33 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007fba95833600) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x0000000105891d99 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x00000001057f9d46 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1489:24 [opt]
    frame #12: 0x00000001057f9ba2 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1496:27 [opt]
    frame #13: 0x00007ff80d5382cd dyld`start + 1805

@lpinca
Copy link
Member

lpinca commented Feb 9, 2025

It is already known that the main thread gets locked by the GC thread during DrainTasks (see #56827) so I think we should not do this as it is only a manual way to work around #54918, not a fix. I don't think it is feasible or correct to do this for all tests that fail due to #56827.

If the flakiness is unbearable I think it is better to revert #56365.

lpinca
lpinca previously requested changes Feb 9, 2025
Copy link
Member

@lpinca lpinca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anonrig
Copy link
Member Author

anonrig commented Feb 9, 2025

See #56971 (comment)

The goal is to make sure the drain is called on GC, and this doesn't break it and actually make sure that we have a major GC call. I don't think this changes the test, and regardless of fixing the flakiness, it actually makes it more correct IMHO. Since it is not the type of GC we validate but the side-effect it causes.

Can we find a middle ground and land this PR as it is? @lpinca I don't want to make it a flaky test and make it non-flaky upon the fix of the deadlock bug. I'd like to make the test more stable regardless of our internal bugs.

@lpinca
Copy link
Member

lpinca commented Feb 9, 2025

I've dismissed my "request changes". For consistency the same change should be applied to test-net-write-fully-async-buffer.js. I can't comment on the correctness as I don't know why type: minor was used.

This test is currently one of the best to reproduce #54918. I'll have to revert this change locally to see if #54918 persists when a fix is found.

@anonrig
Copy link
Member Author

anonrig commented Feb 9, 2025

For consistency the same change should be applied to test-net-write-fully-async-buffer.js.

Sounds like a valid request to me: #56981

@anonrig anonrig requested a review from jasnell February 9, 2025 19:31
@lpinca
Copy link
Member

lpinca commented Feb 9, 2025

Why not adding another commit here or changing both in the same commit? In this way the discussion about the change is in the same place for both tests.

@anonrig anonrig requested a review from lpinca February 9, 2025 19:36
@anonrig anonrig added the commit-queue-rebase Add this label to allow the Commit Queue to land a PR in several commits. label Feb 9, 2025
@lpinca
Copy link
Member

lpinca commented Feb 9, 2025

gc({ type: 'minor' }) was added in f5ce6b1.

cc: @tniessen

@anonrig anonrig added the review wanted PRs that need reviews. label Feb 9, 2025
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@anonrig
Copy link
Member Author

anonrig commented Feb 11, 2025

@nodejs/tsc couldn't find any reviewers for this. would you mind reviewing to NOT wait for 4 more days?

@anonrig anonrig added author ready PRs that have at least one approval, no pending requests for changes, and a CI started. labels Feb 11, 2025
@marco-ippolito marco-ippolito added the commit-queue Add this label to land a pull request using GitHub Actions. label Feb 11, 2025
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Feb 11, 2025
@nodejs-github-bot
Copy link
Collaborator

Landed in 5dafb48...66549b4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. commit-queue-rebase Add this label to allow the Commit Queue to land a PR in several commits. needs-ci PRs that need a full CI run. review wanted PRs that need reviews. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants