-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch pytest traceback to native
#1075
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my Dask-CUDA comment, here are results for conda-python-tests
matrix:
Run/[CUDA Version] | 11.8.0 | 12.0.1 | 12.5.1 |
---|---|---|---|
This change run #1 | 07:08 | 07:23 | 08:04 |
This change run #2 | 07:45 | 05:46 | 07:24 |
This change run #3 | 06:11 | 05:46 | 06:52 |
This change run #4 | 07:11 | 05:55 | 09:39 |
This change run #5 | 07:18 | 06:22 | 07:52 |
PR #1070 | 06:54 | 05:43 | 07:16 |
PR #1069 | 07:10 | 05:50 | 08:01 |
PR #1067 | 07:28 | 05:44 | 08:26 |
PR #1078 run #1 | 07:16 | 05:56 | 07:53 |
PR #1078 run #2 | 07:28 | 06:02 | 07:33 |
According to the results above there's not only no benefit in the change, it actually looks slower. Based off of those results alone I'd say this is not a good fit here.
Do people have other thoughts? I've "requested changes" to block merging for now.
AFAICT there are no xfailed tests in this repo, so I don't think the results are meaningful. I would suggest rerunning a few times. These results seem well within the noise threshold for this repository's test suite. You looked at #1070 and #1069, but consider the one before that #1067. In that case the times are 7:28, 8:26, and 5:44. |
So if the results are not meaningful because there are no xfailed tests, yet they seem worse, what is your suggestion we do? So far, I'm not seeing any potential benefits here.
So are you saying those results are better or worse? If I sum this PR's results [1] and #1067 's [2], this is still doing worse:
I forgot to link my Dask-CUDA comment previously, and for completeness I'm also looking for an answer to the following:
I agree the number of samples is low and rerunning is the obvious next step, I'll have to wait and see the results to be convinced by results, but if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change. |
I was trying to add an extra column to the table that you had. Based on the numbers in your table, it looked like you were looking at the time for the "Python tests" step in the
IMHO we aren't losing useful debugging information. I don't find the long tracebacks particularly helpful or readable relative to the default. That was also the opinion expressed by a few others when discussing this change. Of course, not everyone will agree on something like that so I think it makes more sense to make the decision based on the concrete objective improvements of timing.
My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests. We can't guarantee that ucx-py (or the other repos) will never use xfail markers, and as soon as they do we'll be paying a heavier price than we should by not having this setting. The data from the repositories that do currently use xfails is convincing IMHO. If you disagree and think that the results above are not meaningful, then feel free to either rerun to gather more data or close these PRs until there are xfailed tests that could be used to gather data (note that there is also some sort of GH outage that's been happening this morning that could impact numbers; if we want good numbers we probably should run locally in a controlled environment and not in CI). Realistically, though, I think we will forget to do that and we're better off merging now. |
This is how the information representation is going to change with We basically stop letting pytest do all the pretty printing/coloring/and printing entire source code file but using |
Thanks @galipremsagar , but I actually saw those screenshots previously in rapidsai/cudf#16851 (comment) . However, that doesn't confirm whether we lose information or not. As per the docs, the regular default is |
I've updated the runtimes for the most recent job in #1075 (review), and it still seems to take longer:
|
Sorry, I missed this comment yesterday. Yes, your interpretation is correct. When you say you're not convinced any of these numbers are a signal, that's fair, but what would then be the signal? I currently see no signals that show a benefit from this change, if there's any I'm missing, can you please point them out?
I would normally tend to agree, at least for regular Python sync libraries. When it comes to Python async you normally have almost no useful information, plus it's often the case it's very hard to reproduce them anywhere else but CI. Because of that I don't find we have the luxury to discard any possible hint to what may be happening, therefore I'm against discarding anything especially since there's no clear answer to what exactly we might be missing and so far it looks to me more like a gut feeling.
I would have agreed with that, this change has demonstrated no concrete improvements of timing, so at this point it seems that we're just discussing what was the expected result and not what we are really observing. If there's any clear signs here or in Dask-CUDA that this has indeed improved performance, please point them out and I'll be happy that we did improve runtime, but so far there have been no signs of that and I would argue it looks like it have made things worse.
Well, if there were conclusive signs showing that they were at least "as good", and that we're indeed not losing any potentially useful information (or just simply hints, which is usually the best we can get for async) I would agree with merging. Right now what I'm hearing is "there's a promise it will be better, and the feeling that the information discarded is not useful", and what I observe is "this is looking worse, and I don't know if anything that will be discarded will be missed". To me personally, who's the primary person who has to deal with debugging in UCX-Py/Dask-CUDA, I don't look at the results here as a benefit but rather as a loss. If anybody has the time to do those tests in a controlled environment and show that results are indeed beneficial, I'd happy to change my opinion. |
Also when you say "there are measurable benefits when there are xfailed tests", I'm assuming this requires a large amount of xfailed tests to benefit, is that right? If not, it should be pretty easy to add a couple xfailed tests and clearly demonstrate the runtime benefit on a test set the size of UCX-Py/Dask-CUDA. |
@pentschev I agree that this change will not have any improvement in runtimes due to lack of xfails/xpass tests in
As you see there is no change in pytest times. Coming to printing the stack-trace, |
I do think you need a nontrivial number of xfails to see a difference, yes. I am pretty much 100% sure that there is no negative performance impact from this change, but I do agree that debugging async code (and hard-to-reproduce bugs) can make more informative tracebacks much more important. The native traceback does not discard any frames, it simply removes the pretty-printing and the extra context (normal traces just show the relevant code line, the pytest trace includes many extra lines of context with pretty-printing). If that extra information being displayed in CI is helpful to debug then I think it's fine to close this PR since it sounds like you also don't anticipate xfailing much. |
Sorry I let this fall behind. Based on #1075 (comment) I'm less worried about the time, however, I'm still not 100% convinced this will not remove useful information from the errors that eventually pop in CI. With that said, I'm ok merging this but I'll keep an eye as to whether this removes too much information and if I notice it does I will most likely want to revert it. |
/merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pentschev requested a review on Slack from packaging-codeowners. After reading the discussion above, it seems like Peter has decided to try it. My gut feeling is that without measurable benefit seen in benchmarks, the only other potential benefit would be keeping uniformity wrt this flag on other RAPIDS repos. I'm fine with Peter's "try it and see" approach, especially since he is the primary person who will be observing the results here.
Description
In cudf & cuml we have observed a ~10% to ~20% respectively speed up of pytest suite execution by switching pytest traceback to
--native
:This PR makes similar change to
ucx-py
repo.xref: rapidsai/cudf#16851