Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch pytest traceback to native #1075

Merged
merged 5 commits into from
Nov 4, 2024
Merged

Conversation

galipremsagar
Copy link
Contributor

Description

In cudf & cuml we have observed a ~10% to ~20% respectively speed up of pytest suite execution by switching pytest traceback to --native:

currently:

102474 passed, 2117 skipped, 902 xfailed in 892.16s (0:14:52)

--tb=short:

102474 passed, 2117 skipped, 902 xfailed in 898.99s (0:14:58)

--tb=no:

102474 passed, 2117 skipped, 902 xfailed in 815.98s (0:13:35)

--tb=native:

102474 passed, 2117 skipped, 902 xfailed in 820.92s (0:13:40)

This PR makes similar change to ucx-py repo.

xref: rapidsai/cudf#16851

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my Dask-CUDA comment, here are results for conda-python-tests matrix:

Run/[CUDA Version] 11.8.0 12.0.1 12.5.1
This change run #1 07:08 07:23 08:04
This change run #2 07:45 05:46 07:24
This change run #3 06:11 05:46 06:52
This change run #4 07:11 05:55 09:39
This change run #5 07:18 06:22 07:52
PR #1070 06:54 05:43 07:16
PR #1069 07:10 05:50 08:01
PR #1067 07:28 05:44 08:26
PR #1078 run #1 07:16 05:56 07:53
PR #1078 run #2 07:28 06:02 07:33

According to the results above there's not only no benefit in the change, it actually looks slower. Based off of those results alone I'd say this is not a good fit here.

Do people have other thoughts? I've "requested changes" to block merging for now.

@vyasr
Copy link
Contributor

vyasr commented Sep 25, 2024

AFAICT there are no xfailed tests in this repo, so I don't think the results are meaningful. I would suggest rerunning a few times. These results seem well within the noise threshold for this repository's test suite. You looked at #1070 and #1069, but consider the one before that #1067. In that case the times are 7:28, 8:26, and 5:44.

@pentschev
Copy link
Member

AFAICT there are no xfailed tests in this repo, so I don't think the results are meaningful. I would suggest rerunning a few times. These results seem well within the noise threshold for this repository's test suite.

So if the results are not meaningful because there are no xfailed tests, yet they seem worse, what is your suggestion we do? So far, I'm not seeing any potential benefits here.

You looked at #1070 and #1069, but consider the one before that #1067. In that case the times are 7:28, 8:26, and 5:44.

So are you saying those results are better or worse? If I sum this PR's results [1] and #1067 's [2], this is still doing worse:

In [1]: 7 * 60 + 8 + 7 * 60 + 23 + 8 * 60 + 4
Out[1]: 1355

In [2]: 7 * 60 + 28 + 8 * 60 + 26 + 5 * 60 + 44
Out[2]: 1298

I forgot to link my Dask-CUDA comment previously, and for completeness I'm also looking for an answer to the following:

Do you have more information exactly how does this compare to --tb=auto? From the docs:

pytest --tb=auto    # (default) 'long' tracebacks for the first and last
                     # entry, but 'short' style for the other entries
pytest --tb=long    # exhaustive, informative traceback formatting
pytest --tb=short   # shorter traceback format

It seems that with auto we get "exhaustive" traceback (not really sure what that means), is that the case with native too?

I agree the number of samples is low and rerunning is the obvious next step, I'll have to wait and see the results to be convinced by results, but if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change.

@vyasr
Copy link
Contributor

vyasr commented Sep 25, 2024

So are you saying those results are better or worse?

I was trying to add an extra column to the table that you had. Based on the numbers in your table, it looked like you were looking at the time for the "Python tests" step in the conda-python-tests job, so for instance this 11.8 job that shows 7:08 is the top-left entry in your table, right? Assuming that is correct, the corresponding job in #1067 is this one, which has a run time of 7:28. The sum is probably worse because the 12.0.1 job seems to have run much faster. I'm just not convinced that any of these numbers are signal; they all seem within the noise threshold to me.

if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change.

IMHO we aren't losing useful debugging information. I don't find the long tracebacks particularly helpful or readable relative to the default. That was also the opinion expressed by a few others when discussing this change. Of course, not everyone will agree on something like that so I think it makes more sense to make the decision based on the concrete objective improvements of timing.

So if the results are not meaningful because there are no xfailed tests, yet they seem worse, what is your suggestion we do?

My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests. We can't guarantee that ucx-py (or the other repos) will never use xfail markers, and as soon as they do we'll be paying a heavier price than we should by not having this setting. The data from the repositories that do currently use xfails is convincing IMHO. If you disagree and think that the results above are not meaningful, then feel free to either rerun to gather more data or close these PRs until there are xfailed tests that could be used to gather data (note that there is also some sort of GH outage that's been happening this morning that could impact numbers; if we want good numbers we probably should run locally in a controlled environment and not in CI). Realistically, though, I think we will forget to do that and we're better off merging now.

@galipremsagar
Copy link
Contributor Author

if neither the results are convincing and there's still potential for us to lose useful debugging information, then I don't really see a reason to add this change.

This is how the information representation is going to change with native traceback:

With pytest traceback:
Screenshot 2024-09-19 at 2 34 57 PM
Screenshot 2024-09-19 at 2 35 07 PM
Screenshot 2024-09-19 at 2 35 20 PM

With native traceback:
Screenshot 2024-09-19 at 2 34 04 PM

We basically stop letting pytest do all the pretty printing/coloring/and printing entire source code file but using native traceback.

@pentschev
Copy link
Member

We basically stop letting pytest do all the pretty printing/coloring/and printing entire source code file but using native traceback.

Thanks @galipremsagar , but I actually saw those screenshots previously in rapidsai/cudf#16851 (comment) . However, that doesn't confirm whether we lose information or not. As per the docs, the regular default is auto which is a combination of long ("exhaustive") and short, and while I do not mind that much of the pretty printing, I do mind if we lose frames or information from frames by switching to native. Do you know if anywhere it's documented that their contents is exactly the same, or maybe even more complete with native than auto?

@pentschev
Copy link
Member

I've updated the runtimes for the most recent job in #1075 (review), and it still seems to take longer:

In [1]: (7 + 7 + 6) * 60 + 45 + 24 + 54
Out[1]: 1323

@pentschev
Copy link
Member

I was trying to add an extra column to the table that you had. Based on the numbers in your table, it looked like you were looking at the time for the "Python tests" step in the conda-python-tests job, so for instance this 11.8 job that shows 7:08 is the top-left entry in your table, right? Assuming that is correct, the corresponding job in #1067 is this one, which has a run time of 7:28. The sum is probably worse because the 12.0.1 job seems to have run much faster. I'm just not convinced that any of these numbers are signal; they all seem within the noise threshold to me.

Sorry, I missed this comment yesterday. Yes, your interpretation is correct. When you say you're not convinced any of these numbers are a signal, that's fair, but what would then be the signal? I currently see no signals that show a benefit from this change, if there's any I'm missing, can you please point them out?

IMHO we aren't losing useful debugging information. I don't find the long tracebacks particularly helpful or readable relative to the default. That was also the opinion expressed by a few others when discussing this change.

I would normally tend to agree, at least for regular Python sync libraries. When it comes to Python async you normally have almost no useful information, plus it's often the case it's very hard to reproduce them anywhere else but CI. Because of that I don't find we have the luxury to discard any possible hint to what may be happening, therefore I'm against discarding anything especially since there's no clear answer to what exactly we might be missing and so far it looks to me more like a gut feeling.

Of course, not everyone will agree on something like that so I think it makes more sense to make the decision based on the concrete objective improvements of timing.

I would have agreed with that, this change has demonstrated no concrete improvements of timing, so at this point it seems that we're just discussing what was the expected result and not what we are really observing. If there's any clear signs here or in Dask-CUDA that this has indeed improved performance, please point them out and I'll be happy that we did improve runtime, but so far there have been no signs of that and I would argue it looks like it have made things worse.

My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests. We can't guarantee that ucx-py (or the other repos) will never use xfail markers, and as soon as they do we'll be paying a heavier price than we should by not having this setting. The data from the repositories that do currently use xfails is convincing IMHO. If you disagree and think that the results above are not meaningful, then feel free to either rerun to gather more data or close these PRs until there are xfailed tests that could be used to gather data (note that there is also some sort of GH outage that's been happening this morning that could impact numbers; if we want good numbers we probably should run locally in a controlled environment and not in CI). Realistically, though, I think we will forget to do that and we're better off merging now.

Well, if there were conclusive signs showing that they were at least "as good", and that we're indeed not losing any potentially useful information (or just simply hints, which is usually the best we can get for async) I would agree with merging. Right now what I'm hearing is "there's a promise it will be better, and the feeling that the information discarded is not useful", and what I observe is "this is looking worse, and I don't know if anything that will be discarded will be missed". To me personally, who's the primary person who has to deal with debugging in UCX-Py/Dask-CUDA, I don't look at the results here as a benefit but rather as a loss.

If anybody has the time to do those tests in a controlled environment and show that results are indeed beneficial, I'd happy to change my opinion.

@pentschev
Copy link
Member

My suggestion is to merge because we know that there are measurable benefits when there are xfailed tests.

Also when you say "there are measurable benefits when there are xfailed tests", I'm assuming this requires a large amount of xfailed tests to benefit, is that right? If not, it should be pretty easy to add a couple xfailed tests and clearly demonstrate the runtime benefit on a test set the size of UCX-Py/Dask-CUDA.

@galipremsagar galipremsagar requested a review from a team as a code owner September 26, 2024 21:35
@galipremsagar
Copy link
Contributor Author

galipremsagar commented Sep 26, 2024

@pentschev I agree that this change will not have any improvement in runtimes due to lack of xfails/xpass tests in ucx-py. However, this PR doesn't make the runtime worse. The time in the above table is for the entire CI run with so many conda package install times/git checkouts/etc..etc included too which this pr doesn't influence. The following are the exact pytest execution times we need to look at:

Run/[CUDA Version] 11.8.0 12.0.1 12.5.1
This change run #1 02:23mins & 31.30s 02:08mins & 28.87s 02:40mins & 41.07s
This change run #2 02:29mins & 38.65s 02:11mins & 28.32s 02:24mins & 34.88s
PR #1070 02:22mins & 32.99s 02:09mins & 27.10s 02:24mins & 32.53s
PR #1069 02:27mins & 32.41s 02:12mins & 27.71s 02:38mins & 43.01s
PR #1067 02:31mins & 30.62s 02:10mins & 25.69s 02:26mins & 34.47s
PR #1078 run #1 02:23mins & 39.44s 02:11mins & 27.80s 02:30mins & 35.74s
PR #1078 run #2 02:25mins & 35.45s 02:12mins & 28.10s 02:32mins & 36.72s

As you see there is no change in pytest times.

Coming to printing the stack-trace, --tb=native will just print the errors with standard python library formatting and full-stack trace. There is no more official documentation to show the difference apart from local testing and from local testing I see the following traceback: #1075 (comment), if you think due to the nature of async execution this will be a loss of information I'm happy to close this & dask-cuda PRs.

@vyasr
Copy link
Contributor

vyasr commented Sep 27, 2024

I do think you need a nontrivial number of xfails to see a difference, yes. I am pretty much 100% sure that there is no negative performance impact from this change, but I do agree that debugging async code (and hard-to-reproduce bugs) can make more informative tracebacks much more important. The native traceback does not discard any frames, it simply removes the pretty-printing and the extra context (normal traces just show the relevant code line, the pytest trace includes many extra lines of context with pretty-printing). If that extra information being displayed in CI is helpful to debug then I think it's fine to close this PR since it sounds like you also don't anticipate xfailing much.

@pentschev pentschev changed the base branch from branch-0.40 to branch-0.41 October 28, 2024 19:32
@pentschev
Copy link
Member

Sorry I let this fall behind. Based on #1075 (comment) I'm less worried about the time, however, I'm still not 100% convinced this will not remove useful information from the errors that eventually pop in CI. With that said, I'm ok merging this but I'll keep an eye as to whether this removes too much information and if I notice it does I will most likely want to revert it.

@pentschev
Copy link
Member

/merge

Copy link

@msarahan msarahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pentschev requested a review on Slack from packaging-codeowners. After reading the discussion above, it seems like Peter has decided to try it. My gut feeling is that without measurable benefit seen in benchmarks, the only other potential benefit would be keeping uniformity wrt this flag on other RAPIDS repos. I'm fine with Peter's "try it and see" approach, especially since he is the primary person who will be observing the results here.

@rapids-bot rapids-bot bot merged commit 07db627 into rapidsai:branch-0.41 Nov 4, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants