-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --range
option to ruff format
#9733
Conversation
|
Was "byte offset" an option, or is the LSP not able to provide that? |
I considered it but decided against it because I want to provide users with a safe interface without thinking about encodings. Byte offsets would mean that we, in addition to out of bound indices, would also need to error on inputs that fall between character boundaries. |
I would still bias toward byte offsets I think. Or perhaps even better, provide a way for a user to enter either byte offsets or character offsets. Worrying about encoding is a good point, but I imagine the vast majority of all Python source files are UTF-8. (Is it possible for Python source files to be something other than UTF-8? I'm actually not sure.) Separately from that, could we use the word "codepoint" instead of "character" here? The former has a more concrete and unambiguous definition. The downside I suppose is that "character" is probably a more accessible term. Still, using the word "codepoint" will be an extra clear sign-post that the input is not byte offsets. |
@BurntSushi what's your reasoning for biasing towards byte offsets? The only reason we use byte offsets internally is because they're convenient (and fast) to slice strings. I don't see performance being a key motivator for this use case. Lexing, parsing, and all the IO are so dominant that the offset conversion won't matter. Allowing different formats is intriguing but I think I would than either allow row:col or code point offsets. |
Ah no, it's definitely not about performance. I suppose it's more about what is easy for users to actually get. Users are unlikely to be counting characters or bytes to figure out the inputs to these flags. So they'll probably get them from elsewhere. And I feel like usually what you get are byte offsets, especially when dealing with files on Unix operating systems. For example, grep has a This is also why I suggested offering multiple ways to provide the range. Even if you switched to byte offsets, in the case of a user with char offsets, converting to byte offsets will be pretty annoying. Similarly, if you have byte offsets but need to provide char offsets, that's annoying too. For the same reason, I'd also advocate supporting line/column inputs too (grep provides them as well). I'm not sure what the typical use case for these flags are though, and I'm sure that would have an impact on what we accept. |
@BurntSushi the tooling support is an interesting consideration, although I don't really know what the use cases are for using range formatting over the CLI other than from an editor integration. And even there, I would advocate using the LSP instead that supports UTF32, UTF16, and UTF8 offsets. For today, I don't think I want to support multiple encodings because we aren't aware of any use case. However, it would be nice if the design supported different encodings:
I could see us do both to allow the most flexibility but I think it's something we can defer until we know of actual use cases needing a different encoding (and they cant use the LSP). |
Yeah if you don't anticipate this being used by users directly and instead only with editor integrations, then I absolutely defer to you and whatever is most convenient in that context. It might be worth calling that out in the docs too. |
If we don't expect users to call this directly we should hide it from the CLI help menus |
I'm not convinced that hiding options solves the problem. It is a public API as soon as we add it, even if undocumented. That's why I prefer documenting the behaviour even if I would prefer not having to expose it at all. @BurntSushi it's not that I'm not anticipating other use cases. It's just that I want to focus on the use case at hand. What's important to me is that the design allows us to support other potential use cases in the future, without having to redesign all options. That's why your feedback is very valuable and we should explore alternative options more if you aren't convinced that the one that I outlined are sufficient. |
Yeah I'd rather it be in
Probably "sufficient" isn't the right word. It's not that codepoint offsets won't work. They will. It's a sound approach. I'm mostly just making the argument that, in my experience, byte offsets tend to be easier to come by. If we're just looking for a path forward here that doesn't requiring potentially redesigning everything, then @charliermarsh's idea seems okay. To be clear, I agree with you that these flags will probably only really be used by editor integrations and not by end users directly. So in that sense, being more flexible in what we accept is maybe not so important. If you asked me what my ideal design was and we had a good reason to believe these flags might be used outside of editor integrations, then I think I'd add one flag called |
For reference,
|
I'll reply in more depth but one thing to consider is that powershell's |
I'm leaning toward changing the input to
The downside of this is that it may require users to convert from a byte offset to line/column number if they want to use this feature in an automated way. I think I'm fine with this as a compromise for now because some tools provide line/column number output and converting a byte offset to a line number can be done using The only remaining question is if it should be single or multiple arguments. I think I'll go with a single argument because supporting a custom DSL where you specify the range type is more awkward with multiple options. |
53fb8a4
to
3d57e36
Compare
--range-start
and --range-end
options to ruff format
--range
option to ruff format
6818c10
to
0cd0449
Compare
CodSpeed Performance ReportMerging #9733 will improve performances by 47.49%Comparing Summary
Benchmarks breakdown
|
Co-authored-by: T-256 <[email protected]>
What’s the scoop with the benchmarks? |
It's most likely that I need to rebase my changes |
Summary
This PR adds the--range-start=<CHAR OFFSET>
and--range-end=<CHAR_OFFSET>
options to theformat
command.This PR adds the new
--range=<start>-<end>
option to theformat
command where<start>
and<end>
are specified asline:column
(1 based).The new options allow users only to format a selected range rather than the entire document. The main use case is to enable range formatting in IDEs.
Closes #7233
Design Decisions
The range is specified in character offsets. The alternatives I considered are:line numbers similar to Black but being able to specify the range exactly can help the formatter to narrow the range betterline:column
This would be more consistent to our--output-format=json
where we output row and column numbers and can be easier to determine. However, this is mainly a feature for editors or when integrating Ruff into other tooling where computing a character offset shouldn't be a concern. The main downside ofline:column
is that it is a more complicated valueI went with two options instead of one to avoid the need for a custom syntax like4-5
that user need to figure out--range
option. TLDR: It gives us a way to define our own DSL to support byte and codepoint offsets in the future.Limitations
The current implementation doesn't support notebooks because it's unclear if the range is relative to the notebook content or the raw notebook.
I decided to not support notebooks for now because the main use case, range formatting in VS Code, doesn't require notebook support because it only formats the closest cell.
Test Plan