Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change format of cell uris so that it is harder to guess #141717

Closed
jrieken opened this issue Jan 28, 2022 · 47 comments · Fixed by #141726
Closed

Change format of cell uris so that it is harder to guess #141717

jrieken opened this issue Jan 28, 2022 · 47 comments · Fixed by #141726
Assignees
Labels
api-proposal debt Code quality issues insiders-released Patch has been released in VS Code Insiders notebook
Milestone

Comments

@jrieken
Copy link
Member

jrieken commented Jan 28, 2022

Notebooks use the the CellUri to create a uri for a specific cell of a notebook. The rules are

  • take notebook uri
  • replace scheme with vscode-notebook-cell
  • append a padded fragment of the cell handled (padded for lexical-sort of uris)
  • append the notebook's scheme unless it's file

Samples:

Cell 1 of notebook file:///some/notebook.ipynb is vscode-notebook-cell:///some/notebook.ipynb#ch0000001
Cell 21 of notebook memfs:///some/notebook.ipynb is vscode-notebook-cell:///some/notebook.ipynb#ch0000021memfs

This format is an implementation details and the problem is that it is so obvious that folks starting to program against it (instead of using API, e.g for discovering a notebook for a cell etc). To prevent this we should make the format a little less obvious, e.g. by encoding the fragment a little.

@jrieken jrieken self-assigned this Jan 28, 2022
@jrieken
Copy link
Member Author

jrieken commented Jan 28, 2022

/cc @rebornix

@jrieken
Copy link
Member Author

jrieken commented Jan 28, 2022

/cc @weinand I believe debugging does or did make assumptions about the cell uri format

@weinand
Copy link
Contributor

weinand commented Jan 28, 2022

@jrieken in the notebook debugging samples the debugging functionality did not make assumptions about cell uri format.

But since VS Code persists breakpoints independent from debuggers via their resource URI, it is possible that a persisted cell breakpoint "gets broken" when the cells of a notebook are rearranged and then the notebook cell URIs get new sequential numbers in their URI fragment.

For this problem the notebook debugging samples tried to "fix" the persisted breakpoints when they detected that cell rearrangements happened.

If cell URI would be robust, this ugly workaround would not be necessary.

@roblourens is my assessment still true for the current notebook debugging approach?

@jrieken
Copy link
Member Author

jrieken commented Jan 28, 2022

If cell URI would be robust, this ugly workaround would not be necessary.

That's currently not possible because cells don't have a stable identifier. So, it's not really part of this effort

@roblourens
Copy link
Member

I can't think of anything in debugging that makes assumptions about the format. Changing the format will break anything that is stored by cellURI, like persisted breakpoints. So if we change it we might consider a simple one-time migrationwith the update

@jrieken
Copy link
Member Author

jrieken commented Feb 1, 2022

So if we change it we might consider a simple one-time migrationwith the update

I am not sure how CellUri is used but a relatively simple adoption would be to let CellUri#parse support the old and new variant for a while. @roblourens Can you check with your usage and see if that would work?

@jrieken
Copy link
Member Author

jrieken commented Feb 1, 2022

For this problem the notebook debugging samples tried to "fix" the persisted breakpoints when they detected that cell rearrangements happened.

I do understand the desire for stable URIs but this specific case is also a good sample for why the format should be obfuscated. This specific "massage" isn't working since roughly a year more (due to ce02917). We don't want to invite people to make assumptions...

Stable cell uris/identity is likely another topic and while I believe this is coming (has come?) to jupyter it isn't defined for other notebooks. There is also a specific issue with displaying cell uris in the workbench: we don't have the concept for pluggable resource sorters and therefore cell uris are crafted so that they sort lexicographically well. This makes the problems and references view look nice...

@jrieken
Copy link
Member Author

jrieken commented Feb 1, 2022

/cc @rchiodo I found something that looks like an assumption about the format of the fragment. What's that for and can that work without these assumptions?

https://github.com/microsoft/vscode-jupyter/blob/62356af10732e2bbf1c520e9cea0ad8420ab3f27/src/client/datascience/notebook/intellisense/logReplayService.ts#L152

That was the only usage I could find but I am curious if you know about more such places?

@rchiodo
Copy link
Contributor

rchiodo commented Feb 1, 2022

Yes actually this is used here:
https://github.com/microsoft/lsp-notebook-concat/blob/d9fae2e518e2e32a8ebc97e723e2f3120161566a/src/notebookConcatDocument.ts#L231

We use this in pylance for computing where a notebook cell is in a document when it is created. I think @dbaeumer's new LSP messages will eliminate this need though (as the cells positions are contained in his new messages).

@rchiodo
Copy link
Contributor

rchiodo commented Feb 1, 2022

The first location you mentioned is a debug tool we use for replaying customer's logs. We'd need some way to pull apart the cell URI for that to continue to work because we don't have the actual notebook originally associated with the URI.

@jrieken
Copy link
Member Author

jrieken commented Feb 1, 2022

We'd need some way to pull apart the cell URI for that to continue to work

Can you explain what pulling apart uris means?

@rchiodo
Copy link
Contributor

rchiodo commented Feb 1, 2022

Take the cell index out of the fragment. We need that (during the log replay) to determine what cell a 'textDocument/didChange' event to apply an edit to.

The log replace service is used to try and repro intellisense problems. Customer sends a log of all the LSP requests and a notebook that they started with.

We open the notebook and for every textDocument/didChange event we figure out which cell the edit is for and reapply it. Since the notebook URI doesn't match what we have locally, we just scrape the fragment portion to compute the cell index. Seems like it should work to replace the path part of the URI with our local file and then just search on the cell that matches instead. But that would also presume that the rest of the URI would be consistent across machines.

@jrieken
Copy link
Member Author

jrieken commented Feb 1, 2022

Since the notebook URI doesn't match what we have locally, we just scrape the fragment portion to compute the cell index. Seems like it should work to replace the path part of the URI with our local file and then just search on the cell that matches instead. But that would also presume that the rest of the URI would be consistent across machines.

Ok, thanks for clarifying. I believe a better approach is checking for the vscode-notebook-cell-scheme and simply use the whole fragment without making assumptions about its form. Roughly like so

if(step.textDocument.uri.scheme === 'vscode-notebook-scheme') { // check fixed scheme
	const replaced =  this.activeNotebook
              .cellAt(0)
              .document.uri.with({ fragment: step.textDocument.uri.fragement }); // use with, no string-replace
	step.textDocument.uri = replaced
}

@jrieken
Copy link
Member Author

jrieken commented Feb 1, 2022

I think @dbaeumer's new LSP messages will eliminate this need though (as the cells positions are contained in his new messages).

Do you have an ETA for adopting that?

@rchiodo
Copy link
Contributor

rchiodo commented Feb 1, 2022

Do you have an ETA for adopting that?

The pylance team would also need to switch to adopting it. Probably this month some time.

@dbaeumer
Copy link
Member

dbaeumer commented Feb 2, 2022

@rchiodo has the new LSP messages makes it possible to not rely on the cell URL format.

@jrieken jrieken modified the milestones: February 2022, March 2022 Feb 3, 2022
@jrieken
Copy link
Member Author

jrieken commented Feb 28, 2022

@rchiodo Did that adoption happen?

@rchiodo
Copy link
Contributor

rchiodo commented Feb 28, 2022

No it did not.

@rchiodo
Copy link
Contributor

rchiodo commented Feb 28, 2022

I believe @msfterictraut is in the process of making an experimental version of pylance to adopt this new API.

@debonte
Copy link
Contributor

debonte commented May 17, 2022

@jrieken The changes for the experiment were checked in today. Prerelease builds with experiment support will ship tomorrow. Current plan is to start the experiment on Monday.

@jrieken
Copy link
Member Author

jrieken commented May 18, 2022

👯 Thanks for the update

@jrieken jrieken modified the milestones: May 2022, June 2022 May 20, 2022
@jrieken jrieken modified the milestones: June 2022, July 2022 Jun 20, 2022
@jrieken
Copy link
Member Author

jrieken commented Jul 1, 2022

@debonte I wanna finally make this change happen in July? Any objections? @rchiodo Same question

@debonte
Copy link
Contributor

debonte commented Jul 1, 2022

@jrieken, our experiment is still running. Also, the Interactive Window is currently still using the old (non-LSP) approach due to microsoft/vscode-jupyter#10630.

@rchiodo
Copy link
Contributor

rchiodo commented Jul 1, 2022

Post that being fixed we'd still make assumptions about notebook URIs. We need this in order to determine if the TextDocument for the input box in the interactive window is tied to the notebook for the interactive window or not. @jrieken I'm not sure if you're changes will make this impossible or are they only dealing with the fragment portion?

@rchiodo
Copy link
Contributor

rchiodo commented Jul 1, 2022

The 'logReplayService' above isn't really necessary. It's just a debug tool.

@jrieken
Copy link
Member Author

jrieken commented Jul 4, 2022

or are they only dealing with the fragment portion?

@rchiodo Only the fragment portion. The new format is more obscure but has the same properties: sorts by cell orders and encodes the cell index.

@debonte Assuming we are going ahead with this change. What's the impact on your end? What will break and what intermediate change can you make?

@debonte
Copy link
Contributor

debonte commented Jul 5, 2022

@debonte Assuming we are going ahead with this change. What's the impact on your end? What will break and what intermediate change can you make?

Because we depend on the lsp-notebook-concat code that Rich referenced above, I believe this would cause the ordering of cells to be wrong for anyone using Pylance's concat doc-based notebook support. Depending on how the cells are misordered, this could cause most language server features to misbehave. This would impact any user that is not in the LSP notebooks experiment treatment group or who is using Interactive Window.

The Jupyter team owns this code and would be best able to say how it might be adjusted to support both the old and new formats.

FYI, Rich is OOF this week.

@jrieken
Copy link
Member Author

jrieken commented Jul 6, 2022

I believe this would cause the ordering of cells to be wrong for anyone using Pylance's concat doc-based notebook support.

The new format makes the same ordering guarantees, e.g when sorting by the toString()'ed uri of the cell-documents you get the same order as before. It is just harder to extract the actual (numeric) value from the URI

@debonte
Copy link
Contributor

debonte commented Jul 6, 2022

The lsp-notebook-concat code we are currently using gets the cell fragment as a number like this:

parseInt(cellUri.fragment.substring(2) || '0')

My understanding is that this won't work properly with the new format. That was my point.

@jrieken
Copy link
Member Author

jrieken commented Jul 7, 2022

Yeah, parsing the fragment into a number won't work anymore but looking at how the parse result is actually used gives me confidence. The value is kept in NotebookSpan#fragment and the only usage of the that field seems to be a simple comparison. That will continue to work with the fragment as-is (because cell-order is encoded in the fragment part).

Let's wait for @rchiodo to comment but I am fairly confident that a small change in lsp-notebook-concat will do the job.

(edit)

I didn't bother to create a fork but I believe this change will do it

diff --git a/src/notebookConcatDocument.ts b/src/notebookConcatDocument.ts
index f3e8cec..9aee8ab 100644
--- a/src/notebookConcatDocument.ts
+++ b/src/notebookConcatDocument.ts
@@ -26,7 +26,7 @@ import { createPosition, createRange } from './helper';
 
 type NotebookSpan = {
     uri: vscodeUri.URI;
-    fragment: number;
+    fragment: string;
     inRealCell: boolean;
     startOffset: number;
     endOffset: number;
@@ -233,7 +233,7 @@ export class NotebookConcatDocument implements ITextDocument {
 
         // Compute 'fragment' portion of URI. It's the tentative cell index
         const fragment =
-            cellUri.scheme === InteractiveInputScheme ? -1 : parseInt(cellUri.fragment.substring(2) || '0');
+            cellUri.scheme === InteractiveInputScheme ? '' : cellUri.fragment;
 
         // That fragment determines order in the list (if we're not forcing append)
         const insertIndex = forceAppend ? this._spans.length : this.computeInsertionIndex(fragment);
@@ -659,7 +659,7 @@ export class NotebookConcatDocument implements ITextDocument {
     ): NotebookSpan {
         // Compute fragment based on cell uri
         const fragment =
-            cellUri.scheme === InteractiveInputScheme ? -1 : parseInt(cellUri.fragment.substring(2) || '0');
+            cellUri.scheme === InteractiveInputScheme ? '' : cellUri.fragment;
         return {
             fragment,
             uri: cellUri,
@@ -676,7 +676,7 @@ export class NotebookConcatDocument implements ITextDocument {
     private createTypeIgnoreSpan(cellUri: vscodeUri.URI, offset: number, realOffset: number): NotebookSpan {
         // Compute fragment based on cell uri
         const fragment =
-            cellUri.scheme === InteractiveInputScheme ? -1 : parseInt(cellUri.fragment.substring(2) || '0');
+            cellUri.scheme === InteractiveInputScheme ? '' : cellUri.fragment;
         return {
             fragment,
             uri: cellUri,
@@ -700,7 +700,7 @@ export class NotebookConcatDocument implements ITextDocument {
             }
             return [
                 {
-                    fragment: -1,
+                    fragment: '',
                     uri: cellUri,
                     inRealCell: false,
                     startOffset: 0,
@@ -715,7 +715,7 @@ export class NotebookConcatDocument implements ITextDocument {
 
         return [
             {
-                fragment: -1,
+                fragment: '',
                 uri: cellUri,
                 inRealCell: false,
                 startOffset: 0,
@@ -877,13 +877,13 @@ export class NotebookConcatDocument implements ITextDocument {
         return result as protocol.Range;
     }
 
-    private computeInsertionIndex(fragment: number): number {
+    private computeInsertionIndex(fragment: string): number {
         // Remember if last cell is already the input box
         const inputBoxPresent = this._spans[this._spans.length - 1]?.uri?.scheme === InteractiveInputScheme;
         const totalLength = inputBoxPresent ? this._spans.length - 1 : this._spans.length;
 
         // Find index based on fragment
-        const index = fragment == -1 ? this._spans.length : this._spans.findIndex((c) => c.fragment > fragment);
+        const index = fragment === '' ? this._spans.length : this._spans.findIndex((c) => c.fragment > fragment);
         return index < 0 ? totalLength : index;
     }
 

@rchiodo
Copy link
Contributor

rchiodo commented Jul 11, 2022

I think you're saying a string comparison of the fragments will order them correctly? Maybe it should be more explicit that it's a string compare (instead of using '<').

@rchiodo
Copy link
Contributor

rchiodo commented Jul 11, 2022

I was also thinking the lsp concat stuff could be refactored to use the new messages that Dirk created. We were trying to build the order ourselves but I believe that isn't necessary anymore?

@jrieken
Copy link
Member Author

jrieken commented Jul 12, 2022

Maybe it should be more explicit that it's a string compare (instead of using '<').

Unsure - the comparison operator might be best suited here because it faster and more narrow-spec'd than localeCompare (I believe). I also see this as a temporary fix, e.g using the LSP concat is the way to go but the question is what do you do assuming we make the fragment-change at the end of the week?

The patch I have suggested is a good stepping-stone to prevent breakage but it isn't an alternative to LSP. It just less fragile: today we assume the fragment (1) parses to a number and (2) the numbers are comparable and with my patch we assume (1) the fragment is comparable. That's better but not the optimum, using LSP is the optimum but I haven't seen that happening in the last months and TBH I want to make progress

@rchiodo
Copy link
Contributor

rchiodo commented Jul 12, 2022

using LSP is the optimum

Shortly this will be the case. That's the work that Erik has been doing. It's already shipped but it's just behind an experiment flag at the moment. Once that experiment is at 100%, this middleware code won't be needed anymore.

Given that, I think your temporary fix would be good. Although it requires the fragment change, right? So it needs to go in at the same time? Oh maybe it doesn't. Fragment today would compare the same I think.

@jrieken
Copy link
Member Author

jrieken commented Jul 13, 2022

Oh maybe it doesn't. Fragment today would compare the same I think.

Yes, the change can happen today because comparing the fragment is already behaving like that

@rchiodo
Copy link
Contributor

rchiodo commented Jul 14, 2022

Okay the fragment change made it through all the layers for jupyter. Our build tonight will have the new fragment change.

@vscodenpa vscodenpa added unreleased Patch has not yet been released in VS Code Insiders insiders-released Patch has been released in VS Code Insiders and removed unreleased Patch has not yet been released in VS Code Insiders labels Jul 19, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Sep 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-proposal debt Code quality issues insiders-released Patch has been released in VS Code Insiders notebook
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants