Change format of cell uris so that it is harder to guess #141717

jrieken · 2022-01-28T09:03:57Z

Notebooks use the the CellUri to create a uri for a specific cell of a notebook. The rules are

take notebook uri
replace scheme with vscode-notebook-cell
append a padded fragment of the cell handled (padded for lexical-sort of uris)
append the notebook's scheme unless it's file

Samples:

Cell 1 of notebook file:///some/notebook.ipynb is vscode-notebook-cell:///some/notebook.ipynb#ch0000001
Cell 21 of notebook memfs:///some/notebook.ipynb is vscode-notebook-cell:///some/notebook.ipynb#ch0000021memfs

This format is an implementation details and the problem is that it is so obvious that folks starting to program against it (instead of using API, e.g for discovering a notebook for a cell etc). To prevent this we should make the format a little less obvious, e.g. by encoding the fragment a little.

The text was updated successfully, but these errors were encountered:

jrieken · 2022-01-28T09:04:08Z

/cc @rebornix

jrieken · 2022-01-28T13:02:30Z

/cc @weinand I believe debugging does or did make assumptions about the cell uri format

weinand · 2022-01-28T14:18:28Z

@jrieken in the notebook debugging samples the debugging functionality did not make assumptions about cell uri format.

But since VS Code persists breakpoints independent from debuggers via their resource URI, it is possible that a persisted cell breakpoint "gets broken" when the cells of a notebook are rearranged and then the notebook cell URIs get new sequential numbers in their URI fragment.

For this problem the notebook debugging samples tried to "fix" the persisted breakpoints when they detected that cell rearrangements happened.

If cell URI would be robust, this ugly workaround would not be necessary.

@roblourens is my assessment still true for the current notebook debugging approach?

jrieken · 2022-01-28T17:30:49Z

If cell URI would be robust, this ugly workaround would not be necessary.

That's currently not possible because cells don't have a stable identifier. So, it's not really part of this effort

roblourens · 2022-01-28T19:15:53Z

I can't think of anything in debugging that makes assumptions about the format. Changing the format will break anything that is stored by cellURI, like persisted breakpoints. So if we change it we might consider a simple one-time migrationwith the update

jrieken · 2022-02-01T08:18:33Z

So if we change it we might consider a simple one-time migrationwith the update

I am not sure how CellUri is used but a relatively simple adoption would be to let CellUri#parse support the old and new variant for a while. @roblourens Can you check with your usage and see if that would work?

jrieken · 2022-02-01T08:25:55Z

For this problem the notebook debugging samples tried to "fix" the persisted breakpoints when they detected that cell rearrangements happened.

I do understand the desire for stable URIs but this specific case is also a good sample for why the format should be obfuscated. This specific "massage" isn't working since roughly a year more (due to ce02917). We don't want to invite people to make assumptions...

Stable cell uris/identity is likely another topic and while I believe this is coming (has come?) to jupyter it isn't defined for other notebooks. There is also a specific issue with displaying cell uris in the workbench: we don't have the concept for pluggable resource sorters and therefore cell uris are crafted so that they sort lexicographically well. This makes the problems and references view look nice...

jrieken · 2022-02-01T08:58:30Z

/cc @rchiodo I found something that looks like an assumption about the format of the fragment. What's that for and can that work without these assumptions?

https://github.com/microsoft/vscode-jupyter/blob/62356af10732e2bbf1c520e9cea0ad8420ab3f27/src/client/datascience/notebook/intellisense/logReplayService.ts#L152

That was the only usage I could find but I am curious if you know about more such places?

rchiodo · 2022-02-01T17:34:58Z

Yes actually this is used here:
https://github.com/microsoft/lsp-notebook-concat/blob/d9fae2e518e2e32a8ebc97e723e2f3120161566a/src/notebookConcatDocument.ts#L231

We use this in pylance for computing where a notebook cell is in a document when it is created. I think @dbaeumer's new LSP messages will eliminate this need though (as the cells positions are contained in his new messages).

rchiodo · 2022-02-01T17:37:08Z

The first location you mentioned is a debug tool we use for replaying customer's logs. We'd need some way to pull apart the cell URI for that to continue to work because we don't have the actual notebook originally associated with the URI.

jrieken · 2022-02-01T17:39:44Z

We'd need some way to pull apart the cell URI for that to continue to work

Can you explain what pulling apart uris means?

rchiodo · 2022-02-01T17:44:58Z

Take the cell index out of the fragment. We need that (during the log replay) to determine what cell a 'textDocument/didChange' event to apply an edit to.

The log replace service is used to try and repro intellisense problems. Customer sends a log of all the LSP requests and a notebook that they started with.

We open the notebook and for every textDocument/didChange event we figure out which cell the edit is for and reapply it. Since the notebook URI doesn't match what we have locally, we just scrape the fragment portion to compute the cell index. Seems like it should work to replace the path part of the URI with our local file and then just search on the cell that matches instead. But that would also presume that the rest of the URI would be consistent across machines.

jrieken · 2022-02-01T17:56:09Z

Since the notebook URI doesn't match what we have locally, we just scrape the fragment portion to compute the cell index. Seems like it should work to replace the path part of the URI with our local file and then just search on the cell that matches instead. But that would also presume that the rest of the URI would be consistent across machines.

Ok, thanks for clarifying. I believe a better approach is checking for the vscode-notebook-cell-scheme and simply use the whole fragment without making assumptions about its form. Roughly like so

if(step.textDocument.uri.scheme === 'vscode-notebook-scheme') { // check fixed scheme
	const replaced =  this.activeNotebook
              .cellAt(0)
              .document.uri.with({ fragment: step.textDocument.uri.fragement }); // use with, no string-replace
	step.textDocument.uri = replaced
}

jrieken · 2022-02-01T17:57:12Z

I think @dbaeumer's new LSP messages will eliminate this need though (as the cells positions are contained in his new messages).

Do you have an ETA for adopting that?

rchiodo · 2022-02-01T18:05:21Z

Do you have an ETA for adopting that?

The pylance team would also need to switch to adopting it. Probably this month some time.

dbaeumer · 2022-02-02T15:31:08Z

@rchiodo has the new LSP messages makes it possible to not rely on the cell URL format.

jrieken · 2022-02-28T13:35:34Z

@rchiodo Did that adoption happen?

rchiodo · 2022-02-28T16:45:43Z

No it did not.

rchiodo · 2022-02-28T16:47:05Z

I believe @msfterictraut is in the process of making an experimental version of pylance to adopt this new API.

debonte · 2022-05-17T22:19:42Z

@jrieken The changes for the experiment were checked in today. Prerelease builds with experiment support will ship tomorrow. Current plan is to start the experiment on Monday.

jrieken · 2022-05-18T06:57:55Z

👯 Thanks for the update

jrieken · 2022-07-01T13:45:00Z

@debonte I wanna finally make this change happen in July? Any objections? @rchiodo Same question

debonte · 2022-07-01T16:20:18Z

@jrieken, our experiment is still running. Also, the Interactive Window is currently still using the old (non-LSP) approach due to microsoft/vscode-jupyter#10630.

rchiodo · 2022-07-01T16:31:06Z

Post that being fixed we'd still make assumptions about notebook URIs. We need this in order to determine if the TextDocument for the input box in the interactive window is tied to the notebook for the interactive window or not. @jrieken I'm not sure if you're changes will make this impossible or are they only dealing with the fragment portion?

rchiodo · 2022-07-01T16:31:44Z

The 'logReplayService' above isn't really necessary. It's just a debug tool.

jrieken · 2022-07-04T07:02:05Z

or are they only dealing with the fragment portion?

@rchiodo Only the fragment portion. The new format is more obscure but has the same properties: sorts by cell orders and encodes the cell index.

@debonte Assuming we are going ahead with this change. What's the impact on your end? What will break and what intermediate change can you make?

debonte · 2022-07-05T22:45:39Z

@debonte Assuming we are going ahead with this change. What's the impact on your end? What will break and what intermediate change can you make?

Because we depend on the lsp-notebook-concat code that Rich referenced above, I believe this would cause the ordering of cells to be wrong for anyone using Pylance's concat doc-based notebook support. Depending on how the cells are misordered, this could cause most language server features to misbehave. This would impact any user that is not in the LSP notebooks experiment treatment group or who is using Interactive Window.

The Jupyter team owns this code and would be best able to say how it might be adjusted to support both the old and new formats.

FYI, Rich is OOF this week.

jrieken · 2022-07-06T15:01:28Z

I believe this would cause the ordering of cells to be wrong for anyone using Pylance's concat doc-based notebook support.

The new format makes the same ordering guarantees, e.g when sorting by the toString()'ed uri of the cell-documents you get the same order as before. It is just harder to extract the actual (numeric) value from the URI

debonte · 2022-07-06T19:31:35Z

The lsp-notebook-concat code we are currently using gets the cell fragment as a number like this:

parseInt(cellUri.fragment.substring(2) || '0')

My understanding is that this won't work properly with the new format. That was my point.

jrieken · 2022-07-07T10:25:43Z

Yeah, parsing the fragment into a number won't work anymore but looking at how the parse result is actually used gives me confidence. The value is kept in NotebookSpan#fragment and the only usage of the that field seems to be a simple comparison. That will continue to work with the fragment as-is (because cell-order is encoded in the fragment part).

Let's wait for @rchiodo to comment but I am fairly confident that a small change in lsp-notebook-concat will do the job.

(edit)

I didn't bother to create a fork but I believe this change will do it

diff --git a/src/notebookConcatDocument.ts b/src/notebookConcatDocument.ts
index f3e8cec..9aee8ab 100644
--- a/src/notebookConcatDocument.ts
+++ b/src/notebookConcatDocument.ts
@@ -26,7 +26,7 @@ import { createPosition, createRange } from './helper';
 
 type NotebookSpan = {
     uri: vscodeUri.URI;
-    fragment: number;
+    fragment: string;
     inRealCell: boolean;
     startOffset: number;
     endOffset: number;
@@ -233,7 +233,7 @@ export class NotebookConcatDocument implements ITextDocument {
 
         // Compute 'fragment' portion of URI. It's the tentative cell index
         const fragment =
-            cellUri.scheme === InteractiveInputScheme ? -1 : parseInt(cellUri.fragment.substring(2) || '0');
+            cellUri.scheme === InteractiveInputScheme ? '' : cellUri.fragment;
 
         // That fragment determines order in the list (if we're not forcing append)
         const insertIndex = forceAppend ? this._spans.length : this.computeInsertionIndex(fragment);
@@ -659,7 +659,7 @@ export class NotebookConcatDocument implements ITextDocument {
     ): NotebookSpan {
         // Compute fragment based on cell uri
         const fragment =
-            cellUri.scheme === InteractiveInputScheme ? -1 : parseInt(cellUri.fragment.substring(2) || '0');
+            cellUri.scheme === InteractiveInputScheme ? '' : cellUri.fragment;
         return {
             fragment,
             uri: cellUri,
@@ -676,7 +676,7 @@ export class NotebookConcatDocument implements ITextDocument {
     private createTypeIgnoreSpan(cellUri: vscodeUri.URI, offset: number, realOffset: number): NotebookSpan {
         // Compute fragment based on cell uri
         const fragment =
-            cellUri.scheme === InteractiveInputScheme ? -1 : parseInt(cellUri.fragment.substring(2) || '0');
+            cellUri.scheme === InteractiveInputScheme ? '' : cellUri.fragment;
         return {
             fragment,
             uri: cellUri,
@@ -700,7 +700,7 @@ export class NotebookConcatDocument implements ITextDocument {
             }
             return [
                 {
-                    fragment: -1,
+                    fragment: '',
                     uri: cellUri,
                     inRealCell: false,
                     startOffset: 0,
@@ -715,7 +715,7 @@ export class NotebookConcatDocument implements ITextDocument {
 
         return [
             {
-                fragment: -1,
+                fragment: '',
                 uri: cellUri,
                 inRealCell: false,
                 startOffset: 0,
@@ -877,13 +877,13 @@ export class NotebookConcatDocument implements ITextDocument {
         return result as protocol.Range;
     }
 
-    private computeInsertionIndex(fragment: number): number {
+    private computeInsertionIndex(fragment: string): number {
         // Remember if last cell is already the input box
         const inputBoxPresent = this._spans[this._spans.length - 1]?.uri?.scheme === InteractiveInputScheme;
         const totalLength = inputBoxPresent ? this._spans.length - 1 : this._spans.length;
 
         // Find index based on fragment
-        const index = fragment == -1 ? this._spans.length : this._spans.findIndex((c) => c.fragment > fragment);
+        const index = fragment === '' ? this._spans.length : this._spans.findIndex((c) => c.fragment > fragment);
         return index < 0 ? totalLength : index;
     }

rchiodo · 2022-07-11T16:02:36Z

I think you're saying a string comparison of the fragments will order them correctly? Maybe it should be more explicit that it's a string compare (instead of using '<').

rchiodo · 2022-07-11T16:04:33Z

I was also thinking the lsp concat stuff could be refactored to use the new messages that Dirk created. We were trying to build the order ourselves but I believe that isn't necessary anymore?

jrieken · 2022-07-12T09:37:17Z

Maybe it should be more explicit that it's a string compare (instead of using '<').

Unsure - the comparison operator might be best suited here because it faster and more narrow-spec'd than localeCompare (I believe). I also see this as a temporary fix, e.g using the LSP concat is the way to go but the question is what do you do assuming we make the fragment-change at the end of the week?

The patch I have suggested is a good stepping-stone to prevent breakage but it isn't an alternative to LSP. It just less fragile: today we assume the fragment (1) parses to a number and (2) the numbers are comparable and with my patch we assume (1) the fragment is comparable. That's better but not the optimum, using LSP is the optimum but I haven't seen that happening in the last months and TBH I want to make progress

rchiodo · 2022-07-12T16:36:05Z

using LSP is the optimum

Shortly this will be the case. That's the work that Erik has been doing. It's already shipped but it's just behind an experiment flag at the moment. Once that experiment is at 100%, this middleware code won't be needed anymore.

Given that, I think your temporary fix would be good. Although it requires the fragment change, right? So it needs to go in at the same time? Oh maybe it doesn't. Fragment today would compare the same I think.

jrieken · 2022-07-13T07:24:19Z

Oh maybe it doesn't. Fragment today would compare the same I think.

Yes, the change can happen today because comparing the fragment is already behaving like that

rchiodo · 2022-07-14T17:24:49Z

Okay the fragment change made it through all the layers for jupyter. Our build tonight will have the new fragment change.

jrieken self-assigned this Jan 28, 2022

jrieken added this to the February 2022 milestone Jan 28, 2022

jrieken added debt Code quality issues notebook notebook-api and removed notebook-api labels Jan 28, 2022

vscode-triage-bot assigned rebornix Jan 28, 2022

jrieken removed the notebook-triage label Jan 28, 2022

jrieken mentioned this issue Jan 28, 2022

change the format of CellUri#generate and #parse #141726

Merged

jrieken modified the milestones: February 2022, March 2022 Feb 3, 2022

jrieken modified the milestones: May 2022, June 2022 May 20, 2022

jrieken modified the milestones: June 2022, July 2022 Jun 20, 2022

jrieken added the api-proposal label Jul 13, 2022

jrieken closed this as completed in #141726 Jul 19, 2022

vscodenpa added unreleased Patch has not yet been released in VS Code Insiders insiders-released Patch has been released in VS Code Insiders and removed unreleased Patch has not yet been released in VS Code Insiders labels Jul 19, 2022

github-actions bot locked and limited conversation to collaborators Sep 2, 2022

Change format of cell uris so that it is harder to guess #141717

Change format of cell uris so that it is harder to guess #141717

Comments

jrieken commented Jan 28, 2022 • edited Loading

jrieken commented Jan 28, 2022

jrieken commented Jan 28, 2022

weinand commented Jan 28, 2022

jrieken commented Jan 28, 2022

roblourens commented Jan 28, 2022

jrieken commented Feb 1, 2022

jrieken commented Feb 1, 2022 • edited Loading

jrieken commented Feb 1, 2022

rchiodo commented Feb 1, 2022

rchiodo commented Feb 1, 2022

jrieken commented Feb 1, 2022

rchiodo commented Feb 1, 2022

jrieken commented Feb 1, 2022 • edited Loading

jrieken commented Feb 1, 2022

rchiodo commented Feb 1, 2022

dbaeumer commented Feb 2, 2022

jrieken commented Feb 28, 2022

rchiodo commented Feb 28, 2022

rchiodo commented Feb 28, 2022

debonte commented May 17, 2022

jrieken commented May 18, 2022

jrieken commented Jul 1, 2022

debonte commented Jul 1, 2022

rchiodo commented Jul 1, 2022

rchiodo commented Jul 1, 2022

jrieken commented Jul 4, 2022 • edited Loading

debonte commented Jul 5, 2022

jrieken commented Jul 6, 2022

debonte commented Jul 6, 2022

jrieken commented Jul 7, 2022 • edited Loading

rchiodo commented Jul 11, 2022

rchiodo commented Jul 11, 2022

jrieken commented Jul 12, 2022 • edited Loading

rchiodo commented Jul 12, 2022 • edited Loading

jrieken commented Jul 13, 2022

rchiodo commented Jul 14, 2022

jrieken commented Jan 28, 2022 •

edited

Loading

jrieken commented Feb 1, 2022 •

edited

Loading

jrieken commented Feb 1, 2022 •

edited

Loading

jrieken commented Jul 4, 2022 •

edited

Loading

jrieken commented Jul 7, 2022 •

edited

Loading

jrieken commented Jul 12, 2022 •

edited

Loading

rchiodo commented Jul 12, 2022 •

edited

Loading