Where did the adjective/noun corpus come from? #216

seibert · 2021-03-18T15:49:58Z

While figuring out how the new cell id fields work in format 4.5, I noticed that they are currently being generated from a word corpus saved in these two files:

Does anyone know where these word lists came from? There are some entries in there that are going to produce surprising, problematic, or possibly offensive combinations. I don't know how many employers do random word scans on attachments, but someone can quite easily email a notebook to a colleague, and be unaware that the JSON could contain strings like:

special-marijuana
jewish-holocaust
naked-librarian
[I'll stop there]

I would strongly vote for one of:

Replace these lists with something more curated to avoid these issues
Switch to UUIDs (also allowed by the spec)
Copy a curated list (license permitting) from some other source, like Docker: https://github.com/moby/moby/blob/master/pkg/namesgenerator/names-generator.go

The text was updated successfully, but these errors were encountered:

MSeal · 2021-03-18T22:40:45Z

😬 those are pretty bad word combinations. Thank you for raising.

The choice was made for the list / pattern through: jupyter/enhancement-proposals#62. I'd be in favor of any of those paths forward, though from the JEP the UUID pattern as a default was veto'd in the original proposal so the second suggestion would require more buy-in to adjust.

There were a lot of people on the review of the change but I'll ping a few of those to hopefully give their input on suggested routes to improve: @willingc @choldgraf @minrk @jhamrick

choldgraf · 2021-03-18T23:36:32Z

Ooof we definitely need to change those lists.

I think my preference would be something auto-generated but with some kind of QC to ensure there's nothing offensive or inappropriate in there. Surely this must be something that other projects have needed and created?

Otherwise, I'd be fine going with a pre-created curated list.

Agreed that making them UUIDs would require more discussion, and I feel like this is a decent-enough problem that we should try to get a fix sooner than later.

choldgraf · 2021-03-18T23:44:04Z

I'd be +1 on using the Docker names generator list. That codebase is apache 2.0 licensed which I think means that we could re-use it w/ attribution in a straightforward way, and in a way that wouldn't deviate from the nbformat license restrictions?

bollwyvl · 2021-03-19T00:35:11Z

it would also have to include the full text of the apache 2 license.

bollwyvl · 2021-03-19T00:40:03Z

it seems like something based loosely on builtin hashlib would work, e.g. given no ids whatsoever - id 0: a random number, or a hash of the content of the cell - id 1: hash of id 0 - id 2: hash of id 1 it wouldn't be a full merkle dag/tree/trie, but at least would be _very_ unlikely to generate collisions, and would use existing well-understood mechanism, and would not require adding a large string corpus and/or license... are we really anticipating people will ever read these? if they do ever hit dom IDs (shudder), it is important they don't start with a number, but otherwise...

choldgraf · 2021-03-19T01:12:15Z

@bollwyvl i can't remember where it happened but i think there was discussion in the jep about keeping it human readable because people may work to use these names as references. (Eg in jupyter book someone's might want to say "include the output from cell ntbk/hocus-pocus over here")

bollwyvl · 2021-03-19T01:23:05Z

right, if they speak english...

MSeal · 2021-03-19T03:28:19Z

right, if they speak english...

As a default behavior using English is fine imo. The spec allows any application to use whatever they want beyond the defaults, but the rest of the fields and Jupyter interfaces are English first as default behavior as well.

If we went with hashes I'd feel we're back into the conversation about UUIDs and would rather use the UUID standard, but it was clear from the JEP that enough voices didn't want difficult to remember cell ids as a default behavior and I think we should respect that decision for now.

it would also have to include the full text of the apache 2 license.

Yep.

I think my preference would be something auto-generated but with some kind of QC to ensure there's nothing offensive or inappropriate in there. Surely this must be something that other projects have needed and created?

I'll try to look into more options over the weekend as well. We can probably also do a manual pass or two on the current list and at least reduce the poor word combination space if we don't find a good BSD or more open licensed list.

betatim · 2021-03-19T10:48:47Z

Three ideas for sources of words:

google-ngram, pros: a nearly infinite source of words, cons: needs curating to remove offensive words, licensing is not clear to me
humanhash word list, pros: simple words, not offensive, "unlicense" license, cons: too short? https://github.com/zacharyvoase/humanhash/blob/e6bd570b23bcf66a8928b9dd7f297773dedcf264/humanhash.py#L12
The up goer five list, pros: simple words, not offensive, cons: licensing unclear to me, words can be very short

seibert · 2021-03-19T16:39:57Z

One question I have is what is the requirement for the size of the random ID space? The current system has 1366 adjectives and 3714 nouns, for a total of ~5.1 million combinations. How low can that go and still be acceptable?

It seems like the goal is to be able to randomly generate cell IDs without having to go check for collisions with every other cell in the notebook already. The probability of an accidental collision depends on the size of the ID space and the number of cells we expect to see in a given notebook. That puts us into Birthday Paradox territory, where the probabilities (assuming I have my approximations correct) can become surprisingly large:

# of unique IDs in space	# of cells in the notebook	probability of random collision
5.07e6	200	0.39%
5.07e6	500	0.88%
5.07e6	1000	9.3%

Given that, I'm not even sure 5 million combinations is sufficient, which would rule out the Docker adjective-name list (which has about 25,000 combinations).

The humanhash word list looks promising if we use four of them, which gives 2**32 combinations, cutting collision probabilities to (maybe?) negligible levels:

# of unique IDs in space	# of cells in the notebook	probability of random collision
4.29e9	200	0.0004%
4.29e9	500	0.001%
4.29e9	1000	0.01%

Not sure if kilo-speaker-ten-mike is too unwieldy, but is at least memorable.

(Alternatively, a much smaller ID space could be used if we can rely on tools to keep a list of already used IDs in the notebook and explicitly avoid collisions. I assume that is not desirable, though.)

MSeal · 2021-03-22T04:52:19Z

@seibert The library retries id assignment if it detects a conflict atm because the cost isn't high to do so. The main idea with the default state space size is to avoid it happening often and triggering a warning to the client.

That being said a very small list might be difficult to differentiate even with multiple word-pairs: e.g. kilo-speaker-kilo vs speaker-kilo-speaker isn't ideal but also not offensive to anyone.

Another option is for us to do a stricter word filter on the existing lists -- something like https://www.cs.cmu.edu/~biglou/resources/ (thanks @choldgraf for looking around at options there in).

choldgraf · 2021-03-22T13:39:24Z

FWIW, it seems like filtering out any adjectives / nouns in our list that are also in the CMU list would result in removing about 50 adjectives and 100 nouns:

Code in case it's helpful

import pandas as pd

url_adj = "https://raw.githubusercontent.com/jupyter/nbformat/master/nbformat/corpus/adjectives.txt"
url_noun = "https://raw.githubusercontent.com/jupyter/nbformat/master/nbformat/corpus/nouns.txt"
url_bad = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"

adj = pd.read_csv(url_adj, header=None).squeeze()
noun = pd.read_csv(url_noun, header=None).squeeze()
bad = pd.read_csv(url_bad, header=None).squeeze()

adj_ok = adj[~adj.isin(bad.values)]
noun_ok = noun[~noun.isin(bad.values)]

print(f"Original adjectives: {adj.shape}")
print(f"Adjectives after filter: {adj_ok.shape}")
print(f"Original nouns: {noun.shape}")
print(f"Nouns after filter: {noun_ok.shape}")

MSeal · 2021-03-29T19:05:34Z

Let's go with the changes @choldgraf proposed here this week, as it's a clear improvement to prevent immediately identified problems and I don't see any controversy in the change beyond that we may need further filtering later on.

blois · 2021-03-31T18:02:23Z

Looking through the entries in the CMU bad words list I still see a number of possible combinations that could be interpreted as inappropriate. The list contains a number of nationalities / localities and is applying random adjectives to them.

Colab uses a random sequence of 12 characters chosen from the 64 characters -ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_ with no additional checking for appropriateness. In the years of use I cannot remember any inappropriate results and these are shown in the URL bar.

If human readable is a goal then perhaps allow this to be edited?

Here's a comparison of generated IDs- https://gist.github.com/blois/1f6644fa74cbba5c3be5e998c09278e9#file-cell-ids-ipynb

jasongrout · 2021-03-31T18:27:52Z

I'm +1 on having random character strings. It's impossible to tell if a combination of english words will become offensive based on current news topics, etc, and it just seems much better to avoid the issue altogether.

westurner · 2021-03-31T18:31:38Z

Might I suggest base58 (or base < 58) if there is any need to visually compare or read cell id hashes without a central registry aloud in order to avoid lexical ambiguities? Edit: https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58

choldgraf · 2021-03-31T18:57:30Z

Could I suggest that we implement the suggestion that @MSeal gave here: #216 (comment) and for now just remove the subset of potentially problematic words described above as a clear improvement on what we have now. If people would like to discuss generating randomized strings, I think we should discuss in another issue as a broader / potential long-term improvement.

I think that random character strings are interesting but will require more community discussion and buy-in before implementing, and in the meantime we should improve the immediate issue of "semi-high likelihood of problematic two-word combinations"

blois · 2021-03-31T19:44:15Z

Examples of nouns in the list- africa, european, asian, chinese, france, germany, spanish, african and many others. And it's picking random adjectives to prefix that with, among homeless, hungry, fatty, awful, crude, attractive, offensive. This is just a quick glance.

I think this wordlist approach should be removed with utmost urgency.

MSeal · 2021-03-31T22:55:50Z

I think this wordlist approach should be removed with utmost urgency.

The original proposal had UUIDs, but there was community resistance in the community calls around the topic. I think hashes are just the same UUID pattern with smaller state space. See options B & C from the JEP. The root push back was that it A) felt to be too large of an identifier by default and B) not human differentiable at a glance when put in URLs / read in the raw JSON. I'm unsure how we get quick consensus on backing out on this decision to change to a hash format as the default. This wouldn't change the spec in anyway, just the library's fall-through behavior.

Heroku, What-three-words, and a few other applications have word lists they use in combination but they clearly had more thought put into their arrangement / listing. That parts on me for not researching well enough in the JEP.

Would the filtered list as an immediate release to improve the situation while we discuss a more permanent change be helpful / productive here?

MSeal · 2021-03-31T23:05:19Z

I'm unsure how we get quick consensus on backing out on this decision to change to a hash format as the default.

If we do go that route we may have to just ask for forgiveness.

choldgraf · 2021-04-01T00:22:34Z

The main thing I want to avoid is feeling that we've casually over-ruled a participatory decision that has been made with input from many (or, that we take a long time going through a similar process to over-rule it while this problem is still at-large).

That said, I think that this particular decision ("should the default behavior be to save things as words or hashes") doesn't deviate in fundamental ways from the JEP, it is more like an implementation detail as any library is free to choose their own naming strategy.

Or put another way, in the JEP we have this language:

If corpus is not desired, use Options B or C.

I'd argue that it sounds like "corpus is not desired" in this case, and so I'm fine going with Options B or C and saying that this is still squarely "in-scope" for the JEP.

If we do this, maybe we can open an issue about bringing back human-readable names because I do think it's nice, but something that should be done after a more thoughtful process.

MSeal · 2021-04-01T05:21:54Z

I posted a PR with Option C implemented and added some cross messaging in relevant places so there's some visibility.

Looking at the remaining names post-filtering there's still enough issues that pop up I feel we need to move away from name based ids until a better list is identified for that purpose. @choldgraf if you'd like to make the issue to add that back in for the future please go ahead and open that.

jasongrout · 2021-04-01T20:38:59Z

The main thing I want to avoid is feeling that we've casually over-ruled a participatory decision that has been made with input from many (or, that we take a long time going through a similar process to over-rule it while this problem is still at-large).

+1

That said, I think that this particular decision ("should the default behavior be to save things as words or hashes") doesn't deviate in fundamental ways from the JEP, it is more like an implementation detail as any library is free to choose their own naming strategy.

+1

If we do this, maybe we can open an issue about bringing back human-readable names because I do think it's nice, but something that should be done after a more thoughtful process.

I think this is the best way forward. I think the situation is bad enough that an emergency fix to move to random ids is warranted, and an issue opened up as you say.

MSeal · 2021-04-02T04:54:30Z

5.1.3 is released with 8 character hash-based ids

willingc added the enhancement label Mar 29, 2021

MSeal mentioned this issue Mar 31, 2021

Cell ID JEP Implementation #189

Merged

This was referenced Apr 1, 2021

Change id generation to be hash based to avoid problematic word combinations #217

Merged

Added cell-id proposal jupyter/enhancement-proposals#62

Merged

choldgraf mentioned this issue Apr 1, 2021

Add human-readable cell ID generation #218

Open

MSeal closed this as completed in #217 Apr 2, 2021

choldgraf mentioned this issue Jun 16, 2021

Re-introduce english words to cell IDs #226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where did the adjective/noun corpus come from? #216

Where did the adjective/noun corpus come from? #216

seibert commented Mar 18, 2021

MSeal commented Mar 18, 2021

choldgraf commented Mar 18, 2021

choldgraf commented Mar 18, 2021

bollwyvl commented Mar 19, 2021 via email

bollwyvl commented Mar 19, 2021 via email

choldgraf commented Mar 19, 2021

bollwyvl commented Mar 19, 2021 via email

MSeal commented Mar 19, 2021

betatim commented Mar 19, 2021 •

edited

Loading

seibert commented Mar 19, 2021

MSeal commented Mar 22, 2021

choldgraf commented Mar 22, 2021

MSeal commented Mar 29, 2021 •

edited

Loading

blois commented Mar 31, 2021

jasongrout commented Mar 31, 2021

westurner commented Mar 31, 2021 via email •

edited

Loading

choldgraf commented Mar 31, 2021 •

edited

Loading

blois commented Mar 31, 2021

MSeal commented Mar 31, 2021

MSeal commented Mar 31, 2021

choldgraf commented Apr 1, 2021 •

edited

Loading

MSeal commented Apr 1, 2021

jasongrout commented Apr 1, 2021

MSeal commented Apr 2, 2021

Where did the adjective/noun corpus come from? #216

Where did the adjective/noun corpus come from? #216

Comments

seibert commented Mar 18, 2021

MSeal commented Mar 18, 2021

choldgraf commented Mar 18, 2021

choldgraf commented Mar 18, 2021

bollwyvl commented Mar 19, 2021 via email

bollwyvl commented Mar 19, 2021 via email

choldgraf commented Mar 19, 2021

bollwyvl commented Mar 19, 2021 via email

MSeal commented Mar 19, 2021

betatim commented Mar 19, 2021 • edited Loading

seibert commented Mar 19, 2021

MSeal commented Mar 22, 2021

choldgraf commented Mar 22, 2021

MSeal commented Mar 29, 2021 • edited Loading

blois commented Mar 31, 2021

jasongrout commented Mar 31, 2021

westurner commented Mar 31, 2021 via email • edited Loading

choldgraf commented Mar 31, 2021 • edited Loading

blois commented Mar 31, 2021

MSeal commented Mar 31, 2021

MSeal commented Mar 31, 2021

choldgraf commented Apr 1, 2021 • edited Loading

MSeal commented Apr 1, 2021

jasongrout commented Apr 1, 2021

MSeal commented Apr 2, 2021

betatim commented Mar 19, 2021 •

edited

Loading

MSeal commented Mar 29, 2021 •

edited

Loading

westurner commented Mar 31, 2021 via email •

edited

Loading

choldgraf commented Mar 31, 2021 •

edited

Loading

choldgraf commented Apr 1, 2021 •

edited

Loading