Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where did the adjective/noun corpus come from? #216

Closed
seibert opened this issue Mar 18, 2021 · 24 comments · Fixed by #217
Closed

Where did the adjective/noun corpus come from? #216

seibert opened this issue Mar 18, 2021 · 24 comments · Fixed by #217

Comments

@seibert
Copy link

seibert commented Mar 18, 2021

While figuring out how the new cell id fields work in format 4.5, I noticed that they are currently being generated from a word corpus saved in these two files:

Does anyone know where these word lists came from? There are some entries in there that are going to produce surprising, problematic, or possibly offensive combinations. I don't know how many employers do random word scans on attachments, but someone can quite easily email a notebook to a colleague, and be unaware that the JSON could contain strings like:

  • special-marijuana
  • jewish-holocaust
  • naked-librarian
  • [I'll stop there]

I would strongly vote for one of:

@MSeal
Copy link
Contributor

MSeal commented Mar 18, 2021

😬 those are pretty bad word combinations. Thank you for raising.

The choice was made for the list / pattern through: jupyter/enhancement-proposals#62. I'd be in favor of any of those paths forward, though from the JEP the UUID pattern as a default was veto'd in the original proposal so the second suggestion would require more buy-in to adjust.

There were a lot of people on the review of the change but I'll ping a few of those to hopefully give their input on suggested routes to improve: @willingc @choldgraf @minrk @jhamrick

@choldgraf
Copy link
Contributor

Ooof we definitely need to change those lists.

I think my preference would be something auto-generated but with some kind of QC to ensure there's nothing offensive or inappropriate in there. Surely this must be something that other projects have needed and created?

Otherwise, I'd be fine going with a pre-created curated list.

Agreed that making them UUIDs would require more discussion, and I feel like this is a decent-enough problem that we should try to get a fix sooner than later.

@choldgraf
Copy link
Contributor

I'd be +1 on using the Docker names generator list. That codebase is apache 2.0 licensed which I think means that we could re-use it w/ attribution in a straightforward way, and in a way that wouldn't deviate from the nbformat license restrictions?

@bollwyvl
Copy link
Contributor

bollwyvl commented Mar 19, 2021 via email

@bollwyvl
Copy link
Contributor

bollwyvl commented Mar 19, 2021 via email

@choldgraf
Copy link
Contributor

@bollwyvl i can't remember where it happened but i think there was discussion in the jep about keeping it human readable because people may work to use these names as references. (Eg in jupyter book someone's might want to say "include the output from cell ntbk/hocus-pocus over here")

@bollwyvl
Copy link
Contributor

bollwyvl commented Mar 19, 2021 via email

@MSeal
Copy link
Contributor

MSeal commented Mar 19, 2021

right, if they speak english...

As a default behavior using English is fine imo. The spec allows any application to use whatever they want beyond the defaults, but the rest of the fields and Jupyter interfaces are English first as default behavior as well.

If we went with hashes I'd feel we're back into the conversation about UUIDs and would rather use the UUID standard, but it was clear from the JEP that enough voices didn't want difficult to remember cell ids as a default behavior and I think we should respect that decision for now.

it would also have to include the full text of the apache 2 license.

Yep.

I think my preference would be something auto-generated but with some kind of QC to ensure there's nothing offensive or inappropriate in there. Surely this must be something that other projects have needed and created?

I'll try to look into more options over the weekend as well. We can probably also do a manual pass or two on the current list and at least reduce the poor word combination space if we don't find a good BSD or more open licensed list.

@betatim
Copy link
Member

betatim commented Mar 19, 2021

Three ideas for sources of words:

  1. google-ngram, pros: a nearly infinite source of words, cons: needs curating to remove offensive words, licensing is not clear to me
  2. humanhash word list, pros: simple words, not offensive, "unlicense" license, cons: too short? https://github.com/zacharyvoase/humanhash/blob/e6bd570b23bcf66a8928b9dd7f297773dedcf264/humanhash.py#L12
  3. The up goer five list, pros: simple words, not offensive, cons: licensing unclear to me, words can be very short

@seibert
Copy link
Author

seibert commented Mar 19, 2021

One question I have is what is the requirement for the size of the random ID space? The current system has 1366 adjectives and 3714 nouns, for a total of ~5.1 million combinations. How low can that go and still be acceptable?

It seems like the goal is to be able to randomly generate cell IDs without having to go check for collisions with every other cell in the notebook already. The probability of an accidental collision depends on the size of the ID space and the number of cells we expect to see in a given notebook. That puts us into Birthday Paradox territory, where the probabilities (assuming I have my approximations correct) can become surprisingly large:

# of unique IDs in space # of cells in the notebook probability of random collision
5.07e6 200 0.39%
5.07e6 500 0.88%
5.07e6 1000 9.3%

Given that, I'm not even sure 5 million combinations is sufficient, which would rule out the Docker adjective-name list (which has about 25,000 combinations).

The humanhash word list looks promising if we use four of them, which gives 2**32 combinations, cutting collision probabilities to (maybe?) negligible levels:

# of unique IDs in space # of cells in the notebook probability of random collision
4.29e9 200 0.0004%
4.29e9 500 0.001%
4.29e9 1000 0.01%

Not sure if kilo-speaker-ten-mike is too unwieldy, but is at least memorable.

(Alternatively, a much smaller ID space could be used if we can rely on tools to keep a list of already used IDs in the notebook and explicitly avoid collisions. I assume that is not desirable, though.)

@MSeal
Copy link
Contributor

MSeal commented Mar 22, 2021

@seibert The library retries id assignment if it detects a conflict atm because the cost isn't high to do so. The main idea with the default state space size is to avoid it happening often and triggering a warning to the client.

That being said a very small list might be difficult to differentiate even with multiple word-pairs: e.g. kilo-speaker-kilo vs speaker-kilo-speaker isn't ideal but also not offensive to anyone.

Another option is for us to do a stricter word filter on the existing lists -- something like https://www.cs.cmu.edu/~biglou/resources/ (thanks @choldgraf for looking around at options there in).

@choldgraf
Copy link
Contributor

FWIW, it seems like filtering out any adjectives / nouns in our list that are also in the CMU list would result in removing about 50 adjectives and 100 nouns:

image

Code in case it's helpful
import pandas as pd

url_adj = "https://raw.githubusercontent.com/jupyter/nbformat/master/nbformat/corpus/adjectives.txt"
url_noun = "https://raw.githubusercontent.com/jupyter/nbformat/master/nbformat/corpus/nouns.txt"
url_bad = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"

adj = pd.read_csv(url_adj, header=None).squeeze()
noun = pd.read_csv(url_noun, header=None).squeeze()
bad = pd.read_csv(url_bad, header=None).squeeze()

adj_ok = adj[~adj.isin(bad.values)]
noun_ok = noun[~noun.isin(bad.values)]

print(f"Original adjectives: {adj.shape}")
print(f"Adjectives after filter: {adj_ok.shape}")
print(f"Original nouns: {noun.shape}")
print(f"Nouns after filter: {noun_ok.shape}")

@MSeal
Copy link
Contributor

MSeal commented Mar 29, 2021

Let's go with the changes @choldgraf proposed here this week, as it's a clear improvement to prevent immediately identified problems and I don't see any controversy in the change beyond that we may need further filtering later on.

@blois
Copy link

blois commented Mar 31, 2021

Looking through the entries in the CMU bad words list I still see a number of possible combinations that could be interpreted as inappropriate. The list contains a number of nationalities / localities and is applying random adjectives to them.

Colab uses a random sequence of 12 characters chosen from the 64 characters -ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_ with no additional checking for appropriateness. In the years of use I cannot remember any inappropriate results and these are shown in the URL bar.

If human readable is a goal then perhaps allow this to be edited?

Here's a comparison of generated IDs- https://gist.github.com/blois/1f6644fa74cbba5c3be5e998c09278e9#file-cell-ids-ipynb

@jasongrout
Copy link
Member

I'm +1 on having random character strings. It's impossible to tell if a combination of english words will become offensive based on current news topics, etc, and it just seems much better to avoid the issue altogether.

@westurner
Copy link
Contributor

westurner commented Mar 31, 2021 via email

@choldgraf
Copy link
Contributor

choldgraf commented Mar 31, 2021

Could I suggest that we implement the suggestion that @MSeal gave here: #216 (comment) and for now just remove the subset of potentially problematic words described above as a clear improvement on what we have now. If people would like to discuss generating randomized strings, I think we should discuss in another issue as a broader / potential long-term improvement.

I think that random character strings are interesting but will require more community discussion and buy-in before implementing, and in the meantime we should improve the immediate issue of "semi-high likelihood of problematic two-word combinations"

@blois
Copy link

blois commented Mar 31, 2021

Examples of nouns in the list- africa, european, asian, chinese, france, germany, spanish, african and many others. And it's picking random adjectives to prefix that with, among homeless, hungry, fatty, awful, crude, attractive, offensive. This is just a quick glance.

I think this wordlist approach should be removed with utmost urgency.

@MSeal
Copy link
Contributor

MSeal commented Mar 31, 2021

I think this wordlist approach should be removed with utmost urgency.

The original proposal had UUIDs, but there was community resistance in the community calls around the topic. I think hashes are just the same UUID pattern with smaller state space. See options B & C from the JEP. The root push back was that it A) felt to be too large of an identifier by default and B) not human differentiable at a glance when put in URLs / read in the raw JSON. I'm unsure how we get quick consensus on backing out on this decision to change to a hash format as the default. This wouldn't change the spec in anyway, just the library's fall-through behavior.

Heroku, What-three-words, and a few other applications have word lists they use in combination but they clearly had more thought put into their arrangement / listing. That parts on me for not researching well enough in the JEP.

Would the filtered list as an immediate release to improve the situation while we discuss a more permanent change be helpful / productive here?

@MSeal
Copy link
Contributor

MSeal commented Mar 31, 2021

I'm unsure how we get quick consensus on backing out on this decision to change to a hash format as the default.

If we do go that route we may have to just ask for forgiveness.

@choldgraf
Copy link
Contributor

choldgraf commented Apr 1, 2021

The main thing I want to avoid is feeling that we've casually over-ruled a participatory decision that has been made with input from many (or, that we take a long time going through a similar process to over-rule it while this problem is still at-large).

That said, I think that this particular decision ("should the default behavior be to save things as words or hashes") doesn't deviate in fundamental ways from the JEP, it is more like an implementation detail as any library is free to choose their own naming strategy.

Or put another way, in the JEP we have this language:

If corpus is not desired, use Options B or C.

I'd argue that it sounds like "corpus is not desired" in this case, and so I'm fine going with Options B or C and saying that this is still squarely "in-scope" for the JEP.

If we do this, maybe we can open an issue about bringing back human-readable names because I do think it's nice, but something that should be done after a more thoughtful process.

@MSeal
Copy link
Contributor

MSeal commented Apr 1, 2021

I posted a PR with Option C implemented and added some cross messaging in relevant places so there's some visibility.

Looking at the remaining names post-filtering there's still enough issues that pop up I feel we need to move away from name based ids until a better list is identified for that purpose. @choldgraf if you'd like to make the issue to add that back in for the future please go ahead and open that.

@jasongrout
Copy link
Member

The main thing I want to avoid is feeling that we've casually over-ruled a participatory decision that has been made with input from many (or, that we take a long time going through a similar process to over-rule it while this problem is still at-large).

+1

That said, I think that this particular decision ("should the default behavior be to save things as words or hashes") doesn't deviate in fundamental ways from the JEP, it is more like an implementation detail as any library is free to choose their own naming strategy.

+1

If we do this, maybe we can open an issue about bringing back human-readable names because I do think it's nice, but something that should be done after a more thoughtful process.

I think this is the best way forward. I think the situation is bad enough that an emergency fix to move to random ids is warranted, and an issue opened up as you say.

@MSeal
Copy link
Contributor

MSeal commented Apr 2, 2021

5.1.3 is released with 8 character hash-based ids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants