-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where did the adjective/noun corpus come from? #216
Comments
😬 those are pretty bad word combinations. Thank you for raising. The choice was made for the list / pattern through: jupyter/enhancement-proposals#62. I'd be in favor of any of those paths forward, though from the JEP the UUID pattern as a default was veto'd in the original proposal so the second suggestion would require more buy-in to adjust. There were a lot of people on the review of the change but I'll ping a few of those to hopefully give their input on suggested routes to improve: @willingc @choldgraf @minrk @jhamrick |
Ooof we definitely need to change those lists. I think my preference would be something auto-generated but with some kind of QC to ensure there's nothing offensive or inappropriate in there. Surely this must be something that other projects have needed and created? Otherwise, I'd be fine going with a pre-created curated list. Agreed that making them UUIDs would require more discussion, and I feel like this is a decent-enough problem that we should try to get a fix sooner than later. |
I'd be +1 on using the Docker names generator list. That codebase is apache 2.0 licensed which I think means that we could re-use it w/ attribution in a straightforward way, and in a way that wouldn't deviate from the nbformat license restrictions? |
it would also have to include the full text of the apache 2 license.
|
it seems like something based loosely on builtin hashlib would work, e.g.
given no ids whatsoever
- id 0: a random number, or a hash of the content of the cell
- id 1: hash of id 0
- id 2: hash of id 1
it wouldn't be a full merkle dag/tree/trie, but at least would be _very_
unlikely to generate collisions, and would use existing well-understood
mechanism, and would not require adding a large string corpus and/or
license... are we really anticipating people will ever read these? if they
do ever hit dom IDs (shudder), it is important they don't start with a
number, but otherwise...
|
@bollwyvl i can't remember where it happened but i think there was discussion in the jep about keeping it human readable because people may work to use these names as references. (Eg in jupyter book someone's might want to say "include the output from cell ntbk/hocus-pocus over here") |
right, if they speak english...
|
As a default behavior using English is fine imo. The spec allows any application to use whatever they want beyond the defaults, but the rest of the fields and Jupyter interfaces are English first as default behavior as well. If we went with hashes I'd feel we're back into the conversation about UUIDs and would rather use the UUID standard, but it was clear from the JEP that enough voices didn't want difficult to remember cell ids as a default behavior and I think we should respect that decision for now.
Yep.
I'll try to look into more options over the weekend as well. We can probably also do a manual pass or two on the current list and at least reduce the poor word combination space if we don't find a good BSD or more open licensed list. |
Three ideas for sources of words:
|
One question I have is what is the requirement for the size of the random ID space? The current system has 1366 adjectives and 3714 nouns, for a total of ~5.1 million combinations. How low can that go and still be acceptable? It seems like the goal is to be able to randomly generate cell IDs without having to go check for collisions with every other cell in the notebook already. The probability of an accidental collision depends on the size of the ID space and the number of cells we expect to see in a given notebook. That puts us into Birthday Paradox territory, where the probabilities (assuming I have my approximations correct) can become surprisingly large:
Given that, I'm not even sure 5 million combinations is sufficient, which would rule out the Docker adjective-name list (which has about 25,000 combinations). The humanhash word list looks promising if we use four of them, which gives 2**32 combinations, cutting collision probabilities to (maybe?) negligible levels:
Not sure if (Alternatively, a much smaller ID space could be used if we can rely on tools to keep a list of already used IDs in the notebook and explicitly avoid collisions. I assume that is not desirable, though.) |
@seibert The library retries id assignment if it detects a conflict atm because the cost isn't high to do so. The main idea with the default state space size is to avoid it happening often and triggering a warning to the client. That being said a very small list might be difficult to differentiate even with multiple word-pairs: e.g. Another option is for us to do a stricter word filter on the existing lists -- something like https://www.cs.cmu.edu/~biglou/resources/ (thanks @choldgraf for looking around at options there in). |
FWIW, it seems like filtering out any adjectives / nouns in our list that are also in the CMU list would result in removing about 50 adjectives and 100 nouns: Code in case it's helpful
|
Let's go with the changes @choldgraf proposed here this week, as it's a clear improvement to prevent immediately identified problems and I don't see any controversy in the change beyond that we may need further filtering later on. |
Looking through the entries in the CMU bad words list I still see a number of possible combinations that could be interpreted as inappropriate. The list contains a number of nationalities / localities and is applying random adjectives to them. Colab uses a random sequence of 12 characters chosen from the 64 characters - If human readable is a goal then perhaps allow this to be edited? Here's a comparison of generated IDs- https://gist.github.com/blois/1f6644fa74cbba5c3be5e998c09278e9#file-cell-ids-ipynb |
I'm +1 on having random character strings. It's impossible to tell if a combination of english words will become offensive based on current news topics, etc, and it just seems much better to avoid the issue altogether. |
Might I suggest base58 (or base < 58) if there is any need to visually
compare or read cell id hashes without a central registry aloud in order to
avoid lexical ambiguities?
Edit: https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58
|
Could I suggest that we implement the suggestion that @MSeal gave here: #216 (comment) and for now just remove the subset of potentially problematic words described above as a clear improvement on what we have now. If people would like to discuss generating randomized strings, I think we should discuss in another issue as a broader / potential long-term improvement. I think that random character strings are interesting but will require more community discussion and buy-in before implementing, and in the meantime we should improve the immediate issue of "semi-high likelihood of problematic two-word combinations" |
Examples of nouns in the list- I think this wordlist approach should be removed with utmost urgency. |
The original proposal had UUIDs, but there was community resistance in the community calls around the topic. I think hashes are just the same UUID pattern with smaller state space. See options B & C from the JEP. The root push back was that it A) felt to be too large of an identifier by default and B) not human differentiable at a glance when put in URLs / read in the raw JSON. I'm unsure how we get quick consensus on backing out on this decision to change to a hash format as the default. This wouldn't change the spec in anyway, just the library's fall-through behavior. Heroku, What-three-words, and a few other applications have word lists they use in combination but they clearly had more thought put into their arrangement / listing. That parts on me for not researching well enough in the JEP. Would the filtered list as an immediate release to improve the situation while we discuss a more permanent change be helpful / productive here? |
If we do go that route we may have to just ask for forgiveness. |
The main thing I want to avoid is feeling that we've casually over-ruled a participatory decision that has been made with input from many (or, that we take a long time going through a similar process to over-rule it while this problem is still at-large). That said, I think that this particular decision ("should the default behavior be to save things as words or hashes") doesn't deviate in fundamental ways from the JEP, it is more like an implementation detail as any library is free to choose their own naming strategy. Or put another way, in the JEP we have this language:
I'd argue that it sounds like "corpus is not desired" in this case, and so I'm fine going with Options B or C and saying that this is still squarely "in-scope" for the JEP. If we do this, maybe we can open an issue about bringing back human-readable names because I do think it's nice, but something that should be done after a more thoughtful process. |
I posted a PR with Option C implemented and added some cross messaging in relevant places so there's some visibility. Looking at the remaining names post-filtering there's still enough issues that pop up I feel we need to move away from name based ids until a better list is identified for that purpose. @choldgraf if you'd like to make the issue to add that back in for the future please go ahead and open that. |
+1
+1
I think this is the best way forward. I think the situation is bad enough that an emergency fix to move to random ids is warranted, and an issue opened up as you say. |
5.1.3 is released with 8 character hash-based ids |
While figuring out how the new cell
id
fields work in format 4.5, I noticed that they are currently being generated from a word corpus saved in these two files:Does anyone know where these word lists came from? There are some entries in there that are going to produce surprising, problematic, or possibly offensive combinations. I don't know how many employers do random word scans on attachments, but someone can quite easily email a notebook to a colleague, and be unaware that the JSON could contain strings like:
special-marijuana
jewish-holocaust
naked-librarian
I would strongly vote for one of:
The text was updated successfully, but these errors were encountered: