-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checks: Remove duplicate items from lists #1369
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good finds, thank you for the contribution.
i like your sense of humour, but i will indeed go through and remove duplicates in other lists as part of this.
Well, my thinking in leaving them was: Where the lists come from different external sources (as is the case in all of those left-over duplicates), how do you choose which one keeps the term, and which one doesn't? (And then either way, one list is no longer in sync with its source.) |
But if you want the lists of remaining, cross-list duplicates, they are:
"between a rock and a hard place",
"conspicuous by its absence",
"crystal clear",
"easier said than done",
"gilding the lily",
"horns of a dilemma",
"last but not least",
"moment of truth",
"par for the course",
"thick as thieves",
["blend", ["blend together"]],
["Challah", ["Challah bread"]],
["collaborate", ["collaborate together"]],
["combine", ["combine together"]],
["cooperation", ["mutual cooperation"]],
["fundamentals", ["basic fundamentals"]],
["gift", ["free gift"]],
["innovation", ["new innovation"]],
["merge", ["merge together"]],
["mix", ["mix together"]],
["mutual respect", ["mutual respect for each other"]],
["plans", ["future plans"]],
["potable water", ["potable drinking water"]],
["refer", ["refer back"]],
["retreat", ["retreat back"]],
["surrounded", ["surrounded on all sides"]],
["veteran", ["former veteran"]],
["visible", ["visible to the eye"]], |
i am currently waiting to accrue the funds to obtain the sources myself (my university library does not have them unfortunately), which would make things easier, but you're right. i was considering aggregating the lists so they run by type of check rather than by source, and using comments to keep track, but that might get messy. i'll come up with something. |
That's certainly an option. I mentioned in #1334 that in testing my implementation of reverse-existence, I'd factored the lists of terms out of the check functions so that I could access them externally, specifically in the REPL when I was testing/debugging/benchmarking interactively. That's useful enough that I'd suggest it as an across-the-board change for all of the checking functions — instead of holding their lists internally, move them out to module-level variables which can be accessed from other code. So, where now the "template" is: def some_check_fn(args):
terms = [
"foo",
"bar",
"baz",
]
return some_tool_function(text, terms, blah, blah) My suggestion would be to go with: SOME_CHECK_TERMS = [
"foo",
"bar",
"baz",
]
def some_check_fn(args):
terms = SOME_CHECK_TERMS
return some_tool_function(text, terms, blah, blah) The major advantage there is that you can use But it would also permit combining multiple lists together1 and de-duplicating them, e.g.: FIRST_SOURCE_TERMS = [...]
SECOND_SOURCE_TERMS = [...]
def some_check_fn():
terms = list(set(FIRST_SOURCE_TERMS + SECOND_SOURCE_TERMS))
return some_tool_function(text, terms, blah, blah) Do that, it doesn't matter if the source lists contain redundancies, they'll be dropped in building the final terms list. Notes
|
i like that solution, actually. it could simplify testing, too. i think the goal should eventually be to refactor checks to merely return a list of regex-enabled sequences along with their options, and run the existence check once so it can search the source for everything in as few passes as possible, like what you did with |
While reading through one of the
proselint/checks/*/misc.py
files, I happened to notice a duplicate item on its list of matching terms.So I ran a quick script to look for duplicate items in all of the
misc.py
files, spotted a couple more that way, and removed those as well.Redundancies NOT fixed
Both
cliches/misc.py
andredundancy/misc.py
contain several more duplicates, by virtue of the same term appearing on multiple different lists of cliches/redundancies — those, I didn't touch. This is only about duplicates on the same list.(Plus, the irony of having redundant terms in the redundancies check is inherently entertaining. Enough to justify leaving those alone anyway.)