-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New reverse-existence implementation #1370
Conversation
Also, factor out word lists into module-level variables, for access from elsewhere.
Searches for all (3+-character) words using a simple regexp, then walks the `finditer()` of non-overlapping matches, does some sanity-checking on the candidate string (no digits), and queues an error unless it appears on the list of permitted words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than this, it looks good. this could be used later as a model for refactoring the way existence_check
itself works later, actually.
proselint/tools.py
Outdated
if ignore_case: | ||
permitted = set([word.lower() for word in list]) | ||
allowed_word = functools.partial( | ||
_case_insensitive_allowed_word, permitted) | ||
else: | ||
permitted = set(list) | ||
allowed_word = functools.partial( | ||
_case_sensitive_allowed_word, permitted | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if my previous comment is followed, this can also be simplified.
if ignore_case: | |
permitted = set([word.lower() for word in list]) | |
allowed_word = functools.partial( | |
_case_insensitive_allowed_word, permitted) | |
else: | |
permitted = set(list) | |
allowed_word = functools.partial( | |
_case_sensitive_allowed_word, permitted | |
) | |
permitted = set([word.lower() for word in list] if ignore_case else list) | |
allowed_word = functools.partial( | |
_allowed_word, permitted, ignore_case=ignore_case | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually I missed your set-conversion tweak... hm, that feels a bit "fun" to parse (for humans), compared to:
if ignore_case:
permitted = set([word.lower() for word in list])
else:
permitted = set(list)
I'm open to it, just not sure it's necessary to make future code readers work that hard. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have never really considered the readability of inline if statements for Python programmers, i just know ternaries are a popular construct in other languages i use, like Rust and TypeScript. additionally, ruff has a rule (derived from flake8-simplify) that prefers ternaries, so i would imagine they're not uncommon in Python as well. without any kind of data on how readable people find them, i can only go by convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to reserve them for situations where there's no other choice, like in list comprehensions. (Which can be downright squirrely to read under any circumstances; see my last commit that adds some indentation to try and make my whopper at least semi-readable.) I guess it all comes down to personal comfort; I know people who dislike code like this:
if something:
return one thing
return the other
...and always want the explicit else:
in there. And then OTOH there are linters that will flag the explicit else:
as unnecessary (which of course it is, technically).
¯\_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some cases i might even use a ternary on the return for the example you just gave, actually. however, we won't do that here. i think we should use a ternary for the highlighted block here, and then when i sort out the refactor i'll see what ruff thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that it baaarely fits in 80 columns makes me indifferent to it, so sure. Done.
Oh, one thing I didn't address from the previous PR is your comment about the size of the word lists in the code. I suppose we could move them to a JSON file, and To retain compatibility with zipfile packaging and the like, it'd probably be advisable to use |
I just sketched out a quick attempt at offloading the top1000 word list into a JSON file. Uncompressed, |
|
i appreciate your continued efforts to go the extra mile. since it's really just a list, wouldn't a csv or similar be more efficient than json? |
since |
@Nytelife26 Unfortunately, the modern interface to It's just a matter of adding a dependency (I can never remember the syntax..):
|
Hmm, turning the list into a That's for a file that looks like this, which I hate (but I don't have to read it, Python does):
The JSON file looks like this, technically I could squeeze it tighter by telling it to lose the spaces between each item: ["a", "able", "about", "above", "accept", "across", "act", "actually", "add", "admit", "afraid", "after", "afternoon", "again", "against", "age", "ago", "agree", "ah", "ahead", "air", "all", "allow", Going with CSV also means dealing with |
I was doubly wrong, with poetry it's apparently: importlib-resources = { version = "^6.0", python = "<3.9" } |
using newlines as field delimiters instead of commas would improve readability and make diffs easier, without changing file size, if that helps. you could argue with the validity of this, because it's comma separated values not newline separated values, but i would argue it makes sense to imagine them as records of one "words" field. |
Hmm, I suppose that'd make |
sort of. it still produces a list of lists (on my system at least), which should be trivial to flatten, but it's worth noting. |
Yeah, I ended up having to do this nonsense which is... whatever. I hate with files(proselint).joinpath(_CSV_PATH).open('r') as data:
reader = csv.reader(data)
wordlist = list()
for row in reader:
wordlist.extend(row) EDIT: (You should've seen me trying to write the files, I kept ending up with output that looked like this...)
...And then cursing. A lot. |
@Nytelife26 |
it would be nice if python had a flatten function, like pretty much every other language i use, that's for sure. also, having just tested the build system locally, and reviewing the configuration, all files in |
i'm ready to merge this if you have no further comments, suggestions or changes to make. |
@Nytelife26 All looks good to me, thanks for all your help with this! |
This PR formally submits the implementation of
proselint.tools.reverse_existence_check
which I describe in #1334 (comment). It still builds on the great work @vqle23 did in #1351 (in fact, it builds on the branch from that PR, rebased to currentmain
and with my changes added as followup commits.)A few things have changed from what I described in that issue comment:
proselint.tools
is still namedreverse_existence_check()
, I renamed the actual checkers torestricted.top1000
andrestricted.elementary
, as I felt that was a less unwieldy title/category name that doesn't sacrifice any accuracy/descriptiveness.re.compile(r"\w[\w'-]+\w")
allowed_word
helper function (case-sensitive and not) are now defined statically in thetools.py
file. The proper one is selected and bound with afunctools.partial()
at the start of eachreverse_existence_check()
call.(...And I just noticed, I left some type annotations on the arguments to the helpers, that I'll probably have to take out to appease older Pythons. Bother.)
Fixes: #1334