str.isspace should use the Unicode White_Space property #62436

abalkin · 2013-06-17T01:06:31Z

BPO	18236
Nosy	@malemburg, @loewis, @terryjreedy, @abalkin, @vstinner, @ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka, @animalize, @gnprice
PRs	bpo-18236: Adjust str.isspace to use Unicode's White_Space property. #16254
Files	42973dfea391.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-06-17.01:06:30.764>
labels = ['interpreter-core', 'type-bug', '3.9', 'expert-unicode']
title = 'str.isspace should use the Unicode White_Space property'
updated_at = <Date 2019-09-18.06:02:16.653>
user = 'https://github.com/abalkin'

bugs.python.org fields:

activity = <Date 2019-09-18.06:02:16.653>
actor = 'Greg Price'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core', 'Unicode']
creation = <Date 2013-06-17.01:06:30.764>
creator = 'belopolsky'
dependencies = []
files = ['30680']
hgrepos = ['201']
issue_num = 18236
keywords = ['patch']
message_count = 18.0
messages = ['191303', '191612', '191647', '191648', '191649', '191650', '191652', '191687', '191689', '191706', '191739', '191746', '221919', '230183', '349213', '349214', '349233', '350414']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'terry.reedy', 'belopolsky', 'vstinner', 'ezio.melotti', 'r.david.murray', 'martin.panter', 'serhiy.storchaka', 'malin', 'Greg Price']
pr_nums = ['16254']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue18236'
versions = ['Python 3.9']

abalkin · 2013-06-17T01:06:30Z

ASCII information separators, hex codes 1C through 1F are classified as space:

>>> all(c.isspace() for c in '\N{FS}\N{GS}\N{RS}\N{US}')
True

but int()/float() do not accept strings with leading or trailing separators:

>>> int('123\N{RS}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\x1e'

This is probably because corresponding bytes values are not classified as whitespace:

>>> any(c.encode().isspace() for c in '\N{FS}\N{GS}\N{RS}\N{US}')
False

terryjreedy · 2013-06-21T21:54:51Z

You stated facts: what is your proposal?

The fact that unicode calls characters 'space' does not make then whitespace as commonly understood, or as defined by C, or even as defined by the Unicode database. Unicode apparently has a WSpace property. According to the table in
https://en.wikipedia.org/wiki/Whitespace_%28computer_science%29
1C - 1F are not included by that definition either. For ascii chars, that table matches the C definition, with \r included.

So I think your implied proposal to treat them as whitespace (in strings but not bytes) should be rejected as invalid. For 3.x, the manual should specify that it follows the C definition of 'whitespace' (\r included) for bytes and the extended unicode definition for strings.

>>> int('3\r')
3
>>> int('3\u00a0')
3
>>> int('3\u2000')
3
>>> int(b'3\r')
3
>>> int(b'3\u00a0')
Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    int(b'3\u00a0')
ValueError: invalid literal for int() with base 10: '3\\u00a0'

abalkin · 2013-06-22T16:33:59Z

You stated facts: what is your proposal?

There is a bug somewhere. We cannot simultaneously have

>>> '\N{RS}'.isspace()
True

and not accept '\N{RS}' as whitespace when parsing numbers.

I believe int(x) should be equivalent to int(x.strip()). This is not the case now:

>>> '123\N{RS}'.strip()
'123'
>>> int('123\N{RS}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '123\x1e'

The reason I did not clearly state my proposal is because I am not sure whether bytes.isspace or str.isspace is correct, but I don't see any justification for having them defined differently in the ASCII range.

terryjreedy · 2013-06-22T17:15:06Z

I see your point now. Since RS is not whitespace by any definition I knew of previously, why is RS.isspace True?

Apparent answer: Doc says '''Return true if there are only whitespace characters in the string and there is at least one character, false otherwise. Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”.''' I suspect this is a more expansive definition than WSpace chars, which seems to be the one used by int(), but you could check the int code.

Bytes docs says: "Whenever a bytes or bytearray method needs to interpret the bytes as characters (e.g. the is...() methods, split(), strip()), the ASCII character set is assumed (text strings use Unicode semantics)."

This says to me that str.isxxx and bytes.isxxx should match on ascii chars and not otherwise. That would happen is the bytes methods check for all ascii and decoded to unicode and used str method. Since they do not match, bytes must do something different.

I think there is one definite bug: the discrepancy between str.isspace and bytes.isspace. There is possibly another bug: the discrepancy between 'whitespace' for str.isspace and int/float. After pinning down the details, I think you should ask how to resolve these on py-dev, and which versions to patch.

abalkin · 2013-06-22T17:55:10Z

It looks like str.isspace() is incorrect. The proper definition of unicode whitespace seems to include 26 characters:

# ================================================

0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
0020 ; White_Space # Zs SPACE
0085 ; White_Space # Cc <control-0085>
00A0 ; White_Space # Zs NO-BREAK SPACE
1680 ; White_Space # Zs OGHAM SPACE MARK
180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
2028 ; White_Space # Zl LINE SEPARATOR
2029 ; White_Space # Zp PARAGRAPH SEPARATOR
202F ; White_Space # Zs NARROW NO-BREAK SPACE
205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
3000 ; White_Space # Zs IDEOGRAPHIC SPACE

# Total code points: 26

http://www.unicode.org/Public/UNIDATA/PropList.txt

Python's str.isspace() uses the following definition: "Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”."

Information separators are swept in because they have bidirectional property "B":

>>> unicodedata.bidirectional('\N{RS}')
'B'

See also bpo-10587.

abalkin · 2013-06-22T18:19:51Z

I did a little more investigation and it looks like information separators have been included in whitespace since unicode type was first implemented in Python:

guido 11967 Fri Mar 10 22:52:46 2000 guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 guido 11967 Fri Mar 10 22:52:46 2000 guido 11967 Fri Mar 10 22:52:46 2000 guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 guido 11967 Fri Mar 10 22:52:46 2000 +0000: +0000: /* Returns 1 for Unicode characters having the type 'WS', 'B' or 'S',
0 otherwise. */
+0000:
+0000: int _PyUnicode_IsWhitespace(register const Py_UNICODE ch)
+0000: {
switch (ch) {
case 0x0009: /* HORIZONTAL TABULATION */
case 0x000A: /* LINE FEED */
case 0x000B: /* VERTICAL TABULATION */
case 0x000C: /* FORM FEED */
case 0x000D: /* CARRIAGE RETURN */
case 0x001C: /* FILE SEPARATOR */
case 0x001D: /* GROUP SEPARATOR */
case 0x001E: /* RECORD SEPARATOR */
case 0x001F: /* UNIT SEPARATOR */
case 0x0020: /* SPACE */
case 0x1680: /* OGHAM SPACE MARK */
case 0x2000: /* EN QUAD */
case 0x2001: /* EM QUAD */
case 0x2002: /* EN SPACE */
case 0x2003: /* EM SPACE */
case 0x2004: /* THREE-PER-EM SPACE */
case 0x2005: /* FOUR-PER-EM SPACE */
case 0x2006: /* SIX-PER-EM SPACE */
case 0x2007: /* FIGURE SPACE */
case 0x2008: /* PUNCTUATION SPACE */
case 0x2009: /* THIN SPACE */
case 0x200A: /* HAIR SPACE */
case 0x2028: /* LINE SEPARATOR */
case 0x202F: /* NARROW NO-BREAK SPACE */
case 0x3000: /* IDEOGRAPHIC SPACE */
return 1;
default:
return 0;
}
+0000: }

(hg blame -u -d -n -r 11967 Objects/unicodectype.c)

abalkin · 2013-06-22T18:35:27Z

Martin v. Löwis wrote at bpo-13391 (msg147634):

I do think that _PyUnicode_IsWhitespace should use the White_Space
property (from PropList.txt). I'm not quite sure how they computed
that property (or whether it's manually curated). Since that's a
behavioral change, it can only go into 3.3.

I am adding Martin and Ezio to the "nosy."

loewis · 2013-06-23T07:59:48Z

I stand by that comment: IsWhiteSpace should use the Unicode White_Space property. Since FS/GS/RS/US are not in the White_Space property, it's correct that the int conversion fails. It's incorrect that .isspace() gives true.

There are really several bugs here:

.isspace doesn't use the White_List property
int conversion ultimately uses Py_ISSPACE, which conceptually could deviate from the Unicode properties (as it is byte-based). This is not really an issue, since they indeed match.

I propose to fix this by parsing PropList.txt, and generating _PyUnicode_IsWhitespace based on the White_Space property. For efficiency, it should also generate a fast-lookup array for the ASCII case, or just use _Py_ctype_table (with a comment that this table needs to match PropList White_Space). _Py_ascii_whitespace should go.

Contributions are welcome.

malemburg · 2013-06-23T10:54:21Z

I agree with Martin.

At the time Unicode was added to Python, there was no single Unicode property for white space, so I had to deduce this from the other available properties.

Now that we have a white space property in Unicode, we should start using it. Fortunately, the difference in Python's set of white space chars and the ones having the Unicode white space property are minimal.

abalkin · 2013-06-23T15:15:05Z

I have updated the title to focus this issue on the behavior of str.isspace(). I'll pick up remaining int/float issues in bpo-10581.

abalkin · 2013-06-23T23:34:30Z

I would like someone review this change:

https://bitbucket.org/alexander_belopolsky/cpython/commits/92c187025d0a8a989d9f81f2cb4c96f4eecb81cb?at=issue-18236

The patch can go in without this optimization, but I think this is the right first step towards removing _Py_ascii_whitespace.

I don't think there is a need to generate ASCII optimization in makeunicodedata. While technically Unicode stability policy only guarantees that White_Space property will not be removed from code point s that have it, I think there is little chance that they will ever add White_Space property to another code point in the ASCII range and if they do, I am not sure Python will have to follow.

loewis · 2013-06-24T05:52:28Z

-1 on that patch. It's using trickery to implement the test, and it's not clear that it is actually as efficient as the previous version. The previous version was explicitly modified to use a table lookup for performance reasons.

I'd be fine with not generating PyUnicode_IsWhiteSpace at all, but instead hand-coding it. I suspect that we might want to use more of PropList at some point, so an effort to parse it might not be wasted.

abalkin · 2014-06-29T23:48:58Z

For future reference, the code discussed above is in the following portion of the patch:

-#define Py_UNICODE_ISSPACE(ch) \
-    ((ch) < 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch))
+#define Py_UNICODE_ISSPACE(ch)                                          \
+    ((ch) == ' ' ||                                                     \
+     ((ch) < 128U ? (ch) - 0x9U < 5U : _PyUnicode_IsWhitespace(ch)))

vadmium · 2014-10-28T22:40:06Z

As uncovered in bpo-12855, str.splitlines() currently considers the FS, GS and RS (1C–1E), but not the US (1F), to be line breaks. It might be surprising if these are no longer considered white space but are still considered line breaks.

gnprice · 2019-08-08T04:51:19Z

I've gone and made a patch for this change:
gnprice@7dab9d879

Most of the work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. So I'd like to try to merge those improvements first.

I've filed bpo-37760 for those preparatory changes, and posted several PRs (GH-15128, #59334, #59335) as bite-sized pieces. These PRs can go in in any order.

Please take a look! Reviews appreciated.

animalize · 2019-08-08T05:16:40Z

Greg, could you try this code after your patch?

>>> import re
>>> re.match(r'\s', '\x1e')
<re.Match object; span=(0, 1), match='\x1e'>  # <- before patch

gnprice · 2019-08-08T13:21:56Z

Good question! With the patch:

>> import re
>> re.match(r'\s', '\x1e')
>>

In other words, the definition of the regexp r'\s' follows along. Good to know.

gnprice · 2019-08-25T00:33:01Z

I've gone and made a patch for this change

Update:

The preparatory changes in bpo-37760 are now almost all merged; random.sample() docs unclear on k < len(population) #59470 is the one piece remaining, and I'd be grateful for a review.

It's a generally straightforward and boring change that converts the main data structures of makeunicodedata.py from using length-18 tuples as records to using a dataclass, which I think makes subsequent changes that add features to that script much easier both to write and to review.

I have a slightly updated version of the fix itself, which differs mainly by adding a test: gnprice@9b3bf6739 Comments welcome there too.

larryhastings · 2023-09-18T16:18:38Z

I've done a ridiculous amount of research into this in the last week or two. At long last, I've written a script to produce a canonical answer. It parses the (XML version of the) Unicode database, producing a list of all whitespace and line-breaking whitespace characters. It then compares those lists to the list of characters the Python str and bytes objects consider to be whitespace and line-breaking whitespace characters.

I found three places where there's disagreement. So far this issue has discussed two of those places:

the str object considers U+001C through U+001F to be whitespace, and it shouldn't, and
the str object considers U+001C through U+001E to be line-breaking whitespace, and it shouldn't.

Here's the third, which I don't think so far has been mentioned here:

the bytes object doesn't consider '\v' (vertical tab) and '\f' (form feed) to be whitespace, and it should.

I've attached my script here:
parse_ucd_xml.zip
(Github doesn't permit attaching raw Python scripts, so I've attached it in a zip file.)

serhiy-storchaka · 2023-09-18T16:44:09Z

How to attach a file?

malemburg · 2023-09-18T16:54:42Z

As mentioned a longer while ago (in 2013, some ten years ago), I believe that we should switch Python to stick to the Unicode definitions of whitespace and line breaks.

The question is: How can we do this in a way which doesn't break too much code out there ?

Larry's investigation shows that the "separator" code points are causing trouble. I wonder how often these are used in real-life data.

Looking at the docs, \v and \f were added to the list of line breaks in 3.2 without causing major turmoil. It's likely that the even less commonly used separator codes won't cause much trouble either, so perhaps we could consider the fix a "bug fix".

These are the official line break code points in Unicode: https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt
And these the official white space code points: https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt (first section of the file)

larryhastings · 2023-09-19T00:38:45Z

How to attach a file?

"Attach files by dragging & dropping, selecting or pasting them." Drag & drop worked for me.

Larry's investigation shows that the "separator" code points are causing trouble.

I wouldn't characterize it like that. I don't know if they're causing actual trouble for anybody, because I believe nobody uses those characters in real life. As far as I know, the only problem they cause is to make Python's Unicode object disobey the Unicode standard, in a way I believe is harmless for all practical purposes.

But I also believe that fixing the Unicode object to bring it into spec will also be harmless, and we'll all sleep that much better at night ;-)

Looking at the docs, \v and \f were added to the list of line breaks in 3.2 without causing major turmoil.

Do you mean, added to the list of line breaks for the str object? They still aren't in the list of line breaks for the bytes object; it only recognizes '\r' and '\n' as line breaks.

By the way, in the intervening years since previous correspondents posted proposed patches, all those patch links have died. Bitbucket is long out of the Mercurial game, and clicking on gnprice's patch links go to 404 pages.

larryhastings · 2023-09-19T11:32:17Z

By the way, I did find those .txt files, but I couldn't understand them. My script instead reads the XML version of the Unicode database, which has every annotation for every character. As long as you have a reasonable XML library it's dead easy to deal with. Whitespace is noted as WSpace, either "Y" or "N". Line-breaking is noted as lb, which has many different values; you need the Unicode Line-Breaking Algorithm document to decode those.

ericvsmith · 2023-09-21T08:11:41Z

I think the three issues that @larryhastings identified should be fixed. I think they're bugs, but I don't think they should be backported.

abalkin added the type-bug An unexpected behavior, bug, or error label Jun 17, 2013

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jun 21, 2013

terryjreedy added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Jun 22, 2013

abalkin self-assigned this Jun 23, 2013

abalkin changed the title ~~int() and float() do not accept strings with trailing separators~~ str.isspace should use the Unicode White_Space property Jun 23, 2013

abalkin added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Jun 23, 2013

abalkin removed their assignment Jun 29, 2014

gnprice added the 3.9 only security fixes label Aug 8, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str.isspace should use the Unicode White_Space property #62436

str.isspace should use the Unicode White_Space property #62436

abalkin commented Jun 17, 2013

abalkin commented Jun 17, 2013

terryjreedy commented Jun 21, 2013

abalkin commented Jun 22, 2013

terryjreedy commented Jun 22, 2013

abalkin commented Jun 22, 2013

abalkin commented Jun 22, 2013

abalkin commented Jun 22, 2013

loewis mannequin commented Jun 23, 2013

malemburg commented Jun 23, 2013

abalkin commented Jun 23, 2013

abalkin commented Jun 23, 2013

loewis mannequin commented Jun 24, 2013

abalkin commented Jun 29, 2014

vadmium commented Oct 28, 2014

gnprice commented Aug 8, 2019

animalize mannequin commented Aug 8, 2019

gnprice commented Aug 8, 2019

gnprice commented Aug 25, 2019

larryhastings commented Sep 18, 2023

serhiy-storchaka commented Sep 18, 2023

malemburg commented Sep 18, 2023 •

edited

Loading

larryhastings commented Sep 19, 2023

larryhastings commented Sep 19, 2023

ericvsmith commented Sep 21, 2023

str.isspace should use the Unicode White_Space property #62436

str.isspace should use the Unicode White_Space property #62436

Comments

abalkin commented Jun 17, 2013

abalkin commented Jun 17, 2013

terryjreedy commented Jun 21, 2013

abalkin commented Jun 22, 2013

terryjreedy commented Jun 22, 2013

abalkin commented Jun 22, 2013

abalkin commented Jun 22, 2013

abalkin commented Jun 22, 2013

loewis mannequin commented Jun 23, 2013

malemburg commented Jun 23, 2013

abalkin commented Jun 23, 2013

abalkin commented Jun 23, 2013

loewis mannequin commented Jun 24, 2013

abalkin commented Jun 29, 2014

vadmium commented Oct 28, 2014

gnprice commented Aug 8, 2019

animalize mannequin commented Aug 8, 2019

gnprice commented Aug 8, 2019

gnprice commented Aug 25, 2019

larryhastings commented Sep 18, 2023

serhiy-storchaka commented Sep 18, 2023

malemburg commented Sep 18, 2023 • edited Loading

larryhastings commented Sep 19, 2023

larryhastings commented Sep 19, 2023

ericvsmith commented Sep 21, 2023

malemburg commented Sep 18, 2023 •

edited

Loading