You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per the title, apoc.text.clean appears to strip out everything that is not a-z0-9 (albeit allowing for some normalisation of diacritics).
Unfortunately this means that the Cyrillic/Chinese/Japanese character-sets are stripped from input strings as they are not included within the whitelist.
Examples of current behaviour:
Chinese characters: RETURN apoc.text.clean("桃山區") -> Returns an empty string ""
Cyrillic characters: RETURN apoc.text.clean("А Б В Г Д Е Ж Ѕ Ꙁ И І К Л М Н О П Р С Т ОУ Ф Х Ѡ Ц Ч Ш Щ Ъ ЪІ Ь Ѣ Ꙗ Ѥ Ю Ѫ Ѭ Ѧ Ѩ Ѯ Ѱ Ѳ Ѵ Ҁ") -> Returns an empty string ""
Expected behaviour:
In both of these cases I would expect only the white-space and punctuation to be removed, instead everything is removed.
Consequence:
The consequence of this is that apoc.text.clean is not usable for content which has a significant number of non-ascii symbols within it.
My personal use-case has only a small amount of non-ascii text so it isn't particularly critical for me yet, but long-term it is something that I will need to face eventually.
Desired change:
I see a couple of options:
Whitelisting the relevant unicode character ranges to expand upon the current implementation,
Switching to blacklisting punctuation/whitespace.
Either of these would, ideally, be configurable to allow whitespace for example.
It would be fairly simple/straightforward to get the blacklisting of punctuation/whitespace going, but if expanding the existing a-z0-9 whitelist is chosen then my understanding from the JavaScript world is that deciding what is punctuation versus symbols versus isn't necessarily straightforward to do (see here for JavaScript unicode issues). Perhaps Java is better-equipped?
Normalising non-ascii characters isn't something I am familiar with as I do not know the languages, but my impression is that this is a layer of sophistication and it is better to have something basic functional that can be improved upon.
The text was updated successfully, but these errors were encountered:
Per the title,
apoc.text.clean
appears to strip out everything that is nota-z0-9
(albeit allowing for some normalisation of diacritics).Unfortunately this means that the Cyrillic/Chinese/Japanese character-sets are stripped from input strings as they are not included within the whitelist.
Examples of current behaviour:
RETURN apoc.text.clean("桃山區")
-> Returns an empty string""
RETURN apoc.text.clean("А Б В Г Д Е Ж Ѕ Ꙁ И І К Л М Н О П Р С Т ОУ Ф Х Ѡ Ц Ч Ш Щ Ъ ЪІ Ь Ѣ Ꙗ Ѥ Ю Ѫ Ѭ Ѧ Ѩ Ѯ Ѱ Ѳ Ѵ Ҁ")
-> Returns an empty string""
Expected behaviour:
In both of these cases I would expect only the white-space and punctuation to be removed, instead everything is removed.
Consequence:
The consequence of this is that
apoc.text.clean
is not usable for content which has a significant number of non-ascii symbols within it.My personal use-case has only a small amount of non-ascii text so it isn't particularly critical for me yet, but long-term it is something that I will need to face eventually.
Desired change:
I see a couple of options:
Either of these would, ideally, be configurable to allow whitespace for example.
It would be fairly simple/straightforward to get the blacklisting of punctuation/whitespace going, but if expanding the existing
a-z0-9
whitelist is chosen then my understanding from the JavaScript world is that deciding what is punctuation versus symbols versus isn't necessarily straightforward to do (see here for JavaScript unicode issues). Perhaps Java is better-equipped?Normalising non-ascii characters isn't something I am familiar with as I do not know the languages, but my impression is that this is a layer of sophistication and it is better to have something basic functional that can be improved upon.
The text was updated successfully, but these errors were encountered: