apoc.text.clean strips utf8 characters (incl. cyrillic/chinese/japanese symbols) #744

MysterAitch · 2018-02-20T08:57:52Z

Per the title, apoc.text.clean appears to strip out everything that is not a-z0-9 (albeit allowing for some normalisation of diacritics).

Unfortunately this means that the Cyrillic/Chinese/Japanese character-sets are stripped from input strings as they are not included within the whitelist.

Examples of current behaviour:

Chinese characters: RETURN apoc.text.clean("桃山區") -> Returns an empty string ""
Cyrillic characters: RETURN apoc.text.clean("А Б В Г Д Е Ж Ѕ Ꙁ И І К Л М Н О П Р С Т ОУ Ф Х Ѡ Ц Ч Ш Щ Ъ ЪІ Ь Ѣ Ꙗ Ѥ Ю Ѫ Ѭ Ѧ Ѩ Ѯ Ѱ Ѳ Ѵ Ҁ") -> Returns an empty string ""

Expected behaviour:

In both of these cases I would expect only the white-space and punctuation to be removed, instead everything is removed.

Consequence:

The consequence of this is that apoc.text.clean is not usable for content which has a significant number of non-ascii symbols within it.

My personal use-case has only a small amount of non-ascii text so it isn't particularly critical for me yet, but long-term it is something that I will need to face eventually.

Desired change:

I see a couple of options:

Whitelisting the relevant unicode character ranges to expand upon the current implementation,
Switching to blacklisting punctuation/whitespace.

Either of these would, ideally, be configurable to allow whitespace for example.

It would be fairly simple/straightforward to get the blacklisting of punctuation/whitespace going, but if expanding the existing a-z0-9 whitelist is chosen then my understanding from the JavaScript world is that deciding what is punctuation versus symbols versus isn't necessarily straightforward to do (see here for JavaScript unicode issues). Perhaps Java is better-equipped?

Normalising non-ascii characters isn't something I am familiar with as I do not know the languages, but my impression is that this is a layer of sophistication and it is better to have something basic functional that can be improved upon.

The text was updated successfully, but these errors were encountered:

…l. cyrillic/chinese/japanese symbols)

…hinese/japanese symbols) (#869)

AngeloBusato added the Larus label Jul 23, 2018

AngeloBusato added a commit to larusba/neo4j-apoc-procedures that referenced this issue Jul 27, 2018

fixes neo4j-contrib#744 - add new regexp for clean function

3398c1f

AngeloBusato added a commit to larusba/neo4j-apoc-procedures that referenced this issue Jul 28, 2018

fixes neo4j-contrib#744 - add new regexp for clean function

392a74d

AngeloBusato added a commit to larusba/neo4j-apoc-procedures that referenced this issue Jul 28, 2018

fixes neo4j-contrib#744 - apoc.text.clean strips utf8 characters (inc…

157c99e

…l. cyrillic/chinese/japanese symbols)

jexp closed this as completed in b7337c7 Aug 6, 2018

jexp pushed a commit that referenced this issue Aug 8, 2018

fixes #744 - apoc.text.clean strips utf8 characters (incl. cyrillic/c…

bdd9646

…hinese/japanese symbols) (#869)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apoc.text.clean strips utf8 characters (incl. cyrillic/chinese/japanese symbols) #744

apoc.text.clean strips utf8 characters (incl. cyrillic/chinese/japanese symbols) #744

MysterAitch commented Feb 20, 2018

apoc.text.clean strips utf8 characters (incl. cyrillic/chinese/japanese symbols) #744

apoc.text.clean strips utf8 characters (incl. cyrillic/chinese/japanese symbols) #744

Comments

MysterAitch commented Feb 20, 2018

Examples of current behaviour:

Expected behaviour:

Consequence:

Desired change: