Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apoc.text.clean strips utf8 characters (incl. cyrillic/chinese/japanese symbols) #744

Closed
MysterAitch opened this issue Feb 20, 2018 · 0 comments
Labels

Comments

@MysterAitch
Copy link

Per the title, apoc.text.clean appears to strip out everything that is not a-z0-9 (albeit allowing for some normalisation of diacritics).

Unfortunately this means that the Cyrillic/Chinese/Japanese character-sets are stripped from input strings as they are not included within the whitelist.

Examples of current behaviour:

  • Chinese characters: RETURN apoc.text.clean("桃山區") -> Returns an empty string ""
  • Cyrillic characters: RETURN apoc.text.clean("А Б В Г Д Е Ж Ѕ Ꙁ И І К Л М Н О П Р С Т ОУ Ф Х Ѡ Ц Ч Ш Щ Ъ ЪІ Ь Ѣ Ꙗ Ѥ Ю Ѫ Ѭ Ѧ Ѩ Ѯ Ѱ Ѳ Ѵ Ҁ") -> Returns an empty string ""

Expected behaviour:

In both of these cases I would expect only the white-space and punctuation to be removed, instead everything is removed.

Consequence:

The consequence of this is that apoc.text.clean is not usable for content which has a significant number of non-ascii symbols within it.

My personal use-case has only a small amount of non-ascii text so it isn't particularly critical for me yet, but long-term it is something that I will need to face eventually.

Desired change:

I see a couple of options:

  1. Whitelisting the relevant unicode character ranges to expand upon the current implementation,
  2. Switching to blacklisting punctuation/whitespace.

Either of these would, ideally, be configurable to allow whitespace for example.

It would be fairly simple/straightforward to get the blacklisting of punctuation/whitespace going, but if expanding the existing a-z0-9 whitelist is chosen then my understanding from the JavaScript world is that deciding what is punctuation versus symbols versus isn't necessarily straightforward to do (see here for JavaScript unicode issues). Perhaps Java is better-equipped?

Normalising non-ascii characters isn't something I am familiar with as I do not know the languages, but my impression is that this is a layer of sophistication and it is better to have something basic functional that can be improved upon.

AngeloBusato added a commit to larusba/neo4j-apoc-procedures that referenced this issue Jul 27, 2018
AngeloBusato added a commit to larusba/neo4j-apoc-procedures that referenced this issue Jul 28, 2018
AngeloBusato added a commit to larusba/neo4j-apoc-procedures that referenced this issue Jul 28, 2018
@jexp jexp closed this as completed in b7337c7 Aug 6, 2018
jexp pushed a commit that referenced this issue Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants