Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPTIONS_FILE assumes ISO 8859-1 encoding #239

Closed
omerhj opened this issue Mar 15, 2024 · 1 comment
Closed

OPTIONS_FILE assumes ISO 8859-1 encoding #239

omerhj opened this issue Mar 15, 2024 · 1 comment
Assignees

Comments

@omerhj
Copy link

omerhj commented Mar 15, 2024

I've been using Corb2 successfully to update large numbers of documents. Usually, I declare external variables in the transform.xqy script, and define values for these variables in our options file that I pass to Corb using -DOPTIONS_FILE parameter.

I've now hit a snag: Corb loads the options file using the java.util.Properties.Load(Inputstream) method. Unexpectedly to me, this causes the contents of the properties file to be interpreted as if it has been encoded in ISO 8859-1, even though the system encoding (the LANG environment variable) is en_US.UTF-8. I'll now need to run a repair job to fix the encoding error.

I believe the best way to prevent this issue is to wrap the InputStream parameter in a java.util.InputStreamReader object. That should use the system default charset for the properties file, which is almost always what you want.

I use Java 8 in my production jobs, but Java 21 will (according to its JavaDoc) still use ISO 8859-1 for property files read from an InputStream.

CoRB version: marklogic-corb-2.5.4
OS: CentOS Linux release 7.6.1810 (Core)
JVM: OpenJDK Runtime Environment (build 1.8.0_191-b12)

Let me know if you'd like me to create a pull request.

@hansenmc
Copy link
Member

Thank you for reporting the issue. Sorry to hear about the trouble it caused.

It seems that you are right, InputStreamReader would be a better choice than Load - as it at least provides a means of specifying the encoding.

Whether to use the system encoding or not for various files is always tricky. Might look to use system encoding unless an option is specified to set something different (that way you can load UTF-8 options files on a Windows machine with cp1252 as system encoding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants