-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading / converting from files in different encodings #7834
Comments
mostly dup of #1792 We unfortunately lack a Latin1String. But this can be done manually by using |
We should probably add |
Why should |
Ok, fair enough. Always happy to add fewer things to Base :) |
Please don't add |
That is definitely why I want it in a package. It would be nice if we could make things work with different encodings in a package tough. As long a |
@ivarne Are you really sure it would be useful to support non-Unicode encodings as separate |
The only reason why OP's code works at all is that there is an exact embedding of Latin-1 codepoints in the U+00:U+FF code point range in Unicode. So Latin-1 is privileged in a way that other encodings are not. (This doesn't necessarily mean we should special-case it in a String type though.) |
💯 |
@nalimilan I'm not sure how useful it will be, but Base does not seem like the right place for it. I'm more doodling possible interfaces, than anything else. We currently have a few charsets supported
I would still argue that the conversion code could be written in Julia, and then you will need some types to tag things correctly. Then you will need a Char->Unichar mapping, and it seems right to let the user decide whether he needs a copy (in UTF8/16/32) of the whole string or not. |
The intent of the However you can implement |
Better keep the data in a
I don't think creating a copy is a real problem. In virtually all cases, you don't need to keep two copies of the strings: you just read a small chunk of a text (e.g. one line), convert it to Unicode, and reuse the buffer to read another chunk. The overhead is not that big, and people with big databases will use Unicode anyway. |
Closing as dup of #1792, feel free to continue the discussion there. |
I've got a file in latin1 and I've been trying to work out the best way to bring it in and work with it.
The most straightforward approach doesn't quite work:
So right now I take the long road - reading in each individual character and then converting them to a string.
Would it be possible / reasonable to add a keyword arg to
readall
(and company) that would do the same sort of conversion asstring
behind the scenes?The text was updated successfully, but these errors were encountered: