-
-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating files with UTF-8 filenames broken on Windows #5037
Comments
is the offending PR "revertible"? |
You mean, by users, or by the official upstream release? For users it is fairly easy to comment out the new behavior and by that revert back to the old behavior. |
i feel like with such a large change I don't want to go against "upstream's wishes". So for me it is between:
I would rather not roll back to 1.14.3 but it seems like maybe we must if this will take a while to resolve? |
For the h5py packages on PyPI, I'm holding off upgrading until we hear what's likely to happen with this. But I found this because someone specifically asked for a newer HDF5, so if it seems like it's going to take a while, I might just skip the test and release with Windows unicode support as a known issue. 🤷 |
Hi all, it might be best to proceed with a patch to revert the problematic changes if wishing to upgrade past 1.14.3. I believe there's still the possibility of a 1.14.6 release, but if we do that we'll want to look into fixing the unicode support for all the cases that we can rather than reverting these changes and going back to the previous (also problematic) state. Due to that, it may take longer than desired to get a fix for this out. |
I don't think there really is a better option for Windows than the previous state, assuming you're not going to add a whole new set of file APIs using The only other option I can see is to flip the order, so it tries to treat the input as UTF-8 (converting it & passing to |
Thinking about it, I guess you could make it a compile time option whether filenames are treated as UTF-8 on Windows, allowing you to use any possible file name, or as code page values, giving a more familiar experience for Windows programmers. |
I'm having a hard time to understand this suggestion. If this is the recommend way, as a packager, I would ask for your help to give users clear instructions:
asking downstream packages to patch means that we will each choose a different "solution" but it will push the problems onto end users. I'm not too sure what the details of accelerating a 1.14.6 release with your best recommendations would be, but it would help ensure that documentation found on the web is consistent.
IMO these kinds of "compile time switches" are really not useful to many end users. End users are often unaware of the compile time settings. Package managers are, but then we have to create complicate webs to enable users to "choose which compile time features they have". |
In my humble opinion, this issue has a much more severe impact on users than what one would imagine, at first. Having garbled file names, or being unable to create files at all, will come as a severe limitation. I think this would justify either an additional runtime open/create option, to choose a behavior, or a dedicated API as suggested by @takluyver above. Just my two cents... |
The original idea suggested by myself was simply for us to revert #4172 and create a new 1.14.6 release for h5py's usage. However, it was brought up that if we are going to do a new release then it should have a proper fix for unicode support on Windows by taking one of the approaches mentioned by @takluyver in #5037. Since we want to address both the new issue we've caused for h5py, as well as the issue we were originally trying to fix in #4172, a new release may take a while, which is why I suggested possibly proceeding with a patch to revert the offending PR for h5py to be able to release with the previous behavior. I'd then say that for 1.14.3 and 1.14.4, the expected behavior would be the behavior prior to #4172, where "HDF5 attempted to convert file paths passed to open() and remove() to UTF-16 in order to handle Unicode file paths.". For 1.14.5, we unfortunately didn't catch that this broke h5py, but this means that the expected behavior would be that the library will "try and interpret a file name as ASCII first, and only if that fails, interpret as UTF8 (via the conversion to UTF16 for MS Windows file open methods)". For a possible 1.14.6 release, the expected behavior is currently unknown since we still need to discuss internally how we want to approach fixing this issue. |
Understood. |
I think it's a fallacy that you need to solve both at the same time. A 1.14.6 release with the imperfect-but-working previous state is much more desirable than an indeterminate period where an as-yet-unknown solution is cooked up. Please consider going back to what worked (at least as measured by the predominant usage patterns prior to 1.14.4), and then take however much time you need for a unified solution that solves the issues from #4172 without breaking the rest. |
This is creating quite the pile-up in conda-forge and elsewhere. Basically anyone with users on windows is stuck on 1.14.3. In an effort to still enable 1.14.4, conda-forge is now doubling our CI matrix across ~130 packages just so that packages who cannot move on are still supported, while still trying to keep up-to-date where possible (moving on to 1.14.5 is not even on the radar at the moment, given this situation). If this only gets solved - as currently milestoned - in 2.0, then we're in for a world of pain, because it'll mix the wish to use anything post-1.14.3 with having to rebuild all relevant libraries for 2.0, likely drawing in a huge amount of other work to adapt to the new major version, which would keep this whole situation from being resolved until months after the 2.0 release (which might still slip, and notwithstanding that whatever unified implementation might still contain new and exciting bugs). This is just not sustainable. There's a reason why the regression rules in the kernel demand reverting regressions immediately - trying to layer fixes on top has simply shown again and again to not be a sustainable model (plus, if it goes on long enough, there'll be users that depend on both the new and the old behaviour, further restricting available avenues to fix things). @derobins, given that your PR #4172 caused this, could you please comment here? |
Describe the bug
Up to 1.14.3, HDF5 functions taking filenames expected UTF-8 filenames on Windows, and internally converted this to UTF-16 to pass to Windows Unicode (wide character) APIs. This allows you to create or open any filename Windows can represent, so long as you pass it as UTF-8, and h5py is relying on this behaviour to handle non-ASCII filenames consistently.
In #4172, this was changed to try the code-page dependent API first and fall back to the unicode API, although this is subtly different to what the user on the forum had requested:
The problem is that in most of the common Windows code pages, almost any UTF-8 string is valid, but representing the wrong characters. E.g.:
That's using codepage 1252, which is the default in Western Europe & many English-speaking areas.
And this is precisely what we see in h5py's tests. A file is created through HDF5 with the character U+201A "Single Low-9 Quotation Mark" (which confusingly looks like a regular comma, but isn't). This is encoded as 3 bytes in UTF-8:
And the filename that gets created contains
‚
, which is those same 3 bytes interpreted as codepage 1252.Expected behavior
The previous behaviour was, from our perspective, fine:
H5Fopen
&H5Fcreate
simply took UTF-8 strings on Windows. So long as you know that and know how to convert encodings, you can do everything with that, with no ambiguity.However, this may be confusing for Windows programmers who assume that a
char *
filename parameter will be interpreted as their current code page, since that's how Windows APIs work. You could document that this is different from Windows APIs and leave it alone.If you want to allow a
char *
path to mean either UTF-8 or the current code page, there is unavoidable ambiguity: you cannot know if the bytes'\xc3\x9f'
are UTF-8 (ß) or code page 1252 (ß). Favouring the code page in this scenario, as the current version does, makes UTF-8 essentially unusable. Favouring UTF-8 will mostly work: it's rare to hit on a UTF-8 sequence (except the ASCII part) by chance, but not impossible.The alternative most palatable to Windows developers would probably be to make parallel versions of H5Fopen & H5Fcreate which take wide character strings, and pass these through to the Windows Unicode file APIs. But that would need to be plumbed right through the stack of VFDs and VOLs, which is presumably why you didn't go for that already.
Platform (please complete the following information)
Additional context
h5py/h5py#2520
The text was updated successfully, but these errors were encountered: