Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating files with UTF-8 filenames broken on Windows #5037

Open
takluyver opened this issue Oct 30, 2024 · 13 comments
Open

Creating files with UTF-8 filenames broken on Windows #5037

takluyver opened this issue Oct 30, 2024 · 13 comments
Assignees
Labels
Component - C Library Core C library issues (usually in the src directory) Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Milestone

Comments

@takluyver
Copy link
Contributor

Describe the bug

Up to 1.14.3, HDF5 functions taking filenames expected UTF-8 filenames on Windows, and internally converted this to UTF-16 to pass to Windows Unicode (wide character) APIs. This allows you to create or open any filename Windows can represent, so long as you pass it as UTF-8, and h5py is relying on this behaviour to handle non-ASCII filenames consistently.

In #4172, this was changed to try the code-page dependent API first and fall back to the unicode API, although this is subtly different to what the user on the forum had requested:

try MultiByteToWideChar (with CP_UTF8) first, and then rather than failing completely, fallback to simply calling _open rather than _wopen.

The problem is that in most of the common Windows code pages, almost any UTF-8 string is valid, but representing the wrong characters. E.g.:

>>> "spent €3 in the café".encode("utf-8").decode("cp1252")
'spent €3 in the café'

That's using codepage 1252, which is the default in Western Europe & many English-speaking areas.

And this is precisely what we see in h5py's tests. A file is created through HDF5 with the character U+201A "Single Low-9 Quotation Mark" (which confusingly looks like a regular comma, but isn't). This is encoded as 3 bytes in UTF-8:

>>> chr(0x201a).encode('utf-8')
b'\xe2\x80\x9a'

And the filename that gets created contains ‚, which is those same 3 bytes interpreted as codepage 1252.

Expected behavior

The previous behaviour was, from our perspective, fine: H5Fopen & H5Fcreate simply took UTF-8 strings on Windows. So long as you know that and know how to convert encodings, you can do everything with that, with no ambiguity.

However, this may be confusing for Windows programmers who assume that a char * filename parameter will be interpreted as their current code page, since that's how Windows APIs work. You could document that this is different from Windows APIs and leave it alone.

If you want to allow a char * path to mean either UTF-8 or the current code page, there is unavoidable ambiguity: you cannot know if the bytes '\xc3\x9f' are UTF-8 (ß) or code page 1252 (ß). Favouring the code page in this scenario, as the current version does, makes UTF-8 essentially unusable. Favouring UTF-8 will mostly work: it's rare to hit on a UTF-8 sequence (except the ASCII part) by chance, but not impossible.

The alternative most palatable to Windows developers would probably be to make parallel versions of H5Fopen & H5Fcreate which take wide character strings, and pass these through to the Windows Unicode file APIs. But that would need to be plumbed right through the stack of VFDs and VOLs, which is presumably why you didn't go for that already.

Platform (please complete the following information)

  • HDF5 version: 1.14.5
  • OS and version: Windows
  • Compiler and version: MSVC 19.41.34123.0
  • Build system (e.g. CMake, Autotools) and version: CMake 3.30.5
  • Any configure options you specified: (I can figure the options out if needed, but I don't think they're relevant.)
  • MPI library and version (parallel HDF5): none

Additional context

h5py/h5py#2520

@derobins derobins added this to the 2.0.0 milestone Oct 30, 2024
@derobins derobins self-assigned this Oct 30, 2024
@derobins derobins added Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Component - C Library Core C library issues (usually in the src directory) Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub labels Nov 4, 2024
@hmaarrfk
Copy link

hmaarrfk commented Nov 5, 2024

is the offending PR "revertible"?
#4172

@emmenlau
Copy link

emmenlau commented Nov 5, 2024

is the offending PR "revertible"? #4172

You mean, by users, or by the official upstream release? For users it is fairly easy to comment out the new behavior and by that revert back to the old behavior.

@hmaarrfk
Copy link

hmaarrfk commented Nov 5, 2024

i feel like with such a large change I don't want to go against "upstream's wishes".

So for me it is between:

  • Roll back conda-forge to 1.14.3
  • Create a strange version of HDF5 with undocumented behavior at conda-ofrge

I would rather not roll back to 1.14.3 but it seems like maybe we must if this will take a while to resolve?

@takluyver
Copy link
Contributor Author

For the h5py packages on PyPI, I'm holding off upgrading until we hear what's likely to happen with this. But I found this because someone specifically asked for a newer HDF5, so if it seems like it's going to take a while, I might just skip the test and release with Windows unicode support as a known issue. 🤷

@jhendersonHDF
Copy link
Collaborator

Hi all,

it might be best to proceed with a patch to revert the problematic changes if wishing to upgrade past 1.14.3. I believe there's still the possibility of a 1.14.6 release, but if we do that we'll want to look into fixing the unicode support for all the cases that we can rather than reverting these changes and going back to the previous (also problematic) state. Due to that, it may take longer than desired to get a fix for this out.

@takluyver
Copy link
Contributor Author

takluyver commented Nov 6, 2024

we'll want to look into fixing the unicode support for all the cases that we can rather than reverting these changes and going back to the previous (also problematic) state

I don't think there really is a better option for Windows than the previous state, assuming you're not going to add a whole new set of file APIs using wchar_t *, or a parameter to switch how char * filenames are handled.

The only other option I can see is to flip the order, so it tries to treat the input as UTF-8 (converting it & passing to _wopen) first and falls back to the code page (open). That's better in that 99% of the time it will work for filenames using the codepage, but also worse because of the ambiguity: certain specific names in the code page would be treated as UTF-8 and create an unexpected filename.

@takluyver
Copy link
Contributor Author

Thinking about it, I guess you could make it a compile time option whether filenames are treated as UTF-8 on Windows, allowing you to use any possible file name, or as code page values, giving a more familiar experience for Windows programmers.

@hmaarrfk
Copy link

hmaarrfk commented Nov 7, 2024

it might be best to proceed with a patch to revert the problematic changes if wishing to upgrade past 1.14.3. I believe there's still the possibility of a 1.14.6 release,

I'm having a hard time to understand this suggestion. If this is the recommend way, as a packager, I would ask for your help to give users clear instructions:

  • 1.14.3 expected behavior is _______
  • 1.14.4-1.14.5 expected behavior is _______
  • 1.14.6 expected behavior is _______

asking downstream packages to patch means that we will each choose a different "solution" but it will push the problems onto end users.

I'm not too sure what the details of accelerating a 1.14.6 release with your best recommendations would be, but it would help ensure that documentation found on the web is consistent.

Thinking about it, I guess you could make it a compile time option whether filenames are treated as UTF-8 on Windows, allowing you to use any possible file name, or as code page values, giving a more familiar experience for Windows programmers.

IMO these kinds of "compile time switches" are really not useful to many end users. End users are often unaware of the compile time settings. Package managers are, but then we have to create complicate webs to enable users to "choose which compile time features they have".

@emmenlau
Copy link

emmenlau commented Nov 7, 2024

In my humble opinion, this issue has a much more severe impact on users than what one would imagine, at first. Having garbled file names, or being unable to create files at all, will come as a severe limitation.

I think this would justify either an additional runtime open/create option, to choose a behavior, or a dedicated API as suggested by @takluyver above.

Just my two cents...

@jhendersonHDF
Copy link
Collaborator

jhendersonHDF commented Nov 7, 2024

@hmaarrfk

The original idea suggested by myself was simply for us to revert #4172 and create a new 1.14.6 release for h5py's usage. However, it was brought up that if we are going to do a new release then it should have a proper fix for unicode support on Windows by taking one of the approaches mentioned by @takluyver in #5037. Since we want to address both the new issue we've caused for h5py, as well as the issue we were originally trying to fix in #4172, a new release may take a while, which is why I suggested possibly proceeding with a patch to revert the offending PR for h5py to be able to release with the previous behavior.

I'd then say that for 1.14.3 and 1.14.4, the expected behavior would be the behavior prior to #4172, where "HDF5 attempted to convert file paths passed to open() and remove() to UTF-16 in order to handle Unicode file paths.". For 1.14.5, we unfortunately didn't catch that this broke h5py, but this means that the expected behavior would be that the library will "try and interpret a file name as ASCII first, and only if that fails, interpret as UTF8 (via the conversion to UTF16 for MS Windows file open methods)". For a possible 1.14.6 release, the expected behavior is currently unknown since we still need to discuss internally how we want to approach fixing this issue.

@hmaarrfk
Copy link

hmaarrfk commented Nov 8, 2024

since we still need to discuss internally how we want to approach fixing this issue.

Understood.

@h-vetinari
Copy link

The original idea suggested by myself was simply for us to revert #4172 and create a new 1.14.6 release for h5py's usage. However, it was brought up that if we are going to do a new release then it should have a proper fix for unicode support on Windows by taking one of the approaches mentioned by @takluyver in #5037.

I think it's a fallacy that you need to solve both at the same time. A 1.14.6 release with the imperfect-but-working previous state is much more desirable than an indeterminate period where an as-yet-unknown solution is cooked up. Please consider going back to what worked (at least as measured by the predominant usage patterns prior to 1.14.4), and then take however much time you need for a unified solution that solves the issues from #4172 without breaking the rest.

@h-vetinari
Copy link

This is creating quite the pile-up in conda-forge and elsewhere. Basically anyone with users on windows is stuck on 1.14.3. In an effort to still enable 1.14.4, conda-forge is now doubling our CI matrix across ~130 packages just so that packages who cannot move on are still supported, while still trying to keep up-to-date where possible (moving on to 1.14.5 is not even on the radar at the moment, given this situation).

If this only gets solved - as currently milestoned - in 2.0, then we're in for a world of pain, because it'll mix the wish to use anything post-1.14.3 with having to rebuild all relevant libraries for 2.0, likely drawing in a huge amount of other work to adapt to the new major version, which would keep this whole situation from being resolved until months after the 2.0 release (which might still slip, and notwithstanding that whatever unified implementation might still contain new and exciting bugs).

This is just not sustainable. There's a reason why the regression rules in the kernel demand reverting regressions immediately - trying to layer fixes on top has simply shown again and again to not be a sustainable model (plus, if it goes on long enough, there'll be users that depend on both the new and the old behaviour, further restricting available avenues to fix things).

@derobins, given that your PR #4172 caused this, could you please comment here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - C Library Core C library issues (usually in the src directory) Priority - 0. Blocker ⛔ This MUST be merged for the release to happen Type - Bug / Bugfix Please report security issues to [email protected] instead of creating an issue on GitHub
Projects
None yet
Development

No branches or pull requests

6 participants