Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TMPDIR and use Win API on Windows #1416

Merged
merged 5 commits into from
Aug 7, 2023

Conversation

oscarbenjamin
Copy link
Contributor

See #1206 (comment)

I am hoping that this will expose a CI failure on MinGW because there is a MinGW bug that currently does not show up in CI.

I have not yet tested it locally with MinGW (it takes a long time to test that with the Windows machine that I have available).

@oscarbenjamin
Copy link
Contributor Author

Okay it's been a bit of a roundabout journey but I think I have now understood the original problem in gh-1206 as well as the various changes and what effect they have on different platforms and why ultimately flintlib/python-flint#43 (comment) is not yet fixed: I think I have the fix here though.

The Flint CI won't run until someone approves the workflow but the CI already runs in my fork and passes:
https://github.com/oscarbenjamin/flint2/actions

I'll try to explain but it's difficult to do this succinctly:

Firstly the qsieve_factor function which is in turn used by fmpz_factor uses temporary files to store some data. Previous discussions have suggested removing this use of files and doing everything in memory. The complexity of what I am about to explain demonstrates exactly how many problems are created so clearly longer term there is significant benefit in removing the use of temporary files.

In Flint 2.9.0 the temporary file is created like:

#if defined (__WIN32) && !defined(__CYGWIN__)                                   
    srand((int) GetCurrentProcessId());                                         
#else                                                                           
    srand((int) getpid());                                                      
#endif                                                                          
    nchars = sprintf(qs_inf->fname, "%d", (int) rand());                        
    strcat(qs_inf->fname + nchars, "siqs.dat");                                 
                                                                                
    qs_inf->siqs = fopen(qs_inf->fname, "w"); 

That just creates a file in the current directory with a sort of random filename.

Since the 2.9.0 behaviour is not ideal gh-1201 changed that. Instead the appropriate APIs should be used for Windows/Unix to choose a unique temporary filename which will be in e.g. /tmp on Unix and the appropriate other place on Windows. Then there were two codepaths separated by preprocessor defines with one codepath being for MSVC and MinGW on Windows using GetTempFileNameA and the other codepath being for cygwin and Unix using mkstemp. This was almost the right code but a subtle bug was introduced in this line for the Windows codepath:

    if (GetTempFileNameA(temp_path, "siqs", /*uUnique*/ TRUE, qs_inf->fname) == 0)

Here uUnique is set to TRUE to create a unique filename but that is not how the MS API designers wanted it:

If uUnique is zero, the function attempts to form
a unique file name using the current system time.
If the file already exists, the number is
increased by one and the functions tests if this
file already exists. This continues until a unique
filename is found; the function creates a file by
that name and closes it. Note that the function
does not attempt to verify the uniqueness of the
file name when uUnique is nonzero.

https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-gettempfilenamea

So (obviously?) if you do want unique filenames then you should pass uUnique = 0.

I think at the time there was no CI for MSVC which is perhaps why this was only immediately noticed for MinGW. The bug here applies equally to MSVC and MinGW but reproducing it is tricky because it only manifests when calling qsieve_factor in parallel like:

make check MOD=fmpz_factor -j

The problem is that two separate processes will choose the same filename. The error can manifest in two different ways:

  1. Process A attempts to delete the file while process B holds an open file handle: permission denied.
  2. Process B attempts to delete the file but process A already deleted it: remove failed.

I haven't noticed this happening but there is a third way that is much worse: the two processes overwrite each other's data leading to incorrect results from fmpz_factor. It is likely (but not guaranteed) that in this scenario an error would still arise from at least one process double deleting the file or something like that.

The problem with MinGW was noticed at the time (gh-1206) and an attempt was made to fix it in gh-1279. That PR changed it so that MinGW would follow the Unix codepath rather than the Windows codepath. This would cause the MinGW job to fail in a different way because the Unix codepath assumed that /tmp was the temp directory but that path does not exist on Windows so all calls to qsieve_factor under MinGW would fail. The PR worked around this by adding a build-time configuration variable TMPPATH. The TMPPATH variable (later renamed in gh-1334) was then used for the MinGW CI job:
https://github.com/flintlib/flint2/blob/ef89816abda221fc40cb4a960430cb570ef0f43c/.github/workflows/CI.yml#L473-L475
This means that the MinGW CI job is back to using the current working directory just like Flint 2.9.0 which makes CI look like it passes except that the problem still exists outside of CI: flintlib/python-flint#43 (comment)

The problem with MSVC I guess did not manifest until CI was added in gh-1282 at which point the MSVC codepath (since MinGW was then using the Unix codepath with TMPDIR=. in CI) was changed to use tmpnam and so we finally arrive at the current code in trunk:
https://github.com/flintlib/flint2/blob/ef89816abda221fc40cb4a960430cb570ef0f43c/src/qsieve/factor.c#L212-L229
This code is broken on Windows under either MSVC or MinGW:

  1. Under MSVC the use of tmpnam is also subject to race conditions etc and its use is generally discouraged e.g.

    Never use these functions.
    https://man7.org/linux/man-pages/man3/tmpnam.3.html

  2. Under MinGW this will always fail because it tries to create a temporary file in /tmp which does not exist.

In this PR I fix all of these problems by going back to gh-1201 but with two important fixes:

  1. Use uUnique=0 to get unique filenames.
  2. Match every fopen with an fclose and ensure that qs_inf->siqs is Null if there is no file to close.

The latter point is needed because there is an important constraint in Windows that does not apply in Unix: a file cannot be deleted if any process holds an open file handle.

This merges the MinGW and MSVC codepaths again as in gh-1201 and ensures that the recommended APIs for temporary files are used in all cases.

I also removed the TMPDIR build-time configuration variable. Potentially this variable is useful but the original reason for adding it no longer applies: it was added for the MinGW CI job but the only purpose it serves there is to hide a real bug in CI so it should not be used for that. Clearly looking at what I have written above avoiding temporary files is a good idea in general and if the long term goal is to do that then maybe it is better not to add configuration options for controlling how temporary files are created just to solve a temporary problem (that is already fixed in this PR).

@fredrik-johansson fredrik-johansson merged commit f455752 into flintlib:trunk Aug 7, 2023
@fredrik-johansson
Copy link
Collaborator

Well done! This fix will obviously go in the next alpha or beta release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants