Unexpected behavior in mmap causing issues with rpm #3939

sirredbeard · 2019-03-26T16:04:51Z

Summary:

This is the result of several weeks of collaboration between the Pengwin team, WSL community members, outside experts we retained, and partners at Oracle on an bug affecting Red Hat Package Manager (rpm) on WSL, in CentOS, RHEL, Fedora (partial), Oracle Linux, and Scientific Linux. Those efforts are extensively documented here. After working through several theories and workarounds we believe we have narrowed down the issue to unusual mmap handling on WSL affecting the implementation of Berkeley DB inside rpm. The unusual mmap behavior seems to sync up with our workarounds and mitigations, so we have medium to high confidence this is the issue. There have been a handful of occasionally vague mmap issues reported before here, see #902 and #658, that are not quite on point, closest being #2852. Because this issue affects a broad array of distros on WSL we appreciate Microsoft's attention to this issue.

Your Windows build number: 17134 and 17763.
What you're doing and what's happening:

Install an rpm-based distro on WSL, e.g. WLinux Enterprise with Scientific Linux.
Set root password and create new default user.
su - into root.
Example 1:

[root@t470s ~]# rpm -q rpm
rpm-4.11.3-35.el7.x86_64
[root@t470s ~]# rpm --rebuilddb
[root@t470s ~]# rpm -q rpm
Segmentation fault (core dumped)

Example 2:

[hayden@t470s ~]$ sudo rm -rf /var/lib/rpm/__db*
[hayden@t470s ~]$ db_verify /var/lib/rpm/Packages
BDB5105 Verification of /var/lib/rpm/Packages succeeded.
[hayden@t470s ~]$ sudo rpm --rebuilddb
[hayden@t470s ~]$ sudo yum update
Loaded plugins: ovl
Segmentation fault (core dumped)

What should be happening:

Would expect rpm --rebuilddb would rebuild a working rpmdb.

What's wrong:

Running rpm --rebuilddb breaks rpmdb.

"When the underlying file is extended, the extended part of the mapping is actually mapped back to the beginning of the file. This is why BDB would crash when it extended the size of the file that backed the in-memory cache (one of the __db.### files), and why setting the cache size to a small value works as a work around" - Dr. Lauren Foutz, Oracle

C code which replicates the issue:
mmap_extend.c.txt

Strace of the failing command, if applicable:

See strace, procmon, and etl files here.

For WSL launch issues, please collect detailed logs.

See strace, procmon, and etl files here.

Notes:

This issue does not affect OpenSUSE's implementation of rpm because of a unique rpm implementation in that distro.

The text was updated successfully, but these errors were encountered:

therealkenc · 2019-03-26T18:39:45Z

Likely going to be the underlying cause behind #3742 as well, with a different but related repro. It is also probably causing Ben's observation in #3451 (message).

therealkenc · 2019-03-28T19:21:21Z

Wonderbar. But I've got to ask or it is going to nag at me. The following from Hayden's test case:

    addr1 = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fid, 0);
    /* Extend the file, write the last byte */
    printf("Extend the file to 1MB\n");
    lseek(fid, total_size - 1, SEEK_SET);
    write(fid, buf, sizeof(buf[0]));

... is semantically the same as...

    addr1 = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fid, 0);
    /* Extend the file */
    ftruncate(fid, total_size);

...which is #902. Which was allegedly impossible because:

I've looked into this and #902 and they share the commonality that both rely on functionality that the Windows kernel does not support which makes the scope of this fix a bit larger.

I'll take the win. But either (a) #902 just got fixed too or (b) I'm missing something stupid.

Brian-Perkins · 2019-03-28T21:11:21Z

When a file is mapped, we create a section with that file as the backing store. In NT, when you have a file-backed section you can't make the file smaller -- the section is essentially locking those file "pages". That is causing the #902 problem and is the rationale behind the cited comment. In the current scenario, where mmap is bigger than the file size, that is supported, and the only interesting bit is when the actual view of the extended part of the file is mapped. Currently this is done on-demand via guard pages, after verifying that the file is appropriately sized.

therealkenc · 2019-03-28T22:39:23Z

In the current scenario, where mmap is bigger than the file size, that is supported

Thanks much. Makes sense.

In the #902 test case, maybe instead of actually truncating, make a note locally in the WSL VFS that new file size is 16 bytes (not 8192 byes), make a note the current high water mark is 8192, and don't call down to NT at all. In other words, lie to userspace that you've shrunk the file. A read() beyond the noted size 16 can return 0 bytes as expected. If a subsequent write()/ftruncate() enlarges the file beyond 8192 bytes, extend like you are doing here. Fix up the noted file size on munmap()/close() if the file really did shrink in the end.

If that means the mapped pointer still has valid pages containing ????.... characters from offset 16 through 8191 even after the ftruncate(fd, 16), everyone can live with that. It beats an ftruncate() EINVAL any day of the week, and no working program will poke those addresses after the truncate anyway. [n.b. I assume if this was straightforward or actually work you'd have done it already, but I'll throw it at the wall anyway.]

Brian-Perkins · 2019-05-01T21:11:32Z

Fixed in Windows Insider Build 18890

ytrezq · 2019-05-07T23:12:33Z

@Brian-Perkins : unfortunately no.

therealkenc · 2019-05-07T23:35:06Z

WiredWonder · 2019-08-05T05:52:28Z

@therealkenc what is the chance of this fix being added to a 18362 servicing build? Or do we need to wait until next year / WSL2 to get a fix in a Production release?

GigabyteProductions · 2019-08-05T07:04:54Z

We'd like to see this fixed in WSL1 because we use VMware and VirtualBox on our workstations.

therealkenc · 2019-08-06T18:40:44Z

@therealkenc what is the chance of this fix being added to a 18362 servicing build?

I don't have control over that sort of thing, but historically speaking, chances low to nonexistent. It should appear in the Fall 2019 release tho. You won't have to wait until 20H1 or WSL2.

birbird · 2019-11-20T04:32:04Z

I need this fix, how can I upgrade to 18890? better to keep my subsystem data.

therealkenc · 2020-06-03T03:57:44Z

It should appear in the Fall 2019 release tho.

^--- pretty sure it it didn't make 19H2 aka 1909 (18363 < 18890) but 2004 should be good.

pdinc-oss · 2023-01-04T04:17:55Z

It should appear in the Fall 2019 release tho.

^--- pretty sure it it didn't make 19H2 aka 1909 (18363 < 18890) but 2004 should be good.

Not seeing it in Server 2019 yet, is it being applied there?

sirredbeard mentioned this issue Mar 26, 2019

rpm --rebuilddb causes all future rpm functions to segfault/yum to hang WhitewaterFoundry/Pengwin-Enterprise#20

Closed

Brian-Perkins added the bug label Mar 26, 2019

benhillis added the fixinbound label Mar 28, 2019

therealkenc mentioned this issue May 7, 2019

WSL: ftruncate does not truncate mmap'd files #902

Closed

therealkenc added fixedininsiderbuilds and removed fixinbound labels May 7, 2019

therealkenc mentioned this issue May 14, 2019

BoltDB panics on cursor search since April update #3162

Closed

microsoft deleted a comment from rfikki May 15, 2019

SjonHortensius mentioned this issue May 16, 2019

panic: invalid page type: 0: 4 prysmaticlabs/prysm#2543

Closed

sirredbeard mentioned this issue Jun 19, 2019

dnf update does not release rpm lock WhitewaterFoundry/Fedora-Remix-for-WSL#40

Open

therealkenc mentioned this issue Aug 27, 2019

mremap breaks on MREMAP_MAYMOVE #4445

Closed

sirredbeard mentioned this issue Oct 10, 2019

Segmentation fault when calling rpm -qp #4587

Closed

therealkenc mentioned this issue Feb 6, 2020

mmap fails with EINVAL, Firefox reliably crashes #4873

Closed

therealkenc closed this as completed Jun 3, 2020

therealkenc added fixedinreleasebuild and removed bug fixedininsiderbuilds labels Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior in mmap causing issues with rpm #3939

Unexpected behavior in mmap causing issues with rpm #3939

sirredbeard commented Mar 26, 2019 •

edited

Loading

therealkenc commented Mar 26, 2019

therealkenc commented Mar 28, 2019

Brian-Perkins commented Mar 28, 2019

therealkenc commented Mar 28, 2019

Brian-Perkins commented May 1, 2019

ytrezq commented May 7, 2019 •

edited

Loading

therealkenc commented May 7, 2019

WiredWonder commented Aug 5, 2019

GigabyteProductions commented Aug 5, 2019

therealkenc commented Aug 6, 2019

birbird commented Nov 20, 2019 •

edited

Loading

therealkenc commented Jun 3, 2020 •

edited

Loading

pdinc-oss commented Jan 4, 2023

Unexpected behavior in mmap causing issues with rpm #3939

Unexpected behavior in mmap causing issues with rpm #3939

Comments

sirredbeard commented Mar 26, 2019 • edited Loading

therealkenc commented Mar 26, 2019

therealkenc commented Mar 28, 2019

Brian-Perkins commented Mar 28, 2019

therealkenc commented Mar 28, 2019

Brian-Perkins commented May 1, 2019

ytrezq commented May 7, 2019 • edited Loading

therealkenc commented May 7, 2019

WiredWonder commented Aug 5, 2019

GigabyteProductions commented Aug 5, 2019

therealkenc commented Aug 6, 2019

birbird commented Nov 20, 2019 • edited Loading

therealkenc commented Jun 3, 2020 • edited Loading

pdinc-oss commented Jan 4, 2023

sirredbeard commented Mar 26, 2019 •

edited

Loading

ytrezq commented May 7, 2019 •

edited

Loading

birbird commented Nov 20, 2019 •

edited

Loading

therealkenc commented Jun 3, 2020 •

edited

Loading