Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defense in Depth - Traceless Sanitization #633

Open
Tracked by #221
apyrgio opened this issue Dec 5, 2023 · 11 comments
Open
Tracked by #221

Defense in Depth - Traceless Sanitization #633

apyrgio opened this issue Dec 5, 2023 · 11 comments
Labels
anti-forensics Forensic Inspection Mitigations container security

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Dec 5, 2023

Parent issue: #221

Security Concern

One aspect of the sanitization that Dangerzone has not targeted yet is avoid leaving traces of the converted file on disk. Depending on your threat model, this may be troublesome for two reasons:

  1. It allows an attacker to uncover converted files, by looking for magic numbers on the disk (see extundelete project).
  2. It can nullify a plausible deniability defense, if there is proof that a sensitive file was at some point stored in the computer.

Current Situation

Dangerzone uses Linux commands (libreoffice, gm, pdftoppm, tesseract) for the various stages of the file conversion. Most of these commands rely on a file to work. Dangerzone is passing files from command to command using two locations:

  1. Temporary directories created on the host, which are then mounted into the container (not applicable to Qubes)
  2. The /tmp directory in the container.

Note

A few notes on the /tmp dir of a container. This directory is not guaranteed to be backed by tmpfs, unless you pass a flag like --tmpfs / --mount type=tmpfs. Even in that case, this functionality may not be available across platforms. For example, Docker states that tmpfs mounts are only available on Linux. Also, WSL used to emulate tmpfs on disk.

Note that there are contradicting accounts on whether WSL2 supports tmpfs on RAM:

In any case, we have to verify the avaiability of tmpfs mounts on each platform separately.

Upcoming Improvements

The file passing implementation will drastically change for two reasons:

  1. We are working on integrating PyMuPDF with Dangerzone (Use PyMuPDF to solve most congestions issues in /tmp (client & Server)  #622) , meaning that most of the file conversion steps within the sandbox can take place in memory, instead of invoking commands.
  2. We are also working on making sandbox <-> host communication rely on standard streams (stdin, stdout), instead of mounted directories (Containers Page Streaming based on PyMuPDF #627).

Remaining Problems

These two improvements will drastically reduce the need for file passing during the conversion, but there is one thing that remains, and that is LibreOffice, which does not accept input from stdin.

There is a Python project called pylokit, that wraps LibreOfficeKit and calls some functions directly, without starting an external process. Unfortunately, even these bindings don't offer a way to read a document from memory: https://github.com/xrmx/pylokit/blob/abdfedbdb80ee172785cead189760a77a544045a/pylokit/lokit.py#L112

So, we have to accept that we will create a file in order to use LibreOffice, and our main line of defense will be storing this file in a tmpfs mount. However, we have to account for the cases where this is simply not available in a platform.

Suggestion

Assuming that an encrypted FUSE fs within the container is prohibitively complex and time-consuming, then we have one more option. We can "shred" the file after LibreOffice has used it, i.e., write random data in the disk blocks where the file was stored. See some existing projects:

Note that this does not offer 100% protection. As @legoktm pointed out, modern filesystems and SSDs are shred-resistant.

Note

@legoktm has pointed out that SecureDrop already uses this approach, when deleting submissions: https://github.com/freedomofpress/securedrop/blob/develop/securedrop/rm.py

@apyrgio
Copy link
Contributor Author

apyrgio commented Dec 5, 2023

My take on shred-resistant filesystems and devices, in the context of Docker Desktop is this:

  1. On Linux, we can rest assured that tmpfs mounts will work, so the discussion does not apply there.
  2. On Windows / macOS, the underlying filesystem is chosen by Docker Desktop / WSL2. My understanding is that it's ext4, but we should double check on these platforms. ext4 filesystems with default options (i.e., not journaled) should be shredable.
  3. Leaving traces of a file on disk, e.g. due to wear leveling or bad blocks, is still feasible with shredding. However, it considerably raises the barrier for the attack, in terms of physical access to the disk, and expertise to retrieve the data.

That's my understanding so far, but if anyone has more experience in this, please chip in!

@deeplow
Copy link
Contributor

deeplow commented Dec 5, 2023

On Linux, we can rest assured that tmpfs mounts will work, so the discussion does not apply there.

This is not so clear-cut. As we saw, ubuntu didn't have /tmp mounted as tmpfs.

@apyrgio
Copy link
Contributor Author

apyrgio commented Dec 5, 2023

That's because we currently don't use any of the --tmpfs / --mount type=tmpfs options. My understanding is that once we pass one of these flags, it will work.

Might be worth exploring if /dev/shm is a cross-platform alternative btw, so that we don't worry if /tmp if RAM-backed or not. We may avoid swapping the files in this case as well.

@deeplow
Copy link
Contributor

deeplow commented Dec 5, 2023

That's because we currently don't use any of the --tmpfs / --mount type=tmpfs options. My understanding is that once we pass one of these flags, it will work.

I was meaning on the host, not the containers. But that would be good for the containers.

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Jun 13, 2024

I have filed google/gvisor#10530 which would address this. No promise on practical feasibility yet, but I do think that with the document processing pipeline running in gVisor, ensuring that this pipeline runs entirely in unswappable memory becomes easier to systematically guarantee. A gVisor tmpfs mount isn't a tmpfs mount from the host's perspective. Instead, a gVisor tmpfs mount is backed by runsc application memory. The same is true for the processes running inside the sandbox (LibreOffice, etc.). Therefore, runsc could (at least theoretically) mlock all of its own memory to ensure it cannot be paged out to swap.

@apyrgio
Copy link
Contributor Author

apyrgio commented Jun 13, 2024

Oh. I hadn't realized that gVisor emulates various filesystem types. That's really awesome.

As for the mlock() solution, it does seem the most sensible one. However, I just realized that there's another upcoming feature which will complicate things here: we will soon move the second phase of the conversion (pixels to PDF) to the host (#625).

The good thing about this move is that we will no longer mount pixel data as files, and therefore we can do the reconstruction of the PDF in-memory. The bad thing is that, in order to leave no traces, we would have to do some memory management tricks (e.g., mlock()) in a cross-platform way, on Windows and macOS.

We could perhaps see how cross-platform programs like GnuPG protect their keys from being swapped to the disk. Then, along with the proposed gVisor safeguard, we can have a solid solution to this issue.

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Jun 14, 2024

Oh. I hadn't realized that gVisor emulates various filesystem types. That's really awesome.

I want to point out that if this wasn't the case, then it would be impossible to specify tmpfs mounts in the OCI spec given to runsc. I say this because the statement "gVisor emulates a substantial portion of the Linux syscall ABI" is sometimes misinterpreted to mean "... and the rest is just passed through to the host Linux kernel", but actually it is the opposite. Nothing is passed to the host kernel. If gVisor doesn't implement Linux feature X (where "X" could be tmpfs but really any Linux kernel feature), then you can't use feature X in gVisor at all. It is its own independent kernel.

I just realized that there's another upcoming feature which will complicate things here: we will soon move the second phase of the conversion (pixels to PDF) to the host (#625).

Right... But as long as this conversion happens within memory that is in the Dangerzone process's own address space, then it too can call mlock. The only risk is when one needs to execute subprocesses (or really anytime a fork() happens), as the mlockedness of memory pages isn't preserved across fork()s. Therefore, anything that executes in a subprocess, or any library that spawns a subprocess, becomes a potential leaky vector unless it too is made to mlock its own address space before doing memory write operations (and recursively so with any fork() that it calls in turn, of course).

One way to ensure that may be to impose a seccomp-bpf filter on the Dangerzone process that blocks use of the fork, clone, clone2, clone3, and munlock syscalls. This way it would be obvious if it (or any of its libraries) does any forking of the address space, and (unlike just doing a one-off strace) it would catch the case where one of the libraries it depends on is updated to use one of these syscalls later down the line. This would be a Linux-only solution, but it is probably fair to assume that PyMuPDF and other such libraries would work mostly in the same manner on other platforms as well.

If any of the pixel-to-PDF conversion pipeline does require fork() even after #622, then it may be necessary to keep the two-container approach with both running in gVisor (assuming google/gvisor#10530 is implemented). It would still be possible to gain the benefits from #622 with this approach, as the program running within the second container can be just a minimal Python program that runs the same code doing the PyMuPDF and Tesseract-OCR magic as would otherwise run on the host. Even with this approach, the parent Dangerzone process coordinating all of this would still need to do its own mlocking, since it contains the pixel data in its own address space; but at least it wouldn't need to worry about the implementation details of PyMuPDF.

We could perhaps see how cross-platform programs like GnuPG protect their keys from being swapped to the disk.

I believe the relevant file is this one. It calls mlock on platforms that have it (this includes Mac OS X). On Windows, well, there's a comment about that on lines 355-356... Although a quick search reveals that Windows does have an API for this.

@apyrgio
Copy link
Contributor Author

apyrgio commented Jun 25, 2024

Just wanted to point out, all the above gives us a lot of food for thought, once we decide to tackle this issue.

One question that immediately spawned from the above is: if you mlock() a region, but a library you're using decides to create a copy internally, e.g., for its own processing, won't that evade the mlock() protection? That applies to the second container case as well, I believe.

A scenario I'm thinking is; suppose we ran PyMuPDF within a gVisor sandbox, and pixel data manipulation takes place in Python's heap memory. This memory is handled by gVisor, which would have to unconditionally mlock() all the process' pages, if we want to make sure that a memory region won't be swapped out. That may prove expensive, both in terms of performance and memory.

On Windows, well, there's a comment about that on lines 355-356...

😬

@EtiennePerot
Copy link
Contributor

The mlock(2) system call takes in a set of flags that help with this problem:

       MCL_CURRENT
              Lock all pages which are currently mapped into the address
              space of the process.

       MCL_FUTURE
              Lock all pages which will become mapped into the address
              space of the process in the future.  These could be, for
              instance, new pages required by a growing heap and stack
              as well as new memory-mapped files or shared memory
              regions.

       MCL_ONFAULT (since Linux 4.4)
              Used together with MCL_CURRENT, MCL_FUTURE, or both.  Mark
              all current (with MCL_CURRENT) or future (with MCL_FUTURE)
              mappings to lock pages when they are faulted in.  When
              used with MCL_CURRENT, all present pages are locked, but
              mlockall() will not fault in non-present pages.  When used
              with MCL_FUTURE, all future mappings will be marked to
              lock pages when they are faulted in, but they will not be
              populated by the lock when the mapping is created.
              MCL_ONFAULT must be used with either MCL_CURRENT or
              MCL_FUTURE or both.

Thus it's possible to mlock a part of (or all of) the address space ahead of time without having to fault it all in at the time mlock is called. It's possible to do so for the entire address space (i.e. all current and future memory of a process) using mlockall(2).

Therefore, if a Python program calls mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT), and later some Python library decides to allocate some new memory and to copy data to this new memory, that new memory will already be mlock'd without the library needing to be aware of it. There is no additional system call that has to be issued at the time this new memory is allocated. So I don't think there is any additional cost, whether CPU or memory.

@apyrgio
Copy link
Contributor Author

apyrgio commented Jun 26, 2024

Thanks a lot for the explanation Etienne. I think we have a reasonable path forward here, once we decide to implement this feature 🙂

@EtiennePerot
Copy link
Contributor

EtiennePerot commented Jun 29, 2024

Per my update on the gVisor bug for a fully-mlocked mode, I believe I've confirmed that such a mode is possible to implement in gVisor. However, properly implementing it only makes sense if it is useful for Dangerzone, meaning only if the rest of the Dangerzone application can also run in such a mode. The below are some notes from trying to see how practical that is.

I tried to run Dangerzone under strace using rm -rf /tmp/dangerzone.trace; mkdir /tmp/dangerzone.trace; strace -ff --output=/tmp/dangerzone.trace/trace dev_scripts/dangerzone-cli tests/test_docs/sample-doc.doc. From the generated files, there are three typical "syscall signatures" (seen by running head /tmp/dangerzone.trace/*):

  • The first one is the main Python interpreter, no surprises here.
  • The second one (the vast majority) appears to be some type of thread that mostly calls futex and nanosleep. It is probably doing some synchronization, perhaps for I/O operations?
  • The third one is the expected calls from subprocess.Popen to the container runtime (podman). Running grep -P '^f?exec.*\(' /tmp/dangerzone.trace/* makes them obvious. These are benign in the context of this bug since all they do is call execve shortly after starting.

In order to see where the Python code tries to fork, I added this to the top of dev_scripts/dangerzone:

import threading
threading.Thread = None

... and it crashed on this spot in dangerzone/logic.py:

    def convert_documents(
        self, ocr_lang: Optional[str], stdout_callback: Optional[Callable] = None
    ) -> None:
        def convert_doc(document: Document) -> None:
            self.isolation_provider.convert(
                document,
                ocr_lang,
                stdout_callback,
            )

        max_jobs = self.isolation_provider.get_max_parallel_conversions()
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_jobs) as executor:
            executor.map(convert_doc, self.documents)

I serialized the function body by replacing the last three lines with:

        for document in self.documents:
            convert_doc(document)

... and after this change there were no crashes, so we can conclude that this is probably the only "Python-code-initiated" point where the code willingly forks (as opposed to Python-runtime-initiated forks). But even with this change, there are still lots of threads created. So the Python runtime still decides it needs to fork for some reason. This means the approach I had suggested to self-sandbox the Dangerzone application in a seccomp-bpf filter forbidding fork/clone syscalls may not work as a means to enforce that the application memory remains mlocked.

The next step here is to see why the Python runtime decides to fork, and whether these threads are actually touching any memory that is sensitive or doing I/O on the documents being converted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
anti-forensics Forensic Inspection Mitigations container security
Projects
None yet
Development

No branches or pull requests

3 participants