Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encrypted blob store reuse DEK #53352

Conversation

albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Mar 10, 2020

EDITED 20.03.20 the repository secret is a textual password not a binary AES key

Follows the approach described in #50846 (comment)

This builds upon the data encryption streams from #49896 to create an encrypted snapshot repository.
The repository encryption works with the following existing repository types: FS, Azure, S3, GCS (possibly works with HDFS and URL, but these are not tested).
The encrypted repository is protected by a 256-bit AES key password stored on every node's keystore. The repository keys (KEK - key encryption key) are generated from the password using the PBKDF2 function, and are used to encrypt (using the AES Wrap algorithm) other symmetric keys (referred to as DEK - data encryption keys) which are themselves used to encrypt the blobs of the regular snapshot.

The platinum license is required to snapshot to the encrypted repository, but no license is required to list or restore already encrypted snapshots.

Example of how to use the Encrypted FS Repository:

  • pull down this branch and assemble the jar (./gradlew :distribution:archives:no-jdk-darwin-tar:assemble)
  • similarly to the un-encrypted FS repository, specify the mount point of the shared FS in the elasticsearch.yml conf file (on all the cluster nodes), eg: path.repo: ["/tmp/repo"]
  • make sure the cluster runs under a trial license (simplest configuration is to put xpack.security.license.self_generated.type: true in the elasticsearch.yml file
  • generate the master 256-bit repository key (KEK)
#openssl rand 32 > repository_keyfile.key
  • store the key password inside the elasticsearch.keystore, on every cluster's node, eg for the test_enc_key test_enc_pass repository key password name:
#./bin/elasticsearch-keystore add-file repository.encrypted.test_enc_key.key repository_keyfile.key
./bin/elasticsearch-keystore add repository.encrypted.test_enc_pass.password
  • start-up the cluster, and create the new encrypted repository, eg:
curl -X PUT "localhost:9200/_snapshot/test_enc?pretty" -H 'Content-Type: application/json' -d'
{
  "type": "encrypted",
  "settings": {
    "location": "/tmp/repo/enc",
    "delegate_type": "fs",
    "password_name": "test_enc_pass"
  }
}
'

Relates #49896
Relates #41910
Obsoletes #50846 #48221

@albertzaharovits albertzaharovits self-assigned this Mar 10, 2020
@albertzaharovits
Copy link
Contributor Author

In addition to what has been recently discussed over at #50846 , this PR contains the following two important changes. I am confident that they are good choices, but I still want to raise them for discussion before this PR is ready for review (integration tests are a drag):

  • instead of the repository password in the keystore , the actual AES key is stored (binary format). Using a textual password required a salt value, which must be generated by a secure random source, in order to generate the actual repository key (KEK) . It is rather complicated to broadcast the salt during repository verification, and storing the salt under its own blob adds other complications. I know this might detract from the convenience of the admin humans that prefer text, but the key can be generated, starting the the text, with openssl enc -salt -aes-256-cbc -pass pass:secret -P (the salt must now also be managed by the admin).
  • creating the repository requires a new setting key_name, in addition to delegate_type. This is done in anticipation of the procedure to rotate the repository key. In this case, rotating the key is very simple, although quite long: 1) add a new secure key in the keystore 2) reload the keystore 3) update the repository as read_only 4) call a new to-be-added API which does the decryption and reencryption of the DEKs using the newly added key 5) update the repository from the old key name to the new name, and to remove the read_only status ; Lastly we could have a default value for key_name as the repository name, but changing it is still necessary for rotation.

@tvernum May you please think about these two issues, and let me know your thoughts?

@albertzaharovits
Copy link
Contributor Author

Here is my justification for putting the actual AES key in the elasticsearch.keystore, as opposed to a textual password, as has been the case until now. In a gist, it is both more complex to code, as well as inherently more insecure.

The chief reason is that it simplifies the code. It's not a great saving, but this is a sensible area of the code.
As a reference point, here is how it all works with the repository AES KEK (as a key not a password). The repository KEK is used to AES-WRAP all the DEKs. All wrapped DEKs are stored in a single blob container, each under it's own blob, where the blob container is named after the KEK Id. The KEK Id is the wrapping of a known, fixed plaintext. (This all works because AES-Wrap has no limit on the number of invocation (no nonce) and the ciphertext is deterministic. Moreover, even if it is not a design goal, AES-Wrapping of the same plaintext with different keys should produce unrelated pseudorandom permutations). The DEK Id has no semantics, it is a randomly generated UUID and prepended to each encrypted blob. The blob storing the wrapped DEK is named after the DEK Id (and again, sits under the KEK Id blob container).

To support textual passwords more code has to be added, in addition to what has been described above.
A textual password requires a randomly generated nonce, which is used to derive the key (using the PBKDF2 algorithm). In this sense, a password can be seen as a seed for a family of encryption keys. The decryption, during the restore, must indicate which key from the family to use. To do this, we can add semantics to the DEK Id, so that part of the DEK Id indicates the nonce used to generated the wrapping KEK, and the other part identifies the actual blob of the DEK (as before). The KEK id, which identifies the blob container storing all the wrapped DEKs, must be generated from a hardcoded member of the KEK family of the password. In other words, to generate the KEK Id (which identifies where to store and retrieve from wrapped DEKs), use a hardcoded nonce and a hardcoded plaintext, and use the nonce and the password to generate an AES key which is then used to AES wrap the hardcoded plaintext.
Besides being complicated, the hardcoded nonce exposes the password to brute forcing. And there is no convenient way around it. It is simply not enough entropy in the password to generate an identifier for it, so that it can be stored on the storage service. An identifier for the repository secret (be it password or key) is required during secret rotation. Without going into too much details about secret rotation, the repository secret Id is used when at some point the repository must be informed that it should switch from using one set of wrapped DEKs to different set, which have been wrapped with the new repository secret . The only secure way around it, is to simply not use an identifier derived from the repository secret, and instead add another repository setting which identifies the blob container containing the set of wrapped DEKs.
Therefore, using a password requires adding an extra repository setting, which detracts from the convenience benefits of simpler configuration.

I am willing to change my mind and adopt the textual password again, if others believe it a very important convenience feature, but for now I would continue with the repository secret key implementation if there are no strong objections against it.

CC @tvernum (the "others" in the above actually refers to you 😃 )

@nachogiljaldo do you foresee any difference one way or the other for encrypted snapshots in cloud. I believe in your case, textual or binary, repository secrets both must be generated randomly, so it shouldn't matter.

@nachogiljaldo
Copy link

IIUC this shouldn't make a big difference on our end.

@tvernum
Copy link
Contributor

tvernum commented Mar 18, 2020

@albertzaharovits

Let me summarise to make sure I'm understanding you correctly.

  • Supporting a password in the keystore means pushing it through a KDF to produce a key, that would be the KEK
  • KDFs requires a salt, which should be random
  • Therefore, a password on its own is insufficient to represent a Key, you need a salt as well.
  • You can't ask the user for the salt, because that's not random
  • You can't store the salt in cluster state because then you can't restore into an empty cluster
  • So the salt needs to be stored in the snapshot data.

At that point I'm a bit lost.
You have an "id" for a KEK, so it seems like that id could be "version n, of the secure setting named x, KDF'ed using y, salted with z"
Then when you write out a DEK, you would include that metadata to indicate which KEK it is encrypted with.
I don't follow why that creates a big complication.

That said, progress over perfection. If it is substantially easier to deliver this with raw AES keys in the keystore, then let's do that.

@albertzaharovits
Copy link
Contributor Author

Thank you for thinking through it @tvernum !

Let me summarise to make sure I'm understanding you correctly.

  • Supporting a password in the keystore means pushing it through a KDF to produce a key, that would be the KEK
  • KDFs requires a salt, which should be random
  • Therefore, a password on its own is insufficient to represent a Key, you need a salt as well.
  • You can't ask the user for the salt, because that's not random
  • You can't store the salt in cluster state because then you can't restore into an empty cluster
  • So the salt needs to be stored in the snapshot data.

That's correct. I glanced over it, but this is indeed the core of the issue.

Next, overall I think there are two ways to look at it: either there is a single salt value, stored in the repository on the storage service, OR there are several salt values and the encrypted blobs must point to the one that was used in their case. My involved description above described the second one, because I think (but I'm not sure) that it's the simplest one to implement.

You have an "id" for a KEK, so it seems like that id could be "version n, of the secure setting named x, KDF'ed using y, salted with z"
Then when you write out a DEK, you would include that metadata to indicate which KEK it is encrypted with.
I don't follow why that creates a big complication.

This discussion got me thinking some more, and I tried to formalize the problem.
Here are the main points:

  • we can prepend anything to the encrypted blob, to help identify the associated wrapped DEK
  • there can be multiple wrapped versions of the same DEK, wrapped by a different KEK
  • given a KEK and a DEK Id, we should be able to find the wrapped DEK and unwrap it
  • the repository KEK must be derived from the repository secret in the keystore
  • the repository setting in the cluster state must identify the keystore secret used to derive the KEK (which is used to unwrap all the DEKs associated with the encrypted blobs)
  • the repository settings in the cluster state must be updateable in such a way to allow switching the repository secret

That being said, I think you're right and there is enough to be able obtain the KEK from a textual password using as a nonce the random "name" of the DEK. I want to try sketch the code for it, to convince myself first, and then I'll post an update.

@original-brownbear original-brownbear self-requested a review March 20, 2020 12:03
@albertzaharovits
Copy link
Contributor Author

@tvernum I've reverted back to using a password as a repository secret. It turns out we can use the DEK Id, which is generated randomly, as a nonce to compute the KEK from the repository password. This way, every DEK is wrapped by a different KEK. Generating a KEK from the password is expensive, but given the strategy to reuse DEKs, it should amortize to not be relevant in practical use cases. The changes required to move from the raw key to the password are all encapsulated in the 406a722 commit (it's not a big deal, we can switch back if we have second thoughts).

From a security standpoint, weak passwords are vulnerable to brute force, because snapshots contain blobs with known plaintext, and PBKDF2 being a glorified hash, effectively turns the encrypted known plaintext into a salted hash. It's easy to see an example of it if you look into the code at how the blob name of the wrapped DEK is generated: the random DEK Id is used as the salt for the PBKDF2 function that generates the KEK, then the KEK is used to "wrap" a known plaintext. In essence, AES key are "vulnerable" to the same thing, but brute force is not something to use when referring to random 32 bytes. Therefore textual passwords are just as secure as AES keys if they are 32 bytes long and generated randomly.

I'm still leaning for raw AES keys, if not for avoiding the PBKDF2 code shenanigans, at least for the implicit requirement that the repository secret must be unguessable (i.e. random) and long. If however we stick with passwords, maybe we should rename it to passphrase and specify a minimum length (currently there's none).

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some (now partly outdated) comments that we mostly already covered on another channel just now

protected final String createRepository(final String name, final Settings settings) {
final boolean verify = randomBoolean();
protected final String createRepository(final String name) {
return createRepository(name, Settings.builder().put(repositorySettings()).put(repositorySettings(name)).build(), randomBoolean());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to randomly turn off verification in the general case? It seems that takes away from our chances to catch some IO issues here and there? Maybe we should make this always verify and selective turn off verification where we really need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honored the original behavior, but you're right it should be true unless specified otherwise. I've made this change (verification true instead of randomBoolean).

@@ -81,13 +83,15 @@ protected Settings repositorySettings() {
return Settings.builder().put("compress", randomBoolean()).build();
}

protected final String createRepository(final String name) {
return createRepository(name, repositorySettings());
protected Settings repositorySettings(String repositoryName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe call this additionalRepositorySettings? Or better yet + in the sense of simplicity, maybe just add the repositoryName parameter to the existing repositorySettings() method? That way we don't have to complicate the repo creation code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the repositoryName parameter to the existing repositorySettings() method

Done, thanks for the suggestion!

internalCluster().getDataOrMasterNodeInstances(RepositoriesService.class).forEach(repositories -> {
RepositoryMissingException e = expectThrows(RepositoryMissingException.class, () -> repositories.repository(name));
assertThat(e.getMessage(), containsString("missing"));
assertThat(e.getMessage(), containsString(name));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just assert on e.repository() here and delete the "missing" assertion, that part really is implied by the exception type :)
Or you could just not assert on the message at all IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could just not assert on the message at all IMO

Done, thanks again!

assertThat(e.getMessage(), containsString(name));
});

return name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never use the return here and it seems pointless since it's just the method input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the return value; I was trying to be consistent with the return value of the createRepository method.

@@ -175,7 +192,7 @@ public void testList() throws IOException {
BlobMetaData blobMetaData = blobs.get(generated.getKey());
assertThat(generated.getKey(), blobMetaData, CoreMatchers.notNullValue());
assertThat(blobMetaData.name(), CoreMatchers.equalTo(generated.getKey()));
assertThat(blobMetaData.length(), CoreMatchers.equalTo(generated.getValue()));
assertThat(blobMetaData.length(), CoreMatchers.equalTo(blobLengthFromContentLength(generated.getValue())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe it's easier to follow this logic if you just add a method:

assertBlobSize(BlobMeta meta, long contentLength)

and then use it here and override in the encrypted tests? (I think it saves a little complication and it's what we do for a bunch of other repo related assertions)

return DEKCache.computeIfAbsent(DEKId, ignored -> loadDEK(DEKId));
} catch (ExecutionException e) {
// some exception types are to be expected
if (e.getCause() instanceof IOException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not a big fan of this kind of unwrapping. We're losing part of the stacktrace here and it may be painful to understand (test-)failures because of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it because the loadDEK method is careful to throw IOExceptions when storage has problems and RepositoryException when there are encryption problems (wrong password or tampered metadata). I've learned that IOExceptions in readBlob can move the repository in a corrupted state, and if we wrap the IOException for DEK load (which is called before reading the actual blob) that'd amount to a change of behavior in those cases.

Moreover, I don't believe the ExecutionException wrapping adds useful information.

@tvernum
Copy link
Contributor

tvernum commented Mar 23, 2020

I'm still leaning for raw AES keys

I have no concerns with that. We may need to think about tooling at some point, but we can ship a first release that relies on external tools to generate keys.

I am starting to feel more and more as though cutting out unnecessary complexity is a valuable step to take on this project, so if raw keys does that, then it's probably the right call to make.

@albertzaharovits albertzaharovits marked this pull request as ready for review March 31, 2020 20:06
@albertzaharovits albertzaharovits added the :Security/Security Security issues without another label label Mar 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (:Security/Security)

@albertzaharovits
Copy link
Contributor Author

@original-brownbear @tvernum this is ready for review proper.

Thank you @original-brownbear for the first review round. I've acted on all your suggestions. Unfortunately code shifted around a bit since I've applied spotless, but otherwise there were not big changes. I've mostly restructured integ tests as we discussed.

@original-brownbear
Copy link
Member

@albertzaharovits @tvernum any objections to simply opening a clean PR for this (including the feature branch changes) against master for review and the merging it once we're through the review?
Or just merging this to the feature branch and then continuing there?

  1. From a quick read over this one, I think it's very close already and there's no need for the feature branch complexity.
  2. My brain hurts from trying to review a diff relative to the feature branch with the feature branch now containing the previous approach :)

@albertzaharovits
Copy link
Contributor Author

@original-brownbear Thanks for the prompt feedback!

The feature branch is clean, and it follows master with a few days of lag, and it does not contain previous approaches (I've not merge any of those). I prefer I keep maintaining the feature branch because it gives me the liberty of adding incremental features, most notably password rotation (but maybe others, such as password min-strength requirements, and implicit repository password name). Some integration tests might also come in a follow-up as "incremental features" (although I don't have any specific examples right now).

Yet this PR is big, and does not look "incremental", because it contains the entirety of the code for the functional EncryptedBlobStoreRepository plugin, without password rotation. I wasn't able to find a way to divide the main code of it into testable smaller pieces.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @albertzaharovits I gave it a more thorough read now, looks really close + good! Thanks! Didn't find any big issues just a few small + quick points :)

@albertzaharovits
Copy link
Contributor Author

Thank you for the feedback @original-brownbear . I have addressed all your suggestions, this is ready for another round.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the iterations on this @albertzaharovits ! => LGTM, I really like the approach now :) => let's merge it to the feature branch IMO!

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had these comments sitting for a couple of weeks, and I really should have submited them a while ago. More coming.

long packetIvCounter = ivBuffer.getLong(Integer.BYTES);
if (packetIvNonce != nonce) {
throw new IOException("Packet nonce mismatch. Expecting [" + nonce + "], but got [" + packetIvNonce + "].");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why we don't validate the per-packet nonce anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the explicit nonce check was necessary in order to support using the same DEK for multiple encrypted blobs.

Previously, every blob had its own associated metadata blob, where we stored the nonce, which was explicitly verified here, during decryption. But now there is no place where we can associate metadata for every blob (and this is by design, because associating something to every blob incurs at least one API call and it's difficult to update).

There is no security problem if we're not explicitly checking the nonce, because the nonce is part of every packet's IV and is hence validated when the packet's authn tag is validated, implicitly by the GCM algorithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
Is there still a reason why we store it in the stream then?

Copy link
Contributor

@tvernum tvernum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it through everything other than the crypto. I'll review that next.

AtomicReference<SingleUseKey> keyCurrentlyInUse
) {
final Object lock = new Object();
return () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a complicated way to produce a Supplier of a set of keys with a sequential nonce.
Does it really need the complexity of AtomicReference over a synchronized method?

It feels like the general concept could be written quite simply with a class that extends Supplier, has an internal counter and is synchronized on get (or uses an AtomicInteger for the counter).
I presume there is a reason why you took this approach instead, but the code doesn't tell me why that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the general concept could be written quite simply with a class that extends Supplier, has an internal counter and is synchronized on get (or uses an AtomicInteger for the counter).

This is pretty much what the code here does; here it is without production ornaments:

        return () -> {
                final SingleUseKey nonceAndKey = keyCurrentlyInUse.getAndUpdate(
                    prev -> prev.nonce < MAX_NONCE ? new SingleUseKey(prev.keyId, prev.key, prev.nonce + 1) : EXPIRED_KEY
                );
                if (nonceAndKey.nonce < MAX_NONCE) {
                    return nonceAndKey;
                } else {
                    synchronized (lock) {
                        if (keyCurrentlyInUse.get().nonce == MAX_NONCE) {
                            final Tuple<BytesReference, SecretKey> newKey = keyGenerator.get();
                            keyCurrentlyInUse.set(new SingleUseKey(newKey.v1(), newKey.v2(), MIN_NONCE));
                        }
                    }
                }
        };

Instead of extending Supplier this uses the fact that Supplier is a FunctionalInterface. The internal counter is the nonce itself. Instead of synchronizing on get this uses an AtomicReference as an AtomicInteger. The remaining code is to generate a new key and reset the counter. And in this last code is where the preference for AtomicReference over of AtomicInteger shows off because it allows to atomically set the key and the nonce.

I have added a few more comments. I hope this all makes more sense now.
Do you think I should change something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I think I was missing the fact that you need to generate a new base key when the nonce rolls over. but I do think this code is unnecessarily difficult to follow.
That's a combination of trying too hard to do it all in a lambda and trying to keep the locks really small. Is that necessary?

Couldn't we have:

        return new CheckedSupplier<SingleUseKey, T>() {
                private Tuple<BytesReference, SecretKey> key = null;
                private int nonce = 0;
                public synchronized SingleUseKey get() throws T() {
                   if( key == null || nonce == MAX_NONCE) {
                       key = keyGenerator.get();
                       nonce = MIN_NONCE;
                   } else {
                       nonce++;
                   }
                   return new SingleUseKey(key.v1(), key.v2(), nonce);
                }
        };

@albertzaharovits
Copy link
Contributor Author

@tvernum I have acted on all review points.

@albertzaharovits albertzaharovits requested a review from tvernum May 4, 2020 19:06
@rjernst rjernst added the Team:Security Meta label for security team label May 4, 2020
@albertzaharovits albertzaharovits force-pushed the repository-encrypted-client-side branch 3 times, most recently from 1979ebc to 01d7698 Compare November 28, 2020 13:06
@albertzaharovits albertzaharovits force-pushed the repository-encrypted-client-side branch from 01d7698 to cf8c7fd Compare November 30, 2020 23:06
@albertzaharovits albertzaharovits merged commit 3249cc3 into elastic:repository-encrypted-client-side Dec 2, 2020
@albertzaharovits albertzaharovits deleted the reuse-DEKs-universally branch December 2, 2020 14:09
@choobinejad
Copy link

@albertzaharovits I see some change to the way the KEK is established. Originally it looks like this was a 256-bit AES key, and now it's:

generated from the password using the PBKDF2 function

Can you provide more detail about where the KEK comes from? How is it generated? How is it stored? I think the origin and security of the KEK itself is a key component of this feature.

@albertzaharovits
Copy link
Contributor Author

@choobinejad I've created a sketch diagram in #41910 (comment) to try to explain it. Sorry it took so long....

The KEK is generated from the password using a PBKDF2 algorithm, where the salt is the id of the DEK for which the KEK is used to wrap it; so you've got a 1-1 relationship between DEKs and KEKs. KEKs are never stored, they're only in memory as a proxy to allow using the same password to encrypt DEKs.

@choobinejad
Copy link

Thanks @albertzaharovits, that info and the sketch are really useful!
cc @mbarretta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Security/Security Security issues without another label Team:Security Meta label for security team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants