Make sure every key gets updated at some time #24

vbrandl · 2016-09-12T11:45:19Z

When picking a key to update, parcimonie simply picks a random key from the keyring. This way it can happen that certain keys don't get updated for a longe time while other keys get updated multiple times in the same period.

Maybe it is possible to implement a logic that updates every key once before the next round of updates.
I was thinking of storing a set of all keys from the keyring, picking random keys from the set and update a flag in the set when they get updated. When all keys are updated, a new set is generated and so on.
This set needs to be stored persistent so it still exists after a reboot.

For keys that are added to the keyring while a update cycle is still running, the daemon could compare the full set that was used to create the current update cycle and diff it against the actual keyring. Every key that is new gets added to the update set.

The structure of the set could be something simple like

fingerprint:0
fingerprint:1

where 0 or 1 indicate wether the key has or hasn't been update in the current cycle.

Maybe there is a more elegant way to achieve this but I think it would be good to make sure every key gets updated.

The text was updated successfully, but these errors were encountered:

EtiennePerot · 2016-09-14T04:07:04Z

This sounds feasible as long as the set is different on each generation, in order to avoid correlation through repeated key refresh order.

I think one simplification could be that the program doesn't remember the whole set of keys, but rather it only keeps a set of keys that it has refreshed. Every time it decides to refresh a new key, it just picks a random one from the keyring that is not in that set, refreshes it, and adds it to the set. Then, when the set contains all keys in the keyring, erase the set and start over. This doesn't require having edge cases for key additions or removals, and enforces the random-ordering property.

I don't see a reason to not do this at least with the set in memory, but persisting the set may have new privacy implications. For example, if the persistent storage directory is in some remote storage (or equivalently, synced with dropbox or whatever), then any write to it (i.e. once per key refresh) will be reflected on the network in a potentially-privacy-disrespecting manner: even if the data is encrypted, the upload pattern would coincide time-wise with key refreshes. Today this leak isn't much of a concern because only a small subset of key refreshes result in disk writes (those that actually bring in new key data). I definitely think persistence should not be the default, and that the setting to control that should come with a big disclaimer.

joelpurra · 2016-09-14T07:59:15Z

Although I understand that some keys might not be updated each period (week), I personally like that it's currently a completely random selection. Not sure that I like that each key is "guaranteed" to get updated once per period quite as much. Perhaps the current random selection style is statistically close to "guaranteeing" that each key gets updated at least once per two periods? That's enough for me.

As the time to wait between each key refresh is dependent on the total number of keys N in the keyring, I would guess the number of keys is of less importance -- but has anyone done the math to confirm the probability that a key gets selected (or not) for a low N versus a high N? The answer should say if it's likely that a key might randomly not get updated two periods in a row.

Perhaps there is a way to reduce the chance of a key being selected twice in a period, while keeping some random properties? Say, for a keyring with N keys remembering a list of the N/2 most recently updated keys?

That approach might also be configurable to allow for both random selection with replacement (remember 0) and without replacement (remember N). (Tried to use some statistical terms, but won't pretend I remember enough from university.)

vbrandl · 2016-09-16T17:19:03Z

My first thought was picking random keys from the generated set as stated in the first post, but using a set that only contains the already updated keys is more elegant.

The full persistence vs memory persistence is another topic. Maybe the daemon can wait until the computer shuts down and only then write the current set to the persistent storage.

I'll try to do the math for the probability of each key from a set of N keys getting updated in a round of N updates:

There are N^N different combinations how the keys can get updated in one round (N updates per round and N keys to pick from randomly per update). Only N! of these combinations contain every key from the set. So the probability of each key getting updated would be

(N!) / (N^N)

For a set of 10 keys, that would be

10! / (10^10) = 0.00036288 ~= 0.04%

Since N^N grows faster than N!, the probability to update each key per round gets smaller.

I don't know if my approach to this problem is correct, but thats how I remember it from statistics class. It is actually pretty unlikely to update each key in a reasonable amount of time.

EtiennePerot · 2016-09-17T08:21:59Z

Your math is correct, but it's not really showing the whole picture. What it computes is the probability that the current behavior matches the behavior described in the first post, i.e. the probability that today's parcimonie.sh's would happen to do a full keyring update using the absolute minimal number of refreshes possible to do so.

I don't think this is an interesting metric to look at. A better way to show the effect of this change would be to estimate how old keys are in the current model.

Let's examine the case of a key k in a keyring of size N. In the current model, the probability of a key being older than 1 refresh is (N-1)/N, i.e. the probability that it was not picked in the last refresh. The probability of a key being older than 2 refreshes is ((N-1)/N) * ((N-1)/N), i.e. the probability that it was not picked in the last refresh and not picked in the refresh before that (which are independent events in the current model since we randomly pick keys with replacement at each refresh).

Thus, generally, the probability of a key being older than r refreshes is ((N-1)/N)^r.

Now recall that the refresh frequency is proportional to N itself, so we need to take this into account in order to come up with a time-based metric. With the default value of 1 week for $targetRefreshTime, that's N expected refreshes per week. This means that the probability of a key being older than $targetRefreshTime is that of being older than N refreshes, which is ((N-1)/N)^N (we just set r = N).

Here's how this looks like for different values of N:

%matplotlib notebook

import pandas as pd

d = pd.DataFrame([(((N-1)/N)**N) for N in map(float, range(2, 48))], index=range(2, 48))
p = d.plot()
p.set_xlabel('Keyring size (N)')
p.set_ylabel('Probability of a given key not being refreshed often enough')

This has an upper bound of 0.37; we can thus assert that, in the expected case, 37% keys or less will be older than $targetRefreshTime.

But in practice keys don't actually need refreshing every week. What happens if we care that keys are only refreshed within the last month, approximated by 4 weeks (4*N refreshes)?

%matplotlib notebook

import pandas as pd

d = pd.DataFrame([(((N-1)/N)**(4*N)) for N in map(float, range(2, 48))], index=range(2, 48))
p = d.plot()
p.set_xlabel('Keyring size (N)')
p.set_ylabel('Probability of a given key not being refreshed often enough')

So we can assert that 2% keys or less will be older than 4 * $targetRefreshTime, which is about 1 or 2 keys in the average-size keyring, so let's call that negligible.

tl;dr: This change would ensure that about a 37% of a user's keys stay consistently updated within $targetRefreshTime rather than within the current 4 * $targetRefreshTime.

When put in that light, maybe we should just reformulate how the script interprets $targetRefreshTime accordingly, and get essentially the same effect without the potential downsides?

lcts · 2017-04-06T10:10:39Z

When put in that light, maybe we should just reformulate how the script interprets $targetRefreshTime accordingly, and get essentially the same effect without the potential downsides?

I would, because it would make the setting intuitive. Statistics are not for most people, but everyone will understand a setting of "time after which X% of keys have been refreshed". If X is reasonably high, this also ensures that the fraction of refreshed keys will be basically 1 after double the set time, again something that people will intuitively guess.

Expanding on your math, for any given time, x * $targetRefreshTime the upper limit for the fraction of unsynced keys is 0.37^x, so it's easy to pick a target percentage. Using x = 3 yields almost exactly a 95% refresh fraction, so I would redefine $targetRefreshTime to mean exactly that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure every key gets updated at some time #24

Make sure every key gets updated at some time #24

vbrandl commented Sep 12, 2016 •

edited

Loading

EtiennePerot commented Sep 14, 2016 •

edited

Loading

joelpurra commented Sep 14, 2016

vbrandl commented Sep 16, 2016 •

edited

Loading

EtiennePerot commented Sep 17, 2016 •

edited

Loading

lcts commented Apr 6, 2017 •

edited

Loading

Make sure every key gets updated at some time #24

Make sure every key gets updated at some time #24

Comments

vbrandl commented Sep 12, 2016 • edited Loading

EtiennePerot commented Sep 14, 2016 • edited Loading

joelpurra commented Sep 14, 2016

vbrandl commented Sep 16, 2016 • edited Loading

EtiennePerot commented Sep 17, 2016 • edited Loading

lcts commented Apr 6, 2017 • edited Loading

vbrandl commented Sep 12, 2016 •

edited

Loading

EtiennePerot commented Sep 14, 2016 •

edited

Loading

vbrandl commented Sep 16, 2016 •

edited

Loading

EtiennePerot commented Sep 17, 2016 •

edited

Loading

lcts commented Apr 6, 2017 •

edited

Loading