Countme should report system age, not repository age #1611

dmnks · 2023-07-28T18:12:32Z

Currently, when we compute the system's age bucket (1 through 4) to report in the weekly countme flag, we do that relative to the first-ever metadata refresh (called the epoch) of the respective repository. However, the original proposal intended that it would be the absolute age bucket, that is, since the installation.

This is because we store the cookie files (containing the timestamps) in per-repository directories (persistdir) whose names contain hashes derived from various repository properties including the releasever value. That means, the system's age bucket is effectively reset on each Fedora system upgrade which is not what we want.

To fix this, we should simply keep one single cookie file for the entire system and use that to determine the system's age bucket.

There's a second countme implementation in rpm-ostree (here's why) which reportedly does the right thing. Looking at the code, they do appear to store only one cookie file per system (at /var/lib/rpm-ostree-countme/cookie), as it should be. I think we should just do the same.

To avoid skewing the metrics, the fix should probably include a check for an old, repo-specific cookie file and if it exists, it should load the values from it and then remove the file. When it comes to storing the new values at the end of the addCountmeFlag() function, that should already go into the system-wide cookie file. That way, systems that upgrade to the fixed DNF version would simply continue where they left off, instead of being reset to age 1. Note that this may need special care in case repositories are fetched in parallel.

The text was updated successfully, but these errors were encountered:

mattdm · 2023-07-29T09:05:05Z

To avoid skewing the metrics, the fix should probably include a check for an old, repo-specific cookie file and if it exists, it should load the values from it and then remove the file.

Probably the "best" thing to do is find the oldest countme file (including disabled repos).

Hacky but maybe more accurate — does dnf create any other files in /var or /etc at install time that would likely have a corresponding file date which could be used.

Both of these will probably cause "jumps" in my data — but I'm okay with that, really.

dmnks · 2023-07-29T10:54:07Z

I think we could use the transaction ID 1 in the DNF history database which, I believe, represents the fresh install through Anaconda. The transaction record contains the timestamp. On the CLI, you can check that with:

dnf history info 1 | grep 'Begin time'

That way, we wouldn't need to store the "epoch" in the cookie file, and would just always use the above timestamp for that.

dmnks · 2023-07-29T11:03:35Z

Thinking about it more, the first-ever transaction may not be a reliable indicator of the system age for ephemeral systems that are not installed through Anaconda but from an image (e.g. Podman containers). So we may need a different strategy (for those).

supakeen · 2023-07-31T06:55:22Z

There's more systems that are not installed through Anaconda (the ARM version often gets installed from an image, virtual machines at cloud providers, etc) so I wouldn't special case it :)

dmnks · 2023-07-31T09:14:05Z

Thanks, that's a useful data point to have 😄

dmnks · 2023-07-31T19:00:12Z

Just FTR, @james-antill suggested in a chat that one solution would also be keeping per-repo countme files but doing that in directories named after the repo ID only (not a hash).

travier · 2023-08-07T13:27:15Z

Just as FYI, here is the implementation in rpm-ostree that does not have this issue: https://github.com/coreos/rpm-ostree/blob/main/rust/src/countme/cookie.rs

mattdm · 2024-04-01T16:47:32Z

Is there any movement on this? What is the implementation like in DNF 5?

jan-kolarik · 2024-04-05T08:45:31Z

What is the implementation like in DNF 5?

I've just checked it, it's basically a clone of the dnf4 implementation.

Is there any movement on this?

We'll discuss it with leadership and the team in the following days and provide feedback soon.

mattdm · 2024-04-05T19:28:17Z

I've just checked it, it's basically a clone of the dnf4 implementation.

Ah, bug-for-bug compatibility. :)

Am I possibly currently getting double-counts from people using both, or using e.g. GNOME Software + dnf5 in f39?

dmnks · 2024-04-08T11:40:29Z

If dnf4 and dnf5 both use a different repo "persistdir", then yep, we're likely double-counting already.

This is really silly and needs to be fixed ASAP. Since I wrote that code (and still remember how it works, kinda), it just makes sense for me to have a closer look, then... So I'll do just that, assigning to myself now.

dmnks · 2024-04-09T08:56:08Z

If dnf4 and dnf5 both use a different repo "persistdir", then yep, we're likely double-counting already.

Good news, I guess. I've just checked and dnf5 uses the same persistent directories (/var/lib/dnf/repos/<repoid>-<hash>/) as dnf4, meaning that countme flags are not sent twice for each.

dmnks · 2024-04-12T15:25:13Z

TL;DR: A simple fix is underway. I'll be on PTO next week, so expect silence here until I'm back.

Having thought about this more, we do need to continue tracking the countme timestamps ("cookie" files in /var/lib/dnf/repos/) on a per-repo basis, as opposed to having one system-wide timestamp. This is simply because the countme flag is reported per-repo (via the metalink URL) and using a system-wide cookie would cause only one repo (whichever happens to be fetched first by dnf) to issue the flag each week, which is not what we want.

However, what we do want to change is so that the timestamps aren't dependent on the $releasever value as that value is part of the metalink URL (which is used to compute the hash). Therefore, the easiest fix is to just change the per-repo directory names in /var/lib/dnf/repos/ from <repoid>-<hash> to <repoid>. This was also mentioned above as one of the possible solutions.

I have a working (one-line) patch for that locally, as well as an updated countme.feature test to cover this. So that part is easy.

The tricky part is to ensure that the cookie is not reset when the existing systems upgrade to the fixed libdnf version (once released). Since the directory name changes, libdnf would think that the system doesn't yet have a cookie file and thus 1) would start over, with age set to 1 (countme=1), as if the system was just freshly installed, and 2) would possibly send the flag again in the same week, thus double-counting the system in that week. This would skew the metrics we gather on the server quite a bit.

To prevent that, the cookie file needs to stay the same when you upgrade libdnf to the fixed version, as well as if you decide to downgrade to the old version for some reason. The easiest solution to that seems to be the following:

In a (%post?) scriptlet in libdnf, check whether we have an existing cookie for the main repos ("fedora" and "updates"?).
If we do, create a non-hash symlink for each of those repos. For example, if /var/lib/dnf/repos/fedora-845d89688cb28f31 exists, a symlink named /var/lib/dnf/repos/fedora pointing to the former would be created by the scriptlet.

This way, the same cookie file would be reused after upgrading to the new libdnf version as well as after downgrading it.

What the scriptlet needs to decide, though, is which directory to choose for the symlink target if there are multiple - that can happen easily, such as if dnf --releasever is ever used on the system.

I think it should choose the one that corresponds to the running Fedora version, e.g. by looking at /etc/os-release (VERSION_ID). This is quite easy to do, the hash is a SHA256 of the metalink URL so we can compute that easily in the scriptlet using core-utils programs.

In fact, I also have a draft scriptlet locally which works as described above, we just need to decide on which repositories to "migrate". I'd think "fedora" and "updates" should suffice, but please let me know otherwise.

So, that's for a status update. I've decided to dump my thoughts here because I'll be on vacation next week and might otherwise forget the details 😄 Any feedback is of course welcome in the meantime. Just know that I'll only be able to respond when I'm back.

Actually use the system's installation time (if known) as the reference point, instead of the first-ever countme event recorded for the given repo. This is what the dnf.conf(5) man page has been saying all along, the code just never lived up to it. This fixes the following issues: 1. Systems that only reach out to the repos after an initial period of time after their installation appear "younger" than they really are. 2. Prebuilt OS images may include repo persistdirs with countme cookies in them that were created at build time, making all instances spawned from those images (physical machines, VMs or containers) appear much "older" than they really are. 3. System upgrades cause the bucket to be effectively reset to 1 due to the fact that a changed $releasever value causes a new persistdir to be created. Use the machine-id(5) file's mtime as the single source of truth. This file is typically tied to the system's installation or first boot where it's populated by an installer tool or init system, respectively, and is never changed afterwards. Keep the "relative" epoch (first countme event) as a fallback method, though. This is useful on those systems that don't have a machine-id file (such as OCI containers) but are still used long-term. In those cases, system upgrades aren't really a thing so the above point 3 does not apply. Some containers may also choose to bind-mount the machine-id file from the host (such as what toolbox(1) does), in which case their age will be the same as that of the host. Conveniently, that's also what we want, since the purpose of such containers is to blend with the host as much as possible. Fixes: rpm-software-management#1611

Actually use the system's installation time (if known) as the reference point, instead of the first-ever countme event recorded for the given repo. This is what the dnf.conf(5) man page has been saying all along, the code just never lived up to it. This fixes the following issues: 1. Systems that only reach out to the repos after an initial period of time after their installation appear "younger" than they really are. 2. Prebuilt OS images may include repo persistdirs with countme cookies in them that were created at build time, making all instances spawned from those images (physical machines, VMs or containers) appear much "older" than they really are. 3. System upgrades cause the bucket to be effectively reset to 1 due to the fact that a changed $releasever value causes a new persistdir to be created. Use the machine-id(5) file's mtime as a single source of truth. This file is typically tied to the system's installation or first boot where it's populated by an installer tool or init system, respectively, and is never changed afterwards. Keep the "relative" epoch (first countme event) as a fallback method, though. This is useful on those systems that don't have a machine-id file (such as OCI containers) but are still used long-term. In those cases, system upgrades aren't really a thing so the above point 3 does not apply. Some containers may also choose to bind-mount the machine-id file from the host (such as what toolbox(1) does), in which case their age will be the same as that of the host. Conveniently, that's also what we want, since the purpose of such containers is to blend with the host as much as possible. Fixes: rpm-software-management#1611

Actually use the system's installation time (if known) as the reference point, instead of the first-ever countme event recorded for the given repo. This is what the dnf.conf(5) man page has been saying all along, the code just never lived up to it. This fixes the following issues: 1. Systems that only reach out to the repos after an initial period of time after their installation appear "younger" than they really are. 2. Prebuilt OS images may include repo persistdirs with countme cookies in them that were created at build time, making all instances spawned from those images (physical machines, VMs or containers) appear much "older" than they really are. 3. System upgrades cause the bucket to be effectively reset to 1 due to the fact that a changed $releasever value causes a new persistdir to be created. Use the machine-id(5) file's mtime to infer the installation time. This file is typically tied to the system's installation or first boot where it's populated by an installer tool or init system, respectively, and is never changed afterwards. Keep the "relative" epoch (first countme event) as a fallback method, though. This is useful on those systems that don't have a machine-id file (such as OCI containers) but are still used long-term. In those cases, system upgrades aren't really a thing so the above point 3 does not apply. Some containers may also choose to bind-mount the machine-id file from the host (such as what toolbox(1) does), in which case their age will be the same as that of the host. Conveniently, that's also what we want, since the purpose of such containers is to blend with the host as much as possible. Fixes: rpm-software-management#1611

Actually use the system's installation time (if known) as the reference point, instead of the first-ever countme event recorded for the given repo. This is what the dnf.conf(5) man page always said about the countme option, the code just never lived up to that. This makes bucket calculation more accurate: 1. System upgrades will no longer reset the bucket to 1 (this used to be the case due to a new persistdir being created whenever $releasever changed). 2. Systems that only reach out to the repos after an initial time period after being installed will no longer appear younger than they really are. 3. Prebuilt OS images that happen to include countme cookies created at build time will no longer cause all the instances spawned from those images (physical machines, VMs or containers) to appear older than they really are. Use the machine-id(5) file's mtime to infer the installation time. This file is semantically tied to the system's lifetime since it's typically populated at installation time or during the first boot by an installer tool or init system, respectively, and remains unchanged. The fact that it's a well-defined file with clear semantics ensures that OS images won't accidentally include a prepopulated version of this file with a timestamp corresponding to the image build, unlike our own cookie files (see point 3 above). In some cases, such as in OCI containers without an init system running, the machine-id file may be missing or empty, even though the system is still used long-term. To cover those, keep the original, relative epoch as a fallback method. System upgrades aren't really a thing for such systems so the above point 1 doesn't apply here. Some containers, such as those created by toolbox(1), may also choose to bind-mount the host's machine-id file, thus falling into the same bucket as their host. Conveniently, that's what we want, since the purpose of such containers is to blend with the host as much as possible. Fixes: rpm-software-management#1611

dmnks · 2024-05-09T11:09:18Z

Hacky but maybe more accurate — does dnf create any other files in /var or /etc at install time that would likely have a corresponding file date which could be used.

Not dnf, but there's the /etc/machine-id file which, actually, seems to fit the bill quite perfectly. Its modification timestamp typically reflects the installation time as the ID is generated during system installation or first boot (by systemd) and then stays untouched.

So, scratch my above ponderings about changing the persistdir naming scheme. Instead, I've submitted #1662 which switches age counting to the machine-id file's timestamp.

Here's an updated BDD feature file which demonstrates the new logic (see the Examples table at the bottom of the Scenario Outline): https://github.com/rpm-software-management/ci-dnf-stack/blob/ab365d2bad19f69e188fb449fb6bcdd8834f5815/dnf-behave-tests/dnf/countme.feature#L44

Actually use the system's installation time (if known) as the reference point, instead of the first-ever countme event recorded for the given repo. This is what the dnf.conf(5) man page always said about the countme option, the code just never lived up to that. This makes bucket calculation more accurate: 1. System upgrades will no longer reset the bucket to 1 (this used to be the case due to a new persistdir being created whenever $releasever changed). 2. Systems that only reach out to the repos after an initial time period after being installed will no longer appear younger than they really are. 3. Prebuilt OS images that happen to include countme cookies created at build time will no longer cause all the instances spawned from those images (physical machines, VMs or containers) to appear older than they really are. Use the machine-id(5) file's mtime to infer the installation time. This file is semantically tied to the system's lifetime since it's typically populated at installation time or during the first boot by an installer tool or init system, respectively, and remains unchanged. The fact that it's a well-defined file with clear semantics ensures that OS images won't accidentally include a prepopulated version of this file with a timestamp corresponding to the image build, unlike our own cookie files (see point 3 above). In some cases, such as in OCI containers without an init system running, the machine-id file may be missing or empty, even though the system is still used long-term. To cover those, keep the original, relative epoch as a fallback method. System upgrades aren't really a thing for such systems so the above point 1 doesn't apply here. Some containers, such as those created by toolbox(1), may also choose to bind-mount the host's machine-id file, thus falling into the same bucket as their host. Conveniently, that's what we want, since the purpose of such containers is to blend with the host as much as possible. Fixes: #1611

dmnks mentioned this issue Jul 28, 2023

Man page section about "countme" isn't entirely clear on system age rpm-software-management/dnf#1972

Closed

j-mracek assigned inknos Aug 3, 2023

jan-kolarik added this to DNF team Apr 5, 2024

github-project-automation bot moved this to Backlog in DNF team Apr 5, 2024

dmnks assigned dmnks and unassigned inknos Apr 8, 2024

dmnks moved this from Backlog to In Progress in DNF team Apr 9, 2024

dmnks mentioned this issue May 9, 2024

Fix countme bucket calculation #1662

Merged

jan-kolarik mentioned this issue Jun 3, 2024

Countme should report system age, not repository age rpm-software-management/dnf5#1525

Closed

jan-kolarik closed this as completed in #1662 Jun 6, 2024

github-project-automation bot moved this from In Progress to Done in DNF team Jun 6, 2024

travier mentioned this issue Jul 18, 2024

Consider updating countme logic to match dnf countme changes (using /etc/machine-id as system epoch) coreos/rpm-ostree#5020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Countme should report system age, not repository age #1611

Countme should report system age, not repository age #1611

dmnks commented Jul 28, 2023 •

edited

Loading

mattdm commented Jul 29, 2023

dmnks commented Jul 29, 2023 •

edited

Loading

dmnks commented Jul 29, 2023

supakeen commented Jul 31, 2023 •

edited

Loading

dmnks commented Jul 31, 2023

dmnks commented Jul 31, 2023

travier commented Aug 7, 2023

mattdm commented Apr 1, 2024

jan-kolarik commented Apr 5, 2024

mattdm commented Apr 5, 2024

dmnks commented Apr 8, 2024

dmnks commented Apr 9, 2024

dmnks commented Apr 12, 2024 •

edited

Loading

dmnks commented May 9, 2024 •

edited

Loading

Countme should report system age, not repository age #1611

Countme should report system age, not repository age #1611

Comments

dmnks commented Jul 28, 2023 • edited Loading

mattdm commented Jul 29, 2023

dmnks commented Jul 29, 2023 • edited Loading

dmnks commented Jul 29, 2023

supakeen commented Jul 31, 2023 • edited Loading

dmnks commented Jul 31, 2023

dmnks commented Jul 31, 2023

travier commented Aug 7, 2023

mattdm commented Apr 1, 2024

jan-kolarik commented Apr 5, 2024

mattdm commented Apr 5, 2024

dmnks commented Apr 8, 2024

dmnks commented Apr 9, 2024

dmnks commented Apr 12, 2024 • edited Loading

dmnks commented May 9, 2024 • edited Loading

dmnks commented Jul 28, 2023 •

edited

Loading

dmnks commented Jul 29, 2023 •

edited

Loading

supakeen commented Jul 31, 2023 •

edited

Loading

dmnks commented Apr 12, 2024 •

edited

Loading

dmnks commented May 9, 2024 •

edited

Loading