Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Countme should report system age, not repository age #1611

Closed
dmnks opened this issue Jul 28, 2023 · 14 comments · Fixed by #1662
Closed

Countme should report system age, not repository age #1611

dmnks opened this issue Jul 28, 2023 · 14 comments · Fixed by #1662
Assignees

Comments

@dmnks
Copy link
Contributor

dmnks commented Jul 28, 2023

Currently, when we compute the system's age bucket (1 through 4) to report in the weekly countme flag, we do that relative to the first-ever metadata refresh (called the epoch) of the respective repository. However, the original proposal intended that it would be the absolute age bucket, that is, since the installation.

This is because we store the cookie files (containing the timestamps) in per-repository directories (persistdir) whose names contain hashes derived from various repository properties including the releasever value. That means, the system's age bucket is effectively reset on each Fedora system upgrade which is not what we want.

To fix this, we should simply keep one single cookie file for the entire system and use that to determine the system's age bucket.

There's a second countme implementation in rpm-ostree (here's why) which reportedly does the right thing. Looking at the code, they do appear to store only one cookie file per system (at /var/lib/rpm-ostree-countme/cookie), as it should be. I think we should just do the same.

To avoid skewing the metrics, the fix should probably include a check for an old, repo-specific cookie file and if it exists, it should load the values from it and then remove the file. When it comes to storing the new values at the end of the addCountmeFlag() function, that should already go into the system-wide cookie file. That way, systems that upgrade to the fixed DNF version would simply continue where they left off, instead of being reset to age 1. Note that this may need special care in case repositories are fetched in parallel.

@mattdm
Copy link

mattdm commented Jul 29, 2023

To avoid skewing the metrics, the fix should probably include a check for an old, repo-specific cookie file and if it exists, it should load the values from it and then remove the file.

Probably the "best" thing to do is find the oldest countme file (including disabled repos).

Hacky but maybe more accurate — does dnf create any other files in /var or /etc at install time that would likely have a corresponding file date which could be used.

Both of these will probably cause "jumps" in my data — but I'm okay with that, really.

@dmnks
Copy link
Contributor Author

dmnks commented Jul 29, 2023

I think we could use the transaction ID 1 in the DNF history database which, I believe, represents the fresh install through Anaconda. The transaction record contains the timestamp. On the CLI, you can check that with:

dnf history info 1 | grep 'Begin time'

That way, we wouldn't need to store the "epoch" in the cookie file, and would just always use the above timestamp for that.

@dmnks
Copy link
Contributor Author

dmnks commented Jul 29, 2023

Thinking about it more, the first-ever transaction may not be a reliable indicator of the system age for ephemeral systems that are not installed through Anaconda but from an image (e.g. Podman containers). So we may need a different strategy (for those).

@supakeen
Copy link

supakeen commented Jul 31, 2023

There's more systems that are not installed through Anaconda (the ARM version often gets installed from an image, virtual machines at cloud providers, etc) so I wouldn't special case it :)

@dmnks
Copy link
Contributor Author

dmnks commented Jul 31, 2023

Thanks, that's a useful data point to have 😄

@dmnks
Copy link
Contributor Author

dmnks commented Jul 31, 2023

Just FTR, @james-antill suggested in a chat that one solution would also be keeping per-repo countme files but doing that in directories named after the repo ID only (not a hash).

@travier
Copy link

travier commented Aug 7, 2023

Just as FYI, here is the implementation in rpm-ostree that does not have this issue: https://github.com/coreos/rpm-ostree/blob/main/rust/src/countme/cookie.rs

@mattdm
Copy link

mattdm commented Apr 1, 2024

Is there any movement on this? What is the implementation like in DNF 5?

@jan-kolarik
Copy link
Member

What is the implementation like in DNF 5?

I've just checked it, it's basically a clone of the dnf4 implementation.

Is there any movement on this?

We'll discuss it with leadership and the team in the following days and provide feedback soon.

@github-project-automation github-project-automation bot moved this to Backlog in DNF team Apr 5, 2024
@mattdm
Copy link

mattdm commented Apr 5, 2024

I've just checked it, it's basically a clone of the dnf4 implementation.

Ah, bug-for-bug compatibility. :)

Am I possibly currently getting double-counts from people using both, or using e.g. GNOME Software + dnf5 in f39?

@dmnks dmnks assigned dmnks and unassigned inknos Apr 8, 2024
@dmnks
Copy link
Contributor Author

dmnks commented Apr 8, 2024

If dnf4 and dnf5 both use a different repo "persistdir", then yep, we're likely double-counting already.

This is really silly and needs to be fixed ASAP. Since I wrote that code (and still remember how it works, kinda), it just makes sense for me to have a closer look, then... So I'll do just that, assigning to myself now.

@dmnks
Copy link
Contributor Author

dmnks commented Apr 9, 2024

If dnf4 and dnf5 both use a different repo "persistdir", then yep, we're likely double-counting already.

Good news, I guess. I've just checked and dnf5 uses the same persistent directories (/var/lib/dnf/repos/<repoid>-<hash>/) as dnf4, meaning that countme flags are not sent twice for each.

@dmnks dmnks moved this from Backlog to In Progress in DNF team Apr 9, 2024
@dmnks
Copy link
Contributor Author

dmnks commented Apr 12, 2024

TL;DR: A simple fix is underway. I'll be on PTO next week, so expect silence here until I'm back.

Having thought about this more, we do need to continue tracking the countme timestamps ("cookie" files in /var/lib/dnf/repos/) on a per-repo basis, as opposed to having one system-wide timestamp. This is simply because the countme flag is reported per-repo (via the metalink URL) and using a system-wide cookie would cause only one repo (whichever happens to be fetched first by dnf) to issue the flag each week, which is not what we want.

However, what we do want to change is so that the timestamps aren't dependent on the $releasever value as that value is part of the metalink URL (which is used to compute the hash). Therefore, the easiest fix is to just change the per-repo directory names in /var/lib/dnf/repos/ from <repoid>-<hash> to <repoid>. This was also mentioned above as one of the possible solutions.

I have a working (one-line) patch for that locally, as well as an updated countme.feature test to cover this. So that part is easy.

The tricky part is to ensure that the cookie is not reset when the existing systems upgrade to the fixed libdnf version (once released). Since the directory name changes, libdnf would think that the system doesn't yet have a cookie file and thus 1) would start over, with age set to 1 (countme=1), as if the system was just freshly installed, and 2) would possibly send the flag again in the same week, thus double-counting the system in that week. This would skew the metrics we gather on the server quite a bit.

To prevent that, the cookie file needs to stay the same when you upgrade libdnf to the fixed version, as well as if you decide to downgrade to the old version for some reason. The easiest solution to that seems to be the following:

  1. In a (%post?) scriptlet in libdnf, check whether we have an existing cookie for the main repos ("fedora" and "updates"?).
  2. If we do, create a non-hash symlink for each of those repos. For example, if /var/lib/dnf/repos/fedora-845d89688cb28f31 exists, a symlink named /var/lib/dnf/repos/fedora pointing to the former would be created by the scriptlet.

This way, the same cookie file would be reused after upgrading to the new libdnf version as well as after downgrading it.

What the scriptlet needs to decide, though, is which directory to choose for the symlink target if there are multiple - that can happen easily, such as if dnf --releasever is ever used on the system.

I think it should choose the one that corresponds to the running Fedora version, e.g. by looking at /etc/os-release (VERSION_ID). This is quite easy to do, the hash is a SHA256 of the metalink URL so we can compute that easily in the scriptlet using core-utils programs.

In fact, I also have a draft scriptlet locally which works as described above, we just need to decide on which repositories to "migrate". I'd think "fedora" and "updates" should suffice, but please let me know otherwise.

So, that's for a status update. I've decided to dump my thoughts here because I'll be on vacation next week and might otherwise forget the details 😄 Any feedback is of course welcome in the meantime. Just know that I'll only be able to respond when I'm back.

dmnks added a commit to dmnks/libdnf that referenced this issue May 8, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page has been saying all along, the
code just never lived up to it.

This fixes the following issues:

1. Systems that only reach out to the repos after an initial period of
   time after their installation appear "younger" than they really are.

2. Prebuilt OS images may include repo persistdirs with countme cookies
   in them that were created at build time, making all instances spawned
   from those images (physical machines, VMs or containers) appear much
   "older" than they really are.

3. System upgrades cause the bucket to be effectively reset to 1 due to
   the fact that a changed $releasever value causes a new persistdir to
   be created.

Use the machine-id(5) file's mtime as the single source of truth.  This
file is typically tied to the system's installation or first boot where
it's populated by an installer tool or init system, respectively, and is
never changed afterwards.

Keep the "relative" epoch (first countme event) as a fallback method,
though.  This is useful on those systems that don't have a machine-id
file (such as OCI containers) but are still used long-term.  In those
cases, system upgrades aren't really a thing so the above point 3 does
not apply.

Some containers may also choose to bind-mount the machine-id file from
the host (such as what toolbox(1) does), in which case their age will be
the same as that of the host.  Conveniently, that's also what we want,
since the purpose of such containers is to blend with the host as much
as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 8, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page has been saying all along, the
code just never lived up to it.

This fixes the following issues:

1. Systems that only reach out to the repos after an initial period of
   time after their installation appear "younger" than they really are.

2. Prebuilt OS images may include repo persistdirs with countme cookies
   in them that were created at build time, making all instances spawned
   from those images (physical machines, VMs or containers) appear much
   "older" than they really are.

3. System upgrades cause the bucket to be effectively reset to 1 due to
   the fact that a changed $releasever value causes a new persistdir to
   be created.

Use the machine-id(5) file's mtime as the single source of truth.  This
file is typically tied to the system's installation or first boot where
it's populated by an installer tool or init system, respectively, and is
never changed afterwards.

Keep the "relative" epoch (first countme event) as a fallback method,
though.  This is useful on those systems that don't have a machine-id
file (such as OCI containers) but are still used long-term.  In those
cases, system upgrades aren't really a thing so the above point 3 does
not apply.

Some containers may also choose to bind-mount the machine-id file from
the host (such as what toolbox(1) does), in which case their age will be
the same as that of the host.  Conveniently, that's also what we want,
since the purpose of such containers is to blend with the host as much
as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 8, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page has been saying all along, the
code just never lived up to it.

This fixes the following issues:

1. Systems that only reach out to the repos after an initial period of
   time after their installation appear "younger" than they really are.

2. Prebuilt OS images may include repo persistdirs with countme cookies
   in them that were created at build time, making all instances spawned
   from those images (physical machines, VMs or containers) appear much
   "older" than they really are.

3. System upgrades cause the bucket to be effectively reset to 1 due to
   the fact that a changed $releasever value causes a new persistdir to
   be created.

Use the machine-id(5) file's mtime as a single source of truth.  This
file is typically tied to the system's installation or first boot where
it's populated by an installer tool or init system, respectively, and is
never changed afterwards.

Keep the "relative" epoch (first countme event) as a fallback method,
though.  This is useful on those systems that don't have a machine-id
file (such as OCI containers) but are still used long-term.  In those
cases, system upgrades aren't really a thing so the above point 3 does
not apply.

Some containers may also choose to bind-mount the machine-id file from
the host (such as what toolbox(1) does), in which case their age will be
the same as that of the host.  Conveniently, that's also what we want,
since the purpose of such containers is to blend with the host as much
as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 8, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page has been saying all along, the
code just never lived up to it.

This fixes the following issues:

1. Systems that only reach out to the repos after an initial period of
   time after their installation appear "younger" than they really are.

2. Prebuilt OS images may include repo persistdirs with countme cookies
   in them that were created at build time, making all instances spawned
   from those images (physical machines, VMs or containers) appear much
   "older" than they really are.

3. System upgrades cause the bucket to be effectively reset to 1 due to
   the fact that a changed $releasever value causes a new persistdir to
   be created.

Use the machine-id(5) file's mtime to infer the installation time.  This
file is typically tied to the system's installation or first boot where
it's populated by an installer tool or init system, respectively, and is
never changed afterwards.

Keep the "relative" epoch (first countme event) as a fallback method,
though.  This is useful on those systems that don't have a machine-id
file (such as OCI containers) but are still used long-term.  In those
cases, system upgrades aren't really a thing so the above point 3 does
not apply.

Some containers may also choose to bind-mount the machine-id file from
the host (such as what toolbox(1) does), in which case their age will be
the same as that of the host.  Conveniently, that's also what we want,
since the purpose of such containers is to blend with the host as much
as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 9, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page always said about the countme
option, the code just never lived up to that.

This makes bucket calculation more accurate:

1. System upgrades will no longer reset the bucket to 1 (this used to be
   the case due to a new persistdir being created whenever $releasever
   changed).

2. Systems that only reach out to the repos after an initial time period
   after being installed will no longer appear younger than they really
   are.

3. Prebuilt OS images that happen to include countme cookies created at
   build time will no longer cause all the instances spawned from those
   images (physical machines, VMs or containers) to appear older than
   they really are.

Use the machine-id(5) file's mtime to infer the installation time.  This
file is semantically tied to the system's lifetime since it's typically
populated at installation time or during the first boot by an installer
tool or init system, respectively, and remains unchanged.

The fact that it's a well-defined file with clear semantics ensures that
OS images won't accidentally include a prepopulated version of this file
with a timestamp corresponding to the image build, unlike our own cookie
files (see point 3 above).

In some cases, such as in OCI containers without an init system running,
the machine-id file may be missing or empty, even though the system is
still used long-term.  To cover those, keep the original, relative epoch
as a fallback method.  System upgrades aren't really a thing for such
systems so the above point 1 doesn't apply here.

Some containers, such as those created by toolbox(1), may also choose to
bind-mount the host's machine-id file, thus falling into the same bucket
as their host.  Conveniently, that's what we want, since the purpose of
such containers is to blend with the host as much as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 9, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page always said about the countme
option, the code just never lived up to that.

This makes bucket calculation more accurate:

1. System upgrades will no longer reset the bucket to 1 (this used to be
   the case due to a new persistdir being created whenever $releasever
   changed).

2. Systems that only reach out to the repos after an initial time period
   after being installed will no longer appear younger than they really
   are.

3. Prebuilt OS images that happen to include countme cookies created at
   build time will no longer cause all the instances spawned from those
   images (physical machines, VMs or containers) to appear older than
   they really are.

Use the machine-id(5) file's mtime to infer the installation time.  This
file is semantically tied to the system's lifetime since it's typically
populated at installation time or during the first boot by an installer
tool or init system, respectively, and remains unchanged.

The fact that it's a well-defined file with clear semantics ensures that
OS images won't accidentally include a prepopulated version of this file
with a timestamp corresponding to the image build, unlike our own cookie
files (see point 3 above).

In some cases, such as in OCI containers without an init system running,
the machine-id file may be missing or empty, even though the system is
still used long-term.  To cover those, keep the original, relative epoch
as a fallback method.  System upgrades aren't really a thing for such
systems so the above point 1 doesn't apply here.

Some containers, such as those created by toolbox(1), may also choose to
bind-mount the host's machine-id file, thus falling into the same bucket
as their host.  Conveniently, that's what we want, since the purpose of
such containers is to blend with the host as much as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 9, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page always said about the countme
option, the code just never lived up to that.

This makes bucket calculation more accurate:

1. System upgrades will no longer reset the bucket to 1 (this used to be
   the case due to a new persistdir being created whenever $releasever
   changed).

2. Systems that only reach out to the repos after an initial time period
   after being installed will no longer appear younger than they really
   are.

3. Prebuilt OS images that happen to include countme cookies created at
   build time will no longer cause all the instances spawned from those
   images (physical machines, VMs or containers) to appear older than
   they really are.

Use the machine-id(5) file's mtime to infer the installation time.  This
file is semantically tied to the system's lifetime since it's typically
populated at installation time or during the first boot by an installer
tool or init system, respectively, and remains unchanged.

The fact that it's a well-defined file with clear semantics ensures that
OS images won't accidentally include a prepopulated version of this file
with a timestamp corresponding to the image build, unlike our own cookie
files (see point 3 above).

In some cases, such as in OCI containers without an init system running,
the machine-id file may be missing or empty, even though the system is
still used long-term.  To cover those, keep the original, relative epoch
as a fallback method.  System upgrades aren't really a thing for such
systems so the above point 1 doesn't apply here.

Some containers, such as those created by toolbox(1), may also choose to
bind-mount the host's machine-id file, thus falling into the same bucket
as their host.  Conveniently, that's what we want, since the purpose of
such containers is to blend with the host as much as possible.

Fixes: rpm-software-management#1611
dmnks added a commit to dmnks/libdnf that referenced this issue May 9, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page always said about the countme
option, the code just never lived up to that.

This makes bucket calculation more accurate:

1. System upgrades will no longer reset the bucket to 1 (this used to be
   the case due to a new persistdir being created whenever $releasever
   changed).

2. Systems that only reach out to the repos after an initial time period
   after being installed will no longer appear younger than they really
   are.

3. Prebuilt OS images that happen to include countme cookies created at
   build time will no longer cause all the instances spawned from those
   images (physical machines, VMs or containers) to appear older than
   they really are.

Use the machine-id(5) file's mtime to infer the installation time.  This
file is semantically tied to the system's lifetime since it's typically
populated at installation time or during the first boot by an installer
tool or init system, respectively, and remains unchanged.

The fact that it's a well-defined file with clear semantics ensures that
OS images won't accidentally include a prepopulated version of this file
with a timestamp corresponding to the image build, unlike our own cookie
files (see point 3 above).

In some cases, such as in OCI containers without an init system running,
the machine-id file may be missing or empty, even though the system is
still used long-term.  To cover those, keep the original, relative epoch
as a fallback method.  System upgrades aren't really a thing for such
systems so the above point 1 doesn't apply here.

Some containers, such as those created by toolbox(1), may also choose to
bind-mount the host's machine-id file, thus falling into the same bucket
as their host.  Conveniently, that's what we want, since the purpose of
such containers is to blend with the host as much as possible.

Fixes: rpm-software-management#1611
@dmnks
Copy link
Contributor Author

dmnks commented May 9, 2024

Hacky but maybe more accurate — does dnf create any other files in /var or /etc at install time that would likely have a corresponding file date which could be used.

Not dnf, but there's the /etc/machine-id file which, actually, seems to fit the bill quite perfectly. Its modification timestamp typically reflects the installation time as the ID is generated during system installation or first boot (by systemd) and then stays untouched.

So, scratch my above ponderings about changing the persistdir naming scheme. Instead, I've submitted #1662 which switches age counting to the machine-id file's timestamp.

Here's an updated BDD feature file which demonstrates the new logic (see the Examples table at the bottom of the Scenario Outline): https://github.com/rpm-software-management/ci-dnf-stack/blob/ab365d2bad19f69e188fb449fb6bcdd8834f5815/dnf-behave-tests/dnf/countme.feature#L44

jan-kolarik pushed a commit that referenced this issue Jun 6, 2024
Actually use the system's installation time (if known) as the reference
point, instead of the first-ever countme event recorded for the given
repo.

This is what the dnf.conf(5) man page always said about the countme
option, the code just never lived up to that.

This makes bucket calculation more accurate:

1. System upgrades will no longer reset the bucket to 1 (this used to be
   the case due to a new persistdir being created whenever $releasever
   changed).

2. Systems that only reach out to the repos after an initial time period
   after being installed will no longer appear younger than they really
   are.

3. Prebuilt OS images that happen to include countme cookies created at
   build time will no longer cause all the instances spawned from those
   images (physical machines, VMs or containers) to appear older than
   they really are.

Use the machine-id(5) file's mtime to infer the installation time.  This
file is semantically tied to the system's lifetime since it's typically
populated at installation time or during the first boot by an installer
tool or init system, respectively, and remains unchanged.

The fact that it's a well-defined file with clear semantics ensures that
OS images won't accidentally include a prepopulated version of this file
with a timestamp corresponding to the image build, unlike our own cookie
files (see point 3 above).

In some cases, such as in OCI containers without an init system running,
the machine-id file may be missing or empty, even though the system is
still used long-term.  To cover those, keep the original, relative epoch
as a fallback method.  System upgrades aren't really a thing for such
systems so the above point 1 doesn't apply here.

Some containers, such as those created by toolbox(1), may also choose to
bind-mount the host's machine-id file, thus falling into the same bucket
as their host.  Conveniently, that's what we want, since the purpose of
such containers is to blend with the host as much as possible.

Fixes: #1611
@github-project-automation github-project-automation bot moved this from In Progress to Done in DNF team Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants