Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to distinguish unique installations to anonymous usage report #4077

Open
oleiade opened this issue Nov 27, 2024 · 5 comments
Labels

Comments

@oleiade
Copy link
Member

oleiade commented Nov 27, 2024

Feature Description

Problem Definition

k6 currently collects anonymous usage information as part of its opt-out usage report (--no-usage-report). This report helps us understand k6 usage patterns to improve the tool and guide development decisions.

To better support the k6 development process, we would like to measure the number of active installations of k6 over time.

This requires the ability to track when a given installation of k6 (on a machine that has not opted out of the usage report) was last used. Since each usage report already includes a timestamp, the only additional functionality needed is a mechanism to distinguish one installation from another.

Considerations

The identifier introduced to enable this functionality would:
• Be anonymous.
• Be stored locally on the machine running k6.
• Be included in the usage report only if telemetry has not been opted out.

This identifier:
• Will not contain any personally identifiable information (PII) or system-specific data (e.g., username, hostname, IP address, etc.).
• Will comply with GDPR and other relevant privacy laws by being designed to avoid user identification and to remain strictly anonymous.

Risks

  • Using a random identifier and storing it might cause tampering risks. We should not trust data that the user can provide. As such, the identifier being reproducible/verifiable by k6 before submitting would be a nice to have.

Why This Matters

Having a reliable measure of active installations will:
• Allow us to make more informed decisions about features and improvements.
• Help us better understand k6’s reach and growth while respecting user privacy.

Suggested Solution (optional)

Inspiration & References

ID generation

  • machineid
  • In a previous role, I was exposed to a similar need, and we used a system fingerprinting mechanism that created a hash for the user system. We had the ability to verify this fingerprint, but the hash itself was cryptographic and thus non-reversible.

Proposed solution(s)

TODO

Already existing or connected issues / PRs (optional)

#4038

@oleiade
Copy link
Member Author

oleiade commented Nov 27, 2024

For context, the Alloy project uses a UUID they call a "seed". This seed is saved on disk on the user system as a "seed file".
See https://github.com/grafana/alloy/blob/cc383c1edf988fd4763582c86a2e4b85bcc0f055/internal/alloyseed/alloyseed.go.

cc @joanlopez

@joanlopez
Copy link
Contributor

The most challenging part I see here is to consider what you @oleiade included in the risks section, especially considering that this is an open-source project, which makes it harder to keep some secrets unrevealed.

However, I'm not sure quite sure it does really worth, because as of now we're not doing anything to prevent fake data at the report level, and I see this just a subcase of that.

Do you have any particular idea on how to solve this?

@oleiade
Copy link
Member Author

oleiade commented Nov 28, 2024

@joanlopez I have a couple of ideas, but I don't necessarily think any of them are worth the hassle:

  • We could sign the UUID we generate with public/private key pair. That would involve a bit of infrastructure work I don't even is possible. But that would work.
  • We could use Hardware+Environment related informations we bake into the identifier: id = sha256(os_version + architecture + UUID + salt) (or something along those lines) that way both k6, and the usage report receiver are able to verify the hash is a somewhat reliable way.
  • A bunch of other ways I don't necessarily think are worth it.

In general I don't really think any of those are worth it? Would you agree?
The core risk can also most likely be statistically mitigated, by correlating with other information we collect in the usage report, too.

@joanlopez
Copy link
Contributor

In general I don't really think any of those are worth it? Would you agree? The core risk can also most likely be statistically mitigated, by correlating with other information we collect in the usage report, too.

Yeah, I agree, as I said before. Indeed, if we want to implement any sort of hashing checks, I'd probably suggest to do it for the whole payload, and not only for this concrete field, cause any part of the report could be altered.

The problem is that there's probably not really safe and cheap/easy way to do so. Just for the id, it's true that for instance your first suggestion would probably work, and would be mostly safe, but still I have serious doubts about it really worthing it because of the aforementioned reasons.

@oleiade
Copy link
Member Author

oleiade commented Nov 28, 2024

I agree. My preference would be for adopting the same UUID approach as Alloy, and address any issues as they occur incrementally 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants