Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc "phones home" even when analytics are disabled (via the updater) #4177

Closed
honnibal opened this issue Jul 7, 2020 · 10 comments
Closed

dvc "phones home" even when analytics are disabled (via the updater) #4177

honnibal opened this issue Jul 7, 2020 · 10 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@honnibal
Copy link

honnibal commented Jul 7, 2020

Even if the analytics are disabled, DVC still contacts Iterative's servers during the update check, which potentially provides another avenue for usage information to be collected: https://github.com/iterative/dvc/blob/master/dvc/updater.py . This makes DVC difficult to recommend to users in privacy-sensitive contexts.

I think the updater should be disabled when the analytics are disabled. If you're adamant that you want the updater to have this behaviour, I think it should at least call out to the PyPi server rather than https://updater.dvc.org

Separately, and this is more a subjective thing than the privacy question, I think the way the updates are advertised to the terminal is very noisy. I don't think a CLI tool should be mixing messages like this into its output:

+-----------------------------------------+
|                                         |
|     Update available 1.1.1 -> 1.1.7     |
|     Run `pip install dvc --upgrade`     |
|                                         |
+-----------------------------------------+
Stage is cached, skipping  

These messages make the tool much harder to use programmatically, and I feel it's both intrusive and unnecessary. It's up to users to track updates and choose the version they want to use.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jul 7, 2020
@efiop
Copy link
Contributor

efiop commented Jul 7, 2020

Hi @honnibal !

Thanks for the great feedback!

Regarding noisiness and providing an ability to disable checks, we have an older ticket #1894 discussing it and some ways to solve it. I think we could totally consider introducing a config option to disable update checks (both fetching and notifications). That should be enough for anyone who doesn't want updates for one reason or another. What do you think?

Regarding the notifications themselves for users that didn't turn off the updater, I agree that we could make them less annoying. Maybe by only notifying once a day/week. But what is clear is that we should take great care not print that notification to stdout, so it is not mixed up with the expected output (e.g. in machine-readable --show-json/md/etc).

There is still a concern about critical updates, that we don't currently have a mechanism for 🙁

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jul 7, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jul 7, 2020
@honnibal
Copy link
Author

honnibal commented Jul 7, 2020

Hi @efiop , thanks for the prompt reply!

I'm not sure whether I should talk about this issue here or on the old one. I wrote something for that thread but I'll paste it here instead.

@shcheklein writes:

is there a case in which users would not want to know about an update that potentially fixes a problem that can lead to data loss? We can give an option to disable it completely, and probably should do this, but we should emphasize that you are at risk when you are doing so, and it should be a second or third option after you have tuned the interval first.

Respectfully, I think you're thinking about this the wrong way. I argue this feature should be removed entirely; in fact it should never have been implemented.

It's up to users to decide which version of your software to install. It's also up to them to set up whatever notification feed they want, so that they can be as notified as they want to be. If you're concerned about critical data loss bugs, you could set up a mailing list and ask users to subscribe if they want those notifications. Heck, if you want to go the extra mile, let users who want those notifications give a phone number and then you can text them.

But if you make that mailing list and someone doesn't sign up...well, what does that tell you? It means they didn't care to get those notifications. So then if they miss those notifications, well...It's still sort of your fault for shipping a bug that lost their data. They will always feel sad about that. But, also, they could have signed up for update notifications, and the fact that they didn't know about the update will not be your fault.

Also like. If you really do ship a bug that's losing people's data...don't worry, they'll hear about it! That's kind of an unusual goof. Words will be said, articles will be shared.

Any library I use could in theory have some catastrophic bug that deletes my hard drive. But can you imagine life if every library phoned home to check for updates, and nagged me if it were out of date?

The update notifier isn't normal. It's not something other libraries do, and it's not something you should do either. This sort of thing is really holding DVC back. I think you should remove the analytics entirely as well, but that's a separate topic...

@shcheklein
Copy link
Member

@honnibal thanks for a thorough answer! I really appreciate it. Clearly, I think very differently about this. Let me clarify it, but overall we won't find a silver bullet here or one answer fits all, clearly we can take different directions and different defaults on this.

It's not something other libraries do, and it's not something you should do either.

DVC is not a library (it can be used a library and I hope it does not check for updates in this case). There are CLI tools that check for updates. Some of them quite popular:

NPM
Heroku CLI

If you really do ship a bug that's losing people's data...don't worry, they'll hear about it!

So, like I said, I look at this in a different way. We are taking an extra step (it's not free for us - it complicates DVC itself, requires some lambda that compares versions) to support this because for most our users it's better to be notified. It reminds me regular desktop software, especially browsers that has been moving this direction last few years and thus making web safer. Why CLI should be different?

It means they didn't care to get those notifications.

no. In most cases it means that folks were not even aware about them, or haven't had time to give a thought.

Again, not trying to convince you (I think this would be a philosophical debate), but as soon as:

  • we provide options to opt-out and/or regulate intervals, severity, etc
  • we extra explicit and open about what and why
  • we extra clear about data policies, what do we collect (in this case we don't collect anything), how do we use that data, etc

I think we should satisfy most user requirements and if you someone hates this - there should be a simple path to disable it (it's an explicit action and they will understand consequences).

@honnibal
Copy link
Author

honnibal commented Jul 8, 2020

Thanks for engaging with this.

It's true that DVC is a CLI tool rather than a library, and that does change the balance slightly. And it's true that some CLI tools check for updates. But consider all the other tools you use. Very very few of them do this, and I can't easily think of any where you can't disable the behaviour.

In the use-case I have in mind for DVC, I expect the tool to be executed from other software, e.g. as part of automation. So I might configure and test a container with a particular version of DVC. Once I have created that configuration, I want the work to be run with exactly that version, and later I will consider updating the components and creating a new artifact with more recent versions.

Similarly if I export a project to a colleague, I expect them to run it with the versions I specify. Once they have it working, they can consider updating components and rerunning to ensure nothing has been broken.

The results of the system should not change based on external state. Already with the update notifications, you'll get different outputs from DVC as time goes by, even if the update notifications don't break your CLI parsing scripts, they'll at least make the outputs different. So you've built in some subtle bit-rot: the same program executed with the same arguments will produce different logging next week compared to what it produces today. That's bad.

And again, all of this is completely unnecessary. You can --- should! --- simply not do this! Anyone who wants to can easily follow updates to your software, however they want. People who want to know when updates are published can subscribe. It's a more reliable way to be notified anyway! If you publish these messages into the CLI, there's no guarantee that any human is reading them.

Another thing I will say is, you're trying to solve the problem of "how can we mitigate the impact of bugs we ship to our users". Well, one way that users mitigate the impact of software bugs is to update their software less often, or to time their updates based on whether they're ready to deal with unforeseen problems. It is a perfectly valid strategy to choose software versions that are less recent, in order to avoid being on the bleeding edge of exposure to the latest bugs. People can rightly reason that if a version of DVC has been out for a while and they've been using it successfully and they haven't heard about a severe data loss bug, they probably won't suffer from a severe data loss bug tomorrow. But if they update to a version they haven't tested yet? Well, who knows?

This is a perfectly rational strategy, and I'm sure you make this decision yourself every day. People need the ability to decide whether today is a day they want to encounter new problems, or whether they'd rather stick to the old problems they've already encountered. You're actually making the exposure-to-bugs problem much worse if you make it unpleasant to use anything but the latest release of your software.

Finally, note that it is quite common to be processing data on machines that are blocked from external network access. That's a very sane configuration if you're working with even moderately sensitive data. Tools designed to work in these settings should offer the simple guarantee that they do not unexpectedly call out to the web.

@shcheklein
Copy link
Member

... all the concerns like " I expect them to run it with the versions I specify", "I want the work to be run with exactly that version", "It is a perfectly valid strategy to choose software versions that are less recent"

I think that's already the case, right? DVC does not update itself by default? And you can pin (freeze) the version with the requirements file, for example, or with a script that installs a specific version. Except snap it doesn't force any updates, as far as I know.

"Tools designed to work in these settings should offer the simple guarantee that they do not unexpectedly call out to the web.", "... if you make it unpleasant to use anything but the latest release of your software." concerns

It looks like it can be mitigated with an opt-out option, wdyt?

@honnibal
Copy link
Author

honnibal commented Jul 9, 2020

I think that's already the case, right? DVC does not update itself by default

Yes but you've made this workflow really unpleasant. Every execution comes with a large notification that the user doesn't want to see and can't disable. And this on top of the "support" message every time you type a path wrong, again which can't be disabled. And then every time you install it in a new virtualenv, you have to remember the config command to disable the analytics. All of this feels really bad. I really think it's hurt your adoption a lot.

It looks like it can be mitigated with an opt-out option, wdyt?

Yes I think it would be important to be able to opt-out, and I think it's a shame that there hasn't been an opt-out until now. I think it would be good to default to disabling the update checks for users who disable the analytics, although it's good to have a separate setting as well.

@efiop
Copy link
Contributor

efiop commented Jul 9, 2020

So to summarize: we'll introduce a config option to disable update checks/notifications and will cleanup the "need help" messages (#4126). Thanks for the feedback!

@efiop
Copy link
Contributor

efiop commented Jul 9, 2020

Closing in favor of #1894

@efiop efiop closed this as completed Jul 9, 2020
@honnibal
Copy link
Author

Thanks!

@efiop
Copy link
Contributor

efiop commented Aug 18, 2020

Hi @honnibal . FYI: core.check_update config option was added in 1.5.0, please upgrade and give it a try 🙂 Thanks for the feedback! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

3 participants