Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INCIDENT: Primary nodejs.org web server taken offline by hosting provider #1659

Closed
rvagg opened this issue Jan 14, 2019 · 3 comments
Closed
Labels

Comments

@rvagg
Copy link
Member

rvagg commented Jan 14, 2019

DigitalOcean have taken our primary nodejs.org server offline due to suspicious traffic patterns. From the information they have provided it looks like a simple false positive and we're trying to get it resolved ASAP.

In the meantime, our backup server is handling the load and will hopefully provide appropriate continuity. Please provide details here of anything out of shape and we'll try and get it addressed.

Until resolved:

  • @nodejs/releasers please don't try and publish any new releases, they won't go up.
  • @nodejs/website sorry but you won't be able to publish anything for now, let us know if you have important changes to go up and we may be able to sort it out manually.
@rvagg
Copy link
Member Author

rvagg commented Jan 15, 2019

Server is back online, we don't have full information (yet) but it appears to be a false positive. We've run afoul of some heuristic in their safety system.

Unfortunately, in the process of restoring, SSH host keys got reset (I don't have a solid explanation for this yet but my guess is that it's related to their snapshot logic), and I don't have a backup of the old ones. So anyone (and anything) using SSH access will get errors about host key mismatching. @nodejs/releasers are going to have to fix this up before they can run releases again. If you want to get this sorted out now run ssh direct.nodejs.org and fix the errors that come up by fixing up your ~/.ssh/known_hosts either manually (macOS) or automatically with the suggested commands (Linux). There are probably going to be two lines you need to remove to get it working properly with no warnings, one for IP and one for hostname.

I need to manually do the same process for all release machines, unfortunately, so we'll need to keep an eye on nightly builds to make sure we are getting everything we're supposed to. ci-release should be red on upload failures so it should be obvious if it's not fixed.

@rvagg
Copy link
Member Author

rvagg commented Jan 16, 2019

We're over the hump on this one and our new server seems to be operating just fine. The experience has demonstrated that our current redundancy strategy is acceptable, but not perfect. There were a few problems with the setup that we've identified, but for the vast majority of people there was no noticeable impact.

We don't know exactly what heuristic we hit, I still suspect it was just the volume of traffic being all outgoing and focused on a small number of hosts (Cloudflare edge locations). This isn't a typical pattern and is because we are still relying on our nginx logs for metric collection so we serve all binaries from our own servers even with Cloudflare in front. The effort to replace it with Cloudflare logging has stalled, it's not a straightforward process, but this incident is a good prompt to continue that work. Once that's done, the traffic out of DigitalOcean (and Joyent, where our backup lives) will be considerably smaller and we could even consider downsizing the servers we use.

There's also other architectural concerns that this incident highlights, but we have been having ongoing discussions about our architecture and re-engineering it to distribute resources, tools and access better rather than doing so much on a single server which ends up having "crown jewels" status with only a few people with the ability to properly administer it.

Overall though, not a terrible experience.

@rvagg rvagg closed this as completed Jan 16, 2019
@rvagg
Copy link
Member Author

rvagg commented Jan 29, 2019

Finally got a response, confirms the theory:

It appears that there was an increase of network traffic at the time that we were notified about this possible attack. It looks like this was a simple mistake by our team. Again, we apologize for accidentally disabling the network on this droplet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants