Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Deny localhost/private-ips in storage connections #26186

Merged
merged 6 commits into from
Apr 15, 2024

Conversation

rjobanp
Copy link
Contributor

@rjobanp rjobanp commented Mar 21, 2024

Motivation

Tips for reviewer

There were several approaches discussed to implement this security fix, but we felt that the simplest was to simply resolve DNS ourselves for all storage connections and validate that the resolved address is not to a local or private-ip, thereby preventing a malicious user from attempting to create a Materialize Source that connects to some service on the materialize network. Most of these attacks would already be prevented by Cilium at the K8s networking layer, but this adds additional security.

This feature is activated by a LD feature flag, which I plan to use for testing once this goes to Staging, since it's difficult to validate that it actually allows valid connections in local testing. Instead the tests just validate that when activated, it prevents local/private-ip connections.

This also fixes some bugs in the dyncfg implementation for storage #26076 which was accidentally dropping the dyncfg values when performing a StorageConfiguration update.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@rjobanp rjobanp requested review from benesch, guswynn and a team March 21, 2024 15:25
Copy link

shepherdlybot bot commented Mar 21, 2024

Risk Score:79 / 100 Bug Hotspots:1 Resilience Coverage:50%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Observability
  • Feature Flag
  • Integration Test 🔍 Detected
  • QA Review
  • Run Nightly Tests
  • Unit Test
Risk Summary:

The pull request carries a high risk, with a 79% risk score, due to predictors such as the sum of bug reports of files changed and the delta of executable lines. Historically, PRs with these characteristics are 112% more likely to introduce bugs compared to the repository baseline. Additionally, the repository's predicted bug trend is on the rise, and there's at least one file hotspot indicating recent bug fix activity. The observed bug trend in the repository has also been increasing.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../src/connections.rs 92

@rjobanp rjobanp force-pushed the localhost-storage-fix branch 2 times, most recently from d2ac91d to 62f6dd1 Compare March 21, 2024 15:48
@rjobanp rjobanp mentioned this pull request Mar 21, 2024
5 tasks
Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but there are a few things I think need to be fixed:

  • ssh tunnels also do resolution and must be checked
  • for completeness, though its probably not actually possible to spoof, we should check what aws privatelink hosts resolve to
  • we can't unwrap in rewrite_broker_addr, we have to fallback to an known-unresolvable host (like in ssh failures) and communicate this error through a method like tunnel_status

@rjobanp
Copy link
Contributor Author

rjobanp commented Mar 21, 2024

ssh tunnels also do resolution and must be checked

@guswynn I'm pretty sure I added coverage of all the places where we connect to an SSH bastion host. We don't want to do the resolution on the destination URL in the case of SSH tunnels, since customers may want to connect to a private-address within their network that is resolvable via the SSH bastion-host. We just need to ensure the bastion host itself is global to us.

@guswynn
Copy link
Contributor

guswynn commented Mar 21, 2024

@rjobanp you are totally correct, sorry about that! The kafka error comment should be it then!

@rjobanp rjobanp force-pushed the localhost-storage-fix branch 2 times, most recently from e1d9ce9 to ff655ba Compare March 21, 2024 20:59
@rjobanp
Copy link
Contributor Author

rjobanp commented Mar 21, 2024

@guswynn fixed the failing test cases and addressed your feedback, mostly.

We do now avoid unwraping in the rewrite_broker_addr method and instead return a known unresolvable host if we can't resolve the broker address. We don't need to report the error like we do for SSH since we don't need to distinguish the 'namespace' of the error, and the error will still get propagated as a DNS error by the rdkafka itself when it tries to resolve the broker address.

I also opted not to do anything for privatelink connections -- there is no user input involved with getting the vpc-endpoint address and it seems pointless to complicate that logic which is already somewhat onerous to understand.

@guswynn
Copy link
Contributor

guswynn commented Mar 21, 2024

@rjobanp I would rather we actually communicate a meaningful error to the user, instead of having them interpret whatever rdkafka says about the address failed-broker-address.dev.materialize.com, but if you want to add a TODO and leave that for later, that fine for me!

@rjobanp rjobanp force-pushed the localhost-storage-fix branch 3 times, most recently from 8a4deb0 to e142f6e Compare March 25, 2024 13:26
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic structure here looks good, but I have a wrench to throw into the works. Today, a user can specify a hostname that resolves to multiple IPs, and Materialize will try each of those IPs in order. After this change, Materialize will try only the first. That's a regression with possible availability implications. I know at least one of our customers uses DNS to provide HA for SSH tunnels, by round-robining traffic between two SSH bastion hosts.

It's going to require a good deal more elbow grease, but I think it's fundamentally possible to support this with the same structure you have here. We'll need each library (MySQL, PostgreSQL, Kafka, SSH tunnels) to support taking multiple addresses as input for the same hostname. Then resolve_address can return a Vec<IpAddr>, and we can pass that list of IPs into each library.

Here's a patch to our PostgreSQL fork that integrates an upstream change to allow taking multiple IPs as input: MaterializeInc/rust-postgres#22

The MySQL change is hopefully similarly straightforward, although we'll likely need to reintroduce our fork to avoid blocking on upstream merging our PR.

SSH is a little annoying, because we shell out to the ssh binary and there's no way AFIACT to pass in a list of IPs, even though SSH does the right thing if you pass in a hostname that resolves to multiple IPs. But I think we can just adjust our SSH tunnel code to shell out to ssh multiple times, one per IP, and it won't be too bad.

rdkafka is going to be the most annoying. The underlying librdkafka API supports resolve_cbs that return multiple addresses, but it'll require some C goop to expose this functionality in rust-rdkafka. I can assist with this, if you'd like. We can also take the chance to allow rust-rdkafka's callback to be fallible, which should clean up the error handling a good bit.

src/kafka-util/src/client.rs Outdated Show resolved Hide resolved
src/ore/src/netio/dns.rs Outdated Show resolved Hide resolved
src/ore/src/netio/dns.rs Outdated Show resolved Hide resolved
@rjobanp
Copy link
Contributor Author

rjobanp commented Mar 25, 2024

Good point on this breaking support for HA DNS configs -- let me look into refactoring MySQL and the SSH connection stuff. I would appreciate the help on rust-rdkafka if you have time to look into that!

@rjobanp
Copy link
Contributor Author

rjobanp commented Mar 26, 2024

Addressed the above feedback, and implemented the multiple resolved-ip support for MySQL, Postgres and SSH tunnels, but not yet for rdkafka. This PR is blocked until #26273 merges, which will allow the rust-postgres changes to be built

@rjobanp rjobanp force-pushed the localhost-storage-fix branch 3 times, most recently from 2e180e7 to 04f4cea Compare April 2, 2024 15:45
@rjobanp rjobanp force-pushed the localhost-storage-fix branch 2 times, most recently from b93a6f2 to e2cf39b Compare April 12, 2024 15:30
@rjobanp
Copy link
Contributor Author

rjobanp commented Apr 12, 2024

Okay I've reverted all the non-ssh kafka connection & kafka client changes so that we can get this PR merged without waiting for the support in our rdkafka fork that will allow providing multiple resolved IPs for broker connections. I've left a TODO in its place and will reword the attached issue after we merge. @guswynn ready for another look - thanks!

@rjobanp rjobanp force-pushed the localhost-storage-fix branch from e2cf39b to 3f2cbfc Compare April 12, 2024 19:46
@rjobanp rjobanp force-pushed the localhost-storage-fix branch from 3f2cbfc to a6e4449 Compare April 15, 2024 14:27
Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! I am accepting to unblock you, but before merging resolve_address should return a BTreeSet so all uses try ip's in order!

@@ -27,7 +28,7 @@ use crate::MySqlError;
#[derive(Debug, PartialEq, Clone)]
pub enum TunnelConfig {
/// Establish a direct TCP connection to the database host.
Direct,
Direct { resolved_ips: Option<Vec<IpAddr>> },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document what None means? (here and for pg)

}

fn is_global(addr: IpAddr) -> bool {
// TODO: Switch to `addr.is_global()` once stable: https://github.com/rust-lang/rust/issues/27709
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ones seem to cover more cases, so probably worth making a tracking issue in our repo for this! (actually let me know if you want help on the process of trying to push these through stabilization!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

),
}
.to_owned();
// The number of 'host' and 'hostaddr' values must be the same.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The number of 'host' and 'hostaddr' values must be the same.
// Associate each resolved ip with the exact same, singular host, for tls verification.
// We are required to do this dance because `tokio-postgres` enforces that the number of 'host' and 'hostaddr' values must be the same.

_ => bail_generic!(
"only TCP connections to a single PostgreSQL server are supported"
),
}
.to_owned();
// The number of 'host' and 'hostaddr' values must be the same.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

pub host: String,
/// The hostname/IP of the SSH bastion server.
/// If multiple hosts are specified, they are tried in order.
pub host: Vec<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this as a key when sharing ssh tunnels through the same bastion. To ensure we don't over-connect, this should be a BTreeSet so its sorted. Actually, we should sort in the pg and mysql connection code too.

Note that there are weird cases now where:

  • a host resolves to a different number of ip's on different workers
  • different workers connect to different upstream ips for the pg/mysql/etc server, if they appear connectable at different times

Both cases result in additional ssh connections, which is not something we can work around, but might be worth a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call out! Switching to a BtreeSet across all connection types and will note those edge cases in a comment

@rjobanp rjobanp merged commit a25a115 into MaterializeInc:main Apr 15, 2024
73 checks passed
ParkMyCar added a commit that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants