-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Deny localhost/private-ips in storage connections #26186
Conversation
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:The pull request carries a high risk, with a 79% risk score, due to predictors such as the sum of bug reports of files changed and the delta of executable lines. Historically, PRs with these characteristics are 112% more likely to introduce bugs compared to the repository baseline. Additionally, the repository's predicted bug trend is on the rise, and there's at least one file hotspot indicating recent bug fix activity. The observed bug trend in the repository has also been increasing. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
d2ac91d
to
62f6dd1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good but there are a few things I think need to be fixed:
- ssh tunnels also do resolution and must be checked
- for completeness, though its probably not actually possible to spoof, we should check what aws privatelink hosts resolve to
- we can't unwrap in
rewrite_broker_addr
, we have to fallback to an known-unresolvable host (like in ssh failures) and communicate this error through a method liketunnel_status
@guswynn I'm pretty sure I added coverage of all the places where we connect to an SSH bastion host. We don't want to do the resolution on the destination URL in the case of SSH tunnels, since customers may want to connect to a private-address within their network that is resolvable via the SSH bastion-host. We just need to ensure the bastion host itself is global to us. |
@rjobanp you are totally correct, sorry about that! The kafka error comment should be it then! |
e1d9ce9
to
ff655ba
Compare
@guswynn fixed the failing test cases and addressed your feedback, mostly. We do now avoid unwraping in the I also opted not to do anything for privatelink connections -- there is no user input involved with getting the vpc-endpoint address and it seems pointless to complicate that logic which is already somewhat onerous to understand. |
@rjobanp I would rather we actually communicate a meaningful error to the user, instead of having them interpret whatever rdkafka says about the address |
8a4deb0
to
e142f6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic structure here looks good, but I have a wrench to throw into the works. Today, a user can specify a hostname that resolves to multiple IPs, and Materialize will try each of those IPs in order. After this change, Materialize will try only the first. That's a regression with possible availability implications. I know at least one of our customers uses DNS to provide HA for SSH tunnels, by round-robining traffic between two SSH bastion hosts.
It's going to require a good deal more elbow grease, but I think it's fundamentally possible to support this with the same structure you have here. We'll need each library (MySQL, PostgreSQL, Kafka, SSH tunnels) to support taking multiple addresses as input for the same hostname. Then resolve_address
can return a Vec<IpAddr>
, and we can pass that list of IPs into each library.
Here's a patch to our PostgreSQL fork that integrates an upstream change to allow taking multiple IPs as input: MaterializeInc/rust-postgres#22
The MySQL change is hopefully similarly straightforward, although we'll likely need to reintroduce our fork to avoid blocking on upstream merging our PR.
SSH is a little annoying, because we shell out to the ssh
binary and there's no way AFIACT to pass in a list of IPs, even though SSH does the right thing if you pass in a hostname that resolves to multiple IPs. But I think we can just adjust our SSH tunnel code to shell out to ssh
multiple times, one per IP, and it won't be too bad.
rdkafka is going to be the most annoying. The underlying librdkafka API supports resolve_cb
s that return multiple addresses, but it'll require some C goop to expose this functionality in rust-rdkafka. I can assist with this, if you'd like. We can also take the chance to allow rust-rdkafka's callback to be fallible, which should clean up the error handling a good bit.
Good point on this breaking support for HA DNS configs -- let me look into refactoring MySQL and the SSH connection stuff. I would appreciate the help on rust-rdkafka if you have time to look into that! |
Addressed the above feedback, and implemented the multiple resolved-ip support for MySQL, Postgres and SSH tunnels, but not yet for rdkafka. This PR is blocked until #26273 merges, which will allow the rust-postgres changes to be built |
2e180e7
to
04f4cea
Compare
b93a6f2
to
e2cf39b
Compare
Okay I've reverted all the non-ssh kafka connection & kafka client changes so that we can get this PR merged without waiting for the support in our rdkafka fork that will allow providing multiple resolved IPs for broker connections. I've left a TODO in its place and will reword the attached issue after we merge. @guswynn ready for another look - thanks! |
e2cf39b
to
3f2cbfc
Compare
…ors from kafka client dns resolution
3f2cbfc
to
a6e4449
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! I am accepting to unblock you, but before merging resolve_address
should return a BTreeSet
so all uses try ip's in order!
src/mysql-util/src/tunnel.rs
Outdated
@@ -27,7 +28,7 @@ use crate::MySqlError; | |||
#[derive(Debug, PartialEq, Clone)] | |||
pub enum TunnelConfig { | |||
/// Establish a direct TCP connection to the database host. | |||
Direct, | |||
Direct { resolved_ips: Option<Vec<IpAddr>> }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document what None
means? (here and for pg)
} | ||
|
||
fn is_global(addr: IpAddr) -> bool { | ||
// TODO: Switch to `addr.is_global()` once stable: https://github.com/rust-lang/rust/issues/27709 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These ones seem to cover more cases, so probably worth making a tracking issue in our repo for this! (actually let me know if you want help on the process of trying to push these through stabilization!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call - made a tracking issue https://github.com/MaterializeInc/materialize/issues/26649
src/postgres-util/src/tunnel.rs
Outdated
), | ||
} | ||
.to_owned(); | ||
// The number of 'host' and 'hostaddr' values must be the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The number of 'host' and 'hostaddr' values must be the same. | |
// Associate each resolved ip with the exact same, singular host, for tls verification. | |
// We are required to do this dance because `tokio-postgres` enforces that the number of 'host' and 'hostaddr' values must be the same. | |
src/postgres-util/src/tunnel.rs
Outdated
_ => bail_generic!( | ||
"only TCP connections to a single PostgreSQL server are supported" | ||
), | ||
} | ||
.to_owned(); | ||
// The number of 'host' and 'hostaddr' values must be the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here
src/ssh-util/src/tunnel.rs
Outdated
pub host: String, | ||
/// The hostname/IP of the SSH bastion server. | ||
/// If multiple hosts are specified, they are tried in order. | ||
pub host: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this as a key when sharing ssh tunnels through the same bastion. To ensure we don't over-connect, this should be a BTreeSet
so its sorted. Actually, we should sort in the pg and mysql connection code too.
Note that there are weird cases now where:
- a host resolves to a different number of ip's on different workers
- different workers connect to different upstream ips for the pg/mysql/etc server, if they appear connectable at different times
Both cases result in additional ssh connections, which is not something we can work around, but might be worth a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call out! Switching to a BtreeSet
across all connection types and will note those edge cases in a comment
Motivation
Tips for reviewer
There were several approaches discussed to implement this security fix, but we felt that the simplest was to simply resolve DNS ourselves for all storage connections and validate that the resolved address is not to a local or private-ip, thereby preventing a malicious user from attempting to create a Materialize Source that connects to some service on the materialize network. Most of these attacks would already be prevented by Cilium at the K8s networking layer, but this adds additional security.
This feature is activated by a LD feature flag, which I plan to use for testing once this goes to Staging, since it's difficult to validate that it actually allows valid connections in local testing. Instead the tests just validate that when activated, it prevents local/private-ip connections.
This also fixes some bugs in the dyncfg implementation for storage #26076 which was accidentally dropping the dyncfg values when performing a StorageConfiguration update.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.