Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(perf): Extract scrubbed IP addresses into the span.domain tag #3383

Merged
merged 40 commits into from
Apr 15, 2024

Conversation

gggritso
Copy link
Member

@gggritso gggritso commented Apr 5, 2024

tl;dr URLs with IP domains like http://8.8.8.8/data get the span.domain tag of *.*.*.*. Also a small refactor to support this.

Right now, if a span's description looks like GET http://8.8.8.8/data the scrubbed span.description is GET http://*.*.8.8/ and span.domain is None. This is inconsistent, and also makes for a bad user experience because people cannot find requests to certain IPs by the domain.

There are two causes:

  1. IP addresses in descriptions are ran through domain scrubbing, even though they're not domains. This turns 8.8.8.8 into *.*.8.8 because the scrubber thinks the last two 8s are a TLD
  2. The scrubbed descriptions are parsed by URL and are invalid. The parsing fails, and no domain tag is produced.

This PR makes some refactors, and turns off scrubbing of IP host strings.

Changes

  • normalize_domain --> scrub_host. Instead of a function that takes a domain string and a port and does the asterisking, the function accepts a Url::Host and does different things based on whether it's an IP or a qualified domain. This is nicer because the name is clearer and the function does fewer things
  • scrub_domain_name is a new function that only replaces a string with asterisks. This makes it possible to call this function independently of other scrubbing logic (e.g., to forgo calling it for IP addresses)
  • concatenate_host_and_port is a simple concatenator, since I saw this logic in a few places
  • scrub_ipv4 and scrub_ipv6 that fully scrub out IP addresses

Questions

This will increase the cardinality by the number of ports used with IP addresses in the wild.

I think we'll probably want to do some Big Query research to see how many unique IPs are in the dataset, and maybe only partially scrub the IP (e.g., *.*.*.67).

Otherwise, there are code-level questions:

1. Should I separate this PR out into the refactor and a separate PR to disable IP scrubbing?
2. I had a hard time deciding when to return/accept &str and String. I ended up returning String in most places, since I think the functions that get the object back to should ownership?
3. Overall, I'm not confident about Rust style here, you tell me!

@jjbayer
Copy link
Member

jjbayer commented Apr 8, 2024

@gggritso thank you for picking this up and for improving separation of concerns!

I think we'll probably want to do some Big Query research to see how many unique IPs are in the dataset, but maybe it's safe enough to try and see?

That would be OK with me! As long as we follow up with some cardinality analysis.

  1. Should I separate this PR out into the refactor and a separate PR to disable IP scrubbing?

That would be ideal, but IMO the PR is still small enough to keep them both in one.

  1. I had a hard time decising when to return/accept &str and String. I ended up returning String in most places, since I think the functions that get the object back to should ownership?

See PR comment.

@gggritso gggritso changed the title feat(perf): Extract IP addresses into the span.domain tag feat(perf): Extract scrubbed IP addresses into the span.domain tag Apr 8, 2024
@gggritso
Copy link
Member Author

gggritso commented Apr 8, 2024

@jjbayer I made a few major changes:

  1. No more actual extraction of IP addresses, all IPs are fully scrubbed out
  2. Tried to address your feedback, but I'm not good at Rust, so I might have done a bad job.

Thanks for taking an early look!

gggritso added 4 commits April 9, 2024 14:00
Instead of allocating strings (especially since it's just for
localhost), use the provided constants.
We're looking for a defined host and undefined port, rather than an
`Option` host.
@gggritso gggritso marked this pull request as ready for review April 10, 2024 14:50
@gggritso gggritso requested a review from a team as a code owner April 10, 2024 14:50
@gggritso
Copy link
Member Author

@jjbayer made another round of changes! Should be pretty close now 🙌🏻

gggritso and others added 5 commits April 12, 2024 14:01
- accept `&str`
- return `Cow`
Don't bother with one-time-long allow lists. Use a `match` and don't
allocate anything.
@gggritso
Copy link
Member Author

@Dav1dde @jjbayer thanks for bearing with me, and for the thorough review 🙏🏻 I tried to implement all your suggestions, but I might have missing something, please let me know! The major changes:

  • Cow in a lot more places
  • no more allowlists for IPs, using a match instead
  • better function docs, including unit tests
  • removed unnecessary casting

@gggritso gggritso requested review from Dav1dde and jjbayer April 12, 2024 18:49
Copy link
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of docs nits and I don't know why they weren't caught by CI :(

@gggritso
Copy link
Member Author

@Dav1dde updated! Added/fixed docstrings and committed your Cow usage suggestions as-is, thanks!

@gggritso gggritso requested a review from Dav1dde April 15, 2024 14:19
Copy link
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@gggritso gggritso merged commit 72b50ef into master Apr 15, 2024
20 checks passed
@gggritso gggritso deleted the fix/perf/extract-ip-domain-tags-ii branch April 15, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants