Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable per table metrics collection #5649

Closed

Conversation

sverch
Copy link
Contributor

@sverch sverch commented Jan 3, 2020

If the user is creating a large number of tables, this could cause a spike in memory. This change allows the user to turn off any metrics that scale with the number of tables.

@sverch sverch requested a review from sougou as a code owner January 3, 2020 02:09
@sverch
Copy link
Contributor Author

sverch commented Jan 3, 2020

@deepthi if you have a chance could you take a look at this?

Also, do you have suggestions on how to test a change like this?

@sverch
Copy link
Contributor Author

sverch commented Jan 3, 2020

I'd like to just build/run something based on the in-tree code to at least verify this. Does the test suite have something that can do this?

@deepthi
Copy link
Member

deepthi commented Jan 3, 2020

I'd like to just build/run something based on the in-tree code to at least verify this. Does the test suite have something that can do this?

There is a unit test for these metrics at https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletserver/schema/engine_test.go#L293

You should be able to copy/extend that to test with the flag set to false.

@sverch
Copy link
Contributor Author

sverch commented Jan 6, 2020

Thanks @deepthi! I looked at that and I'm not sure it tests exactly this change, but it's definitely a good starting point. I'll take another look and see if there's a way to make this more testable.

I think the issue is that in the way I implemented this the metrics are still being collected, just not exported to the prometheus metrics page. If I instead make the option not collect those metrics at all, that may be strictly better and make it more testable at the same time.

Even if I do that though, I would want to do the full test to make sure it works all together and that the page shows what I expect. Is there some kind of end to end integration test focused on the vttablet binary that I can hook into for that? Otherwise, I'll probably just try to set up a lot of this manually, or try to get the docker compose example working with a local binary. Just wondering if something like this has already been done.

@sverch
Copy link
Contributor Author

sverch commented Jan 6, 2020

Hmm, it looks like to actually disable stats collection I'll either need to change this query or not run stats collection at all. I'm going to keep looking to see if those stats are used in other places, but my first idea is that disabling running the query for per-table stats at all is the correct way to do this.

It still doesn't fix the problem of the end to end tests, but I can figure that out if there isn't an existing integration test tool that already does this.

@sverch
Copy link
Contributor Author

sverch commented Jan 6, 2020

Actually, I'm realizing this is actually more than just an implementation question, but also a desired result question.

  1. We can aggregate all the per table stats into single metrics streams. This means not labelling these streams with the table name. This will prevent the explosion of metrics streams I think, but also not be useful.
  2. We can disable these completely, so that these metrics streams don't show up at all.

I'm leaning towards 2, and that's what I'm working on now, because some of these stats (like "table rows") don't seem very useful without the corresponding table name.

If the user is creating a large number of tables, this could cause a
spike in memory.  This change allows the user to turn off any metrics
that scale with the number of tables.

Signed-off-by: Shaun Verch <[email protected]>
@sverch sverch force-pushed the option-to-disable-per-table-metrics branch from bada152 to f741175 Compare January 6, 2020 21:13
@sverch
Copy link
Contributor Author

sverch commented Jan 6, 2020

  1. We can disable these completely, so that these metrics streams don't show up at all.

Actually, I followed this to the end and I don't think it's a good idea. This is not just touching code related to metrics, it's also loading the table schema, so in practice the code was a bit odd. It loaded everything and just disabled the one line that actually adds them to the table object, not really saving anything.

I think I'll stick with this approach that's here now then.

@deepthi Ultimately where I am now is that I'd like to test the page that's exposed on the /metrics endpoint of a running vttablet. This is sounding more and more like an integration test. Is there something similar to that already?

@deepthi
Copy link
Member

deepthi commented Jan 6, 2020

@deepthi Ultimately where I am now is that I'd like to test the page that's exposed on the /metrics endpoint of a running vttablet. This is sounding more and more like an integration test. Is there something similar to that already?

I don't think such a test exists, you might have to test manually.
The only thing close to it is the web test at https://github.com/vitessio/vitess/blob/master/test/vtctld_web_test.py

You should still create a unit test similar to what we already have.

@sverch
Copy link
Contributor Author

sverch commented Jan 13, 2020

I'm trying to figure out what kind of test to add, but the only thing this PR changes is what is passed to this function: https://github.com/vitessio/vitess/blob/master/go/stats/counters.go#L401

That seems to eventually call into an extension to https://golang.org/pkg/expvar/, which is normally used to export variables on /debug/vars. Sure enough, I can also see these variables if I go to /debug/vars. That library is here: https://github.com/vitessio/vitess/blob/master/go/stats/export.go

Specifically, I think this function is what tells the library to output prometheus compatible output.

There are tests for that exporter library, but those are for functionality of the exporter itself: https://github.com/vitessio/vitess/blob/master/go/stats/export_test.go

Given that this is a wrapper around expvar, I think the code that is actually calling these handlers is in that library. Sure enough, it's called by the http handler.

If this was a function that I could call to get the output it would be obvious how to test this, and I could do something similar to what's in the existing unit test (where the test calls this engine object in various ways and tests the result). In this case I need to actually find some way to call the handler through the normal channels since there's not an obvious function to call.

This looks promising actually https://blog.questionable.services/article/testing-http-handlers-go/. Maybe that can give me what I want actually. I might be able to directly make a request to the /metrics endpoint.

Currently, this is returning something unexpected in my local test.  It
should return prometheus metrics but instead it returns a JSON object.
However this is the outline for how I would test this.

Signed-off-by: Shaun Verch <[email protected]>
@sverch
Copy link
Contributor Author

sverch commented Jan 20, 2020

I pushed a commit with the skeleton of the test, but the result is unexpected.

I would expect the /metrics endpoint to return prometheus metrics, as it does for a running vttablet, but it returns a json object that looks identical to the one returned by /debug/status. I suspect something is not getting properly registered, so I'll look into that.

@sverch
Copy link
Contributor Author

sverch commented Jan 20, 2020

I see the same tests of the prometheus backend but with different results: https://github.com/vitessio/vitess/blob/master/go/stats/prometheusbackend/prometheusbackend_test.go#L344

I'm going to try import that library into the test and see what I get.

@sverch
Copy link
Contributor Author

sverch commented Jan 20, 2020

I tried copying this initialization pattern from the prometheus backend tests. It still didn't return something that looked like prometheus metrics.

I was able to get something that returned prometheus metrics by copying this line but it didn't have the per table metrics. I suspect that prometheus just hasn't had a chance to run and log any metrics for tables, or it might just be that it doesn't do any scraping in the test stub because it's not querying a real database.

@sverch
Copy link
Contributor Author

sverch commented Jan 27, 2020

All right, here's what I've found so far:

  • The unit test for prometheusbackend returns things in the prometheus format, but doesn't include the metrics for the tables, even when I don't disable them
  • If I copy in the relevant code from that test into the engine test, I get the same result (prometheus format, but not new metrics for the tables)
  • If I use the handler from the expvar library, I can see the table metrics, but they have no values (so I think that means they were registered but not scraped). Plus, the format is JSON not prometheus.

In the test, all the right internal structures are updated, so I think there's another thread somehow collecting them and exporting them on this endpoint.

All the other tests sidestep this by calling everything directly. In the prometheus tests for example, the stats are collected in the tests, rather than registered in another function. In our case, we're trying to test that these registered metrics handlers are working properly, so we don't have access to them directly in the test.

I can at least test whether they are registered though, by seeing if they even show up in the JSON object returned by the expvar handler.

To me, this raises another question. Essentially, will this actually reduce memory usage? It seems like for core functionality, we have to actually load the table information anyway, so all I'm changing is whether we export it. If that's the case, I'd expect the scalability issue to be very similar.

@sverch
Copy link
Contributor Author

sverch commented Jan 27, 2020

@deepthi based on my last comment, I have two questions for you:

  • Is loading these tables necessary for core functionality? I'm assuming the vttablet has some cache of tables it knows about that it must pull anyway. This would change whether I expect this to actually reduce memory usage.
  • If I can add a test that just tests whether the metric was registered properly, would that be enough? I suspect testing that stats actually are getting collected in a unit test will be more difficult (see my last comment).

@sverch
Copy link
Contributor Author

sverch commented Jan 27, 2020

Is loading these tables necessary for core functionality? I'm assuming the vttablet has some cache of tables it knows about that it must pull anyway. This would change whether I expect this to actually reduce memory usage.

Actually, it would still be useful, because it will reduce memory usage on prometheus instances that scrape this. Still good to know whether this is the case though.

@sverch
Copy link
Contributor Author

sverch commented Feb 14, 2020

I believe #5809 will resolve the original intent of this in a more structural way.

Although this line in the PR description gives me pause:

The dropping of dimensions is not applicable to all stats vars.
We drop dimensions for counters and timings, but not gauges.

@sougou Does this mean that PR doesn't address this per table issue, because these are mostly gauges?

@sougou
Copy link
Contributor

sougou commented Feb 14, 2020

Sorry, I didn't realize that they were affected by gauges also. Since it doesn't make sense to drop a dimension for gauges, I think the best approach will be to drop those variables altogether. Something like -stats_drop_variables=.... I can amend that PR to support this if you think this will work.

@sougou
Copy link
Contributor

sougou commented Feb 14, 2020

I've updated #5809 to support a new flag -stats_drop_variables.

@sverch
Copy link
Contributor Author

sverch commented Feb 18, 2020

Thanks @sougou ! I looked at your commit and can confirm that it solves the problem that this PR was trying to address. Closing it out now.

@sverch sverch closed this Feb 18, 2020
@sverch sverch deleted the option-to-disable-per-table-metrics branch February 18, 2020 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants