Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,admission: automatically infer disk names for stores #109350

Closed

Conversation

irfansharif
Copy link
Contributor

Part of #86857.

This commit eliminate the need to provide the disk-name common environments e.g. linux with store on EBS or GCP PD. To make use of AC's disk bandwidth tokens, users still need to specify the provisioned bandwidth, for now. So in a sense this machinery is still "disabled by default". Next steps:

  • automatically measure provisioned bandwidth, using something like github.com/irfansharif/probe, gate behind envvars or cluster settings;
  • add roachtests that make use of these disk bandwidth tokens;
  • roll it out in managed environments;
  • roll it out elsewhere.

Release note: None

@irfansharif irfansharif requested review from a team as code owners August 23, 2023 16:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, @irfansharif, and @sumeerbhola)


pkg/server/node.go line 1065 at r1 (raw file):

// TODO(irfansharif): Is this platform specific? (Seems to work on M1s,
// gceworkers, GCP/AWS instances with/without attached storage devices).
func getDeviceIDsToNames() (map[uint64]string, error) {

Can you try instead use our upstream dependency gosigar which has various APIs to do just this (you might need to upgrade to a later version).

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, @knz, and @sumeerbhola)


pkg/server/node.go line 1065 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Can you try instead use our upstream dependency gosigar which has various APIs to do just this (you might need to upgrade to a later version).

I'm looking at their release docs: https://pkg.go.dev/github.com/elastic/gosigar, it doesn't expose the device ID specifically needed to do this mapping. Alternatively, it's not exposing anything that obviates the need for such a device ID given we're trying to go from a store ID (where we can deduce the path to some file within the store) => device name. Any other clues? I'll look at the code to see what platform-neutral bits to cargocult.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, @irfansharif, and @sumeerbhola)


pkg/server/node.go line 1065 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm looking at their release docs: https://pkg.go.dev/github.com/elastic/gosigar, it doesn't expose the device ID specifically needed to do this mapping. Alternatively, it's not exposing anything that obviates the need for such a device ID given we're trying to go from a store ID (where we can deduce the path to some file within the store) => device name. Any other clues? I'll look at the code to see what platform-neutral bits to cargocult.

I was thinking you would use https://pkg.go.dev/github.com/cloudfoundry/gosigar#FileSystemList
We have done this in another place before. See pkg/storage/store_properties.go.

In fact I wonder if you should not simply extend this latter source file with what you need, instead of adding logic here.

Generally, pkg/server is not the right place to do OS-specific stuff (or anything relating to the OS really). If there's something you need that gosigar (and store_properties) don't help you with, let's discuss this on #server and get Ben's input.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just looked at the store spec changes since @knz is reviewing the other parts

Reviewed 4 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, and @irfansharif)


pkg/base/store_spec.go line 231 at r1 (raw file):

	}
	if len(spec.DiskName) == 0 {
		return ProvisionedRateSpec{},

needs updating


pkg/base/store_spec.go line 310 at r1 (raw file):

		fmt.Fprint(&buffer, ",")
	}
	if len(ss.ProvisionedRateSpec.DiskName) > 0 {

this needs to change if DiskName is optional, since presumably we can now specify bandwidth and not specify disk-name


pkg/base/store_spec.go line 362 at r1 (raw file):

//   - 0.2             -> 20% of the available space
//   - attrs=xxx:yyy:zzz A colon separated list of optional attributes.
//   - provisioned-rate=disk-name=<disk-name>[:bandwidth=<bandwidth-bytes/s>] The

comment needs updating


pkg/cli/cliflags/flags.go line 1002 at r1 (raw file):

bandwidth saturation, the "provisioned-rate" field can be specified with
the optional "disk-name" and "bandwidth" parameters. It's only when the
"bandwidth" parameter is specified for a given store that admission control

We could enable this solely if kvadmission.store.provisioned_bandwidth is positive, if we can infer the disk name. But if we can't infer the disk name, we don't have a good way to tell the user that even though they may be expecting disk bandwidth enforcement, it isn't happening.

@irfansharif irfansharif force-pushed the 230802.disk-name-mapping branch from e1603e1 to 5a9c520 Compare September 1, 2023 23:42
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, @knz, and @sumeerbhola)


pkg/base/store_spec.go line 231 at r1 (raw file):

Previously, sumeerbhola wrote…

needs updating

Removed.


pkg/base/store_spec.go line 310 at r1 (raw file):

Previously, sumeerbhola wrote…

this needs to change if DiskName is optional, since presumably we can now specify bandwidth and not specify disk-name

Done.


pkg/base/store_spec.go line 362 at r1 (raw file):

Previously, sumeerbhola wrote…

comment needs updating

Done.


pkg/cli/cliflags/flags.go line 1002 at r1 (raw file):
Updated the doc text to make this more explicit. I actually forgot the the cluster setting can be used to easily enable this functionality - exciting!

But if we can't infer the disk name, we don't have a good way to tell the user that even though they may be expecting disk bandwidth enforcement, it isn't happening.

I'm not sure what to do about it, I think I'll ignore this for now. Perhaps we can make use of some SQL-hinting when mucking with that specific cluster setting. I added TODO next to that setting.


pkg/server/node.go line 1065 at r1 (raw file):

I was thinking you would use https://pkg.go.dev/github.com/cloudfoundry/gosigar#FileSystemList

These APIs don't seem to pull out the same kinds of device names we're after. DevName from https://pkg.go.dev/github.com/cloudfoundry/gosigar#FileSystem returns blkdev=tmpfs when running using cockroach start-single-node --store $(mktemp -d) --insecure --logtostderr on my GCE worker for example, when what I want is s1 mapped to disk ‹sda1› that I'm getting currently. Looking through the library code, it's populating the device names using /etc/mtab. If I've got the vocabulary right, this is the block device name, but I'm after the physical device name. I don't see the right primitives in gosigar to do what I'm doing in this code block.

I'm not married to having this code here. I agree this being OS-specific makes it better placed outside of pkg/server so I've added a TODO saying as much. If ok with you though, I'd really like to defer doing this. Just in the interest of building on top of this PR for some roachtests I want to check in making use of this very-unused AC mechanism.

@irfansharif irfansharif force-pushed the 230802.disk-name-mapping branch from 5a9c520 to c6ee2b4 Compare September 2, 2023 02:54
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, @knz, and @sumeerbhola)


pkg/server/node.go line 1065 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I was thinking you would use https://pkg.go.dev/github.com/cloudfoundry/gosigar#FileSystemList

These APIs don't seem to pull out the same kinds of device names we're after. DevName from https://pkg.go.dev/github.com/cloudfoundry/gosigar#FileSystem returns blkdev=tmpfs when running using cockroach start-single-node --store $(mktemp -d) --insecure --logtostderr on my GCE worker for example, when what I want is s1 mapped to disk ‹sda1› that I'm getting currently. Looking through the library code, it's populating the device names using /etc/mtab. If I've got the vocabulary right, this is the block device name, but I'm after the physical device name. I don't see the right primitives in gosigar to do what I'm doing in this code block.

I'm not married to having this code here. I agree this being OS-specific makes it better placed outside of pkg/server so I've added a TODO saying as much. If ok with you though, I'd really like to defer doing this. Just in the interest of building on top of this PR for some roachtests I want to check in making use of this very-unused AC mechanism.

Looks like the linter fought me. I put some of this logic in pkg/util/sysutil instead - PTAL.

Part of cockroachdb#86857.

This commit eliminate the need to provide the disk-name common
environments e.g. linux with store on EBS or GCP PD. To make use of AC's
disk bandwidth tokens, users still need to specify the provisioned
bandwidth, for now. So in a sense this machinery is still "disabled by
default". They can also do this through the
kvadmission.store.provisioned_bandwidth cluster setting.

Next steps:

- add roachtests that make use of these disk bandwidth tokens;
- automatically measure provisioned bandwidth, using something like
  github.com/irfansharif/probe, gate behind envvars or cluster settings;
- roll it out in managed environments;
- roll it out elsewhere.

Release note: None
@irfansharif irfansharif force-pushed the 230802.disk-name-mapping branch from e719124 to 885abad Compare September 2, 2023 06:52
@knz
Copy link
Contributor

knz commented Sep 2, 2023

Let's see where this conversation goes before this merges.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @bananabrick, @irfansharif, and @knz)


-- commits line 6 at r2:
nit: elminates


pkg/kv/kvserver/kvadmission/kvadmission.go line 106 at r2 (raw file):

// don't have an easy way to communicate to the user that admission control
// enforcement for disk bandwidth is not happening, despite expectations. Maybe
// we surface this through SQL hints somehow?

what are "SQL hints"?


pkg/server/node.go line 1022 at r2 (raw file):

	// TODO(irfansharif): Use something like github.com/irfansharif/probe to
	// measure this empirically on process start, if not explicitly passed in.
	provisionedRate map[roachpb.StoreID]base.ProvisionedRateSpec

The code on master ensured these two maps were consistent. But if there was not a ProvisionedRateSpec it fell back to using clusterProvisionedBandwidth in a second place (in Node.GetPebbleMetrics).
We need a slightly different solution now that the ProvisionedRateSpec may not provide the disk name, and the auto discovery may fail, so we can't ensure the two maps are consistent.

I think we should have an invariant that provisionedRate has an entry for each store, even if the spec is empty. diskNameToStoreID will be a subset, and if something is not in diskNameToStoreID we won't be able to populate the usage (which will be 0).


pkg/server/node.go line 1029 at r2 (raw file):

	}
	stats = make(map[roachpb.StoreID]admission.DiskStats)
	for id, spec := range dsm.provisionedRate {

see my earlier comment on having an entry in provisionedRate for each store -- in that case this code should not need to change.


pkg/server/node.go line 1100 at r2 (raw file):

			// If users specify disk names manually, compare the automatically
			// observed data against what they spelled out.
			dsm.provisionedRate[storeIdent.StoreID] = specs[i].ProvisionedRateSpec

So if len(specs[i].ProvisionedRateSpec.DiskName == 0 we are not even using the ProvisionedRateSpec?

Can we test this by updating TestDiskStatsMap? Same question for the next comment.


pkg/server/node.go line 1105 at r2 (raw file):

			if storeID, found := dsm.diskNameToStoreID[diskName]; found {
				if storeID != storeIdent.StoreID {
					return errors.Newf("mismatched store ID for disk %s, expected %d got %d",

I think we should log a warning and continue, and use whatever was configured, since it is possible that what is being automatically discovered is not quite right, and which is why the operator has provided an override.


pkg/util/sysutil/device_unix.go line 27 at r2 (raw file):

// GetDeviceMap returns a map of device IDs to device names, for all physical
// devices.
func GetDeviceMap() (map[uint64]string, error) {

I see why the map is built once since the uint64 from GetDeviceID doesn't allow us to narrow down this ReadDir and related work.
But how about hiding these details behind a struct:

type DeviceNameMapper struct {
   ...
}

func MakeDeviceNameMapper() DeviceNameMapper {
   // Do the map init
   ...
}

func (mapper DeviceNameMapper) GetDeviceNameForFile(filename string) (name string, ok bool) {
   ...
}

The stat of the filename can be done here by calling os.Stat. We don't really want to do that stat using the vfs.FS of the engine since there is an underlying assumption that we are not dealing with memFS or something like that for that engine.

getDeviceMap and geDeviceIDForFilename would be the OS specific code, and the above struct would be shared.

@irfansharif
Copy link
Contributor Author

Hey @sumeerbhola, @aadityasondhi: I think what's here is pretty salvageable to get disk bandwidth tokens a bit easier to run for users. But I'll close this PR - I'll leave it to y'all to resuscitate if it's still the sort of thing we want to do.

craig bot pushed a commit that referenced this pull request Mar 18, 2024
115375: changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) r=jayshrivastava a=jayshrivastava

### sql: count ranges per partition in PartitionSpans

This change updates span partitioning to count ranges while making
partitions. This allows callers to rebalance partitions based on
range counts without having to iterate over the spans to count
ranges.

Release note: None
Epic: None

### changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) #115375

Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change
rewrites it to use range iterators, reducing the memory usage to O(spans).

This change also adds a randomized test to assert that all spans are accounted for after
rebalancing. It also adds one more unit test.

Informs: #113898
Epic: None

### changefeedccl: add rebalancing checks

This change adds extra test coverage for partition rebalancing in
changefeeds. It adds checks which are performed after rebalancing
to assert that the output list of spans covers exactly the same keys
as the input list of spans. These checks are expensive so they only
run if the environment variable `COCKROACH_CHANGEFEED_TESTING_REBALANCING_CHECKS`
is true. This variable is true in cdc roachtests and unit tests.

Release note: None
Epic: None

119885: storage: support per-store IO metrics with fine granularity r=jbowens,abarganier a=CheranMahalingam

Currently, timeseries metrics are collected on a 10s interval which hides momentary spikes in IO. This commit introduces a central disk monitoring system that polls for disk stats at a 100ms interval. Additionally, the current system accumulates disk metrics across all block devices which includes noise from unrelated processes. This commit also adds support for exporting per-store IO metrics (i.e. IO stats for block devices that map to stores used by Cockroach).

These changes will be followed up by a PR to remove the need for customers to specify disk names when setting the provisioned bandwidth for each store as described in #109350.

Fixes: #104114, #112898.
Informs: #89786.

Epic: None.
Release note: None.

120649: changefeedccl: avoid undefined behavior in distribution test r=wenyihu6 a=jayshrivastava

The `rangeDistributionTester` would sometimes calculate log(0) when determining the node to move a range too. Most of the time, this would be some garbage value which gets ignored. Sometimes, this may return a valid node id, causing the range distribution to be wrong and failing the test failures. This change updates the tester to handle this edge case.

Closes: #120470
Release note: None

Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Cheran Mahalingam <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants