Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload/ycsb: support workload D, E and fix key selection for insertion and reading #37804

Merged
merged 4 commits into from
May 29, 2019

Conversation

jeffrey-xiao
Copy link
Contributor

@jeffrey-xiao jeffrey-xiao commented May 24, 2019

This PR does the following items:

  1. Support workload D (95% reads skewed to newer keys, 5% writes) by adding a latest generator
  2. Support workload E (95% scans, 5% writes) by implementing scans
  3. Fix key selection for insertion by having a separate index for next row to insert rather than using the iMaxHead in the zipfian generator
  4. Fix key selection for reading by having a separate counter for number of rows instead of using the number of initial rows. I didn't fully understand the method proposed in the ycsb paper in section 5.3, so if this isn't the best way, I'm open to changes.
  5. Improve key selection by hashing before modding. This follows the logic proposed in the ycsb paper. Originally we were hashing after applying a modulo, but that always causes the first inserted key to be the hotspot rather than distribute the hotspot around.
  6. Change API for generators to be more consistent. All generators now take in a lower bound and upper bound and generate an inclusive range between the lower and upper bound (I.E. [lower, upper]).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Have you tried running them on a roachprod cluster? Consider adding these workloads to the set of ycsb roachtests and then running them and seeing what you see. Both of these are useful and realistic workloads.

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffrey-xiao and @nvanbenschoten)


pkg/workload/ycsb/ycsb.go, line 338 at r1 (raw file):

	updateStmts []*gosql.Stmt

	// The next row index to insert.

Seems like you never want these to be nil pointers when you interact with them below. How about avoiding the allocations above and just taking the address of the field for use with the atomics?


pkg/workload/ycsb/ycsb.go, line 126 at r3 (raw file):

		g.flags.StringVar(&g.workload, `workload`, `B`, `Workload type. Choose from A-F.`)
		g.flags.StringVar(&g.requestDistribution, `request-distribution`, `zipfian`, `Distribution for request key generation [zipfian, uniform, latest].`)
		g.flags.StringVar(&g.scanLengthDistribution, `scan-length-distribution`, `uniform`, `Distribution for scan length generation [zipfian, uniform].`)

nit: can you augment the help text for these 3 flags to mention that this is for workload E.


pkg/workload/ycsb/ycsb.go, line 510 at r3 (raw file):

	defer res.Close()
	for res.Next() {

nit: no need for this blank line. I guess if you feel like it's visually important you can add a comment inside the loop.

Copy link
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/workload/ycsb/ycsb.go, line 338 at r1 (raw file):

Previously, ajwerner wrote…

Seems like you never want these to be nil pointers when you interact with them below. How about avoiding the allocations above and just taking the address of the field for use with the atomics?

The row indexes and counters are shared between workers, so they have to be allocated.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Reviewed 2 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, 2 of 2 files at r7, 1 of 1 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffrey-xiao)


pkg/workload/ycsb/latest_generator.go, line 24 at r2 (raw file):

)

// LatestGenerator is a random number generator that generates draws from a

The official YCSB benchmark seems to use a skewed latest generator for workload D. Why did you decide to go with a uniform distribution here?

EDIT: never mind, looks like you realized this already. Want to call this SkewedLatestGenerator to mirror their implementation?


pkg/workload/ycsb/uniform_generator.go, line 62 at r3 (raw file):

}

// Uint64 returns a random Uint64 between iMin and iMax, drawn from a uniform

How does this ensure that the value is above iMin?


pkg/workload/ycsb/ycsb.go, line 406 at r1 (raw file):

// See YCSB paper section 5.3 for a complete description of how keys are chosen.
func (yw *ycsbWorker) nextReadKey() string {
	rowCount := atomic.LoadUint64(yw.rowCount)

Do we ever modify rowCount? Do we need this atomic load?


pkg/workload/ycsb/ycsb.go, line 144 at r2 (raw file):

	return workload.Hooks{
		Validate: func() error {
			switch g.workload {

Do you think we should be defaulting the request distribution differently depending on the workload?


pkg/workload/ycsb/ycsb.go, line 307 at r3 (raw file):

		return workload.QueryLoad{}, errors.Errorf("Unknown request distribution: %s", g.requestDistribution)
	}
	if err != nil {

Nice catch!

Copy link
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/workload/ycsb/uniform_generator.go, line 62 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this ensure that the value is above iMin?

Oops, forgot to make this change.


pkg/workload/ycsb/ycsb.go, line 406 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we ever modify rowCount? Do we need this atomic load?

Also forgot to increment this when iMax gets incremented.


pkg/workload/ycsb/ycsb.go, line 144 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think we should be defaulting the request distribution differently depending on the workload?

I think that's a reasonable change.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffrey-xiao and @nvanbenschoten)


pkg/workload/ycsb/uniform_generator.go, line 67 at r12 (raw file):

	z.mu.Lock()
	defer z.mu.Unlock()
	return rand.Uint64()%(z.mu.iMax-z.iMin+1) + z.iMin

This should be using z.mu.r.Uint64(), not the global rand.Uint64().

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffrey-xiao and @nvanbenschoten)


pkg/workload/ycsb/uniform_generator.go, line 67 at r12 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This should be using z.mu.r.Uint64(), not the global rand.Uint64().

Also, using mod to get integers within a range is slightly biased. This code should probably be using math/rand.Rand.Int63n. I realize this wasn't changed in this PR, but as long as you're in here...

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffrey-xiao and @nvanbenschoten)


pkg/workload/ycsb/ycsb.go, line 409 at r12 (raw file):

	// distribution and modding the samples. To properly support INSERTS, we need
	// to maintain a separate rownum counter.
	rownum := yw.randGen.IMaxHead()

This was the only call to IMaxHead. I think we can remove that method now, which was a bit of an oddity in the random generators.


pkg/workload/ycsb/ycsb.go, line 438 at r12 (raw file):

func (yw *ycsbWorker) nextReadKey() string {
	rowCount := atomic.LoadUint64(yw.rowCount)
	rowIndex := yw.hashKey(yw.requestGen.Uint64()) % rowCount

buildKeyName also calls hashKey, so this will cause us to call hashKey twice for each read key. Is that expected?


pkg/workload/ycsb/ycsb.go, line 472 at r12 (raw file):

			return err
		}
		atomic.AddUint64(yw.rowCount, 1)

Can you confirm that this is what the official YCSB implementation does? I ask because it is still possible to generate a read key which hasn't been inserted yet due to concurrency in the inserts.

Copy link
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @petermattis)


pkg/workload/ycsb/uniform_generator.go, line 67 at r12 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Also, using mod to get integers within a range is slightly biased. This code should probably be using math/rand.Rand.Int63n. I realize this wasn't changed in this PR, but as long as you're in here...

Done. Unfortunate that there is no Uint64n.


pkg/workload/ycsb/ycsb.go, line 409 at r12 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This was the only call to IMaxHead. I think we can remove that method now, which was a bit of an oddity in the random generators.

Done.


pkg/workload/ycsb/ycsb.go, line 438 at r12 (raw file):

Previously, petermattis (Peter Mattis) wrote…

buildKeyName also calls hashKey, so this will cause us to call hashKey twice for each read key. Is that expected?

The issue with the previous implementation is that it gets hashed after the mod which always causes the hotspot to be the first key. The yscb paper suggests to hash then mod. Hashing twice for each read key is also consistent with the official ycsb implementation. When the request distribution is zipfian, the official yscb implementation uses a ScrambledZipfianGenerator which hashes the output of a ZipfianGenerator. Then they hash the generated key num to obtain the key name.


pkg/workload/ycsb/ycsb.go, line 472 at r12 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you confirm that this is what the official YCSB implementation does? I ask because it is still possible to generate a read key which hasn't been inserted yet due to concurrency in the inserts.

Thanks for the catch. The official implementation uses an AcknowledgedCounterGenerator which I've now added.

@jeffrey-xiao jeffrey-xiao force-pushed the ycsb-workload branch 2 times, most recently from af20f54 to eb5e742 Compare May 28, 2019 17:27
Previously, the keys for insertion were chosen based on the iMaxHead of
the zipfian distribution generator. As suggested in the ycsb paper, the
zipfian distribution generator was later changed to use a much larger
key space, but the key selection for insertion was never changed. This
commit adds an index for the next row to be inserted and a row counter
that keeps track of the largest contiguous sequence of keys able to be
read. Since IMaxHead is no longer used after these changes, it was
removed from the randGenerator interface and all implementors.

Additionally, we perform the convertion from random number to key in a
different manner than the yscb paper. The yscb paper performs a hash
then mods the number, while we mod the number then hash it. This causes
the hotspot of keys when using the zipfian to be record 0 (I.E. the
oldest record). Another hash was added to ensure that the hotspot of
keys changes. Hashing twice is also consistent with the official yscb
implementation.

Finally, there was no logic to handle reading keys that were inserted
since the key selection for reading only picked from the initial rows.
Instead of modding by the initial rows, we now mod by the total number
of rows recorded by the row counter.

Release note: None
The skewed latest generator is used for workload D and skews to the most
recently inserted keys. Additionally, changed the API for zipfGenerator
to generate an inclusive bound in [0, defaultIMax - 1], instead of [1,
defaultIMax].

Release note: None
As described in the ycsb paper, the key should be chosen by a zipfian
distribution and the number of keys to scan should be chosen by a
uniform distribution.

Add a command-line flag for choosing the scan length distribution and
also changed the API for the uniform generate to accept a lower bound
and upper bound. This makes it more consistent with the zipfian
generator and the latest generator and is needed for generating a scan
length.

Release note: None
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffrey-xiao and @nvanbenschoten)


pkg/workload/ycsb/acknowledged_counter.go, line 45 at r16 (raw file):

	c := &AcknowledgedCounter{}
	c.mu.count = initialCount
	c.mu.window = make([]bool, windowSize)

I believe this will use windowSize bytes. You could reduce that by a factor of 8 by using a bitmap. Not sure if that is worth it as we're only talking about 1MB of memory and the current code is clear and concise.


pkg/workload/ycsb/uniform_generator.go, line 67 at r12 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

Done. Unfortunate that there is no Uint64n.

Someday (not today) we may upgrade to golang.org/x/exp/rand which does have Uint64n.

@jeffrey-xiao
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request May 29, 2019
37804: workload/ycsb: support workload D, E and fix key selection for insertion and reading r=jeffrey-xiao a=jeffrey-xiao

This PR does the following items:

1. Support workload D (95% reads skewed to newer keys, 5% writes) by adding a latest generator
2. Support workload E (95% scans, 5% writes) by implementing scans
3. Fix key selection for insertion by having a separate index for next row to insert rather than using the `iMaxHead` in the zipfian generator
4. Fix key selection for reading by having a separate counter for number of rows instead of using the number of initial rows. I didn't fully understand the method proposed in the ycsb paper in section 5.3, so if this isn't the best way, I'm open to changes.
5. Improve key selection by hashing before modding. This follows the logic proposed in the ycsb paper. Originally we were hashing after applying a modulo, but that always causes the first inserted key to be the hotspot rather than distribute the hotspot around.
6. Change API for generators to be more consistent. All generators now take in a lower bound and upper bound and generate an inclusive range between the lower and upper bound (I.E. [lower, upper]).

Co-authored-by: Jeffrey Xiao <[email protected]>
@craig
Copy link
Contributor

craig bot commented May 29, 2019

Build succeeded

@craig craig bot merged commit e5e1116 into cockroachdb:master May 29, 2019
@jeffrey-xiao jeffrey-xiao deleted the ycsb-workload branch May 29, 2019 15:57
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 30, 2022
This change fixes the handling of the --request-distribution flag so
that any user overrides are respected. Before, the default request
distribution for a given workload was always used, regardless of the
flag.

This bug dates back to cockroachdb#37804. I fear this has caused a good deal of
confusion in the past, as we often turn to YCSB-A uniform as an example
of an uncontended read/update workload.

Release justification: workload only.
craig bot pushed a commit that referenced this pull request Aug 30, 2022
85920: backupccl: elide data from offline tables during RESTORE r=adityamaru a=msbutler

A previous PR (#83915) enabled BACKUP to target offline descriptors
(including IMPORTing and RESTOREing tables), but didn't fully handle their
rollback on RESTORE. This PR ensures RESTORE properly handles these offline
tables:
- An offline table created by RESTORE or IMPORT PGDUMP is fully discarded. The
  table does not exist in the restoring cluster.
- An offline table created by an IMPORT INTO has all importing data elided in
  the restore processor and is restored online to its pre import state.

If the RESTORE occurs AOST after the table returns online, the table is
RESTOREd to its post offline state.

Currently, all tables that return online are fully re-backed up. For many
situations above, this is unecessary. Future work should ensure that these
tables are not re-backed up when we know all operations within the
incremental backup interval on the table were MVCC.

This patch only applies to BACKUP and RESTORE flavors that implicitly target an
offline importing/restoring table. A future PR should allow a user to
explicitly target an existing offline table undergoing an IMPORT.

Release justification: fixes bug where offline tables were not properly handled
in restore.

86826: workload/ycsb: fix --request-distribution flag r=nvanbenschoten a=nvanbenschoten

This change fixes the handling of the `--request-distribution` flag so
that any user overrides are respected. Before, the default request
distribution for a given workload was always used, regardless of the
flag.

This bug dates back to #37804. I fear this has caused a good deal of
confusion in the past, as we often turn to YCSB-A uniform as an example
of an uncontended read/update workload.

Release justification: workload only.

87047: pgwire: fix txn options for prepared BEGIN  r=ZhouXing19 a=rafiss

fixes #87012

Release note (bug fix): The statement tag for SHOW command
results in the pgwire protocol no longer contain the number of returned
rows.

Release note (bug fix): Fixed a bug where the options given to the BEGIN
TRANSACTION command would be ignored if the BEGIN was a prepared
statement.

Release justification: low risk bug fix

87113: ci: run `acceptance` test in `ci` config r=jlinder a=rickystewart

We need this to get detailed info in the `test.xml`.

Release justification: Non-production code changes
Release note: None

87114: ci: fix nightlies that are invoking `bazci` incorrectly r=rail a=rickystewart

Some of these nightlies were broken as a result of
`c8d66a9c334dd87acd62190f314a4f3ec236ce82`.

Release justification: Non-production code changes
Release note: None

87116: bazci: delete unused `testdata` directory r=celiala a=rickystewart

Release justification: Non-production code changes
Release note: None

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Aug 30, 2022
This change fixes the handling of the --request-distribution flag so
that any user overrides are respected. Before, the default request
distribution for a given workload was always used, regardless of the
flag.

This bug dates back to #37804. I fear this has caused a good deal of
confusion in the past, as we often turn to YCSB-A uniform as an example
of an uncontended read/update workload.

Release justification: workload only.
blathers-crl bot pushed a commit that referenced this pull request Aug 30, 2022
This change fixes the handling of the --request-distribution flag so
that any user overrides are respected. Before, the default request
distribution for a given workload was always used, regardless of the
flag.

This bug dates back to #37804. I fear this has caused a good deal of
confusion in the past, as we often turn to YCSB-A uniform as an example
of an uncontended read/update workload.

Release justification: workload only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants