Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdctl/check: create new check command for memory usage #9185

Merged
merged 1 commit into from
Feb 20, 2018

Conversation

spzala
Copy link
Member

@spzala spzala commented Jan 22, 2018

Create a new command similar to check perf that can check the memory
consumption for putting different workloads. Return user with a message
that whether there are enough memory for a given workload with pass
or fail.

Fixed #9121

Contributing guidelines

Please read our contribution workflow before submitting a pull request.

@spzala
Copy link
Member Author

spzala commented Jan 22, 2018

@xiang90 @gyuho hello, here is a first pass of the changes. Please take a look with the approach. Hopping I'm i the right direction. Thanks!

totalAllocAfter := statsAfter.Alloc
diffAlloc := totalAllocAfter - totalAllocBefore
// rounded percent of memory consumed
percentAlloc := (diffAlloc*100)/totalOSMemory
Copy link
Member Author

@spzala spzala Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Couple of example outputs,
# ./bin/etcdctl check datascale
PASS: Memory usage is optimal. Percent memory used is : 26
PASS
# ./bin/etcdctl check datascale --load="m"
PASS: Memory usage is optimal. Percent memory used is : 38
PASS
I couldn't test "l" yet, I need to increate quota and then test. For now it times out,
# ./bin/etcdctl check datascale --load="l"
Error: etcdserver: request timed out

totalAllocAfter := statsAfter.Alloc
diffAlloc := totalAllocAfter - totalAllocBefore
// rounded percent of memory consumed
percentAlloc := (diffAlloc*100) / totalOSMemory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Couple of example outputs,
# ./bin/etcdctl check datascale
PASS: Memory usage is optimal. Percent memory used is : 26
PASS
# ./bin/etcdctl check datascale --load="m"
PASS: Memory usage is optimal. Percent memory used is : 38
PASS
I couldn't test "l" yet, I need to increate quota and then test. For now it times out,
# ./bin/etcdctl check datascale --load="l"
Error: etcdserver: request timed out

totalAllocAfter := statsAfter.Alloc
diffAlloc := totalAllocAfter - totalAllocBefore
// rounded percent of memory consumed
percentAlloc := (diffAlloc * 100) / totalOSMemory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Couple of example outputs,
# ./bin/etcdctl check datascale
PASS: Memory usage is optimal. Percent memory used is : 26
PASS
# ./bin/etcdctl check datascale --load="m"
PASS: Memory usage is optimal. Percent memory used is : 38
PASS
I couldn't test "l" yet, I need to increase quota and then test. For now it times out,
# ./bin/etcdctl check datascale --load="l"
Error: etcdserver: request timed out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output looks good to me. probably 26% of memory is used, expected less than 60% is easier to read than the current format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spzala can you try to check why l times out on your environment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think it's because of my default storage size. "l" takes more than 2 GB. I will be increasing it with --quota-backend-bytes.

@codecov-io
Copy link

codecov-io commented Jan 22, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@b03fd4c). Click here to learn what that means.
The diff coverage is 9.17%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9185   +/-   ##
=========================================
  Coverage          ?   75.34%           
=========================================
  Files             ?      365           
  Lines             ?    30827           
  Branches          ?        0           
=========================================
  Hits              ?    23227           
  Misses            ?     5975           
  Partials          ?     1625
Impacted Files Coverage Δ
etcdctl/ctlv3/command/util.go 21.66% <0%> (ø)
etcdctl/ctlv3/command/check.go 12.74% <12.19%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b03fd4c...53d2a2e. Read the comment docs.

ExitWithError(ExitBadFeature, fmt.Errorf("unknown load option %v", checkDatascaleLoad))
}
cfg := checkDatascaleCfgMap[model]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we shall print out what to test here.

example output:

start to test small data scale [10000 key-value pairs, 1kb per key-value, 50 concurrent clients]

}

if percentAlloc > 60 { // leaves less than 40 percent of memory
fmt.Println("FAIL: Memory usage is too high. Percent memory used is :", percentAlloc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably print out the expected value (60%).

also how about also print out the absolute memory usage in MB besides the percentage value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 sure. Thanks much! All the comments looks good to me, agree that It's good to provide some more info than less.

@xiang90
Copy link
Contributor

xiang90 commented Jan 22, 2018

Looks great in general.

@spzala
Copy link
Member Author

spzala commented Jan 22, 2018

@xiang90 Thanks much!! I will be working on your suggestions and update new commit soon.

totalAllocAfter := statsAfter.Alloc
diffAlloc := totalAllocAfter - totalAllocBefore
// rounded percent of memory consumed
percentAlloc := float64((diffAlloc * 100) / totalOSMemory)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 Here are couple of sample outputs with the latest changes:
# ./bin/etcdctl check datascale
Start data scale check for work load [10000 key-value pairs, 1024kb per key-value, 50 concurrent clients].
PASS: Memory usage is 25% of available. Expected less than 60%.
INFO: Approx total process memory used : 3mb out of 12mb.
PASS
# ./bin/etcdctl check datascale --load="m"
Start data scale check for work load [100000 key-value pairs, 1024kb per key-value, 200 concurrent clients].
PASS: Memory usage is 22% of available. Expected less than 60%.
INFO: Approx total process memory used : 6mb out of 29mb.
PASS
I couldn't test "large" work load yet due to my VM which limited memory but I believe that's OK for this PR. However, I am in mid of getting bigger machines to set up a 3 node cluster and I will be testing it there. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000 key-value pairs, 1024kb per key-value

should this be 1024Bytes instead of kb?

INFO: Approx total process memory used : 3mb out of 12mb.

10,000 * 1KB = 10MB data. it looks strange that etcd only consumed 3mb in this case. Probably it is due to there are some pre-allocation happening, and we calculate the delta after the per-allocation. we probably should just report the total usage instead of the delta.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup :), good catch. It's Bytes not kb! OK, agree, let me take a closer look there. Thanks @xiang90 !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 you're right. I think using delta as I was didn't make much sense because the total available Sys also changes after put ops. I am going to use total stats.Alloc and stats.Sys after the put operations in newer commit, let's see if that makes sense. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 the newer sample output as in commit 3 looks like below:
# ./bin/etcdctl check datascale
Start data scale check for work load [10000 key-value pairs, 1024 bytes per key-value, 50 concurrent clients].
PASS: Memory usage is 47% of available. Expected less than 60%.
INFO: Approximate total process memory used : 8mb out of 18mb.
PASS
# ./bin/etcdctl check datascale --load="m"
Start data scale check for work load [100000 key-value pairs, 1024 bytes per key-value, 200 concurrent clients].
PASS: Memory usage is 44% of available. Expected less than 60%.
INFO: Approximate total process memory used : 24mb out of 55mb.
PASS

clients: 500,
},
"xl": {
limit: 30000000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 30000000 but not 10000000? I like the idea of increasing order of limit each time from s -> m -> l -> xl

Copy link
Member Author

@spzala spzala Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fanminshi That's like xxl :-) :-) I think we decided 3000000 considering this is a test and as we go higher in number it can take more time and resource. @xiang90 any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spzala I saw that xl = 3 * 10^7 and l = 1 * 10^6. So i am curious why isn't xl = 1*10^7 which is one order increase from l.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xl tries to hit the upper bound aggressively which is 3 versions of 1M objects (which is 3M in total).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanminshi ahhh..sorry I read one more zero in your suggestion and so was kidding with xxl :) Thanks @xiang90 for the explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 make sense. maybe add a comment on xl to explain 3 versions of 1M objects ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanminshi sure, sounds good. Thanks!

} else {
fmt.Println(fmt.Sprintf("PASS: Memory usage is %v%% of available. Expected less than %v%%.", int(percentAlloc), 60))
}
fmt.Println(fmt.Sprintf("INFO: Approx total process memory used : %vmb out of %vmb.", int(mbUsed), int(mbTotal)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print out this before FAIL/PASS? Also change Approx to Approximate to make the output look more formal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 Agree on using Approximate. On printing it I thought regardless or FAIL/PASS it's a useful message and having FAIL/PASS at very end is more consistent with check perf. So I would keep it if it makes sense to leave as it is or can move it to end? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. i saw it already prints before PASS/FAIL. I am confused since i also see FAIL/PASS at line 364/367. Probably we can remove the PASS: and FAILED: there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 missed this comment in the latest commit but this sounds good and I am fixing it. Thanks!

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

INFO: Approximate total process memory used : 8mb out of 18mb.

I looked at the output again, and believe this is not right. The 18mb is not the total memory of the box, but the total memory go acquired from the box. The 8mb is the allocated (in use) memory of the 18mb memory go acquired from the box.

The right way to measure this is (memory acquired from system)/(total memory of the system) which is (18MB/total memory of the system).

Probably we need to find a way in go to get the system memory. we can start with just Linux.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

/cc @spzala

@spzala
Copy link
Member Author

spzala commented Jan 24, 2018

@xiang90 thanks and yes you are right. The runtime.ReadMemStats provides the memory usage for the go process. I wondered about the approach myself too but I couldn't find any go specific functions to find system memory. I also used watch -n 2 free -m on separate command line to see how system memory is changing. About why I chose runtime.ReadMemStats - I read ...memory usage consumption via the Go runtime.ReadMemStats in this link, https://coreos.com/etcd/docs/latest/benchmarks/etcd-storage-memory-benchmark.html and so thought I should rely on using ReadMemStats too. On calculating usage of system memory, I could not find any Go function for it and so I guess we need to Linux commands for this purpose something I guess you are suggesting. Thanks!

@spzala spzala added the WIP label Jan 24, 2018
@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@spzala

check out https://github.com/pbnjay/memory? or try /proc/meminfo on linux?

@spzala
Copy link
Member Author

spzala commented Jan 24, 2018

@xiang90 cool, thanks!!

@spzala
Copy link
Member Author

spzala commented Jan 24, 2018

@xiang90 it seems straightforward with use of syscall . I got busy with some kube work but will be updating changes soon. Thanks again!

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@spzala we probably want this command works at least on both linux and osx. just a note if you decide to use syscall.

@spzala
Copy link
Member Author

spzala commented Jan 25, 2018

@xiang90 OK, sure. Thanks for the heads up!

@spzala
Copy link
Member Author

spzala commented Jan 25, 2018

@xiang90 with using system memory as a criteria, I guess it looks really good. I could only make it to work on Linux but sample output as an update and for you to take a quick look. Finding osx free memory seems not obvious but going to work on it.
# ./bin/etcdctl check datascale
Start data scale check for work load [10000 key-value pairs, 1024 bytes per key-value, 50 concurrent clients].
INFO: Approximate total system memory used : 182 MB out of 2 GB.
PASS: Memory usage is 7% of available. Expected less than 60%.

# ./bin/etcdctl check datascale --load="m"
Start data scale check for work load [100000 key-value pairs, 1024 bytes per key-value, 200 concurrent clients].
INFO: Approximate total system memory used : 1548 MB out of 2 GB.
FAIL: Memory usage is 74% of available. Expected less than 60%.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018

based on our dbtest, 1M 1kb keys would consume about 2.1GB memory.

However it seems that in your result, 0.1M 1kb keys consumed 1.5GB memory which does not seem right. I do notice that in your setup the kv size is actually 2kb (key=1kb, value=1kb) rather than 1kb in total(we should keep it 1kb though). however, it still does not explain the huge difference between the results.

/cc @gyuho can you also take a look of this PR?

@spzala
Copy link
Member Author

spzala commented Jan 26, 2018

@xiang90 Thanks and it's interesting. I did run 'watch' on memory changes in a separate command line and I could see the usage was pretty accurate when I run the command. But yes you are right, I have 1kb for each key and value, I thought that's what we wanted to do. I will change them 1kb in total.

@spzala
Copy link
Member Author

spzala commented Jan 26, 2018

@xiang90 After I change kv to total 1kb, I see a good difference but seems like still not similar to what you have mentioned. I will be creating a new commit with latest code for better look. Meanwhile, here is what I see,
# ./bin/etcdctl check datascale --load="m"
Start data scale check for work load [100000 key-value pairs, 1024 bytes per key-value, 200 concurrent clients].
INFO: Approximate total system memory used : 545 MB out of 2 GB.
PASS: Memory usage is 25% of available. Expected less than 60%.
Also, I had watch -n 2 free -m running where I see approximately similar ~530MB usage of memory.

@gyuho
Copy link
Contributor

gyuho commented Jan 26, 2018

Could you run with 1M keys? Then we can compare with https://github.com/coreos/dbtester/tree/master/test-results/2018Q1-01-etcd#write-1m-keys-256-byte-key-1kb-value-best-throughput-etcd-1k-clients-with-100-conns, which fetches memory usage directly from top command in linux.

@spzala
Copy link
Member Author

spzala commented Jan 26, 2018

@gyuho thanks and sure! I tried just now with all possible free memory on my ~8gb VM but couldn't run the big workload as it times out. I have got a new machine but needed to set up etcd, so will be getting back on this soon.

@gyuho
Copy link
Contributor

gyuho commented Jan 26, 2018

@spzala Also https://github.com/coreos/etcd/tree/master/etcdctl#check-subcommand needs update.

Are we measuring the memory usage of etcd + client? Then document that command process needs to be collocated with etcd in the same address space.

wg.Add(len(clients))

// get the available system memory before the put operations
systemInfoBefore := &syscall.Sysinfo_t{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i think i see the problem here:

right now, we are measuring the memory usage of the machine that etcdctl runs on. what we want to measure is the memory usage of the etcd server rather than etcdctl.

probably we should query the etcd metrics endpoint to get the actual etcd server side memory usage instead of the etcdctl one. or we need to tell the user of this command to run etcdctl on the same machine as the etcd server, and only measure the memory usage addition caused by the etcd server.

/cc @gyuho

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be better.

@spzala etcd server metrics exposes process_resident_memory_bytes. Can you try this out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any documentation on this metrics (https://github.com/coreos/etcd/blob/master/Documentation/metrics.md), but this is what we recommend to use for monitoring etcd via Prometheus. See example here https://github.com/coreos/etcd/blob/1d99d3886f6cb5fb7ef13100c9587cc01820d38e/Documentation/op-guide/grafana.json#L448-L453

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 @gyuho yes currently I am running server and client on the same machine. @gyuho So to get server side memory usage, are you suggesting to use Prometheus and then look for process_resident_memory_bytes in the metrics? Sorry for the delay, I was out on sick leave and recovering from flu. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any good reference on process_resident_memory_bytes on a quick search but at least one place it's mentioned that, "process_resident_memory_bytes is the amount of memory the Prometheus process is using from the kernel."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho a gentle reminder about my question above :-) - I didn't understand how I can test using process_resident_memory_bytes? - are you suggesting to use Prometheus and then look for process_resident_memory_bytes in the metrics? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can query [server-host]:2379/metrics HTTP endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much @gyuho I have just updated new commit. For now, I have hard coded 127.0.0.1 as host but if the overall approach look good, I need some idea on deciding endpoint - should we ask user to provide endpoint? or is there a better way to decide it ourselves? Couple other small things also needed to be done as I have mentioned as review comments. For 1M kv, I got this output which seems looks in-line with @xiang90 earlier mentioned with ~2.x GB consumption. Below are couples of example outputs,

# ./bin/etcdctl check datascale --load="l"
Start data scale check for work load [1000000 key-value pairs, 1024 bytes per key-value, 500 concurrent clients].
INFO: Approximate total system memory used : 2024.46875 MB out of 12 GB.
PASS: Memory usage is 15.565745892195743% of available. Expected less than 60%.
# ./bin/etcdctl check datascale --load="l"
Start data scale check for work load [1000000 key-value pairs, 1024 bytes per key-value, 500 concurrent clients].
INFO: Approximate total system memory used : 2485.078125 MB out of 13 GB.
PASS: Memory usage is 18.482729928669876% of available. Expected less than 60%.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some idea on deciding endpoint - should we ask user to provide endpoint?

You can get endpoints from etcdctl --endpoints flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho thanks, true but that's only if user provides it right? so that means we will use first endpint if provided --endpoints with etcdctl check database and if ```endpoints`` flag is not provided then we use 127.0.0.1?


// get the available system memory after the put operations
systemInfoAfter := &syscall.Sysinfo_t{}
sysErrAfter := syscall.Sysinfo(systemInfoAfter)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we aren't going to use this approach but I noticed that this should have called before Delete ops above.

os.Exit(ExitError)
}
}

// get the process_resident_memory_bytes from <server:2379>/metrics
func processMemoryResidentBytes(host string) float64 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this func looks good, I guess we should have it as util function to be reused ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can put it in util.go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks @gyuho


// get the available system memory before the put operations
systemInfoBefore := &syscall.Sysinfo_t{}
sysErrBefore := syscall.Sysinfo(systemInfoBefore)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syscall is breaking build because it works on Linux only. I know we need to run these code on Linux and OSX so I will be working it and updating a new commit. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can define this with build tags, and let other OS just return dummy values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still need as we get the server process memory usage from metrics endpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 that's a good question, so far I have used endpoint to get the memory in use (process_resident_memory_bytes) but we also need total available memory to compare the percentage consumption. I am checking if endpoint metrics also returns that value, and if so you are right we don't need to use syscall. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 @gyuho so for the purpose of comparing the difference of process_resident_memory_bytes before and after kv insert to the actual available system memory, can I rely on process_virtual_memory_bytes ? (I don't see the process_virtual_memory_bytes same as what I get with syscall.Sysinfo on my VM but I am not sure if we use process_virtual_memory_bytes for monitoring in any way and unfortunately I couldn't much info on it.) Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, might be a dumb question but when when I get endpoint metrics on a mac why don't I see the similar output as linux i.e. on mac I don't get process_virtual_memory_bytes, is that a normal behavior or am I missing something? Thanks!

@spzala
Copy link
Member Author

spzala commented Feb 14, 2018

@xiang90 @gyuho I am now using endpoint metrics for everything and processes user provided endpoint if any or localhost. Please take a look. Here are updated output examples:

# ./bin/etcdctl check datascale
Start data scale check for work load [10000 key-value pairs, 1024 bytes per key-value, 50 concurrent clients].
INFO: Approximate total system memory used : 33.17 MB out of 10.08 GB.
PASS: Memory usage is 0.32% of available. Expected less than 60%.
# ./bin/etcdctl check datascale --endpoints=http://127.0.0.1:2379
Start data scale check for work load [10000 key-value pairs, 1024 bytes per key-value, 50 concurrent clients].
INFO: Approximate total system memory used : 19.81 MB out of 10.34 GB.
PASS: Memory usage is 0.19% of available. Expected less than 60%.
# ./bin/etcdctl check datascale --load="m"
Start data scale check for work load [100000 key-value pairs, 1024 bytes per key-value, 200 concurrent clients].
INFO: Approximate total system memory used : 297.72 MB out of 10.13 GB.
PASS: Memory usage is 2.87% of available. Expected less than 60%.
# ./bin/etcdctl check datascale --load="l"
Start data scale check for work load [1000000 key-value pairs, 1024 bytes per key-value, 500 concurrent clients].
INFO: Approximate total system memory used : 1655.70 MB out of 10.10 GB.
PASS: Memory usage is 16.01% of available. Expected less than 60%.


// delete the created kv pairs
ctx, cancel = context.WithCancel(context.Background())
_, err = clients[0].Delete(ctx, checkDatascalePrefix, v3.WithPrefix())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that I could not fully test was with large workload i.e. with option --load="l". In that case, I had to comment the Delete because by deleting eventually ran into context deadline exceeded . But I hope my testing is still OK as after commenting LOC 259 to 364 I could run the command fine as I provided in the example in my another comment. I guess it's my VM because after running --load="l" I have to compact and defrag to continue using etcd. Thanks!

if strings.HasPrefix(line, `process_resident_memory_bytes`) {
residentMemory = strings.TrimSpace(strings.TrimPrefix(line, `process_resident_memory_bytes`))
}
if strings.HasPrefix(line, `process_virtual_memory_bytes`) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual memory is not the total available memory on the host. i guess we need to modify etcd server to make it report the total available memory on host.

how about just keeping the percentage calculation as a todo, and simply reporting the memory usage and outputting pass always?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 sure, thanks. That sounds a great approach !! (I have a new go func created for darwin to find available memory using vm_stat (hw.physmem is not reliable at all) and one on Linux with syscall, not sure if we can use that for our todo or on server side but as you suggested let's take that separately for a concise and clear discussion on it's own). Thanks!

@spzala
Copy link
Member Author

spzala commented Feb 15, 2018

@xiang90 I have updated new commit and modified help text little bit to reflect our approach. If it look good, I will be creating doc separately to modify etcdctl readme. Hope that's OK. Thanks!

@spzala
Copy link
Member Author

spzala commented Feb 15, 2018

Sample outputs:

# ./bin/etcdctl check datascale
Start data scale check for work load [10000 key-value pairs, 1024 bytes per key-value, 50 concurrent clients].
PASS: Approximate system memory used : 9.49 MB.
# ./bin/etcdctl check datascale --endpoints=http://127.0.0.1:2379 --load="m"
Start data scale check for work load [100000 key-value pairs, 1024 bytes per key-value, 200 concurrent clients].
PASS: Approximate system memory used : 181.89 MB.

@xiang90
Copy link
Contributor

xiang90 commented Feb 15, 2018

we might want to add a progress bar to track the testing progress. lgtm in general. defer to @gyuho for a final code review.

fmt.Printf("FAIL: ERROR(%v) -> %d\n", k, v)
}
ok = false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move if ok { ... } here? ok is basically same as len(s.ErrorDist) == 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho sure, good point. Thanks!

@spzala
Copy link
Member Author

spzala commented Feb 15, 2018

@xiang90 @gyuho I have created #9328 #9329 and #9327 and assigned to myself as follow up work. Thanks!!

@spzala
Copy link
Member Author

spzala commented Feb 16, 2018

@gyuho @xiang90 can I go ahead and squash now? many commits and after I rebased on top of #9330 I see build failed, squashing might give a clear view of it?. Thanks!

@gyuho
Copy link
Contributor

gyuho commented Feb 16, 2018

@spzala Just rebase from latest commit in your master branch. And squash them all?

Create a new command similar to check perf that can check the memory
consumption for putting different workloads on a given endpoint. If no endpoint
is provided, localhost will be used. Return user with a message that whether
there are enough memory for a given workload with pass or fail.

Fixed etcd-io#9121
@spzala
Copy link
Member Author

spzala commented Feb 17, 2018

Thanks @gyuho squashed.

@spzala
Copy link
Member Author

spzala commented Feb 20, 2018

@gyuho can we please have this in now if it looks good? so that I can start working on top of it? Thanks!!

@gyuho gyuho merged commit e19df69 into etcd-io:master Feb 20, 2018
@gyuho
Copy link
Contributor

gyuho commented Feb 20, 2018

@spzala Let's add auto compact, auto defrag options in a separate PR? Thanks!

@spzala
Copy link
Member Author

spzala commented Feb 21, 2018

@gyuho thanks and yup, I will have a PR soon on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants