-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression 1.6 to 2.0.2 #1254
Comments
FYI, this is the code that is the subject of my GopherCon talk. The extract program shown above successfully extracts 57+ million records in under 4 hours using Badger 1.6, at a fairly consistent 200-400ms per 1000 (variability mostly due to size of messages coming in from Kafka). |
I profiled 5000 records. |
Here's 10,000 records. Seems to be spending a crazy amount of time doing JSON. |
hey @dougdonohoe, the bloom filters stored in the table seem to take up most of the time in your profile (that's where the json comes from). Can you show me the output of https://godoc.org/github.com/dgraph-io/badger#DB.CacheMetrics ? The bloom filters are loaded on DB startup into the cache and read from the cache. The default size of the cache is 1 GB so it should have enough space to hold a lot of bloom filters. |
When do you want me to fetch the metrics? After db startup, or during processing? It seems like things get slow after some processing? Also, what is a bloom filter? Is this new in 2.0? Is it necessary? |
@jarifibrahim It's also worth noting when I ran the benchmarks in #1248 and set |
Here are stats at beginning and every 1000 records:
Note that I have two tables I'm listing metrics for.
In our application, these tables will have 23+ million entries. We have other tables that have 59+ million entries. I would be worried if the cache has to contain all keys to get 1.6 level performance. |
The access pattern that causes this is as follows:
Psuedo code for this has 3mm new keys being created, and for each new key doing 10 random lookups of previous keys. After about 2022000 records (after 1st flush), things start getting really slow.
Under the covers, GetId(s) eventually calls:
I'm willing to share actual code, but not in this public ticket. |
Hey @dougdonohoe, I've raised #1256 to fix this. Can you please run your application against #1257 and let me know how it goes? |
Sure. Can you remind me how to get this version using |
Here you go.
|
I ran it, and performance is better (possibly slightly slower than 1.6, but hard to gauge). Unfortunately, at first flush i got this error:
Never seen an error like this before. |
I'm re-running from fresh start to see if the above repeats. |
@dougdonohoe Can you check how many vlog files you have? Badger keeps track of the |
In Badger 2.0, the
Notice the jump from 3 to 18. In 1.6, I have files from |
@dougdonohoe The files could be missing because they got garbage collected. The error you mentioned before is a strange one. The code here Line 893 in b13b927
|
As for the performance, the code in #1257 is definitely faster than v1.6.0
For v1.6.0, it read 2.8 GB
|
@jarifibrahim - sorry, I deleted the data and started again. It's working so far (aside from the odd numbering of vlog files noted above). What I'm doing in this phase is mostly writing. I'll have better numbers to compare to 1.6 when it finishes (takes 4-5 hours). |
It happened again, after several hours:
Here are all the files:
What would you like to know next? |
@dougdonohoe How are you inserting the data? Stream writer? |
I'm using |
In other news, I ran my test program to create 22,000,000 entries (with random lookups done for each one). In 1.6, it ran in 25m22s. In 2.0.2 with patch, it ran in 20m40s. So that is a positive sign. OTOH, I was trying to compare the re-world performance of my Kafka extraction program, but it died due to the above filename bug. At the time of the crash, it had been running for 1h45m. At the approximately same time when last run using 1.6, it took 1h30m. So it appears to be 16% slower in 2.x. Note that this code is predominately write-focused. |
I don't know much about the code, but I noticed this: 1.6
2.0
Seems suspicious to have removed the atomic add. |
I spent a little more time trying to understand the Badger code base. It's a lot to come up to speed on. It is obvious the However, one thing I don't understand in 2.0 is why the In 1.6,
However in 2.0, it looks like
Also notice file I'd like to dig in more, but I have to do some day job stuff :-) |
Note that both occurrence of this bug happened well after startup. To summarize previous info:
I've been looking through the code a bit more, and it must be failing in the
AFAIK, aside from open, the only place that |
FYI: I've moved on to other things and have put on hold our upgrade to 2.0. This intermittent issue where the |
Hey @dougdonohoe, do you some specific time/pattern between the If there's any script/test program that you can share which can help me reproduce the issue it would be great. I haven't been able to reproduce the failure and that's why we haven't fixed it yet. As for the Line 1405 in 91c31eb
It is a sequential call and that's why it should be okay to not have any locks there. Earlier we were using atomics but they weren't really needed.
Writes to the vlog file should always be serial. We cannot write concurrently to the vlog file which is the write-ahead log file. |
Github issues have been deprecated. |
Issue 1187 appears to be what changes this code. Making a note in case I have to revisit this. |
The |
What version of Go are you using (
go version
)?go version go1.12.7 darwin/amd64
What version of Badger are you using?
2.0.2 (upgrading from 1.6.0)
Does this issue reproduce with the latest master?
Haven't tried.
What are the hardware specifications of the machine (RAM, OS, Disk)?
GCP 8 CPU (Intel Haswell), 32 GB RAM, 750 GB local ssd
What did you do?
Running code which extracts data from Kafka and saves to Badger DB. I'm running on exact same hardware, disk and my code against exact same Kafka topic.
What did you expect to see?
Better or equal performance with Badger 2.
What did you see instead?
Severe slowdown after writing ~1,461,000 records. See below
1.6.0 performance:
Performance in 1.6.0 takes about 300-400ms to extract 1000 messages.
2.0.2 performance:
Notice that at approximately offset 1462000 (1,462,000 records), things start slowing down from a rate of 300-400ms per 1,000 records to 25-30 seconds per 1,000 records! It happens after the very first
Flushing memtable
debug message. If you look above, theFlushing
happens at the exact same place, but things continue speedily after.I tried the same with compression disabled and saw similar results. The options I'm using are
DefaultOptions
with the following tweaks:I literally just started on the 2.0 migration today. I'm running the same code I've been running for 6 months.
The text was updated successfully, but these errors were encountered: