-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92
Conversation
By default the maximum number of connections per route is set to [DEFAULT_MAX_CONN_PER_ROUTE](https://javadoc.io/static/org.elasticsearch.client/elasticsearch-rest-client/7.4.2/org/elasticsearch/client/RestClientBuilder.html#DEFAULT_MAX_CONN_PER_ROUTE) which is 10. I additionally set the maximum number of connections for all routes in total to the same value (default is [DEFAULT_MAX_CONN_TOTAL](https://javadoc.io/static/org.elasticsearch.client/elasticsearch-rest-client/7.4.2/org/elasticsearch/client/RestClientBuilder.html#DEFAULT_MAX_CONN_TOTAL) = 30). I don't think it makes sense to expose this parameter as well in the config file as glutton uses only a single route.
Improve parallel processing
Would love to help to get this completed. What's left to do before this can be merged? |
Hi @karatekaneen ! I hope you're doing well. |
Lovely! I'll take it for a test spin as soon as I get the chance. Haven't worked with Solr before so might be a bit tricky to set up for me |
The tests seems to be broken which makes the Here's some of the output from the failed step: > Task :compileTestJava FAILED
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.UnpayWallMetadata;
^
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:14: error: cannot find symbol
UnpayWallReader target;
^
symbol: class UnpayWallReader
location: class UnpaidWallReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/PmidReaderTest.java:13: error: cannot find symbol
PmidReader target;
^
symbol: class PmidReader
location: class PmidReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/IstexIdsReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.IstexData; |
Sorry I forgot working on the tests! They have been updated. |
@kermitt2 Tried it out and it works perfectly. Haven't tried the HAL stuff though since it's of no interest for us. Awesome work! The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning. |
I think this is related to the low capacity of your VM, because we have at the same time storing in LMBD (with memory page, so nice to have RAM) and indexing with ES, which is also very RAM hungry. Even if the RAM does not look used, it is in reality because it's memory paging (RAM is used as much as available by LMDB). I have a very good server and got everything processed for CrossRef in 2h 43m :)
Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server). We can see also that we have 8796 CrossRef records stored, but not indexed (148840337-148831541), this is something I will investigate. |
Did you have to change any special parameters? I increased the memory of elastic search to 64G and to the crossref task to 64G but I'm way lower than your values:
I'm using the SSD (on AWS) and I've set up the fastest throughput for it I'm not sure what may I do to increase the throughput 🤔 |
I used a recent workstation, all SSD, and unchanged parameters. It's likely that AWS SSD and cpu are not comparable with bare metal. For memory paging performance, half of the memory used for elasticsearch should be left to the system. So if 32GB memory for elasticsearch JVM, 32GB must be left to the system. I think it's the same for the glutton loading task and LMDB. Personally I used the default setting also for the memory and I think at least half of the memory was always available for the OS. |
Thanks @kermitt2 ! |
@lfoppiano What you're describing sounds like what I encountered. I had similar performance in the beginning with it dropping over time until I hit 300/s on average. I also used a VM with SSD but on GCP so maybe this only affects cloud instances and not bare metal for some reason? |
I don't remember exactly, but GCP seemed to me that was working faster, using the SSD. I forgot. Since I was using the free credits I had limitation to 250Gb maximum so, at the end I did not manage to load the full database 😢 I will try again, and if I find the solution I will post here. |
I'm testing a new instance with nitro hypervisor and I think I increased the throughput of the disks... it's better but quite far from your performance, @kermitt2
Update: it took 13h to load the full index, with performance dropping from 9k/second to 3k/second. quite a lot
|
(PCIe 5) |
This is a working PR for version 0.3, which introduces many changes:
Todo: