Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92

Merged
merged 36 commits into from
Apr 28, 2024

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Feb 25, 2024

This is a working PR for version 0.3, which introduces many changes:

  • storing (on LMDB) and indexing (on ElasticSearch cluster) are now done in one step (no more Node.js separate indexing), in an asynchronous manner. This solves supporting bibliographical records beyond DOI as the record naming identifier is centralized. Storing and indexing are done in parallel and quite optimized (load time should be relatively similar to the previous load time, but we have the indexing done in addition)
  • only one Java application for everything (removal of the Node.js part, which was fun to write, but ultimately a bad idea as well noted by karatekaneen :D )
  • update to Dropwizard 4, change to logback, update of various other dependencies
  • support of ElasticSearch 8.* and the latest Java client API for indexing (still the Java High Level REST Client for retrieval for the moment)
  • update of file import formats for the CrossRef dump flavors (using Fix import for Crossref format 2022 #83 from @lfoppiano)
  • support of HAL archive, loading via the HAL web API (10 times faster than OAI-PMH), query integrating HAL ID and results integrating full HAL records (with or without DOI)

Todo:

  • Solr cluster support (ongoing), as a more open search framework alternative... it will likely replace entirely ElasticSearch given its new license
  • load and index PubMed full records (via Medline dumps), the parsing and JSON conversion are done, only need a proper command and a lookup class. This will support all the PubMed entries without DOI for the reference matching (around 8M records)

Marcin Kardas and others added 29 commits November 2, 2021 17:28
By default the maximum number of connections per route is set to
[DEFAULT_MAX_CONN_PER_ROUTE](https://javadoc.io/static/org.elasticsearch.client/elasticsearch-rest-client/7.4.2/org/elasticsearch/client/RestClientBuilder.html#DEFAULT_MAX_CONN_PER_ROUTE) which is 10. I additionally set the maximum number
of connections for all routes in total to the same value (default is
[DEFAULT_MAX_CONN_TOTAL](https://javadoc.io/static/org.elasticsearch.client/elasticsearch-rest-client/7.4.2/org/elasticsearch/client/RestClientBuilder.html#DEFAULT_MAX_CONN_TOTAL) = 30). I don't think it makes sense to expose this parameter as well
in the config file as glutton uses only a single route.
@kermitt2 kermitt2 marked this pull request as draft February 25, 2024 20:25
@lfoppiano lfoppiano mentioned this pull request Mar 4, 2024
4 tasks
@karatekaneen
Copy link
Contributor

Would love to help to get this completed. What's left to do before this can be merged?

@kermitt2 kermitt2 marked this pull request as ready for review April 22, 2024 18:25
@kermitt2
Copy link
Owner Author

Hi @karatekaneen ! I hope you're doing well.
It's actually complete - except what is in the todo list, but that would be too much for this PR.
I was waiting for some feedback from a user, but it's fully functional according to my tests. I will try to merge it next week-end after quickly reviewing the documentation.

@karatekaneen
Copy link
Contributor

Lovely! I'll take it for a test spin as soon as I get the chance. Haven't worked with Solr before so might be a bit tricky to set up for me

@karatekaneen
Copy link
Contributor

The tests seems to be broken which makes the gradlew clean build command fail.
When running gradlew clean jar instead to skip the tests it seems to work and the server starts up. Waiting for the data to download and then I'll try to get everything up and running.

Here's some of the output from the failed step:

> Task :compileTestJava FAILED
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.UnpayWallMetadata;
                                   ^
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:14: error: cannot find symbol
    UnpayWallReader target;
    ^
  symbol:   class UnpayWallReader
  location: class UnpaidWallReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/PmidReaderTest.java:13: error: cannot find symbol
    PmidReader target;
    ^
  symbol:   class PmidReader
  location: class PmidReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/IstexIdsReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.IstexData;

@kermitt2
Copy link
Owner Author

Sorry I forgot working on the tests! They have been updated.

@kermitt2 kermitt2 merged commit 98a9268 into master Apr 28, 2024
@karatekaneen
Copy link
Contributor

@kermitt2 Tried it out and it works perfectly. Haven't tried the HAL stuff though since it's of no interest for us. Awesome work!

The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.

@kermitt2
Copy link
Owner Author

kermitt2 commented May 6, 2024

The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.

I think this is related to the low capacity of your VM, because we have at the same time storing in LMBD (with memory page, so nice to have RAM) and indexing with ES, which is also very RAM hungry. Even if the RAM does not look used, it is in reality because it's memory paging (RAM is used as much as available by LMDB).

I have a very good server and got everything processed for CrossRef in 2h 43m :)

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 148831541
crossref_storing_rejected_records
             count = 8123459

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 148840337
         mean rate = 15194.88 events/second
     1-minute rate = 16574.74 events/second
     5-minute rate = 16992.97 events/second
    15-minute rate = 17176.07 events/second

BUILD SUCCESSFUL in 2h 43m 28s
3 actionable tasks: 1 executed, 2 up-to-date

real    163m28.910s
user	0m6.562s
sys	0m3.821s

Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).

We can see also that we have 8796 CrossRef records stored, but not indexed (148840337-148831541), this is something I will investigate.

@lfoppiano
Copy link
Collaborator

Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).

Did you have to change any special parameters? I increased the memory of elastic search to 64G and to the crossref task to 64G but I'm way lower than your values:

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 1014324
crossref_storing_rejected_records
             count = 65676

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 1019324
         mean rate = 5671.59 events/second
     1-minute rate = 5549.65 events/second
     5-minute rate = 4115.43 events/second
    15-minute rate = 3361.42 events/second

I'm using the SSD (on AWS) and I've set up the fastest throughput for it

I'm not sure what may I do to increase the throughput 🤔

@kermitt2
Copy link
Owner Author

I used a recent workstation, all SSD, and unchanged parameters. It's likely that AWS SSD and cpu are not comparable with bare metal.

For memory paging performance, half of the memory used for elasticsearch should be left to the system. So if 32GB memory for elasticsearch JVM, 32GB must be left to the system. I think it's the same for the glutton loading task and LMDB. Personally I used the default setting also for the memory and I think at least half of the memory was always available for the OS.

@lfoppiano
Copy link
Collaborator

Thanks @kermitt2 !
I have also some limitations on the number of CPUs, but it looks like even they have SSD, the performance is not satisfying. The SSD's performances also are decreasing with time. I did allocate 64Gb for elastic and 64Gb for the java job, but I will try to allocate 32Gb instead

@karatekaneen
Copy link
Contributor

@lfoppiano What you're describing sounds like what I encountered. I had similar performance in the beginning with it dropping over time until I hit 300/s on average. I also used a VM with SSD but on GCP so maybe this only affects cloud instances and not bare metal for some reason?

@lfoppiano
Copy link
Collaborator

I don't remember exactly, but GCP seemed to me that was working faster, using the SSD. I forgot. Since I was using the free credits I had limitation to 250Gb maximum so, at the end I did not manage to load the full database 😢

I will try again, and if I find the solution I will post here.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Aug 20, 2024

I'm testing a new instance with nitro hypervisor and I think I increased the throughput of the disks...

it's better but quite far from your performance, @kermitt2
@karatekaneen @kermitt2 can you make the same test on your machine? I'd be interested to compare

  • read test:
sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1024000/64/32, sectors = 2097152000, start = 0
 Timing cached reads:   41542 MB in  1.99 seconds = 20912.98 MB/sec
 Timing buffered disk reads: 1058 MB in  3.00 seconds = 352.56 MB/sec
  • write test:
dd if=/dev/zero of=/tmp/mnt/temp oflag=direct bs=128k count=16k
dd: failed to open '/tmp/mnt/temp': No such file or directory
ubuntu@ip-172-31-33-189:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k
16384+0 records in
16384+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 22.6979 s, 94.6 MB/s

Update: it took 13h to load the full index, with performance dropping from 9k/second to 3k/second. quite a lot

8/21/24, 8:33:21 AM ============================================================

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 149803402
crossref_storing_rejected_records
             count = 8185933

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 149805141
         mean rate = 3033.63 events/second
     1-minute rate = 1564.45 events/second
     5-minute rate = 1564.89 events/second
    15-minute rate = 1565.73 events/second

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 21, 2024

  • read
lopez@trainer:~$ sudo hdparm -Ttv /dev/nvme0n1p2

/dev/nvme0n1p2:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3814934/64/32, sectors = 7812984832, start = 1050624
 Timing cached reads:   49550 MB in  2.00 seconds = 24818.52 MB/sec
 Timing buffered disk reads: 5198 MB in  3.00 seconds = 1732.26 MB/sec
  • write
lopez@trainer:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k
16384+0 records in
16384+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.921125 s, 2.3 GB/s

(PCIe 5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants