Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92

kermitt2 · 2024-02-25T20:23:39Z

This is a working PR for version 0.3, which introduces many changes:

storing (on LMDB) and indexing (on ElasticSearch cluster) are now done in one step (no more Node.js separate indexing), in an asynchronous manner. This solves supporting bibliographical records beyond DOI as the record naming identifier is centralized. Storing and indexing are done in parallel and quite optimized (load time should be relatively similar to the previous load time, but we have the indexing done in addition)
only one Java application for everything (removal of the Node.js part, which was fun to write, but ultimately a bad idea as well noted by karatekaneen :D )
update to Dropwizard 4, change to logback, update of various other dependencies
support of ElasticSearch 8.* and the latest Java client API for indexing (still the Java High Level REST Client for retrieval for the moment)
update of file import formats for the CrossRef dump flavors (using Fix import for Crossref format 2022 #83 from @lfoppiano)
support of HAL archive, loading via the HAL web API (10 times faster than OAI-PMH), query integrating HAL ID and results integrating full HAL records (with or without DOI)

Todo:

Solr cluster support (ongoing), as a more open search framework alternative... it will likely replace entirely ElasticSearch given its new license
load and index PubMed full records (via Medline dumps), the parsing and JSON conversion are done, only need a proper command and a lookup class. This will support all the PubMed entries without DOI for the reference matching (around 8M records)

By default the maximum number of connections per route is set to [DEFAULT_MAX_CONN_PER_ROUTE](https://javadoc.io/static/org.elasticsearch.client/elasticsearch-rest-client/7.4.2/org/elasticsearch/client/RestClientBuilder.html#DEFAULT_MAX_CONN_PER_ROUTE) which is 10. I additionally set the maximum number of connections for all routes in total to the same value (default is [DEFAULT_MAX_CONN_TOTAL](https://javadoc.io/static/org.elasticsearch.client/elasticsearch-rest-client/7.4.2/org/elasticsearch/client/RestClientBuilder.html#DEFAULT_MAX_CONN_TOTAL) = 30). I don't think it makes sense to expose this parameter as well in the config file as glutton uses only a single route.

Improve parallel processing

…ding

…tead of OAI-PMH

karatekaneen · 2024-04-22T14:50:45Z

Would love to help to get this completed. What's left to do before this can be merged?

kermitt2 · 2024-04-22T18:29:47Z

Hi @karatekaneen ! I hope you're doing well.
It's actually complete - except what is in the todo list, but that would be too much for this PR.
I was waiting for some feedback from a user, but it's fully functional according to my tests. I will try to merge it next week-end after quickly reviewing the documentation.

karatekaneen · 2024-04-22T19:11:50Z

Lovely! I'll take it for a test spin as soon as I get the chance. Haven't worked with Solr before so might be a bit tricky to set up for me

karatekaneen · 2024-04-23T09:58:10Z

The tests seems to be broken which makes the gradlew clean build command fail.
When running gradlew clean jar instead to skip the tests it seems to work and the server starts up. Waiting for the data to download and then I'll try to get everything up and running.

Here's some of the output from the failed step:

> Task :compileTestJava FAILED
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.UnpayWallMetadata;
                                   ^
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:14: error: cannot find symbol
    UnpayWallReader target;
    ^
  symbol:   class UnpayWallReader
  location: class UnpaidWallReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/PmidReaderTest.java:13: error: cannot find symbol
    PmidReader target;
    ^
  symbol:   class PmidReader
  location: class PmidReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/IstexIdsReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.IstexData;

kermitt2 · 2024-04-27T18:53:40Z

Sorry I forgot working on the tests! They have been updated.

karatekaneen · 2024-04-29T13:07:38Z

@kermitt2 Tried it out and it works perfectly. Haven't tried the HAL stuff though since it's of no interest for us. Awesome work!

The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.

kermitt2 · 2024-05-06T15:40:33Z

The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.

I think this is related to the low capacity of your VM, because we have at the same time storing in LMBD (with memory page, so nice to have RAM) and indexing with ES, which is also very RAM hungry. Even if the RAM does not look used, it is in reality because it's memory paging (RAM is used as much as available by LMDB).

I have a very good server and got everything processed for CrossRef in 2h 43m :)

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 148831541
crossref_storing_rejected_records
             count = 8123459

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 148840337
         mean rate = 15194.88 events/second
     1-minute rate = 16574.74 events/second
     5-minute rate = 16992.97 events/second
    15-minute rate = 17176.07 events/second

BUILD SUCCESSFUL in 2h 43m 28s
3 actionable tasks: 1 executed, 2 up-to-date

real    163m28.910s
user	0m6.562s
sys	0m3.821s

Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).

We can see also that we have 8796 CrossRef records stored, but not indexed (148840337-148831541), this is something I will investigate.

lfoppiano · 2024-08-18T13:39:09Z

Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).

Did you have to change any special parameters? I increased the memory of elastic search to 64G and to the crossref task to 64G but I'm way lower than your values:

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 1014324
crossref_storing_rejected_records
             count = 65676

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 1019324
         mean rate = 5671.59 events/second
     1-minute rate = 5549.65 events/second
     5-minute rate = 4115.43 events/second
    15-minute rate = 3361.42 events/second

I'm using the SSD (on AWS) and I've set up the fastest throughput for it

I'm not sure what may I do to increase the throughput 🤔

kermitt2 · 2024-08-19T07:07:15Z

I used a recent workstation, all SSD, and unchanged parameters. It's likely that AWS SSD and cpu are not comparable with bare metal.

For memory paging performance, half of the memory used for elasticsearch should be left to the system. So if 32GB memory for elasticsearch JVM, 32GB must be left to the system. I think it's the same for the glutton loading task and LMDB. Personally I used the default setting also for the memory and I think at least half of the memory was always available for the OS.

lfoppiano · 2024-08-19T13:12:23Z

Thanks @kermitt2 !
I have also some limitations on the number of CPUs, but it looks like even they have SSD, the performance is not satisfying. The SSD's performances also are decreasing with time. I did allocate 64Gb for elastic and 64Gb for the java job, but I will try to allocate 32Gb instead

karatekaneen · 2024-08-19T14:47:59Z

@lfoppiano What you're describing sounds like what I encountered. I had similar performance in the beginning with it dropping over time until I hit 300/s on average. I also used a VM with SSD but on GCP so maybe this only affects cloud instances and not bare metal for some reason?

lfoppiano · 2024-08-19T14:55:00Z

I don't remember exactly, but GCP seemed to me that was working faster, using the SSD. I forgot. Since I was using the free credits I had limitation to 250Gb maximum so, at the end I did not manage to load the full database 😢

I will try again, and if I find the solution I will post here.

lfoppiano · 2024-08-20T18:16:02Z

I'm testing a new instance with nitro hypervisor and I think I increased the throughput of the disks...

it's better but quite far from your performance, @kermitt2
@karatekaneen @kermitt2 can you make the same test on your machine? I'd be interested to compare

read test:

sudo hdparm -Ttv  /dev/nvme0n1

/dev/nvme0n1:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 1024000/64/32, sectors = 2097152000, start = 0
 Timing cached reads:   41542 MB in  1.99 seconds = 20912.98 MB/sec
 Timing buffered disk reads: 1058 MB in  3.00 seconds = 352.56 MB/sec

write test:

dd if=/dev/zero of=/tmp/mnt/temp oflag=direct bs=128k count=16k
dd: failed to open '/tmp/mnt/temp': No such file or directory
ubuntu@ip-172-31-33-189:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k
16384+0 records in
16384+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 22.6979 s, 94.6 MB/s

Update: it took 13h to load the full index, with performance dropping from 9k/second to 3k/second. quite a lot

8/21/24, 8:33:21 AM ============================================================

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 149803402
crossref_storing_rejected_records
             count = 8185933

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 149805141
         mean rate = 3033.63 events/second
     1-minute rate = 1564.45 events/second
     5-minute rate = 1564.89 events/second
    15-minute rate = 1565.73 events/second

kermitt2 · 2024-08-21T09:57:42Z

read

lopez@trainer:~$ sudo hdparm -Ttv /dev/nvme0n1p2

/dev/nvme0n1p2:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3814934/64/32, sectors = 7812984832, start = 1050624
 Timing cached reads:   49550 MB in  2.00 seconds = 24818.52 MB/sec
 Timing buffered disk reads: 5198 MB in  3.00 seconds = 1732.26 MB/sec

write

lopez@trainer:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k
16384+0 records in
16384+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.921125 s, 2.3 GB/s

(PCIe 5)

Marcin Kardas and others added 29 commits November 2, 2021 17:28

Update Dockerfile

41b3f48

Prevent glutton from hanging for two minutes

fd74782

preparing working branch for version 0.3

e8d4ca5

Merge pull request #68 from mkardas/incremental-update

2c2b637

Improve parallel processing

fix dependency

89eba9c

complete merge

9b88fcb

avoid exploding the log size when something is going bad with the loa…

8e17030

…ding

increase the lmdb map size to be safe

55af8b6

review Readme

626cca5

review Readme

a61eac6

update crossref dump torrent version

a9a15ef

some fix for parsing medline/pubmed recent xml

8ecb21e

fix other medline import errors

35f7762

Big Bang change

e42a127

extending pmid, various fixes

9641a59

review resources

d2f6a99

complete oai-pmh and tei parser

9936337

use logback

eeec3d4

add clean gradle tasks, document

fcd4413

add indexer; update dependencies

54345b0

add async indexing; fix some mapping; implement HAL API harvester ins…

f6446e9

…tead of OAI-PMH

hal id injection via doi; some fix on server side and dependencies

46e702d

review hal id injection

346e80d

fix unused compression for one map

1d2b7b5

update es client

4a5fa5b

review crossref combined storing+indexing

d1554a5

review doc, with proper readthedocs

96ae5a5

cleaning

d8d37ba

kermitt2 marked this pull request as draft February 25, 2024 20:25

kermitt2 added 6 commits February 25, 2024 21:47

fix conflict

43d7957

format variant

e9d0ef4

more variants

0d34a17

better logging

c58e836

add HAL audit and related process

860ff1c

fix hal id injection

da38e29

lfoppiano mentioned this pull request Mar 4, 2024

Fix import for Crossref format 2022 #83

Closed

4 tasks

kermitt2 marked this pull request as ready for review April 22, 2024 18:25

karatekaneen mentioned this pull request Apr 23, 2024

Feature Request: Add support for authentication for Elasticsearch #93

Open

update tests

8efc84e

kermitt2 merged commit 98a9268 into master Apr 28, 2024

karatekaneen mentioned this pull request May 14, 2024

New version still relies on Node.js for incremental loading #95

Closed

lfoppiano mentioned this pull request Sep 13, 2024

Slow importing of Crossref full metadata dump in LMDB #88

Closed

lfoppiano deleted the hal-support branch September 13, 2024 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92

Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92

kermitt2 commented Feb 25, 2024 •

edited

Loading

karatekaneen commented Apr 22, 2024

kermitt2 commented Apr 22, 2024

karatekaneen commented Apr 22, 2024

karatekaneen commented Apr 23, 2024

kermitt2 commented Apr 27, 2024

karatekaneen commented Apr 29, 2024

kermitt2 commented May 6, 2024 •

edited

Loading

lfoppiano commented Aug 18, 2024

kermitt2 commented Aug 19, 2024

lfoppiano commented Aug 19, 2024

karatekaneen commented Aug 19, 2024

lfoppiano commented Aug 19, 2024

lfoppiano commented Aug 20, 2024 •

edited

Loading

kermitt2 commented Aug 21, 2024 •

edited

Loading

Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92

Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92

Conversation

kermitt2 commented Feb 25, 2024 • edited Loading

karatekaneen commented Apr 22, 2024

kermitt2 commented Apr 22, 2024

karatekaneen commented Apr 22, 2024

karatekaneen commented Apr 23, 2024

kermitt2 commented Apr 27, 2024

karatekaneen commented Apr 29, 2024

kermitt2 commented May 6, 2024 • edited Loading

lfoppiano commented Aug 18, 2024

kermitt2 commented Aug 19, 2024

lfoppiano commented Aug 19, 2024

karatekaneen commented Aug 19, 2024

lfoppiano commented Aug 19, 2024

lfoppiano commented Aug 20, 2024 • edited Loading

kermitt2 commented Aug 21, 2024 • edited Loading

kermitt2 commented Feb 25, 2024 •

edited

Loading

kermitt2 commented May 6, 2024 •

edited

Loading

lfoppiano commented Aug 20, 2024 •

edited

Loading

kermitt2 commented Aug 21, 2024 •

edited

Loading