Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing CPU thread limit? Parsing multiple files, reads only 8 files at a time #1774

Open
JervenBolleman opened this issue Feb 6, 2025 · 9 comments

Comments

@JervenBolleman
Copy link

Giving qlever indexer 600+ files to load, only 8 files are read at a time (notice 8 zcat files, and up to 1600% cpu usage for the IndexBuilderMain executable, average more around 1400% on top). Is there an inbuilt limitation? as the machine this is observed on has many more idle CPU cores.

@hannahbast
Copy link
Member

@JervenBolleman To verify, I just ran qlever index with the standard Qleverfile for UniProt (which reads 677 input streams) on an otherwise idle machine with 32 logical cores. Than all cores are busy and ps -ef | grep "gzip -cd" shows over 600 processes.

The average speed of the parsing is 4.3 M/s. Which average speed is reported on your machine?

@JervenBolleman
Copy link
Author

JervenBolleman commented Feb 6, 2025 via email

@joka921
Copy link
Member

joka921 commented Feb 17, 2025

Hi @JervenBolleman
If your setup allows you to compile QLever from scratch, and therefore to change the source code,
you can play around with the following constants (typically by increasing them). In the future we can also expose those to be less hardcoded, but currently you are the only person with larger machines than us explicitly contacting us about this

In src/index/ConstantsIndexBuilding.h:

  • NUM_PARALLEL_PARSER_THREADS = 8, that is where the number of 8 files comes from. The question is however, are those 8 zcat threads also busy (all at 100% each).
  • NUM_PARALLEL_ITEM_MAPS = 10 (probably also relevant).
  • QUEUE_SIZE_[BEFORE|AFTER]_PARALLEL_PARSING = 10

I think we can also hack around when we meet soon.

@JervenBolleman
Copy link
Author

JervenBolleman commented Feb 17, 2025

@joka921 thanks, I will set NUM_PARALLEL_PARSER_THREADS to 1/3rd total number CPU, and the others to 1/2. And see how that goes. Using const auto processor_count = std::thread::hardware_concurrency(); to get the number of cores.
That does not play nice with constexpr, so trying hard coded first.

First experiment is that it does not seem to really help. I am wondering about my own IO settings and if there might be a problem there.

@aindlq
Copy link

aindlq commented Feb 17, 2025

@joka921 +1 for that, I tried to manually increase constants that you specified and it helped to reduce index build, on relatively small data I got down from 764s to 676s

@hannahbast
Copy link
Member

@JervenBolleman @aindlq Just for the record: I am currently building an index for what I assume is the same dataset (complete UniProt RDF dump from 2025-02-05), and the average parsing speed is 4.1 M/s, on a Ryzen 9 9950X (16 cores) with 4 NVMe in a RAID0 and using the standard settings regarding the constants listed by @joka921.

@aindlq
Copy link

aindlq commented Feb 17, 2025

@hannahbast

with image from docker hub:

INFO: Triples parsed: 667,236,386 [average speed 2.2 M/s, last batch 2.1 M/s, fastest 2.6 M/s, slowest 2.0 M/s]

with custom build:

INFO: Triples parsed: 667,236,386 [average speed 3.2 M/s, last batch 3.0 M/s, fastest 5.2 M/s, slowest 2.6 M/s]

I just more or less randomly increased constants that @joka921 recommended. Parsing n-triples from stdin in parallel. fs cache is dropped before index run.

$ lscpu

  Model name:             Intel(R) Xeon(R) Gold 5412U
    CPU family:           6
    Model:                143
    Thread(s) per core:   2
    Core(s) per socket:   24

diff --git a/src/index/ConstantsIndexBuilding.h b/src/index/ConstantsIndexBuilding.h
index d7c18029..13b01a6e 100644
--- a/src/index/ConstantsIndexBuilding.h
+++ b/src/index/ConstantsIndexBuilding.h
@@ -66,21 +66,21 @@ constexpr inline std::string_view QLEVER_INTERNAL_INDEX_INFIX = ".internal";
 // unique elements of the vocabulary are identified via hash maps. Typically, 6
 // is a good value. On systems with very few CPUs, a lower value might be
 // beneficial.
-constexpr inline size_t NUM_PARALLEL_ITEM_MAPS = 10;
+constexpr inline size_t NUM_PARALLEL_ITEM_MAPS = 24;
 
 // The number of threads that are parsing in parallel, when the parallel Turtle
 // parser is used.
-constexpr inline size_t NUM_PARALLEL_PARSER_THREADS = 8;
+constexpr inline size_t NUM_PARALLEL_PARSER_THREADS = 24;
 
 // Increasing the following two constants increases the RAM usage without much
 // benefit to the performance.
 
 // The number of unparsed blocks of triples, that may wait for parsing at the
 // same time
-constexpr inline size_t QUEUE_SIZE_BEFORE_PARALLEL_PARSING = 10;
+constexpr inline size_t QUEUE_SIZE_BEFORE_PARALLEL_PARSING = 24;
 // The number of parsed blocks of triples, that may wait for parsing at the same
 // time
-constexpr inline size_t QUEUE_SIZE_AFTER_PARALLEL_PARSING = 10;
+constexpr inline size_t QUEUE_SIZE_AFTER_PARALLEL_PARSING = 24;
 

@hannahbast
Copy link
Member

@aindlq That is very valuable feedback, thanks! It would of course be great if QLever could figure that out itself given the resources. Not the highest-priority item on our list, but we will look into that eventually

@JervenBolleman
Copy link
Author

@hannahbast I am indexing on an original Epyc Zen 1 machine. Difference in parsing speed is equivalent to difference in CPU frequency. Seems like there is a bottleneck operation somewhere in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants