Probing hashmap #3

jelmervdl · 2023-06-30T15:10:07Z

Attempt to reduce the memory hunger of docalign since it is turning into an issue in HPLT processing.

Aah, crap, peak memory footprint isn't lower :(

This version (mostly focus on memory, runtime is on my desktop while doing other things so not realistic)

Calculated DF from 2464504 documents
DF queue performance:
  underflow: 4823
   overflow: 0
Pruned 157562017 (81.1585%) entries from DF
Very frequent ngram set is now 227054 long.
Read 307 documents into memory
Load queue performance:
  underflow: 16
   overflow: 0
Read queue performance (Note: blocks when score queue fills up):
  underflow: 4822
   overflow: 0
Score queue performance:
  underflow: 4828
   overflow: 0
  6m25.74s real   42m5.71s user   1m46.59s sys
         18629947392  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            10793957  page reclaims
                   5  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  23  voluntary context switches
             1391819  involuntary context switches
       8201683541841  instructions retired
       6036882163532  cycles elapsed
         18185654272  peak memory footprint
      55

Current version in main repo:

Calculated DF from 2464504 documents
DF queue performance:
  underflow: 4338
   overflow: 0
Pruned 157562017 (81.1585%) entries from DF
Very frequent ngram set is now 227054 long.
Read 307 documents into memory
Load queue performance:
  underflow: 16
   overflow: 0
Read queue performance (Note: blocks when score queue fills up):
  underflow: 4818
   overflow: 0
Score queue performance:
  underflow: 4828
   overflow: 0
  13m0.76s real   1h25m24.56s user    1m43.82s sys
         18356985856  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             6078042  page reclaims
                   5  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  37  voluntary context switches
             2789048  involuntary context switches
       9243225569447  instructions retired
      11959176496660  cycles elapsed
         15795527680  peak memory footprint
      55

document-aligner/docalign.cpp

And a small optimisation by only ever copying the smaller to the larger table. The worst-case scenario occurs because the bigger AutoProbing table will have all its entries sorted by hash, and will trigger many hash collisions on merge. The problem is that it takes a long time for the destination table to be doubled because that's based on an entry counter. But all the entries that get added, cause collisions. So that entry counter will only very, very slowly increase, maintaining a scenario in which most inserts are collisions.

jelmervdl · 2023-07-03T15:56:13Z

Okay, so just replacing the table isn't doing the job. Next attempt: don't let the thread-specific hash maps become so big? easy to test this by just running 1 thread and seeing what happens with memory footprint.

jelmervdl added 2 commits June 30, 2023 15:05

Fix path to bitextor preprocess

f0d9ede

Replace unorderd_map and unorderd_set with kpu's ProbingHashTable

7bf0ae6

jelmervdl commented Jun 30, 2023

View reviewed changes

document-aligner/docalign.cpp Outdated Show resolved Hide resolved

jelmervdl added 2 commits July 3, 2023 16:31

Less trickery

11199b0

jelmervdl mentioned this pull request Jul 3, 2023

Add Reserve() to AutoProbing kpu/preprocess#37

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probing hashmap #3

Probing hashmap #3

jelmervdl commented Jun 30, 2023 •

edited

Loading

jelmervdl commented Jul 3, 2023

Probing hashmap #3

Are you sure you want to change the base?

Probing hashmap #3

Conversation

jelmervdl commented Jun 30, 2023 • edited Loading

jelmervdl commented Jul 3, 2023

jelmervdl commented Jun 30, 2023 •

edited

Loading