Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NQ parsing: IndexBuilderMain "merging partial vocabularies" takes very long time #1468

Closed
Stiksels opened this issue Aug 27, 2024 · 13 comments
Closed

Comments

@Stiksels
Copy link

Issue description
Trying to build index for a zipped nquads file (~2mio named graphs, ~140mio triples). The proces gets stuck on "Merging partial vocabularies" for over 2hours now...

Logs

2024-08-27 16:11:38.977 - INFO: QLever IndexBuilder, compiled on Tue Aug 27 06:08:21 UTC 2024 using git hash d900cd
2024-08-27 16:11:38.979 - INFO: You specified the input format: NQ
2024-08-27 16:11:38.979 - INFO: Processing input triples from /dev/stdin ...
2024-08-27 16:11:38.981 - INFO: You specified "locale = nl_BE" and "ignore-punctuation = 1"
2024-08-27 16:11:38.981 - WARN: You are using Locale settings that differ from the default language or country.
        This should work but is untested by the QLever team. If you are running into unexpected problems,
        Please make sure to also report your used locale when filing a bug report. Also note that changing the
        locale requires to completely rebuild the index
2024-08-27 16:11:38.981 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines
2024-08-27 16:11:38.981 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-08-27 16:11:38.981 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-08-27 16:11:39.133 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-08-27 16:12:06.892 - INFO: Triples parsed: 10,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:12:30.420 - INFO: Triples parsed: 20,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:12:56.159 - INFO: Triples parsed: 30,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:13:24.150 - INFO: Triples parsed: 40,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:13:51.562 - INFO: Triples parsed: 50,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:14:19.591 - INFO: Triples parsed: 60,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:14:45.409 - INFO: Triples parsed: 70,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:15:15.178 - INFO: Triples parsed: 80,000,000 [average speed 0.4 M/s, last batch 0.3 M/s, fas2024-08-27 16:15:42.560 - INFO: Triples parsed: 90,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fas2024-08-27 16:16:16.543 - INFO: Triples parsed: 100,000,000 [average speed 0.4 M/s, last batch 0.3 M/s, fa2024-08-27 16:16:44.003 - INFO: Triples parsed: 110,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fa2024-08-27 16:17:12.634 - INFO: Triples parsed: 120,000,000 [average speed 0.4 M/s, last batch 0.3 M/s, fa2024-08-27 16:17:40.213 - INFO: Triples parsed: 130,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fa2024-08-27 16:18:07.096 - INFO: Triples parsed: 140,000,000 [average speed 0.4 M/s, last batch 0.4 M/s, fa2024-08-27 16:18:12.618 - INFO: Triples parsed: 142,064,904 [average speed 0.4 M/s, last batch 0.4 M/s, fastest 0.4 M/s, slowest 0.3 M/s] 
2024-08-27 16:18:12.773 - INFO: Number of triples created (including QLever-internal ones): 169,946,282 [may contain duplicates]
2024-08-27 16:18:12.774 - INFO: Merging partial vocabularies ...
Screenshot 2024-08-27 at 17 16 54
@ad-freiburg ad-freiburg deleted a comment Aug 27, 2024
@joka921
Copy link
Member

joka921 commented Aug 27, 2024

@Stiksels Thanks for reporting this.
Can you give us access to the NQ file and your used Settings (QLeverfile or commandline options/settings.json file) for the IndexBuilder, so we can locally reproduce this?

@Stiksels
Copy link
Author

@joka921 it eventually did work, the merging of the partial vocabularies took 3+ hours.

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten-full-nq.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten-full-nq docker.io/adfreiburg/qlever:latest -c 'zcat testdata/publiq-uit-activiteiten_2024-08-23_12-30-13.nq.gz | IndexBuilderMain -F nq -f - -i uit-activiteiten-full-nq -s uit-activiteiten-full-nq.settings.json --stxxl-memory 5G | tee uit-activiteiten-full-nq.index-log.txt'

Here is the log:
uit-activiteiten-full-nq.index-log.txt

@Stiksels
Copy link
Author

Some additional info, I'm running on an older MacBook Pro model:

  • year : 2020
  • processor : 2GHz Quad-Core Intel Core i5
  • graphics : Intel Iris Plus Graphics 1536 MB
  • memory : 16 GB 3733MhZ LPDDR4X
  • storage : 512 GB SSD

In our cloud /K8S setup (amd64), the index build for the compressed nquads file took ~2h in total (faster than the compressed ntriples file)

I'll add a download link for the file shortly

@Stiksels Stiksels changed the title NQ parsing: IndexBuilderMain stuck on "merging partial vocabularies" NQ parsing: IndexBuilderMain "merging partial vocabularies" takes very long time Aug 28, 2024
@hannahbast
Copy link
Member

@Stiksels Can you provide a link to the NQ file?

@Stiksels
Copy link
Author

Hi @hannahbast , here is the downloadlink (exp 12h):

https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBAaCmV1LW5vcnRoLTEiRjBEAiBoz0XRms9EuYERPrfkcoOoMni%2BmihELxOmzaT7Wh%2Bz7wIgcEA7pj%2BaDVHAFZT1DVkpwz4rCbwJu8XWTwR7yLro%2FTsq6AIIKRAAGgw2NTc3MzczOTIzMzIiDIHtMnfc7yEP%2BrJLTirFAg6ygDKWHKDepEyeGLBH1cNZZdF2uCqYS9LnixT3%2FywT1GhRuSl3CU%2FecRuYYok2Ps0Mzi0yb%2FLIjPCf19RXzRDGtXVphyFE9nmlR3dnuLb%2Byup3Ed6GwUa3B8C18U5O%2FoVSZq7c4iyftb48S92694iQew2LB7rGb2rRO3siBcClGWqVtHctUE%2FdxpVzMCt5omkdEF16xNnULYYXjuNu0tll1zdsLAzhZdw0lXjg9RBZSFPHpquGQOn6HNMjXlmz8FO6EDwwxjdDRCcf5Qwmj3IzOtbIBxrjCc8CAsiZrjv4qMfTCARiARlfDSbbWBoydeSFXQJFBtFRHnWCiejS79kTJITEtA%2Bmi9T%2BcFxC%2BbM8Icod5XlXENlIY4U3h8ednz27itIve7ruYFZt6YNR9eH%2BuotW6c%2BuEfea9wrBF%2FqPhN09tIkwsa%2B7tgY6iAIfatKk6rcVVCO4BPxFybPawxcHXeMIYqhaqszbwzSfCvf491esLVb7CdAf14DHzrCwh%2BM9G6eKSvqkjhFMv3lWuqM7KlItZBSY8u58pa9TAkDs1wnp8mi0B%2FgjfP%2FcYx30Fef5%2BOOR%2Fo0gJZaiuLQB6DAtYHfg%2BV%2BvVVi%2FbFKGHs9CZ8NF4QcwHKXmfbdLnF1yvxnd2lq7wDhVKcwW7qMkhjf6UuNeOO7BPawaPHeE8X4Xx80m5X3FpvU78aEBsP5f0TZwivS13ay7IfkBTGPTIgoH1%2FZvsvfDeoz5330KQKXEvK1pvSWRqz%2FYmiXZBoLnq2Wk136Hn3xShUUk3THpyy0TV1%2Fb5XA%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240828T092620Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGOJDPNQ5Y%2F20240828%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=8419105cd8b15a933bc3ad9ac2ccdcadad53478c3284d8b9a6528932e8576cef

@Stiksels
Copy link
Author

Following up on this: I ran qlever index for this file on a new MacBook Pro M3 / 18GB and the entire process took ~5min

uit-activiteiten-full-nq.index-log (1).txt

Not sure if it's necessary or high prio to support older devices? I put in a request for a new laptop 😂

@hannahbast
Copy link
Member

@Stiksels It's not necessarily about the age of the computer, but about the version of the compiler and maybe the operating system. The merging of the vocabularies handles many files using many threads. It seems that with older compilers and/or older operating systems, the machine code produced does something crazily non-optimal. We haven't figured out exactly what yet.

@hannahbast
Copy link
Member

Hi @hannahbast , here is the downloadlink (exp 12h):

https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBAaCmV1LW5vcnRoLTEiRjBEAiBoz0XRms9EuYERPrfkcoOoMni%2BmihELxOmzaT7Wh%2Bz7wIgcEA7pj%2BaDVHAFZT1DVkpwz4rCbwJu8XWTwR7yLro%2FTsq6AIIKRAAGgw2NTc3MzczOTIzMzIiDIHtMnfc7yEP%2BrJLTirFAg6ygDKWHKDepEyeGLBH1cNZZdF2uCqYS9LnixT3%2FywT1GhRuSl3CU%2FecRuYYok2Ps0Mzi0yb%2FLIjPCf19RXzRDGtXVphyFE9nmlR3dnuLb%2Byup3Ed6GwUa3B8C18U5O%2FoVSZq7c4iyftb48S92694iQew2LB7rGb2rRO3siBcClGWqVtHctUE%2FdxpVzMCt5omkdEF16xNnULYYXjuNu0tll1zdsLAzhZdw0lXjg9RBZSFPHpquGQOn6HNMjXlmz8FO6EDwwxjdDRCcf5Qwmj3IzOtbIBxrjCc8CAsiZrjv4qMfTCARiARlfDSbbWBoydeSFXQJFBtFRHnWCiejS79kTJITEtA%2Bmi9T%2BcFxC%2BbM8Icod5XlXENlIY4U3h8ednz27itIve7ruYFZt6YNR9eH%2BuotW6c%2BuEfea9wrBF%2FqPhN09tIkwsa%2B7tgY6iAIfatKk6rcVVCO4BPxFybPawxcHXeMIYqhaqszbwzSfCvf491esLVb7CdAf14DHzrCwh%2BM9G6eKSvqkjhFMv3lWuqM7KlItZBSY8u58pa9TAkDs1wnp8mi0B%2FgjfP%2FcYx30Fef5%2BOOR%2Fo0gJZaiuLQB6DAtYHfg%2BV%2BvVVi%2FbFKGHs9CZ8NF4QcwHKXmfbdLnF1yvxnd2lq7wDhVKcwW7qMkhjf6UuNeOO7BPawaPHeE8X4Xx80m5X3FpvU78aEBsP5f0TZwivS13ay7IfkBTGPTIgoH1%2FZvsvfDeoz5330KQKXEvK1pvSWRqz%2FYmiXZBoLnq2Wk136Hn3xShUUk3THpyy0TV1%2Fb5XA%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240828T092620Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGOJDPNQ5Y%2F20240828%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=8419105cd8b15a933bc3ad9ac2ccdcadad53478c3284d8b9a6528932e8576cef

I tried the link and got

Connecting to qlever-backups.s3.eu-west-1.amazonaws.com (qlever-backups.s3.eu-west-1.amazonaws.com)|52.218.122.50|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-09-02 06:59:31 ERROR 403: Forbidden.

@Stiksels
Copy link
Author

Stiksels commented Sep 2, 2024

@hannahbast can you try again with this new link (expires at 22h35 Brussels time):

https://qlever-backups.s3.eu-west-1.amazonaws.com/activiteiten.nq.gz?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIn%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJHMEUCIG4nac1WtrbNQv8Unm0OFIjEZhn5SbBDSn9osFEQBx%2B3AiEA3TqffWEGMi9Rfu6GcLSjVfitBsFkF%2FPMwt4LDVzqOkMq8QIIov%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw2NTc3MzczOTIzMzIiDBZO%2FFOwnOb8wCTljCrFAnKKM5sdIEuQQCJt5IM6xD5PT8YpZlRjuFXpxvxK0w8%2B65kVbkOYcUcA7JANc2tfCKs4EifgWyGk2NVGRuNVRPCWRf1k4QCvPkrHdLISA%2BlFfblWM4islmcy0MvicMsLSzhFxYW5jQFZxkJ%2Figu8HqP9ITQR3RlsAfOpz1zMXbO3bzd%2F0%2FqDQWXAftSngLXHN8MV%2F9npzSSXprTPel8W3ecFBdHf57okh2ecs1JlatWcHRi33IhdCemGqTumc86bl1%2Ft5hhChSSNapg49L%2B6iE5AzfqUkApOlQIll%2FbI2n0Rhytz6Ko8V2tFGSz8p4ipv8MREy2SbCMXTGMNlb2rzrH5cF0isI3CrZYNFYqS1%2BXvL%2BEtAywTmMJOK9zqBKFAtwZ8%2BtVwT9wyFZ6VHgFLcNqMn5Fg9TM0A11bcNc0XdPn4Y6v%2Btgw4fHVtgY6hwL3arbpzKgU%2BgcBKkQlFLotVuBWfLdnxehpW9W98E3EdSblcopiM%2FJBygRmIQNprTdwk7%2FrCzQlFtjt%2Bd2OlaAKglwzZu3CMtZNa24D%2FcKzOi%2F4S0ybsn9EJyXzrap6YmpZO1HKKPG%2Fm1P3rNldrwzYTP3Oynk3EgRkVtFazAjjDS5V6dd%2BteWjDBbhcqLVZVjOLFWs%2BOyVdcg9itxdEICB21GMOwGi3EoSEn8mYQ%2BcozyspDF5HvWEydNBMiUSasTUmtIrP4WQ79RQ1UABWyrzgiTmyBaANoHIbFsSRFBRVIDVmsc3t3d6123Mi%2FQ2Vkm%2BI5iproOiyCpLlFKNg9LW2tg2OxoYbA%3D%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240902T083606Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAZSJBUEDGB5T5MNG2%2F20240902%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Signature=e1c89a13866d371980cd3e742e91837b91519f4da2fc9751c1bcac9d8ced5b9e

@Stiksels
Copy link
Author

Stiksels commented Sep 23, 2024

update about the issue above; the difference in indexing performance between the Mac intel vs Mac silicon was due to the used qlever version (0.5.3 vs any version higher).

Docker image : sha256:92c4a6431e26384b6af701d1958d11d2df213f92e63f99067b6278d16ed104c4

With version 0.5.3 and manually overwritten index.py command to use nq instead of ttl, the dataset linked above gets indexed by both machines "smooth". With qlever 0.5.4 and up, the indexing process gets stuck on "merging partial vocabularies"

index.py def execute

index_cmd = (f"{args.cat_input_files} | {args.index_binary}"
                     f" -F nq -f -"
                     f" -i {args.name}"
                     f" -s {args.name}.settings.json")

mac silicon

Command: index

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten docker.io/adfreiburg/qlever:latest -c 'zcat publiq-uit-activiteiten_2024-08-23_12-30-13.nq.gz | IndexBuilderMain -F nq -f - -i uit-activiteiten -s uit-activiteiten.settings.json --stxxl-memory 5G | tee uit-activiteiten.index-log.txt'
  • triples parsed: 142,064,904 [average speed 1.1 M/s, last batch 1.1 M/s, fastest 1.2 M/s, slowest 1.1 M/s]
  • Words merged: 14,670,392 [average speed 0.2 M/s, last batch 0.3 M/s, fastest 0.3 M/s, slowest 0.3 M/s]
  • time total: < 5 min
    uit-activiteiten.index-log.txt

Mac Intel

Command: index

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten-full-nq.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten-full-nq docker.io/adfreiburg/qlever:latest -c 'zcat testdata/publiq-uit-activiteiten_2024-08-23_12-30-13.nq.gz | IndexBuilderMain -F nq -f - -i uit-activiteiten-full-nq -s uit-activiteiten-full-nq.settings.json --stxxl-memory 5G | tee uit-activiteiten-full-nq.index-log.txt'
  • Triples parsed: 142,064,904 [average speed 0.4 M/s, last batch 0.4 M/s, fastest 0.5 M/s, slowest 0.4 M/s]
  • Words merged: 14,670,392 [average speed 0.1 M/s, last batch 0.1 M/s, fastest 0.1 M/s, slowest 0.1 M/s]
  • time total: < 15 min
    uit-activiteiten-full-nq.index-log.txt

@Stiksels
Copy link
Author

Also, I'm working in a virtual environment with Python 3.12.6

@Stiksels
Copy link
Author

Stiksels commented Oct 2, 2024

with the latest docker image docker pull adfreiburg/qlever:commit-996315f :

  • building an index for 1 relatively large NQ dataset works smooth
  • Building an index for multiple NQ datasets gets stuck on the step "merging partial vocabularies"

Steps to reproduce:

Command: index

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > demo-gene.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.demo-gene docker.io/adfreiburg/qlever:latest -c 'zcat data-nq/*.nq.gz | IndexBuilderMain -F nq - -i demo-gene -s demo-gene.settings.json --stxxl-memory 5G | tee demo-gene.index-log.txt'

2024-10-02 15:26:22.647 - INFO: QLever IndexBuilder, compiled on Tue Oct  1 14:47:32 UTC 2024 using git hash 996315
2024-10-02 15:26:22.649 - INFO: You specified the input format: NQ
2024-10-02 15:26:22.649 - INFO: Processing input triples from /dev/stdin ...
2024-10-02 15:26:22.649 - INFO: You specified "locale = nl_BE" and "ignore-punctuation = 1"
2024-10-02 15:26:22.649 - WARN: You are using Locale settings that differ from the default language or country.
        This should work but is untested by the QLever team. If you are running into unexpected problems,
        Please make sure to also report your used locale when filing a bug report. Also note that changing the
        locale requires to completely rebuild the index
2024-10-02 15:26:22.650 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines
2024-10-02 15:26:22.650 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-10-02 15:26:22.650 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-10-02 15:26:22.708 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-10-02 15:31:48.852 - INFO: Triples parsed: 437,147,767 [average speed 1.3 M/s, last batch 1.3 M/s, fastest 1.4 M/s, slowest 1.3 M/s] 
2024-10-02 15:31:48.866 - INFO: Number of triples created (including QLever-internal ones): 508,776,783 [may contain duplicates]
2024-10-02 15:31:48.867 - INFO: Merging partial vocabularies ...
^C24-10-02 15:41:06.642 - INFO: Words merged: 30,000,000 [average speed 0.1 M/s, last batch 0.0 M/s, fastest 0.4 M/s, slowest 0.0 M/s]

@Stiksels
Copy link
Author

Stiksels commented Nov 5, 2024

This seemed like a resource allocation issue; by(significantly) increasing the memory limit in Docker Desktop (and making sure there are no conflicts with running servers), the indexing for multiple datasets now runs stable

@Stiksels Stiksels closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants