Bad performance when importing Netscape bookmarks > 3000 entries #985

virtualtam · 2017-10-04T21:29:53Z

Performance issue

When importing Netscape bookmark files featuring a consequent number of entries (>3000), the following happens:

a PHP timeout is reached: max_execution_time (default: 30 seconds)
when using a reverse proxy, the HTTP error 504 Bad gateway is raised due to the application not responding

I've observed this behaviour on the following environments:

development workstation: Apache HTTPD 2.4, PHP 7.1
Docker container: Alpine 3.6 (Docker: switch to Alpine Linux #846)

This is an issue as the import feature can be a game changer for users looking forward to migrating their data to Shaarli (#902, #969)

Troubleshooting

I've written a script to generate random/fake bookmark dumps of configurable size: generate_netscape_bookmarks.py

Performances start getting bad as soon as the imported file features more than ~3000 entries, and only get worse beyond that point :(

The NetscapeBookmarkParser library behaves decently when parsing these files, so the root cause probably lies in the NetscapeBookmarkUtils->import() method, the usual suspects being either the foreach() { ... } loop, the final disk write operation, or both.

The text was updated successfully, but these errors were encountered:

izumitelj · 2017-10-05T10:30:12Z

I agree this is very important issue for new and serious users. I've been struggling with importing 4583 links from Diigo for almost a week now . First I had some issues with server and file sizes #969, and now I'm stuck with probable parsing errors as described #902.

I also have some code in descriptions, but also various other characters that should probably be escaped. I've manually escaped all < & > characters with text editor, but I still get the error ...unknown file format. Nothing was imported.

Since Delicious, Diigo and maybe other services don't properly escape problematic characters on export, would it be bad idea for Shaarli Netscape parser to just accept anything between <DD> & next <DT> as it is?

virtualtam · 2017-10-05T13:18:22Z

Writing a parser is far from being a trivial task, especially with formats as crappy as the Netscape one.

The current parser supports most standard use cases:

importing bookmarks from a browser dump (Firefox, Chrome, etc.)
importing bookmarks from other services, provided the data has been properly sanitized (escaping, HTML entities, etc.)

Which leaves us with a couple edge case when users attempt to import large dumps with funky data...

I've started writing a grammar-based lexer/parser for the Netscape format at https://github.com/virtualtam/hoa-netscape-bookmark-parser ; this will take time for experimenting and testing, but it if works we'll end up with a library that's way better and more maintainable :)

izumitelj · 2017-10-05T14:02:03Z

After manually escaping all < > from descriptions and titles I still got the problem ...unknown file format. Nothing was imported..

Than again I've tried to split the original file (4583 links, 1098716 bytes) in two, and this time I've successfully imported it that way.

Therefore I can confirm this issue. It would also be nice if Shaarli message could provide more details about the error because ...unknown file format. Nothing was imported. is misleading.

ArthurHoaro · 2017-10-07T09:05:24Z

It would also be nice if Shaarli message could provide more details about the error because ...unknown file format. Nothing was imported. is misleading.

Please add this in your config file before running the import: "dev": { "debug": true } ; then you should have a very detailled log file after the import in your data folder.

izumitelj · 2017-10-07T09:43:32Z

I've tried that already and the output is even less informative. So here is the exact log while failing to import the file (the one that I've escaped manually, but didn't split in two smaller ones - which works).

[2017-10-07 11:36:04.168555] [info] PARSING LINE #0
[2017-10-07 11:36:04.168654] [debug] [#0] Content: 
[2017-10-07 11:36:04.168782] [info] File parsing ended

ArthurHoaro · 2017-10-07T14:08:28Z

Well, this logs means that no content at all is parsed after being sanitized. Maybe it's related to the PR I just made (shaarli/netscape-bookmark-parser#43).

@virtualtam The performance issue comes from the history writing. I'll submit a PR to fix it.

With large imports it has a large impact on performances and isn't really useful. Instead, write an IMPORT event, which let client using the history service resync its DB. -> 15k link import done in 6 seconds. Fixes shaarli#985

virtualtam added enhancement performance tools developer tools labels Oct 4, 2017

virtualtam added this to the 0.10.0 milestone Oct 4, 2017

ArthurHoaro mentioned this issue Oct 7, 2017

Don't write History for link import #992

Merged

virtualtam closed this as completed in #992 Oct 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance when importing Netscape bookmarks > 3000 entries #985

Bad performance when importing Netscape bookmarks > 3000 entries #985

virtualtam commented Oct 4, 2017 •

edited

Loading

izumitelj commented Oct 5, 2017 •

edited

Loading

virtualtam commented Oct 5, 2017

izumitelj commented Oct 5, 2017

ArthurHoaro commented Oct 7, 2017

izumitelj commented Oct 7, 2017 •

edited

Loading

ArthurHoaro commented Oct 7, 2017

Bad performance when importing Netscape bookmarks > 3000 entries #985

Bad performance when importing Netscape bookmarks > 3000 entries #985

Comments

virtualtam commented Oct 4, 2017 • edited Loading

Performance issue

Troubleshooting

izumitelj commented Oct 5, 2017 • edited Loading

virtualtam commented Oct 5, 2017

izumitelj commented Oct 5, 2017

ArthurHoaro commented Oct 7, 2017

izumitelj commented Oct 7, 2017 • edited Loading

ArthurHoaro commented Oct 7, 2017

virtualtam commented Oct 4, 2017 •

edited

Loading

izumitelj commented Oct 5, 2017 •

edited

Loading

izumitelj commented Oct 7, 2017 •

edited

Loading