Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance when importing Netscape bookmarks > 3000 entries #985

Closed
virtualtam opened this issue Oct 4, 2017 · 6 comments
Closed
Milestone

Comments

@virtualtam
Copy link
Member

virtualtam commented Oct 4, 2017

Performance issue

When importing Netscape bookmark files featuring a consequent number of entries (>3000), the following happens:

  • a PHP timeout is reached: max_execution_time (default: 30 seconds)
  • when using a reverse proxy, the HTTP error 504 Bad gateway is raised due to the application not responding

I've observed this behaviour on the following environments:

This is an issue as the import feature can be a game changer for users looking forward to migrating their data to Shaarli (#902, #969)

Troubleshooting

I've written a script to generate random/fake bookmark dumps of configurable size: generate_netscape_bookmarks.py

Performances start getting bad as soon as the imported file features more than ~3000 entries, and only get worse beyond that point :(

The NetscapeBookmarkParser library behaves decently when parsing these files, so the root cause probably lies in the NetscapeBookmarkUtils->import() method, the usual suspects being either the foreach() { ... } loop, the final disk write operation, or both.

@virtualtam virtualtam added this to the 0.10.0 milestone Oct 4, 2017
@izumitelj
Copy link

izumitelj commented Oct 5, 2017

I agree this is very important issue for new and serious users. I've been struggling with importing 4583 links from Diigo for almost a week now . First I had some issues with server and file sizes #969, and now I'm stuck with probable parsing errors as described #902.

I also have some code in descriptions, but also various other characters that should probably be escaped. I've manually escaped all < & > characters with text editor, but I still get the error ...unknown file format. Nothing was imported.

Since Delicious, Diigo and maybe other services don't properly escape problematic characters on export, would it be bad idea for Shaarli Netscape parser to just accept anything between <DD> & next <DT> as it is?

@virtualtam
Copy link
Member Author

Writing a parser is far from being a trivial task, especially with formats as crappy as the Netscape one.

The current parser supports most standard use cases:

  • importing bookmarks from a browser dump (Firefox, Chrome, etc.)
  • importing bookmarks from other services, provided the data has been properly sanitized (escaping, HTML entities, etc.)

Which leaves us with a couple edge case when users attempt to import large dumps with funky data...

I've started writing a grammar-based lexer/parser for the Netscape format at https://github.com/virtualtam/hoa-netscape-bookmark-parser ; this will take time for experimenting and testing, but it if works we'll end up with a library that's way better and more maintainable :)

@izumitelj
Copy link

After manually escaping all < > from descriptions and titles I still got the problem ...unknown file format. Nothing was imported..

Than again I've tried to split the original file (4583 links, 1098716 bytes) in two, and this time I've successfully imported it that way.

Therefore I can confirm this issue. It would also be nice if Shaarli message could provide more details about the error because ...unknown file format. Nothing was imported. is misleading.

@ArthurHoaro
Copy link
Member

It would also be nice if Shaarli message could provide more details about the error because ...unknown file format. Nothing was imported. is misleading.

Please add this in your config file before running the import: "dev": { "debug": true } ; then you should have a very detailled log file after the import in your data folder.

@izumitelj
Copy link

izumitelj commented Oct 7, 2017

I've tried that already and the output is even less informative. So here is the exact log while failing to import the file (the one that I've escaped manually, but didn't split in two smaller ones - which works).

[2017-10-07 11:36:04.168555] [info] PARSING LINE #0
[2017-10-07 11:36:04.168654] [debug] [#0] Content: 
[2017-10-07 11:36:04.168782] [info] File parsing ended

@ArthurHoaro
Copy link
Member

Well, this logs means that no content at all is parsed after being sanitized. Maybe it's related to the PR I just made (shaarli/netscape-bookmark-parser#43).

@virtualtam The performance issue comes from the history writing. I'll submit a PR to fix it.

ArthurHoaro added a commit to ArthurHoaro/Shaarli that referenced this issue Oct 7, 2017
With large imports it has a large impact on performances and isn't really useful.

Instead, write an IMPORT event, which let client using the history service resync its DB.

-> 15k link import done in 6 seconds.

Fixes shaarli#985
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants