Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up downloads of large reports #4245

Open
andrewjbtw opened this issue Oct 18, 2023 · 3 comments
Open

Speed up downloads of large reports #4245

andrewjbtw opened this issue Oct 18, 2023 · 3 comments

Comments

@andrewjbtw
Copy link

When downloading reports of over about 75,000 druids, the download speed diminishes over time. This results in multi-hour download times for large reports.

Example: I started a report download for 500,000 druids this morning. It never finished because my wifi connection dropped two hours later.

I didn't screenshot the speed right away, but I think it started at about 60 KB/sec

After about 15 minutes this was the speed:

Screenshot 2023-10-18 at 10 03 13 AM

And then it kept dropping:

Screenshot 2023-10-18 at 10 22 19 AM
Screenshot 2023-10-18 at 10 41 57 AM
Screenshot 2023-10-18 at 11 48 17 AM

At this point my wifi dropped, about 460,000 rows into the report.

If the report maintained the speed it started with, it still would have taken a while but I think the download would have worked. The drop in speed means the last rows take much longer than the first. I did the same download on Monday and it took over 3 hours.

@justinlittman
Copy link
Contributor

I'm fairly sure that deep pagination is problematic for Lucene (and hence Solr). See, for example, https://solr.apache.org/guide/8_11/pagination-of-results.html#performance-problems-with-deep-paging

@andrewjbtw
Copy link
Author

The report is being generated as if a browser is paging through results? Not through a query that asks for larger numbers of rows?

@justinlittman
Copy link
Contributor

If my understanding is correct, behind the scenes the reporter is iterating over pages of results from Solr. This is the standard way of dealing with a large result set (instead of just asking for all of the results in a single request).

There might be some performance gains by increasing the page size when iterating for a download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants