Issue 292: Parallelise plot creation using `multiprocessing` #346

baileythegreen · 2021-10-18T19:41:25Z

Uses multiprocessing to speed up plot generation. See:

Use multiprocessing to speed up plot file output creation #292.

Closes #292.

Type of change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality not to work as expected)
This change requires a documentation update
This is a documentation update

Action Checklist

…sue_292

codecov · 2021-10-18T19:47:41Z

Codecov Report

Merging #346 (122f6d4) into master (9ae767f) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
+ Coverage   76.26%   76.33%   +0.06%     
==========================================
  Files          52       52              
  Lines        3388     3398      +10     
==========================================
+ Hits         2584     2594      +10     
  Misses        804      804

baileythegreen · 2021-10-18T20:03:58Z

The codefactor test fails because I have a list comprehension calling the plotting functions and not assigning to anything. How much do we care about codefactor being unhappy with something like that?

widdowquinn · 2021-10-19T07:29:57Z

The codefactor test fails because I have a list comprehension calling the plotting functions and not assigning to anything. How much do we care about codefactor being unhappy with something like that?

I'm not especially bothered about codefactor's view on that, but I am curious what the list comprehension brings to the code over, say, a for loop. Is it more legible or efficient this way?

widdowquinn · 2021-10-19T07:30:55Z

If you demonstrated the speedup, it would be useful to note the values (on your system, and others if you tried) here.

baileythegreen · 2021-10-19T08:35:12Z

I haven't run timed tests on this yet; that's on the to-do list.

There is a slight efficiency bump that comes with the comprehension statements over a conventional for loop; I think something about how it checks of there is still more to do? Likely small in this case, and I'll have to spend some time looking for where I learned that, if you want a source.

widdowquinn · 2021-10-19T08:39:08Z

If you're doing timed tests, you won't need a source - the data will show it ;)

baileythegreen · 2021-10-19T11:33:56Z

@widdowquinn Here are some times for plotting the output of comparisons between 49 genomes:

iteration method	time (s)
for loop	50.251
list comp. + mp	28.069
for loop + mp	28.445
gen. exp. + mp	30.613

The main difference is clearly the use of multiprocessing. I believe the list comprehension is slightly faster than the for loop because it is implemented directly in C and the for loop has some Python overhead, but I still haven't found where I learned this, so can't explain it better right now. The speed-up is not huge here; with a list comprehension, there is also some amount of overhead that goes along with the generation of the list; this is of finite size in this case, because the number of plots is not dependent on the number of genomes; it can be avoided with a generator expression, but that comes with some additional lines of code to catch the end exception.

For loop:

for func, args in plotting_commands:
    pool.apply_async(func, args, {})

List comprehension:

[pool.apply_async(func, args, {}) for func, args in plotting_commands]

Generator expression:

plots = (pool.apply_async(func, args, {}) for func, args in plotting_commands)
while True:
    try:
        next(plots)
    except StopIteration:
        break

The for loop is probably the more widely known construction, easier for most people to read, though I don't think the list comprehension is too bad in this case. If we are concerned with readability, I would focus more on this that directly precedes the generation of the plotting commands:

    for matdata in [
        MatrixData(*_)
        for _ in [
            ("identity", pd.read_json(results.df_identity), {}),
            ("coverage", pd.read_json(results.df_coverage), {}),
            ("aln_lengths", pd.read_json(results.df_alnlength), {}),
            ("sim_errors", pd.read_json(results.df_simerrors), {}),
            ("hadamard", pd.read_json(results.df_hadamard), {}),
        ]
    ]:

It was a while before I registered that that was a for loop containing a nested list comprehension.

widdowquinn · 2021-10-19T11:55:17Z

The list comp/for loop time difference doesn't look hugely significant for a single run. The difference could accumulate for larger datasets, though. I assume this is an average over 3-5 (or more) runs, so the variance could be informative.

The generator comprehension overhead is possibly due to the generator not calling pool.apply_async() until the while loop happens - I'd expect that extra while loop to be where the overhead is.

I'm not sure that last structure is a nested list comprehension, though. The form is a for loop operating on a list comprehension:

for item in [output(_) for _ in [tuple1, tuple2, ..., tuple4]]:
    do_something(item)

A nested list comprehension would be:

[do_something(item) for item in [output(_) for _ in [tuple1, tuple2, ..., tuple4]]

As a rule, if there's no compelling justification otherwise, for clarity it may be best to reserve list comprehensions for those cases where the result of the operation needs to return a list as the result of the operations. Otherwise there's the nagging feeling that an assignment is missing. A standard for loop might be more readable where no returned list is needed. I'm sure I've broken the rule more than once, though - and probably didn't note my justification at the time, either.

widdowquinn · 2021-10-19T11:59:40Z

The multiprocessing speed-up is a good win, though - nice work!

The new name will be less confusing because this function also runs the code that generates distribution and the scatter plots.

For generating trees

…sue_292

baileythegreen added 2 commits October 18, 2021 20:05

Add multiprocessing support to pyani plot

f157b96

Merge branch 'master' of https://github.com/widdowquinn/pyani into is…

77d576f

…sue_292

baileythegreen added performance the issue relates to making pyani more efficient visualisation issues relating to plot outputs labels Oct 18, 2021

baileythegreen requested a review from widdowquinn as a code owner October 18, 2021 19:41

baileythegreen marked this pull request as draft October 18, 2021 19:42

baileythegreen added 6 commits October 19, 2021 13:54

Change list comprehension to for loop for readability

2b09fb0

Rename write_run_heatmaps()

2846bdf

The new name will be less confusing because this function also runs the code that generates distribution and the scatter plots.

Add new dependencies

a23d142

For generating trees

Change local variable name to avoid conflicts

451e2d3

Merge branch 'master' of https://github.com/widdowquinn/pyani into is…

a5002ea

…sue_292

Remove import time

122f6d4

baileythegreen marked this pull request as ready for review November 29, 2021 18:23

baileythegreen changed the title ~~Issue 292~~ Issue 292: Parallelise plot creation using multiprocessing Nov 30, 2021

baileythegreen merged commit 4dd4f83 into master Nov 30, 2021

baileythegreen deleted the issue_292 branch May 24, 2022 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 292: Parallelise plot creation using `multiprocessing` #346

Issue 292: Parallelise plot creation using `multiprocessing` #346

baileythegreen commented Oct 18, 2021

codecov bot commented Oct 18, 2021 •

edited

Loading

baileythegreen commented Oct 18, 2021

widdowquinn commented Oct 19, 2021

widdowquinn commented Oct 19, 2021

baileythegreen commented Oct 19, 2021

widdowquinn commented Oct 19, 2021

baileythegreen commented Oct 19, 2021

widdowquinn commented Oct 19, 2021 •

edited

Loading

widdowquinn commented Oct 19, 2021

Issue 292: Parallelise plot creation using multiprocessing #346

Issue 292: Parallelise plot creation using multiprocessing #346

Conversation

baileythegreen commented Oct 18, 2021

Type of change

Action Checklist

codecov bot commented Oct 18, 2021 • edited Loading

Codecov Report

baileythegreen commented Oct 18, 2021

widdowquinn commented Oct 19, 2021

widdowquinn commented Oct 19, 2021

baileythegreen commented Oct 19, 2021

widdowquinn commented Oct 19, 2021

baileythegreen commented Oct 19, 2021

widdowquinn commented Oct 19, 2021 • edited Loading

widdowquinn commented Oct 19, 2021

Issue 292: Parallelise plot creation using `multiprocessing` #346

Issue 292: Parallelise plot creation using `multiprocessing` #346

codecov bot commented Oct 18, 2021 •

edited

Loading

widdowquinn commented Oct 19, 2021 •

edited

Loading