Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 292: Parallelise plot creation using multiprocessing #346

Merged
merged 8 commits into from
Nov 30, 2021

Conversation

baileythegreen
Copy link
Contributor

Uses multiprocessing to speed up plot generation. See:

Closes #292.

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality not to work as expected)
  • This change requires a documentation update
  • This is a documentation update

Action Checklist

  • Work on a single issue/concept (if there are multiple separate issues to address, please use a separate pull request for each)
  • Fork the pyani repository under your own account (please allow write access for repository maintainers)
  • Set up an appropriate development environment (please see CONTRIBUTING.md)
  • Create a new branch with a short, descriptive name
  • Work on this branch
    • style guidelines have been followed
    • new code is commented, especially in hard-to-understand areas
    • corresponding changes to documentation have been made
    • tests for the change have been added that demonstrate the fix or feature works
  • Test locally with pytest -v non-passing code will not be merged
  • Rebase against origin/master
  • Check changes with flake8 and black before submission
  • Commit branch
  • Submit pull request, describing the content and intent of the pull request
  • Request a code review
  • Continue the discussion at Pull requests section in the pyani repository

@baileythegreen baileythegreen added performance the issue relates to making pyani more efficient visualisation issues relating to plot outputs labels Oct 18, 2021
@baileythegreen baileythegreen marked this pull request as draft October 18, 2021 19:42
@codecov
Copy link

codecov bot commented Oct 18, 2021

Codecov Report

Merging #346 (122f6d4) into master (9ae767f) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
+ Coverage   76.26%   76.33%   +0.06%     
==========================================
  Files          52       52              
  Lines        3388     3398      +10     
==========================================
+ Hits         2584     2594      +10     
  Misses        804      804              

@baileythegreen
Copy link
Contributor Author

The codefactor test fails because I have a list comprehension calling the plotting functions and not assigning to anything. How much do we care about codefactor being unhappy with something like that?

@widdowquinn
Copy link
Owner

The codefactor test fails because I have a list comprehension calling the plotting functions and not assigning to anything. How much do we care about codefactor being unhappy with something like that?

I'm not especially bothered about codefactor's view on that, but I am curious what the list comprehension brings to the code over, say, a for loop. Is it more legible or efficient this way?

@widdowquinn
Copy link
Owner

If you demonstrated the speedup, it would be useful to note the values (on your system, and others if you tried) here.

@baileythegreen
Copy link
Contributor Author

I haven't run timed tests on this yet; that's on the to-do list.

There is a slight efficiency bump that comes with the comprehension statements over a conventional for loop; I think something about how it checks of there is still more to do? Likely small in this case, and I'll have to spend some time looking for where I learned that, if you want a source.

@widdowquinn
Copy link
Owner

If you're doing timed tests, you won't need a source - the data will show it ;)

@baileythegreen
Copy link
Contributor Author

@widdowquinn Here are some times for plotting the output of comparisons between 49 genomes:

iteration method time (s)
for loop 50.251
list comp. + mp 28.069
for loop + mp 28.445
gen. exp. + mp  30.613

The main difference is clearly the use of multiprocessing. I believe the list comprehension is slightly faster than the for loop because it is implemented directly in C and the for loop has some Python overhead, but I still haven't found where I learned this, so can't explain it better right now. The speed-up is not huge here; with a list comprehension, there is also some amount of overhead that goes along with the generation of the list; this is of finite size in this case, because the number of plots is not dependent on the number of genomes; it can be avoided with a generator expression, but that comes with some additional lines of code to catch the end exception.

For loop:

for func, args in plotting_commands:
    pool.apply_async(func, args, {})

List comprehension:

[pool.apply_async(func, args, {}) for func, args in plotting_commands]

Generator expression:

plots = (pool.apply_async(func, args, {}) for func, args in plotting_commands)
while True:
    try:
        next(plots)
    except StopIteration:
        break

The for loop is probably the more widely known construction, easier for most people to read, though I don't think the list comprehension is too bad in this case. If we are concerned with readability, I would focus more on this that directly precedes the generation of the plotting commands:

    for matdata in [
        MatrixData(*_)
        for _ in [
            ("identity", pd.read_json(results.df_identity), {}),
            ("coverage", pd.read_json(results.df_coverage), {}),
            ("aln_lengths", pd.read_json(results.df_alnlength), {}),
            ("sim_errors", pd.read_json(results.df_simerrors), {}),
            ("hadamard", pd.read_json(results.df_hadamard), {}),
        ]
    ]:

It was a while before I registered that that was a for loop containing a nested list comprehension.

@widdowquinn
Copy link
Owner

widdowquinn commented Oct 19, 2021

The list comp/for loop time difference doesn't look hugely significant for a single run. The difference could accumulate for larger datasets, though. I assume this is an average over 3-5 (or more) runs, so the variance could be informative.

The generator comprehension overhead is possibly due to the generator not calling pool.apply_async() until the while loop happens - I'd expect that extra while loop to be where the overhead is.

I'm not sure that last structure is a nested list comprehension, though. The form is a for loop operating on a list comprehension:

for item in [output(_) for _ in [tuple1, tuple2, ..., tuple4]]:
    do_something(item)

A nested list comprehension would be:

[do_something(item) for item in [output(_) for _ in [tuple1, tuple2, ..., tuple4]]

As a rule, if there's no compelling justification otherwise, for clarity it may be best to reserve list comprehensions for those cases where the result of the operation needs to return a list as the result of the operations. Otherwise there's the nagging feeling that an assignment is missing. A standard for loop might be more readable where no returned list is needed. I'm sure I've broken the rule more than once, though - and probably didn't note my justification at the time, either.

@widdowquinn
Copy link
Owner

The multiprocessing speed-up is a good win, though - nice work!

@baileythegreen baileythegreen marked this pull request as ready for review November 29, 2021 18:23
@baileythegreen baileythegreen changed the title Issue 292 Issue 292: Parallelise plot creation using multiprocessing Nov 30, 2021
@baileythegreen baileythegreen merged commit 4dd4f83 into master Nov 30, 2021
@baileythegreen baileythegreen deleted the issue_292 branch May 24, 2022 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance the issue relates to making pyani more efficient visualisation issues relating to plot outputs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use multiprocessing to speed up plot file output creation
2 participants