-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 292: Parallelise plot creation using multiprocessing
#346
Conversation
Codecov Report
@@ Coverage Diff @@
## master #346 +/- ##
==========================================
+ Coverage 76.26% 76.33% +0.06%
==========================================
Files 52 52
Lines 3388 3398 +10
==========================================
+ Hits 2584 2594 +10
Misses 804 804 |
The codefactor test fails because I have a list comprehension calling the plotting functions and not assigning to anything. How much do we care about codefactor being unhappy with something like that? |
I'm not especially bothered about codefactor's view on that, but I am curious what the list comprehension brings to the code over, say, a for loop. Is it more legible or efficient this way? |
If you demonstrated the speedup, it would be useful to note the values (on your system, and others if you tried) here. |
I haven't run timed tests on this yet; that's on the to-do list. There is a slight efficiency bump that comes with the comprehension statements over a conventional for loop; I think something about how it checks of there is still more to do? Likely small in this case, and I'll have to spend some time looking for where I learned that, if you want a source. |
If you're doing timed tests, you won't need a source - the data will show it ;) |
@widdowquinn Here are some times for plotting the output of comparisons between 49 genomes:
The main difference is clearly the use of multiprocessing. I believe the list comprehension is slightly faster than the for loop because it is implemented directly in C and the for loop has some Python overhead, but I still haven't found where I learned this, so can't explain it better right now. The speed-up is not huge here; with a list comprehension, there is also some amount of overhead that goes along with the generation of the list; this is of finite size in this case, because the number of plots is not dependent on the number of genomes; it can be avoided with a generator expression, but that comes with some additional lines of code to catch the end exception. For loop: for func, args in plotting_commands:
pool.apply_async(func, args, {}) List comprehension: [pool.apply_async(func, args, {}) for func, args in plotting_commands] Generator expression: plots = (pool.apply_async(func, args, {}) for func, args in plotting_commands)
while True:
try:
next(plots)
except StopIteration:
break The for loop is probably the more widely known construction, easier for most people to read, though I don't think the list comprehension is too bad in this case. If we are concerned with readability, I would focus more on this that directly precedes the generation of the plotting commands: for matdata in [
MatrixData(*_)
for _ in [
("identity", pd.read_json(results.df_identity), {}),
("coverage", pd.read_json(results.df_coverage), {}),
("aln_lengths", pd.read_json(results.df_alnlength), {}),
("sim_errors", pd.read_json(results.df_simerrors), {}),
("hadamard", pd.read_json(results.df_hadamard), {}),
]
]: It was a while before I registered that that was a for loop containing a nested list comprehension. |
The list comp/for loop time difference doesn't look hugely significant for a single run. The difference could accumulate for larger datasets, though. I assume this is an average over 3-5 (or more) runs, so the variance could be informative. The generator comprehension overhead is possibly due to the generator not calling I'm not sure that last structure is a nested list comprehension, though. The form is a for loop operating on a list comprehension: for item in [output(_) for _ in [tuple1, tuple2, ..., tuple4]]:
do_something(item) A nested list comprehension would be: [do_something(item) for item in [output(_) for _ in [tuple1, tuple2, ..., tuple4]] As a rule, if there's no compelling justification otherwise, for clarity it may be best to reserve list comprehensions for those cases where the result of the operation needs to return a list as the result of the operations. Otherwise there's the nagging feeling that an assignment is missing. A standard for loop might be more readable where no returned list is needed. I'm sure I've broken the rule more than once, though - and probably didn't note my justification at the time, either. |
The |
The new name will be less confusing because this function also runs the code that generates distribution and the scatter plots.
For generating trees
multiprocessing
Uses
multiprocessing
to speed up plot generation. See:multiprocessing
to speed up plot file output creation #292.Closes #292.
Type of change
Action Checklist
pyani
repository under your own account (please allow write access for repository maintainers)CONTRIBUTING.md
)pytest -v
non-passing code will not be mergedorigin/master
flake8
andblack
before submissionPull requests
section in thepyani
repository