-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MULTIPLE CHANGES to BUGFIX Dropped Futures #67
base: master
Are you sure you want to change the base?
Conversation
This is a surgical change in SCOOP that fixes many of its previous issues and finally makes it useful along with the communication error robustness
This still does not completely fix the problem as i have later found out. I will fix this over the weekend and add a description of the new logic in a file in the scoop root directory. |
1. Made the fault tolerance scheme a heartbeat scheme. 1. Each worker has a parallel process (not thread, see below) that sends the heartbeat every TIME_BETWEEN_HEARTBEATS. 2. The broker checks every TASK_CHECK_INTERVAL seconds for and worker with assigned jobs that does has not sent a heartbeat in the last TIME_BEFORE_LOSING_WORKER seconds. It then considers this worker dead and performs the following: 1. Cleans up all futures created by that worker from self.assigned_tasks and self.unassigned_tasks. 2. Requests a resend of all futures that were assigned to be executed on that worker (This could potentially fail if there is no connection to the host that spawned these processes. Need to fix this case) 3. Deletes entry of this future from the heartbeat times dict. 1. Made the local broker a python forked process instead of thread because running long C programs from pypthon causes python to be unable to switch threads 2. Made the sending of heartbeat a forked process due to point 2 3. Changed the scheduling to be FCFS: 1. Changed the updateQueue to remove the dependence on lowwatermark as that didnt make sense when we only fetch a job once one is completed (i.e. local queue is empty and greenlet has returned something). Also the time length only changes after having fetched a job. 2. Changed the container for self.available_workers to a set to prevent spamming of futureRequests which may occur if the UpdateQueue receives replies instead of futures after sending the future request. 4. Made control loop clearer and added comments explaining some of the underlying mechanisms 5. Made task sending safe against worker loss (sometimes workers are lost and before they can be removed from the qeue a task is assigned leading to a ZMQError).
Thanks for those, I'll review them this week and merge them as soon as possible. |
The following is a summary of changes made to scoop
|
Made the following changes 1. Added a new function append_init which is used by futures.submit. This function adds newly spawned futures to the broker's task queue. 2. Added new state variable request_in_process to the FutureQueue object in order to track the state of the future request. This is used to track the state of a future request so that a future request is sent iff the previous future request has been answered 3. Also, replaced function recvFutures with recvIncoming which passes the received incoming messages back to updateQueue where they are processed. Changed _recv to only perform preprocessing of the messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the line 329 of scoop/_types.py the function _delete is still used "sending_future._delete()", however this function was removed in this same commit. Program fails with the error: AttributeError: 'Future' object has no attribute "_delete"
Yeah, I had made a few more changes but not pushed it here. Will do today.
…On Thu, Jul 27, 2017 at 3:24 AM, Néstor F. Aguirre ***@***.*** > wrote:
***@***.**** requested changes on this pull request.
In the line 329 of scoop/_types.py the function _delete is still used
"sending_future._delete()", however this function was removed in this same
commit. Program fails with the error: AttributeError: 'Future' object has
no attribute "_delete"
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#67 (review)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEXsfveb7my9RaZRSF0jaI1xYSwgw21sks5sR-bWgaJpZM4NzNlZ>
.
|
Hmm.. Its strange, I can't seem to find he part of code that you're talking about. Are you sure you're not using an older version of this? BTW, If you're running this on linux, you may be interested in trying out the IGITUGraz fork (you should use the branch TUGmaster). It has some system-dependent fixes that improve process cleanup during interruption and exceptions. |
I found the error in this pull request. Not in a specific branch. I was interested in testing your modifications: "Fixing case where the resend request could fail if source worker is lost" and "Made the local broker a python forked process instead of thread". Thanks! |
The thing is there are some commits after that in this pull request. titled as follows
search for the commit ID's on this page you should find them. These commits contain the fixes to the issues you mentioned and more. Enjoy! |
Interestingly, when I check the diff of ea14f32, I see that the very line you mention has been changed, so right now, I really have no clue where it is that you're seeing that patch of code. This is a link to the file at that commit (you should be able to see the commit ID in the link) If you go to the line you specified (329) you'll see that there is no usage of _delete PS: I think it would be easier for you if you simply pulled this branch into your local repo and tried it out. |
OK. I will try it. Thanks a lot ! |
This is a surgical change in SCOOP that fixes many of its previous issues and finally makes it useful along with the communication error robustness.
I have slightly changed the run control loop, sendResult, made the deletes and appends more explicit, and fixed #65 fixed #66 (which was already fixed by #63) and fixed #64.
All tests seem to be passing except the SLURM tests.
Also added some documentation to the code.