You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently all of the cluster state, like executor info, task info, are stored in the sled db. And a global lock is used for dealing with concurrency issue. Not only the serialization and deserialization cost will be large, but also the global lock will be a bottleneck when hundreds of thousands of tasks need to be dealt with.
Describe the solution you'd like
For the scheduler, it mainly maintains two kinds of states. One relates to the executor and the other relates to the job. For each kind of states, there are stable ones and volatile ones. For states with different stabilities, it's better to deal with them with different ways:
Stable state
We may still store them in the sled db as a ground truth which will be helpful for fast recovery. However, better to cache them in memory to reduce the serialization and deserialization cost.
Volatile state
It's better not to store them in the db and just keep them in memory. When the scheduler restarts, these volatile cluster state info will be lost.
The following describes details about whether the state belongs to the stable one or not:
Stable:
Executors:
Identification info
id
host
port
grpc_port
Resources
total task slots
Jobs
Job
metadata
status
Stage
plan
status
Volatile
Executors
Liveness Info
heartbeat timestamp
Internal state
memory usage
Available resources
available task slots
Jobs
Task
definition
status
Additional counters
pending tasks for each stage
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently all of the cluster state, like executor info, task info, are stored in the sled db. And a global lock is used for dealing with concurrency issue. Not only the serialization and deserialization cost will be large, but also the global lock will be a bottleneck when hundreds of thousands of tasks need to be dealt with.
Describe the solution you'd like
For the scheduler, it mainly maintains two kinds of states. One relates to the executor and the other relates to the job. For each kind of states, there are stable ones and volatile ones. For states with different stabilities, it's better to deal with them with different ways:
We may still store them in the sled db as a ground truth which will be helpful for fast recovery. However, better to cache them in memory to reduce the serialization and deserialization cost.
It's better not to store them in the db and just keep them in memory. When the scheduler restarts, these volatile cluster state info will be lost.
The following describes details about whether the state belongs to the stable one or not:
The text was updated successfully, but these errors were encountered: