[Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

yahoNanJing · 2022-01-29T08:39:41Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently all of the cluster state, like executor info, task info, are stored in the sled db. And a global lock is used for dealing with concurrency issue. Not only the serialization and deserialization cost will be large, but also the global lock will be a bottleneck when hundreds of thousands of tasks need to be dealt with.

Describe the solution you'd like

For the scheduler, it mainly maintains two kinds of states. One relates to the executor and the other relates to the job. For each kind of states, there are stable ones and volatile ones. For states with different stabilities, it's better to deal with them with different ways:

Stable state
We may still store them in the sled db as a ground truth which will be helpful for fast recovery. However, better to cache them in memory to reduce the serialization and deserialization cost.
Volatile state
It's better not to store them in the db and just keep them in memory. When the scheduler restarts, these volatile cluster state info will be lost.

The following describes details about whether the state belongs to the stable one or not:

Stable:
- Executors:
  - Identification info
    - id
    - host
    - port
    - grpc_port
  - Resources
    - total task slots
- Jobs
  - Job
    - metadata
    - status
  - Stage
    - plan
    - status
Volatile
- Executors
  - Liveness Info
    - heartbeat timestamp
  - Internal state
    - memory usage
  - Available resources
    - available task slots
- Jobs
  - Task
    - definition
    - status
  - Additional counters
    - pending tasks for each stage

yahoNanJing added the enhancement New feature or request label Jan 29, 2022

yahoNanJing mentioned this issue May 19, 2022

Ballista Enhancement Overview apache/datafusion-ballista#7

Open

15 tasks

andygrove added the ballista label Feb 5, 2022

yahoNanJing mentioned this issue Feb 11, 2022

Refactor scheduler state with different management policy for volatile and stable states #1810

Merged

alamb closed this as completed in #1810 Feb 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

[Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

yahoNanJing commented Jan 29, 2022 •

edited

Loading

[Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

[Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

Comments

yahoNanJing commented Jan 29, 2022 • edited Loading

yahoNanJing commented Jan 29, 2022 •

edited

Loading