Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ballista] Support to better manage cluster state, like alive executors, executor available task slots, etc #1703

Closed
yahoNanJing opened this issue Jan 29, 2022 · 0 comments · Fixed by #1810
Labels
enhancement New feature or request

Comments

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Jan 29, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently all of the cluster state, like executor info, task info, are stored in the sled db. And a global lock is used for dealing with concurrency issue. Not only the serialization and deserialization cost will be large, but also the global lock will be a bottleneck when hundreds of thousands of tasks need to be dealt with.

Describe the solution you'd like

For the scheduler, it mainly maintains two kinds of states. One relates to the executor and the other relates to the job. For each kind of states, there are stable ones and volatile ones. For states with different stabilities, it's better to deal with them with different ways:

  • Stable state
    We may still store them in the sled db as a ground truth which will be helpful for fast recovery. However, better to cache them in memory to reduce the serialization and deserialization cost.
  • Volatile state
    It's better not to store them in the db and just keep them in memory. When the scheduler restarts, these volatile cluster state info will be lost.

The following describes details about whether the state belongs to the stable one or not:

  • Stable:
    • Executors:
      • Identification info
        • id
        • host
        • port
        • grpc_port
      • Resources
        • total task slots
    • Jobs
      • Job
        • metadata
        • status
      • Stage
        • plan
        • status
  • Volatile
    • Executors
      • Liveness Info
        • heartbeat timestamp
      • Internal state
        • memory usage
      • Available resources
        • available task slots
    • Jobs
      • Task
        • definition
        • status
      • Additional counters
        • pending tasks for each stage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants