Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Add cluster idle HTTP api #53850

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

gengjun-git
Copy link
Contributor

@gengjun-git gengjun-git commented Dec 11, 2024

Why I'm doing:

Add cluster idle api to help judge the cluster status.

What I'm doing:

Add a new daemon thread WarehouseIdleChecker to count all the tasks being executed and the time when the last task ended. If all the tasks being executed are 0 and the time when the last task ended has exceeded Config.warehouse_idle_check_interval_seconds, the system is considered idle.

The entry point for all SQL execution in the system is StmtExecutor. If it is synchronous SQL, we only need to count the SQL executed in StmtExecutor. If it is asynchronous SQL, such as broker load, we also need to count these asynchronously executed tasks: Stream Load, Broker Load, Spark Load, Routine Load, Backup/Restore, Schema Change.

In addition, for SQL executed within the system, such as statistics collection, we do not need to count them. Add the isInternal field in StmtExecutor to determine whether the SQL is initiated internally or by the user.

api request
/api/idle_status

api response

{
    "isClusterIdle": false,
    "clusterIdleTime": -1,
    "warehouses": [
        {
            "id": 0,
            "name": "default_warehouse",
            "isIdle": false,
            "idleTime": -1
        }
    ]
}

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@gengjun-git gengjun-git requested review from a team as code owners December 11, 2024 12:10
@gengjun-git gengjun-git changed the title [Feature] Add cluster idle HTTP api [Enhancement] Add cluster idle HTTP api Dec 11, 2024
if (warehouse == null) {
throw ErrorReportException.report(ErrorCode.ERR_UNKNOWN_WAREHOUSE, String.format("name: %s", warehouseName));
}
Warehouse warehouse = getWarehouse(warehouseName);

try {
long workerGroupId = selectWorkerGroupInternal(warehouse.getId()).orElse(StarOSAgent.DEFAULT_WORKER_GROUP_ID);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
Potential null pointer dereference due to the use of getWarehouse which might return null.

You can modify the code like this:

public Warehouse getWarehouse(String warehouseName) {
    Warehouse warehouse = nameToWh.get(warehouseName);
    if (warehouse == null) {
        throw ErrorReportException.report(ErrorCode.ERR_UNKNOWN_WAREHOUSE, String.format("name: %s", warehouseName));
    }
    return warehouse;
}

public Warehouse getWarehouse(long warehouseId) {
    Warehouse warehouse = idToWh.get(warehouseId);
    if (warehouse == null) {
        throw ErrorReportException.report(ErrorCode.ERR_UNKNOWN_WAREHOUSE, String.format("id: %d", warehouseId));
    }
    return warehouse;
}

This ensures that Warehouse is not null before proceeding with operations on it, thus preventing potential null pointer exceptions.

@HangyuanLiu HangyuanLiu self-assigned this Dec 12, 2024
@gengjun-git gengjun-git requested review from a team as code owners December 13, 2024 10:44
@gengjun-git gengjun-git force-pushed the add_cluster_idle_api branch 3 times, most recently from 40a1c79 to 3294a0a Compare December 13, 2024 15:33
@@ -396,6 +397,7 @@ protected void runRunningJob() throws AlterCancelException {
this.finishedTimeMs = System.currentTimeMillis();

GlobalStateMgr.getCurrentState().getEditLog().logAlterJob(this);
WarehouseIdleChecker.updateJobLastFinishTime(warehouseId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a better choice to put it in AlterJobV2::run? Then you don't need to write each job separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

import com.starrocks.warehouse.IdleStatus;
import io.netty.handler.codec.http.HttpMethod;

public class IdleAction extends RestBaseAction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indicate what the Idle content is. The name of this class cannot express the meaning of the content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -146,10 +151,7 @@ public List<Long> getAllComputeNodeIds(long warehouseId) {
}

private List<Long> getAllComputeNodeIds(long warehouseId, long workerGroupId) {
Warehouse warehouse = idToWh.get(warehouseId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to delete the existence check of warehouse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is check in the getWarehouse function

Signed-off-by: gengjun-git <[email protected]>
Signed-off-by: gengjun-git <[email protected]>
Signed-off-by: gengjun-git <[email protected]>
Seaven
Seaven previously approved these changes Dec 16, 2024
stephen-shelby
stephen-shelby previously approved these changes Dec 16, 2024
@@ -2755,6 +2755,9 @@ public class Config extends ConfigBase {
@ConfField(mutable = true)
public static int lake_warehouse_max_compute_replica = 3;

@ConfField(mutable = true, comment = "time interval to check whether warehouse is idle")
public static long warehouse_idle_check_interval_seconds = 60;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider set this to 0 in open source version to disable the check by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new config warehouse_idle_check_enable, because the warehouse_idle_check_interval_seconds is also used to check the last job finish time.

}

@Override
public void execute(BaseRequest request, BaseResponse response) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this api endpoint be protected by authentication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not need, there is no secret info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a security concern but I would defer it to you guys decision.

Comment on lines 67 to 73
if (runningSQL.get() == 0
&& runningStreamLoad == 0
&& runningBrokerSparkLoad == 0
&& runningRoutineLoad == 0
&& runningBackupRestore == 0
&& runningAlterJob == 0
&& runningTask == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum them , check the sum == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

nshangyiming
nshangyiming previously approved these changes Dec 16, 2024
HangyuanLiu
HangyuanLiu previously approved these changes Dec 16, 2024
Signed-off-by: gengjun-git <[email protected]>
Signed-off-by: gengjun-git <[email protected]>
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

fail : 123 / 228 (53.95%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/scheduler/TaskRunFIFOQueue.java 0 10 00.00% [250, 251, 253, 254, 255, 256, 257, 258, 259, 261]
🔵 com/starrocks/backup/BackupHandler.java 0 4 00.00% [936, 937, 938, 939]
🔵 com/starrocks/alter/AlterHandler.java 0 6 00.00% [210, 211, 212, 213, 215, 216]
🔵 com/starrocks/load/routineload/RoutineLoadMgr.java 0 8 00.00% [830, 831, 833, 834, 835, 837, 839, 841]
🔵 com/starrocks/scheduler/TaskRunScheduler.java 0 12 00.00% [226, 227, 228, 229, 230, 231, 232, 233, 234, 236, 238, 240]
🔵 com/starrocks/load/streamload/StreamLoadMgr.java 0 8 00.00% [721, 723, 724, 725, 726, 728, 729, 731]
🔵 com/starrocks/load/loadv2/SparkLoadJob.java 0 4 00.00% [840, 841, 845, 846]
🔵 com/starrocks/load/loadv2/LoadMgr.java 0 8 00.00% [831, 832, 834, 835, 836, 838, 840, 842]
🔵 com/starrocks/alter/AlterJobMgr.java 0 4 00.00% [629, 630, 631, 632]
🔵 com/starrocks/statistic/FullStatisticsCollectJob.java 0 1 00.00% [249]
🔵 com/starrocks/server/WarehouseManager.java 4 8 50.00% [92, 93, 142, 203]
🔵 com/starrocks/backup/RestoreJob.java 1 2 50.00% [1871]
🔵 com/starrocks/load/streamload/StreamLoadTask.java 1 2 50.00% [1219]
🔵 com/starrocks/warehouse/WarehouseIdleChecker.java 36 68 52.94% [49, 50, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 76, 78, 79, 80, 81, 83, 85, 86, 130]
🔵 com/starrocks/alter/AlterJobV2.java 10 12 83.33% [211, 226]
🔵 com/starrocks/scheduler/SqlTaskRunProcessor.java 1 1 100.00% []
🔵 com/starrocks/scheduler/PartitionBasedMvRefreshProcessor.java 1 1 100.00% []
🔵 com/starrocks/sql/ast/UserVariable.java 1 1 100.00% []
🔵 com/starrocks/statistic/HyperStatisticsCollectJob.java 1 1 100.00% []
🔵 com/starrocks/statistic/ExternalFullStatisticsCollectJob.java 1 1 100.00% []
🔵 com/starrocks/common/Config.java 2 2 100.00% []
🔵 com/starrocks/qe/feedback/PlanAdvisorExecutor.java 1 1 100.00% []
🔵 com/starrocks/qe/StmtExecutor.java 10 10 100.00% []
🔵 com/starrocks/load/pipe/filelist/RepoExecutor.java 2 2 100.00% []
🔵 com/starrocks/server/GlobalStateMgr.java 3 3 100.00% []
🔵 com/starrocks/scheduler/TaskRunExecutor.java 1 1 100.00% []
🔵 com/starrocks/qe/ConnectContext.java 2 2 100.00% []
🔵 com/starrocks/alter/RollupJobV2.java 3 3 100.00% []
🔵 com/starrocks/warehouse/IdleStatus.java 11 11 100.00% []
🔵 com/starrocks/load/routineload/RoutineLoadJob.java 2 2 100.00% []
🔵 com/starrocks/http/rest/IdleAction.java 9 9 100.00% []
🔵 com/starrocks/datacache/DataCacheSelectExecutor.java 1 1 100.00% []
🔵 com/starrocks/alter/OnlineOptimizeJobV2.java 1 1 100.00% []
🔵 com/starrocks/connector/metadata/MetadataExecutor.java 1 1 100.00% []
🔵 com/starrocks/statistic/StatisticExecutor.java 3 3 100.00% []
🔵 com/starrocks/load/loadv2/BrokerLoadJob.java 4 4 100.00% []
🔵 com/starrocks/alter/LakeTableSchemaChangeJob.java 3 3 100.00% []
🔵 com/starrocks/backup/BackupJob.java 2 2 100.00% []
🔵 com/starrocks/statistic/StatisticsCollectJob.java 1 1 100.00% []
🔵 com/starrocks/alter/SchemaChangeJobV2.java 3 3 100.00% []
🔵 com/starrocks/http/HttpServer.java 1 1 100.00% []

Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@andyziye andyziye disabled auto-merge December 17, 2024 02:21
@andyziye andyziye merged commit 6cd9fbc into StarRocks:main Dec 17, 2024
44 of 45 checks passed
Copy link

@Mergifyio backport branch-3.4

@github-actions github-actions bot removed the 3.4 label Dec 17, 2024
Copy link

@Mergifyio backport branch-3.3

Copy link

@Mergifyio backport branch-3.2

Copy link
Contributor

mergify bot commented Dec 17, 2024

backport branch-3.4

✅ Backports have been created

Copy link
Contributor

mergify bot commented Dec 17, 2024

backport branch-3.3

✅ Backports have been created

Copy link
Contributor

mergify bot commented Dec 17, 2024

backport branch-3.2

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Dec 17, 2024
(cherry picked from commit 6cd9fbc)

# Conflicts:
#	fe/fe-core/src/main/java/com/starrocks/common/Config.java
#	fe/fe-core/src/main/java/com/starrocks/statistic/HyperStatisticsCollectJob.java
mergify bot pushed a commit that referenced this pull request Dec 17, 2024
(cherry picked from commit 6cd9fbc)

# Conflicts:
#	fe/fe-core/src/main/java/com/starrocks/common/Config.java
#	fe/fe-core/src/main/java/com/starrocks/qe/feedback/PlanAdvisorExecutor.java
#	fe/fe-core/src/main/java/com/starrocks/server/GlobalStateMgr.java
#	fe/fe-core/src/main/java/com/starrocks/statistic/HyperStatisticsCollectJob.java
mergify bot pushed a commit that referenced this pull request Dec 17, 2024
(cherry picked from commit 6cd9fbc)

# Conflicts:
#	fe/fe-core/src/main/java/com/starrocks/alter/LakeTableSchemaChangeJob.java
#	fe/fe-core/src/main/java/com/starrocks/alter/OnlineOptimizeJobV2.java
#	fe/fe-core/src/main/java/com/starrocks/alter/RollupJobV2.java
#	fe/fe-core/src/main/java/com/starrocks/alter/SchemaChangeJobV2.java
#	fe/fe-core/src/main/java/com/starrocks/backup/BackupJob.java
#	fe/fe-core/src/main/java/com/starrocks/backup/RestoreJob.java
#	fe/fe-core/src/main/java/com/starrocks/common/Config.java
#	fe/fe-core/src/main/java/com/starrocks/connector/metadata/MetadataExecutor.java
#	fe/fe-core/src/main/java/com/starrocks/datacache/DataCacheSelectExecutor.java
#	fe/fe-core/src/main/java/com/starrocks/load/routineload/RoutineLoadJob.java
#	fe/fe-core/src/main/java/com/starrocks/load/streamload/StreamLoadMgr.java
#	fe/fe-core/src/main/java/com/starrocks/load/streamload/StreamLoadTask.java
#	fe/fe-core/src/main/java/com/starrocks/qe/ConnectContext.java
#	fe/fe-core/src/main/java/com/starrocks/qe/StmtExecutor.java
#	fe/fe-core/src/main/java/com/starrocks/qe/feedback/PlanAdvisorExecutor.java
#	fe/fe-core/src/main/java/com/starrocks/server/GlobalStateMgr.java
#	fe/fe-core/src/main/java/com/starrocks/server/WarehouseManager.java
#	fe/fe-core/src/main/java/com/starrocks/sql/util/CustomizedQueryExecutor.java
#	fe/fe-core/src/main/java/com/starrocks/statistic/HyperStatisticsCollectJob.java
#	fe/fe-core/src/test/java/com/starrocks/scheduler/MVRefreshTestBase.java
#	fe/fe-core/src/test/java/com/starrocks/sql/optimizer/rule/transformation/materialization/MvRewriteTestBase.java
wanpengfei-git pushed a commit that referenced this pull request Dec 17, 2024
wanpengfei-git pushed a commit that referenced this pull request Dec 19, 2024
@gengjun-git
Copy link
Contributor Author

ignore backport check: 3.2.14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants