HLT and cloud switchover interface #61

smorovic · 2015-02-02T13:00:30Z

Availability of HLT nodes for cloud requires cooperation between mechanisms for starting CMSSW jobs built into hltd and a service which runs VM instances in same machines.

An external tool, possibly integrated with LevelZero FM, will be used to control which FU nodes should stop HLT and switch to the cloud mode. The tool will then directly contact hltd on nodes using cgi interface.

From hltd version 1.6.0, an API is available to allow taking the FU out of HLT. Activation is done in a similar way to new run notifications: a cgi script creates file in watch directory (standard location is '/fff/data'). Request needs to be send to hltd(by default port 9000), using the following form of the URL: http://host:9000/cgi-bin/exclude_cgi.py

FUs will then analyze which is the last lumisection completed on BU and signal CMSSW processes to finish within two additional lumisections. During this time, hltd switches into "activatingCloud" mode, and stops accepting new run start events from BU hltd. A number of available resources are masked in box file so that BU stops requesting data for machines in a switch over mode. Upon CMSSW jobs and local merge scripts finishing, all core resources are moved to /etc/appliance/resources/cloud and finally FU switches to the "cloud" mode, at which point virtual machines can run.

Since a recent conclusion was to activate VM startup through hltd, one possibility is to run a script which will signal the local cloud service ("cloud tool") that VMs can be started (when CMSSW are finished).

An "include" interface is also provided in hltd 1.6.0. However, currently this only returns core files in their usual place and allows hltd to accept new HLT runs.

I propose to modify this interface, which will execute a script/command communicating to the cloud tool to stop VMs ("include_cgi.py"), before the switchover to HLT is completed. The script called by hltd can be synchronous, i.e. it returns only when VMs are shut down (note that cgi calls are still asynchronous: they create a file which triggers action in hltd and immediately finish).

In addition, hltd will update a file providing name of the mode in which hltd currently is. This file could be polled for any mode change by the cloud service (if necessary). I propose that the file location is: "/fff/data/mode", with the following modes possible: "HLT", "activatingCloud","deactivatingCloud","cloud".

Monitoring: hltd mode can be monitored through elasticsearch. The mode can be written to box info files which are filled into central ES index (every 5 seconds per machine). In addition, a separate monitoring chain could be implemented through cloud services by polling content of the "/fff/data/mode" file.
Handling of runs during cloud mode: new run requests will be ignored, but hltd can cache last such request to allow joining the ongoing run once cloud mode is switched off. Otherwise it would not join HLT until the next run after switching back is started.

smorovic added the enhancement label Feb 2, 2015

smorovic self-assigned this Feb 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HLT and cloud switchover interface #61

HLT and cloud switchover interface #61

smorovic commented Feb 2, 2015

HLT and cloud switchover interface #61

HLT and cloud switchover interface #61

Comments

smorovic commented Feb 2, 2015