EMU: Sample Temperature not correctly logged #5756

DominicOram · 2020-09-23T15:48:40Z

As a user on EMU I would like my nexus file to correctly log my sample temperature. Run numbers 104626 and 104412 had issues where about 2 seconds into the run the value of their sample temperature dropped to 0K and never returned to a sensible value.

Acceptance criteria

It is understood why this happened
There is a solution to stop it happening again

DominicOram · 2020-09-23T16:15:24Z

For 104626 the timeline of the issue was:
2020-09-15 00:39:24 Run started
2020-09-15 00:39:12 Temperature is 225.870403
2020-09-15 00:39:27 Temperature is NULL (Archive_off)
2020-09-15 00:39:28 Temperature is 224.9636
2020-09-15 00:39:28 Temperature is 225.87040
2020-09-15 00:49:30 Run Ended
2020-09-15 00:49:31 Temperature is NULL (Archive_off)

The nexus file shows:
225.8704 at -14
224.96361 at 2
0 at 2

Supporting logs:
emu_logs.zip

DominicOram · 2020-09-23T16:30:07Z

It appears they were in the CCR_highT configuration at the time, where the Temp_Sample block is pointing to EUROTHRM_01:A02:TEMP and logged on change at a rate of every 30 seconds. It appears that the logging did not work correctly as there are no values between 00:39 and 00:49.

FreddieAkeroyd · 2020-09-23T16:34:20Z

is that logged in inst archive as well as block ?

DominicOram · 2020-09-23T16:35:24Z

At 00:39:27-00:39:28 the eurotherm logs also show reading mismatches for a number of PVs but not the temperature itself.
EUROTHRM_01-20200915.log

DominicOram · 2020-09-23T16:38:11Z

is that logged in inst archive as well as block ?

Yes

DominicOram · 2020-09-23T16:47:10Z

For the inst archive we have:

channel	time	nanosecs	severity	status	num_val	float_val
2095	00:30:31	8.64E+08	5	5	NULL	225.8704	NULL	...	NULL	98782151
2095	00:39:27	1.84E+08	5	5	NULL	224.9636	NULL	...	NULL	98804581
2095	00:39:28	4.16E+08	4	3	NULL	225.8704	NULL	...	NULL	98804582
2095	00:49:50	97899840	5	5	NULL	225.0644	NULL	...	NULL	98826668
2095	00:49:51	2.43E+08	5	5	NULL	225.8704	NULL	...	NULL	98826669
2095	01:01:43	8.89E+08	5	5	NULL	225.9712	NULL	...	NULL	98855056

Which implies an error in the eurotherm as it should be logging every second

FreddieAkeroyd · 2020-09-23T16:48:41Z

is there a logging deadband?

DominicOram · 2020-09-23T16:52:48Z

No deadbands on the logging. Given how upset the eurotherm logs seem to be I could be convinced that's the issue

KathrynBaker · 2020-09-23T16:54:44Z

There are framing and parity errors listed in the eurotherm logs from the days either side of this, so my instinct is that the Eurotherm and IOCs could do with a restart at the very least

FreddieAkeroyd · 2020-09-23T17:02:24Z

Do we need to check the cabling?

FreddieAkeroyd · 2020-09-23T18:39:46Z

So eurotherm power cycled and moved to a different moxa port, see how that goes...

DominicOram · 2020-09-24T12:44:23Z

This seems to be a Eurotherm problem across the board, though EMU's is particularly bad. #5611 and #5733 are also likely symptoms of this same issue. Some examples from other instruments:

Interestingly some of these logs hint at the issue occurring every 2 hours

FreddieAkeroyd · 2020-09-24T12:54:40Z

Do you know why we have a reply timeout at 200ms? Looking at the log of emu below the eurotherm takes a bit longer to respond and then sends out the previous replies to the new commands. If the reply timeout was 1 seconds (which is the stream device default) then it would likely work without error. Does Labview have a longer timeout?

[2020-09-23 19:42:40] 2020/09/23 19:42:40.894956 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: No reply within 200 ms to "[7m<04>[27m0011OP[7m<05>[27m"

[2020-09-23 20:35:53] 2020/09/23 20:35:53.709220 L0 IN:EMU:EUROTHRM_01:A01:RBV: No reply within 200 ms to "[7m<04>[27m0011PV[7m<05>[27m"

[2020-09-23 20:35:53] 2020/09/23 20:35:53.909204 L0 IN:EMU:EUROTHRM_01:A02:RBV: No reply within 200 ms to "[7m<04>[27m0022PV[7m<05>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012150 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: Input "[7m<02>[27mPV 0.527"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012163 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: mismatch after 1 byte "[7m<02>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012171 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: got "PV 0.527" where "OP" was expected

[2020-09-23 20:35:54] 2020/09/23 20:35:54.051903 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: Input "[7m<02>[27mPV 0.52[7m<19>[27m[7m<04>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.051918 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: mismatch after 1 byte "[7m<02>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.051928 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: got "PV 0.52[7m<19>[27m[7m<04>[27m" where "OP" was expected

[2020-09-23 20:35:54] 2020/09/23 20:35:54.688715 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: Match passed again

[2020-09-23 20:35:54] 2020/09/23 20:35:54.749439 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: Match passed again

DominicOram · 2020-09-24T12:56:25Z

The 200 ms appears to come from the original implementation https://github.com/ISISComputingGroup/EPICS-eurotherm2k/blame/master/eurotherm2kApp/protocol/eurotherm2k.proto

FreddieAkeroyd · 2020-09-24T12:58:45Z

I think it is worth us increasing it, it might stop errors and we might also be bombarding the eurotherm with too many commands and that could cause an issue

KathrynBaker · 2020-09-24T13:03:15Z

The more recent LabVIEW driver has a timeout of 2000ms, the older one is a little more complicated so might not be worth the time to check on it

FreddieAkeroyd · 2020-09-24T13:04:21Z

e.g. reply from RBV seems to end up in Op query 0.5 seconds later, 1s coudl be enough here but 2s might be better

[2020-09-23 20:35:53] 2020/09/23 20:35:53.709220 L0 IN:EMU:EUROTHRM_01:A01:RBV: No reply within 200 ms to "[7m<04>[27m0011PV[7m<05>[27m"

[2020-09-23 20:35:53] 2020/09/23 20:35:53.909204 L0 IN:EMU:EUROTHRM_01:A02:RBV: No reply within 200 ms to "[7m<04>[27m0022PV[7m<05>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012150 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: Input "[7m<02>[27mPV 0.527"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012163 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: mismatch after 1 byte "[7m<02>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012171 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: got "PV 0.527" where "OP" was expected

FreddieAkeroyd · 2020-09-24T13:15:05Z

@DominicOram it is not just every two hours, it is the same two hours on different instruments? maybe they all got power cycled at the same time, or maybe something more central is happening?

FreddieAkeroyd · 2020-09-25T09:36:05Z

Having changed replytimeout to 2000ms there have been no comms errors overnight, will create a PR

Tom-Willemsen · 2020-09-25T09:41:28Z

As per @John-Holt-Tessella comment in teams, https://github.com/ISISComputingGroup/ibex_developers_manual/wiki/Eurotherm states that increasing the timeouts on a eurotherm is a "gotcha"... I think this would need further testing before being deployed more widely

KathrynBaker · 2020-09-25T10:23:49Z

Given that the ISIS version of the driver used a 2000ms timeout, we might need to alter the scan loops rather than keep to the timeout. Evidence tells us it isn't long enough. We might also be hitting here different ways of communicating with Eurotherms, as I doubt the original users are using the comms protocol we are. Maybe we should consider again the move to modbus control (#4240) as a more general situation rather that the specialist one for a single device. (The Eurotherms on site can all speak Modbus, we just don't use it as fas as I'm aware and instead we stuck with the protocol as had been used in SECI.)

FreddieAkeroyd · 2020-09-25T10:25:53Z

It says "some scans depend on the timeout", that makes sense for readtimeout but not replytimout? The issue looks like it is hitting the eurotherm with more commands when it is busy, which is caused by a low replytimeout. The multiple scan loops don't stop us continuing to send commands, they just reduce that rate a bit so it eventually catches up. Maybe we should revisit the scan loops - i am not aware of the history though.

FreddieAkeroyd · 2020-09-25T10:27:42Z

I had also left camonitors on the alarm status on EMU yesterday, so far no alarms raised either

FreddieAkeroyd · 2020-09-25T11:01:13Z

I have now left a monitor on READ_RESTART.LCNT as well to see if I can see any intermediate SCAN failures, you only get an alarm if 10 fail in succession

FreddieAkeroyd · 2020-10-12T11:14:11Z

I am not sure how often we use read timeout - it looks like it may read using InTerminator = \x03 and then wait for the bcc character to appear so it can be discarded by flush on next read?

FreddieAkeroyd · 2020-10-12T11:16:33Z

Maybe we could enhance streamdevice for a regex terminator to allow for \x03 + any other single character?

Tom-Willemsen · 2020-10-13T08:40:22Z

meeting on friday to decide path forwards

FreddieAkeroyd · 2020-10-15T20:09:10Z

EMU had a brief sequence of timeouts on 28/9 and 5/10, but otherwise clean log files from 24/9

FreddieAkeroyd · 2020-10-15T20:54:52Z

@John-Holt-Tessella rather than writing a full asyn driver It may be possible to add a new asyn interpose interface https://epics.anl.gov/modules/soft/asyn/R4-38/asynDriver.html#interposeInterfaces this could potentially be used to remove a BCC character on input as I believe a \x03 is always followed by a BCC on input. For example asynInterposeCom adjusts input that passes by to add/remove RFC 2117 characters, it may provide a good example. I think the modbus driver code behind modbusInterposeConfig add/removes checksums on pasing bytes so could also be worth a look as an example.

Tom-Willemsen · 2020-10-16T13:25:31Z

In a meeting with @FreddieAkeroyd @KathrynBaker @DominicOram and myself we have decided:

We will continue hotfixing EMU with the above-mentioned hotfix in the short term
- This is because we are not confident enough that the hotfix will work correctly for other instruments (in particular instruments with 6+ eurotherm crates)
- We do not want to spend significant time testing this hotfix for all instruments, as this will not affect the long term solution anyway.
In the medium/long term we want to move all eurotherms onto using modbus. Ticket 4240, which has been proposed.

For the avoidance of doubt, the hotfix applied on EMU is:

$ git diff .
diff --git a/support/eurotherm2k/master/data/eurotherm2k.proto b/support/eurotherm2k/master/data/eurotherm2k.proto
index 423e85041..df0e66cc1 100644
--- a/support/eurotherm2k/master/data/eurotherm2k.proto
+++ b/support/eurotherm2k/master/data/eurotherm2k.proto
@@ -5,7 +5,7 @@
 locktimeout = 5000;

 # Uses a timeout when there is no termination character (i.e. for reads)
-replytimeout = 200;
+replytimeout = 2000;
 readtimeout  = 100;
 extrainput   = Error;

diff --git a/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto b/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto
index 423e85041..df0e66cc1 100644
--- a/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto
+++ b/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto
@@ -5,7 +5,7 @@
 locktimeout = 5000;

 # Uses a timeout when there is no termination character (i.e. for reads)
-replytimeout = 200;
+replytimeout = 2000;
 readtimeout  = 100;
 extrainput   = Error;

DominicOram added the support label Sep 23, 2020

DominicOram added the in progress label Sep 23, 2020

DominicOram added this to the SPRINT_2020_09_17 milestone Sep 23, 2020

DominicOram added ready and removed in progress labels Sep 23, 2020

Tom-Willemsen self-assigned this Oct 12, 2020

Tom-Willemsen added in progress and removed ready labels Oct 12, 2020

Tom-Willemsen added impeded and removed in progress labels Oct 13, 2020

kjwoodsISIS modified the milestones: SPRINT_2020_09_17, SPRINT_2020_10_15 Oct 15, 2020

Tom-Willemsen mentioned this issue Oct 16, 2020

EUROTHERM: MODBUS Control #4240

Closed

Tom-Willemsen added review and removed impeded labels Oct 16, 2020

Tom-Willemsen added the 3 label Oct 16, 2020

Alistair-McGann-Tessella added under review completed and removed under review labels Oct 20, 2020

John-Holt-Tessella removed the review label Oct 22, 2020

kjwoodsISIS added the fixed label Nov 11, 2020

kjwoodsISIS closed this as completed Nov 11, 2020

kjwoodsISIS removed the completed label Dec 4, 2020

FreddieAkeroyd mentioned this issue Apr 6, 2023

eurotherm: test increasing replytimeout #7697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EMU: Sample Temperature not correctly logged #5756

EMU: Sample Temperature not correctly logged #5756

DominicOram commented Sep 23, 2020 •

edited

Loading

DominicOram commented Sep 23, 2020

DominicOram commented Sep 23, 2020

FreddieAkeroyd commented Sep 23, 2020

DominicOram commented Sep 23, 2020

DominicOram commented Sep 23, 2020

DominicOram commented Sep 23, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 23, 2020

DominicOram commented Sep 23, 2020

KathrynBaker commented Sep 23, 2020

FreddieAkeroyd commented Sep 23, 2020

FreddieAkeroyd commented Sep 23, 2020

DominicOram commented Sep 24, 2020

FreddieAkeroyd commented Sep 24, 2020

DominicOram commented Sep 24, 2020

FreddieAkeroyd commented Sep 24, 2020

KathrynBaker commented Sep 24, 2020

FreddieAkeroyd commented Sep 24, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 24, 2020

FreddieAkeroyd commented Sep 25, 2020

Tom-Willemsen commented Sep 25, 2020

KathrynBaker commented Sep 25, 2020

FreddieAkeroyd commented Sep 25, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 25, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 25, 2020 •

edited

Loading

FreddieAkeroyd commented Oct 12, 2020

FreddieAkeroyd commented Oct 12, 2020

Tom-Willemsen commented Oct 13, 2020

FreddieAkeroyd commented Oct 15, 2020 •

edited

Loading

FreddieAkeroyd commented Oct 15, 2020 •

edited

Loading

Tom-Willemsen commented Oct 16, 2020

EMU: Sample Temperature not correctly logged #5756

EMU: Sample Temperature not correctly logged #5756

Comments

DominicOram commented Sep 23, 2020 • edited Loading

Acceptance criteria

DominicOram commented Sep 23, 2020

DominicOram commented Sep 23, 2020

FreddieAkeroyd commented Sep 23, 2020

DominicOram commented Sep 23, 2020

DominicOram commented Sep 23, 2020

DominicOram commented Sep 23, 2020 • edited Loading

FreddieAkeroyd commented Sep 23, 2020

DominicOram commented Sep 23, 2020

KathrynBaker commented Sep 23, 2020

FreddieAkeroyd commented Sep 23, 2020

FreddieAkeroyd commented Sep 23, 2020

DominicOram commented Sep 24, 2020

FreddieAkeroyd commented Sep 24, 2020

DominicOram commented Sep 24, 2020

FreddieAkeroyd commented Sep 24, 2020

KathrynBaker commented Sep 24, 2020

FreddieAkeroyd commented Sep 24, 2020 • edited Loading

FreddieAkeroyd commented Sep 24, 2020

FreddieAkeroyd commented Sep 25, 2020

Tom-Willemsen commented Sep 25, 2020

KathrynBaker commented Sep 25, 2020

FreddieAkeroyd commented Sep 25, 2020 • edited Loading

FreddieAkeroyd commented Sep 25, 2020 • edited Loading

FreddieAkeroyd commented Sep 25, 2020 • edited Loading

FreddieAkeroyd commented Oct 12, 2020

FreddieAkeroyd commented Oct 12, 2020

Tom-Willemsen commented Oct 13, 2020

FreddieAkeroyd commented Oct 15, 2020 • edited Loading

FreddieAkeroyd commented Oct 15, 2020 • edited Loading

Tom-Willemsen commented Oct 16, 2020

DominicOram commented Sep 23, 2020 •

edited

Loading

DominicOram commented Sep 23, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 24, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 25, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 25, 2020 •

edited

Loading

FreddieAkeroyd commented Sep 25, 2020 •

edited

Loading

FreddieAkeroyd commented Oct 15, 2020 •

edited

Loading

FreddieAkeroyd commented Oct 15, 2020 •

edited

Loading