Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMU: Sample Temperature not correctly logged #5756

Closed
DominicOram opened this issue Sep 23, 2020 · 37 comments
Closed

EMU: Sample Temperature not correctly logged #5756

DominicOram opened this issue Sep 23, 2020 · 37 comments

Comments

@DominicOram
Copy link
Contributor

DominicOram commented Sep 23, 2020

As a user on EMU I would like my nexus file to correctly log my sample temperature. Run numbers 104626 and 104412 had issues where about 2 seconds into the run the value of their sample temperature dropped to 0K and never returned to a sensible value.

Acceptance criteria

  • It is understood why this happened
  • There is a solution to stop it happening again
@DominicOram
Copy link
Contributor Author

For 104626 the timeline of the issue was:
2020-09-15 00:39:24 Run started
2020-09-15 00:39:12 Temperature is 225.870403
2020-09-15 00:39:27 Temperature is NULL (Archive_off)
2020-09-15 00:39:28 Temperature is 224.9636
2020-09-15 00:39:28 Temperature is 225.87040
2020-09-15 00:49:30 Run Ended
2020-09-15 00:49:31 Temperature is NULL (Archive_off)

The nexus file shows:
225.8704 at -14
224.96361 at 2
0 at 2

Supporting logs:
emu_logs.zip

@DominicOram
Copy link
Contributor Author

It appears they were in the CCR_highT configuration at the time, where the Temp_Sample block is pointing to EUROTHRM_01:A02:TEMP and logged on change at a rate of every 30 seconds. It appears that the logging did not work correctly as there are no values between 00:39 and 00:49.

@FreddieAkeroyd
Copy link
Member

is that logged in inst archive as well as block ?

@DominicOram
Copy link
Contributor Author

At 00:39:27-00:39:28 the eurotherm logs also show reading mismatches for a number of PVs but not the temperature itself.
EUROTHRM_01-20200915.log

@DominicOram
Copy link
Contributor Author

is that logged in inst archive as well as block ?

Yes

@DominicOram
Copy link
Contributor Author

DominicOram commented Sep 23, 2020

For the inst archive we have:

channel time nanosecs severity status num_val float_val
2095 00:30:31 8.64E+08 5 5 NULL 225.8704 NULL ... NULL 98782151
2095 00:39:27 1.84E+08 5 5 NULL 224.9636 NULL ... NULL 98804581
2095 00:39:28 4.16E+08 4 3 NULL 225.8704 NULL ... NULL 98804582
2095 00:49:50 97899840 5 5 NULL 225.0644 NULL ... NULL 98826668
2095 00:49:51 2.43E+08 5 5 NULL 225.8704 NULL ... NULL 98826669
2095 01:01:43 8.89E+08 5 5 NULL 225.9712 NULL ... NULL 98855056

Which implies an error in the eurotherm as it should be logging every second

@FreddieAkeroyd
Copy link
Member

is there a logging deadband?

@DominicOram
Copy link
Contributor Author

No deadbands on the logging. Given how upset the eurotherm logs seem to be I could be convinced that's the issue

@KathrynBaker
Copy link
Member

There are framing and parity errors listed in the eurotherm logs from the days either side of this, so my instinct is that the Eurotherm and IOCs could do with a restart at the very least

@FreddieAkeroyd
Copy link
Member

Do we need to check the cabling?

@FreddieAkeroyd
Copy link
Member

So eurotherm power cycled and moved to a different moxa port, see how that goes...

@DominicOram
Copy link
Contributor Author

This seems to be a Eurotherm problem across the board, though EMU's is particularly bad. #5611 and #5733 are also likely symptoms of this same issue. Some examples from other instruments:

Interestingly some of these logs hint at the issue occurring every 2 hours

@FreddieAkeroyd
Copy link
Member

Do you know why we have a reply timeout at 200ms? Looking at the log of emu below the eurotherm takes a bit longer to respond and then sends out the previous replies to the new commands. If the reply timeout was 1 seconds (which is the stream device default) then it would likely work without error. Does Labview have a longer timeout?

[2020-09-23 19:42:40] 2020/09/23 19:42:40.894956 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: No reply within 200 ms to "[7m<04>[27m0011OP[7m<05>[27m"

[2020-09-23 20:35:53] 2020/09/23 20:35:53.709220 L0 IN:EMU:EUROTHRM_01:A01:RBV: No reply within 200 ms to "[7m<04>[27m0011PV[7m<05>[27m"

[2020-09-23 20:35:53] 2020/09/23 20:35:53.909204 L0 IN:EMU:EUROTHRM_01:A02:RBV: No reply within 200 ms to "[7m<04>[27m0022PV[7m<05>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012150 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: Input "[7m<02>[27mPV 0.527"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012163 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: mismatch after 1 byte "[7m<02>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012171 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: got "PV 0.527" where "OP" was expected

[2020-09-23 20:35:54] 2020/09/23 20:35:54.051903 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: Input "[7m<02>[27mPV 0.52[7m<19>[27m[7m<04>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.051918 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: mismatch after 1 byte "[7m<02>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.051928 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: got "PV 0.52[7m<19>[27m[7m<04>[27m" where "OP" was expected

[2020-09-23 20:35:54] 2020/09/23 20:35:54.688715 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: Match passed again

[2020-09-23 20:35:54] 2020/09/23 20:35:54.749439 L0 IN:EMU:EUROTHRM_01:A02:OUTPUT: Match passed again

@DominicOram
Copy link
Contributor Author

@FreddieAkeroyd
Copy link
Member

I think it is worth us increasing it, it might stop errors and we might also be bombarding the eurotherm with too many commands and that could cause an issue

@KathrynBaker
Copy link
Member

The more recent LabVIEW driver has a timeout of 2000ms, the older one is a little more complicated so might not be worth the time to check on it

@FreddieAkeroyd
Copy link
Member

FreddieAkeroyd commented Sep 24, 2020

e.g. reply from RBV seems to end up in Op query 0.5 seconds later, 1s coudl be enough here but 2s might be better

[2020-09-23 20:35:53] 2020/09/23 20:35:53.709220 L0 IN:EMU:EUROTHRM_01:A01:RBV: No reply within 200 ms to "[7m<04>[27m0011PV[7m<05>[27m"

[2020-09-23 20:35:53] 2020/09/23 20:35:53.909204 L0 IN:EMU:EUROTHRM_01:A02:RBV: No reply within 200 ms to "[7m<04>[27m0022PV[7m<05>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012150 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: Input "[7m<02>[27mPV 0.527"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012163 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: mismatch after 1 byte "[7m<02>[27m"

[2020-09-23 20:35:54] 2020/09/23 20:35:54.012171 L0 IN:EMU:EUROTHRM_01:A01:OUTPUT: got "PV 0.527" where "OP" was expected

@FreddieAkeroyd
Copy link
Member

@DominicOram it is not just every two hours, it is the same two hours on different instruments? maybe they all got power cycled at the same time, or maybe something more central is happening?

@FreddieAkeroyd
Copy link
Member

Having changed replytimeout to 2000ms there have been no comms errors overnight, will create a PR

@Tom-Willemsen
Copy link
Contributor

As per @John-Holt-Tessella comment in teams, https://github.com/ISISComputingGroup/ibex_developers_manual/wiki/Eurotherm states that increasing the timeouts on a eurotherm is a "gotcha"... I think this would need further testing before being deployed more widely

@KathrynBaker
Copy link
Member

Given that the ISIS version of the driver used a 2000ms timeout, we might need to alter the scan loops rather than keep to the timeout. Evidence tells us it isn't long enough. We might also be hitting here different ways of communicating with Eurotherms, as I doubt the original users are using the comms protocol we are. Maybe we should consider again the move to modbus control (#4240) as a more general situation rather that the specialist one for a single device. (The Eurotherms on site can all speak Modbus, we just don't use it as fas as I'm aware and instead we stuck with the protocol as had been used in SECI.)

@FreddieAkeroyd
Copy link
Member

FreddieAkeroyd commented Sep 25, 2020

It says "some scans depend on the timeout", that makes sense for readtimeout but not replytimout? The issue looks like it is hitting the eurotherm with more commands when it is busy, which is caused by a low replytimeout. The multiple scan loops don't stop us continuing to send commands, they just reduce that rate a bit so it eventually catches up. Maybe we should revisit the scan loops - i am not aware of the history though.

@FreddieAkeroyd
Copy link
Member

FreddieAkeroyd commented Sep 25, 2020

I had also left camonitors on the alarm status on EMU yesterday, so far no alarms raised either

@FreddieAkeroyd
Copy link
Member

FreddieAkeroyd commented Sep 25, 2020

I have now left a monitor on READ_RESTART.LCNT as well to see if I can see any intermediate SCAN failures, you only get an alarm if 10 fail in succession

@FreddieAkeroyd
Copy link
Member

I am not sure how often we use read timeout - it looks like it may read using InTerminator = \x03 and then wait for the bcc character to appear so it can be discarded by flush on next read?

@FreddieAkeroyd
Copy link
Member

Maybe we could enhance streamdevice for a regex terminator to allow for \x03 + any other single character?

@Tom-Willemsen Tom-Willemsen self-assigned this Oct 12, 2020
@Tom-Willemsen
Copy link
Contributor

meeting on friday to decide path forwards

@FreddieAkeroyd
Copy link
Member

FreddieAkeroyd commented Oct 15, 2020

EMU had a brief sequence of timeouts on 28/9 and 5/10, but otherwise clean log files from 24/9

@FreddieAkeroyd
Copy link
Member

FreddieAkeroyd commented Oct 15, 2020

@John-Holt-Tessella rather than writing a full asyn driver It may be possible to add a new asyn interpose interface https://epics.anl.gov/modules/soft/asyn/R4-38/asynDriver.html#interposeInterfaces this could potentially be used to remove a BCC character on input as I believe a \x03 is always followed by a BCC on input. For example asynInterposeCom adjusts input that passes by to add/remove RFC 2117 characters, it may provide a good example. I think the modbus driver code behind modbusInterposeConfig add/removes checksums on pasing bytes so could also be worth a look as an example.

@Tom-Willemsen
Copy link
Contributor

In a meeting with @FreddieAkeroyd @KathrynBaker @DominicOram and myself we have decided:

  • We will continue hotfixing EMU with the above-mentioned hotfix in the short term
    • This is because we are not confident enough that the hotfix will work correctly for other instruments (in particular instruments with 6+ eurotherm crates)
    • We do not want to spend significant time testing this hotfix for all instruments, as this will not affect the long term solution anyway.
  • In the medium/long term we want to move all eurotherms onto using modbus. Ticket 4240, which has been proposed.

For the avoidance of doubt, the hotfix applied on EMU is:

$ git diff .
diff --git a/support/eurotherm2k/master/data/eurotherm2k.proto b/support/eurotherm2k/master/data/eurotherm2k.proto
index 423e85041..df0e66cc1 100644
--- a/support/eurotherm2k/master/data/eurotherm2k.proto
+++ b/support/eurotherm2k/master/data/eurotherm2k.proto
@@ -5,7 +5,7 @@
 locktimeout = 5000;

 # Uses a timeout when there is no termination character (i.e. for reads)
-replytimeout = 200;
+replytimeout = 2000;
 readtimeout  = 100;
 extrainput   = Error;

diff --git a/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto b/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto
index 423e85041..df0e66cc1 100644
--- a/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto
+++ b/support/eurotherm2k/master/eurotherm2kApp/protocol/eurotherm2k.proto
@@ -5,7 +5,7 @@
 locktimeout = 5000;

 # Uses a timeout when there is no termination character (i.e. for reads)
-replytimeout = 200;
+replytimeout = 2000;
 readtimeout  = 100;
 extrainput   = Error;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants