-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EMU: Sample Temperature not correctly logged #5756
Comments
For 104626 the timeline of the issue was: The nexus file shows: Supporting logs: |
It appears they were in the |
is that logged in inst archive as well as block ? |
At 00:39:27-00:39:28 the eurotherm logs also show reading mismatches for a number of PVs but not the temperature itself. |
Yes |
For the inst archive we have:
Which implies an error in the eurotherm as it should be logging every second |
is there a logging deadband? |
No deadbands on the logging. Given how upset the eurotherm logs seem to be I could be convinced that's the issue |
There are framing and parity errors listed in the eurotherm logs from the days either side of this, so my instinct is that the Eurotherm and IOCs could do with a restart at the very least |
Do we need to check the cabling? |
So eurotherm power cycled and moved to a different moxa port, see how that goes... |
This seems to be a Eurotherm problem across the board, though EMU's is particularly bad. #5611 and #5733 are also likely symptoms of this same issue. Some examples from other instruments:
Interestingly some of these logs hint at the issue occurring every 2 hours |
Do you know why we have a reply timeout at 200ms? Looking at the log of emu below the eurotherm takes a bit longer to respond and then sends out the previous replies to the new commands. If the reply timeout was 1 seconds (which is the stream device default) then it would likely work without error. Does Labview have a longer timeout?
|
The 200 ms appears to come from the original implementation https://github.com/ISISComputingGroup/EPICS-eurotherm2k/blame/master/eurotherm2kApp/protocol/eurotherm2k.proto |
I think it is worth us increasing it, it might stop errors and we might also be bombarding the eurotherm with too many commands and that could cause an issue |
The more recent LabVIEW driver has a timeout of 2000ms, the older one is a little more complicated so might not be worth the time to check on it |
e.g. reply from RBV seems to end up in Op query 0.5 seconds later, 1s coudl be enough here but 2s might be better
|
@DominicOram it is not just every two hours, it is the same two hours on different instruments? maybe they all got power cycled at the same time, or maybe something more central is happening? |
Having changed replytimeout to 2000ms there have been no comms errors overnight, will create a PR |
As per @John-Holt-Tessella comment in teams, https://github.com/ISISComputingGroup/ibex_developers_manual/wiki/Eurotherm states that increasing the timeouts on a eurotherm is a "gotcha"... I think this would need further testing before being deployed more widely |
Given that the ISIS version of the driver used a 2000ms timeout, we might need to alter the scan loops rather than keep to the timeout. Evidence tells us it isn't long enough. We might also be hitting here different ways of communicating with Eurotherms, as I doubt the original users are using the comms protocol we are. Maybe we should consider again the move to modbus control (#4240) as a more general situation rather that the specialist one for a single device. (The Eurotherms on site can all speak Modbus, we just don't use it as fas as I'm aware and instead we stuck with the protocol as had been used in SECI.) |
It says "some scans depend on the timeout", that makes sense for readtimeout but not replytimout? The issue looks like it is hitting the eurotherm with more commands when it is busy, which is caused by a low replytimeout. The multiple scan loops don't stop us continuing to send commands, they just reduce that rate a bit so it eventually catches up. Maybe we should revisit the scan loops - i am not aware of the history though. |
I had also left camonitors on the alarm status on EMU yesterday, so far no alarms raised either |
I have now left a monitor on READ_RESTART.LCNT as well to see if I can see any intermediate SCAN failures, you only get an alarm if 10 fail in succession |
I am not sure how often we use read timeout - it looks like it may read using |
Maybe we could enhance streamdevice for a regex terminator to allow for \x03 + any other single character? |
meeting on friday to decide path forwards |
EMU had a brief sequence of timeouts on 28/9 and 5/10, but otherwise clean log files from 24/9 |
@John-Holt-Tessella rather than writing a full asyn driver It may be possible to add a new asyn interpose interface https://epics.anl.gov/modules/soft/asyn/R4-38/asynDriver.html#interposeInterfaces this could potentially be used to remove a BCC character on input as I believe a \x03 is always followed by a BCC on input. For example asynInterposeCom adjusts input that passes by to add/remove RFC 2117 characters, it may provide a good example. I think the modbus driver code behind modbusInterposeConfig add/removes checksums on pasing bytes so could also be worth a look as an example. |
In a meeting with @FreddieAkeroyd @KathrynBaker @DominicOram and myself we have decided:
For the avoidance of doubt, the hotfix applied on EMU is:
|
As a user on EMU I would like my nexus file to correctly log my sample temperature. Run numbers 104626 and 104412 had issues where about 2 seconds into the run the value of their sample temperature dropped to 0K and never returned to a sensible value.
Acceptance criteria
The text was updated successfully, but these errors were encountered: