Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDK Lambda Genesys IOC: Investigate (and fix?) why the IOC can't connect easily #1594

Closed
KathrynBaker opened this issue Sep 15, 2016 · 27 comments

Comments

@KathrynBaker
Copy link
Member

KathrynBaker commented Sep 15, 2016

During the Muon Front End testing it was realised that something exceedingly odd was occurring in relation to the Genesys PSUs.

Using the IOC to connect was failing, unless another program (HTERM in this instance) had already connected. This is not desirable behaviour, as connecting to 14 PSUs manually does take time.

The following scenarios should help to think about where the problem might be, as it is unclear whether this is related to the Genesys specifically, Server 2012 and Nport could be part of the problem, or it could be something else entirely.

These are the scenarios tested, that I can recall, and their results.
The PSU and Computer have both just been power cycled, start the IOC - comms will not initialise (you will see multiple timeout messages)
On the computer, stop the IOC, connect to the PSU via HTERM, disconnect in HTERM (do NOT send any commands), start the IOC - everything should be fine
Power cycle the PSU - everything should continue to work
Stop the IOC, start the IOC (any way you like) - everything should continue to work

I can't remember if these were explicitly tested or their results if they were:
Stop the IOC, power cycle the PSU, start the IOC
Power cycle the computer

This will need access to a Genesys, and may need a Server 2012 system, and may or may not require a Moxa to be included in the testing. As yet, a test Genesys has not been sourced, nor has an opportunity to explore and test with the existing MUONFE system. Some time was spent on this, but given that there was a workaround that was used instead, with the awareness that this may need to be redone between now and any fix for the problem.

@ChrisM-S
Copy link

I had a random conversation at lunch today (with someone talking about comms. issues with some hardware, their system didn't use flow control and got occasional overwrites which brought a cryo-magnet down! - also controlled from EPICS). It rang a small bell in my head about this problem. Is it possible that flow control needs to be enabled (software or hardware) for this power supply to respond? (and also to work reliably?)

Two scenarios I can think of are:

  1. The device assumes software flow-control pre-emptively. ie it requires an XON to be sent before it will communicate - but then will continue to communicate unless it detects an overrun?
  2. The device assumes/requires hardware flow control and requires RTS, DCD or some other signal handshake before communication can start. In a no-flow control situation the signal may just remain asserted?

The one thing that I imagine HTERM will do is establish flow control first, even if it then works without it. Is it possible to just try enabling flow control in the EPICS parameters (hardware or software) unless the device appears to have a preference and see if this works.

The other possible lesson is to check whether we should be regularly using flow control in some scenarios where traffic could get heavy. In this case, a buffer became overwriten and a query command ending with a ? turned into the same command without - which set a setpoint to 0 :-)

@KathrynBaker
Copy link
Member Author

I know that HTERM is not set up with flow control, and I can’t find any reference to use a flow control other than “none” in the genesys manuals. It might be worth a try, but if that is the solution then there is something more fundamentally wrong with the system.

Typically we don’t turn flow control on because the devices can’t handle it, sending random flow control signals to something that can’t deal with it is as bad as not sending them to something that does.

@ChrisM-S
Copy link

Yes, it is a bit of a mine field!

For what it’s worth, I really needed it on the Thurlby PS for the binary ICE (but didn’t have it enabled or even think of it) because the thing crashed if you wrote too fast before it was ready and overwrote its buffer. In the end, I made sure of my own flow control by ensuring I read the response from the previous command before sending another (and only send complete commands which produced output).

When you next get a chance to look at the Lambda Genesys, if you widen out the HTERM screen there are four green indicators for CTS, DSR, RI and DCD. It would be very interesting to see if any of these change/flash when you “connect” to the port.

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Nov 9, 2016

For reference, I have yet to find a solution, but I have tried the following in the asyn setup, none of which has given any success:

Parity as none, even and odd
clocal as Y and N
ctrscts as N and Y
ixon/ixoff as NN, YY, YN and NY

where not coupled and specified everything else was in the default mode (first option listed above), and so far no joy talking to the lambda genesys directly from EPICS.

I have even tried the clocal and ctrscts in each configuration.

I haven't (yet) gone through every permutation of ALL the values, as that feels like clutching at straws.

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Nov 9, 2016

Things to try next time:
Stopping and starting Nport
Accessing the Moxa ports without NPort

Possible other solutions:
Write a script to run during computer startup that just does that connect/disconnect

Because this can't be fixed this time around, I'm sending it back to the backlog, and will need to remember to repropose it when the next shutdown is coming around.

@KathrynBaker KathrynBaker removed their assignment Nov 9, 2016
@KathrynBaker KathrynBaker removed this from the SPRINT_2016_11_03 milestone Nov 9, 2016
@ChrisM-S
Copy link

If HTERM and LabVIEW are both able to talk to the Lambda Genesys with everything as is, there must be some instruction or setup which is different/missing (or differently timed) in the EPICS driver. It might be worth using Sysinternals “Portmon” (or even the breakout box?) to check the signals (DCD, RTS etc.) to see if there is something there. If not chased down, it is pretty likely to rear its head again at some point.

@KathrynBaker
Copy link
Member Author

Hence returning this ticket to the backlog. There wasn’t time to chase this down this shutdown to any greater extent, at least not without losing the ability to do other work which was also necessary. As such, we don’t have a solution, there are some alternatives to try next time around which may or may not provide answers, including the use of port monitors (a second person is likely needed from a safety perspective with a physical one, as it requires messing around in the back of a PSU).

@KathrynBaker
Copy link
Member Author

One thing I haven't tested, that might be worth a try for this is to use the MOXA via the IP address and port number rather than the COM port, as that would remove nport on Server 2012 as a potential issue

@KathrynBaker
Copy link
Member Author

There is now a spare TDK connected to COM20 (Port 16 on the first MOXA) on NDEMUONFE. Due to location and safety concerns, ideally we would be in the vicinity of the PSU (on the Mezzanine, in front of a Danfysik), we should not leave it plugged into the mains and turned on, and under no circumstances can we turn the output power on.

There are various tests we can attempt, and as this is a spare, we could potentially keep testing during cycle.

To be tried:

  • Connect to port on MOXA 'directly'
  • Borrow the port and try to connect from a Windows 7 LabVIEW system (just for sanity)
  • Borrow the port and try to connect from a Windows 7 EPICS system
  • Try the various connection options with a sniffer on the lines to see just what lines are going up and down each time

@kjwoodsISIS kjwoodsISIS added awaiting and removed ready labels Feb 9, 2017
@KathrynBaker
Copy link
Member Author

James has confirmed that the PSUs are currently on (no current, don;t apply one) - so a restart would give us a number to try and test in a different ways. I may go down tomorrow morning, otherwise I will be down there on Monday

@KathrynBaker
Copy link
Member Author

A quick update to add as I've discovered a few more things (no solution yet) - but if anyone has any ideas, please weigh in.

I haven't been able to try any of the alternatives yet - adding the ports to other locations has not been successful. however in these comparisons I did start noticing some differences in the Async-Settings when looking at the server of the Moxa, and between the ones that hadn't been talked to via HTERM. As such I realised that there may be an issue with the RTS/CTS that I hadn't noticed before. As such, I have gone through those settings again, and rather than fix the problem I've realised that instead I can use it to recreate the problem.

e.g.
Connect to COM17 (port 13 if looking at the Moxa directly) via HTERM
Start the IOC
GENESYS_01:1 starts and connects, the others will continue to show the timeout issues
Stop the IOC
Include a setting of crtscts to N (don't use hadware flow control - should be the setting on the MOXA)
Start the IOC
Nothing connects now, they all get timeouts
Stop the IOC, do the connection via HTERM, restart the IOC, and any that have had a HTERM connect will now be working again.

My next stop is to try variations of N and No in that setting of crtscts (and double check that parameter name), but I'm not convinced it isn't a red herring. At least we can now put this back into the error state without having to restart the computer - although just what is having this effect I'm not sure.

@ChrisM-S
Copy link

ChrisM-S commented Apr 24, 2017 via email

@FreddieAkeroyd
Copy link
Member

Looking at the asyn source code you need to use single letters (Y and N) but it is case insensitive. It uses the win32 SetCommConfig() function and the command Chris mentions prints out most of the relevant parameters

@FreddieAkeroyd
Copy link
Member

I've pushed a mod to asyn that will print out a lot of extra information via the dbior IOC command - you'd need to rebuild/redeploy to get it

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Apr 25, 2017 via email

@ChrisM-S
Copy link

ChrisM-S commented Apr 25, 2017 via email

@KathrynBaker
Copy link
Member Author

KathrynBaker commented Apr 25, 2017 via email

@KathrynBaker
Copy link
Member Author

Changing the settings in the Moxa has no effect, it looks like this might be how Server2012 is initialising the ports, so I'm going to go down that path to see if I can find a solution from that side, as there isn't one available easily in software

@KathrynBaker
Copy link
Member Author

Next update. I tried all the combinations, and they will turn the dtr and rts circuits to on or handshaking from the st.cmd, but not one will set them to OFF.

I'll hold off using the extras from Freddie for a little while. However, I did try a couple of other things (including installing a new version of nport administrator!)

I used the same mode command to set the two specific fields to OFF, and that allows the port to communicate easily. However, even via that setting method, it isn't maintained during a restart. But, at least using that method we can write a bat script that can be called on instrument restart to get things back into working order.

Or, can we call that line from the st.cmd? Which would be even more reliable, and adaptable to other instruments.

@FreddieAkeroyd
Copy link
Member

The asyn code is:

else if (epicsStrCaseCmp(key, "crtscts") == 0) {
    if (epicsStrCaseCmp(val, "Y") == 0) {
        tty->commConfig.dcb.fOutxCtsFlow = TRUE;
        tty->commConfig.dcb.fRtsControl = RTS_CONTROL_HANDSHAKE;
    }
    else if (epicsStrCaseCmp(val, "N") == 0) {
        tty->commConfig.dcb.fOutxCtsFlow = FALSE;
        tty->commConfig.dcb.fRtsControl = RTS_CONTROL_ENABLE;
    }

So setting it to N sets RTS_CONTROL_ENABLE and there is no way to set RTS_CONTROL_DISABLE currently. If crtscts was specified in st.cmd, I'd expect the settings to remain unchanged but maybe windows applies some default? The previous mods I did just printed more information, and I think we have worked out what we need to do. I'm not sure if RTS_CONTROL_ENABLE or RTS_CONTROL_DISABLE is correct above or whether another option is required. I can add a special option for us (D for disable) at the moment and then query the correct setting with the epics mailing list or Mark Rivers.

@FreddieAkeroyd
Copy link
Member

In EPICS disabling CRTSCTS or DTR flow control via the asyn setoptions sets the line to ENABLE and high permanently, there is no option to "disable" the mechanism. I've now pushed a mod to asyn to allow you to set both crtscts and clocal options to the value "D" to turn off the mechanisms.

@FreddieAkeroyd
Copy link
Member

BTW @KathrynBaker @ChrisM-S good detective work :-)

@KathrynBaker
Copy link
Member Author

Given the proximity to beam start, I'm not going to make any changes to NDEMUONFE this shutdown, as Freddie has handled the modification elsewhere, I'm going to mark this ticket as ready for review, and create a new one for modifying the setup on MUONFE before next cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants