Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random segmentation fault (core dumped) in ARM64 #151

Closed
su600 opened this issue Dec 30, 2020 · 38 comments
Closed

Random segmentation fault (core dumped) in ARM64 #151

su600 opened this issue Dec 30, 2020 · 38 comments

Comments

@su600
Copy link

su600 commented Dec 30, 2020

Hi
My program Read tag values in a loop, 1 time/second.
This program running in Docker (base image is Python 3.7.4), on ARM64 platform. I have test many times, this Segmentation fault (core dumped) happen randomly.
pylogix==0.6.4 and 0.7.7 both have this problem.

my code is like this.

def rockwellread(rockwellip):
    comm = PLC()
    comm.IPAddress=rockwellip
    for i in range(len(taglist)):
            tagValue[i] = (comm.Read(taglist[i])).Value
            # tagValue[i] = random.random()
    return tagValue

while 1:
    rockwellread(rockwellip)
    time.sleep(1)

For this code, random.random() won't Segmentation fault (core dumped), once I use comm.Read this happen randomly, Maybe after loops for dozens time or hundreds times.

I am not familiar with C/C++, and don't know how to solve this. Please help.

@TheFern2
Copy link
Collaborator

TheFern2 commented Dec 30, 2020

Hi @su600 this isn't c++, is python. Can you do me a favor and print out the i value to see if is faulting in the same tag. If it happens in the same i value let us know the tag type.

Also post full stack trace error when it happens.

@su600
Copy link
Author

su600 commented Dec 30, 2020

I know pylogix is a Python Project, so it is weird happen segmentation fault (core dumped) .
Maybe pylogix is a wrapper of C API ? When this problem happened, the docker container exited immediately,there is no more error information.

My code above is just an example to test this problem, the i value is random, and the number of the loop run before segmentation fault (core dumped) is also random.

@evaldes2015
Copy link

Maybe the segfault comes from the IP stack? Since your code is running the read in a tight loop you may be overwhelming the device you are talking to. If this happens it might send garbage back.

Put something like time.sleep(0.020) in your loop and see if this prevents the segfault.

@dmroeder
Copy link
Owner

pylogix is a pure python project.
Full stack trace might be helpful as @TheFern2 points out.

One problem I see with your example, each time you are calling rockwellread(), you are creating a new instance of the driver while not closing it. The PLC will eventually flush the connections, but you are creating new connections faster than the PLC will flush them, it will eventually get tired of this.

Why read each tag individually and not just just read the whole list at once? It would be much faster to read the list.

import time
from pylogix import PLC

tag_list = ["Result_Value", "SwitchValue", "Time.Minute"]

def my_function(tag_list):
    ret = comm.Read(tag_list)
    return ret

with PLC("192.168.1.61") as comm:
    while True:
        x = my_function(tag_list)
        print(x)
        time.sleep(1)

@TheFern2
Copy link
Collaborator

TheFern2 commented Dec 30, 2020

@evaldes2015 He already has a time.sleep(1) one second in the initial example. 0.2 is milliseconds, 0.02 is what microseconds? point is will make it faster, not slower. I agree might as well add a big timeout of at least 10s, if the segmentation error occurs then is definitely an issue of the container most likely. That's of course after trying Dustin previous example.

@dmroeder Nice catch on the number of PLC instances without closing.

@evaldes2015
Copy link

@TheFern2 If you look at his code, there's a tight loop inside his readrockwell function where he reads a bunch of tags. He only sleeps between calls to the function, not between reads. If his device is an ENBT, he might overload it by doing this.

@dmroeder
Copy link
Owner

I think @evaldes2015 was suggesting putting a 20ms delay between each read. That would be a little different than the 1 second call to start the read process.

I agree with you @TheFern2 , the segmentation fault is likely some container issue. I don't work with containers, so it's a little hard for me to troubleshoot. I don't believe this to be a pylogix issue, though I'm willing to be wrong. Of course, how the connection is handled is still an issue.

@TheFern2
Copy link
Collaborator

@evaldes2015 Ah gotcha, yes another sleep inside the readrockwell, I misunderstood. I would make all time sleeps 5 or 10s just for testing purposes, and slowly decrease them. Just to ensure the seg error doesn't occur. Because if it does occur even with timeouts then we'd be out of ideas as far as pylogix goes.

@dmroeder
Copy link
Owner

Sleep is in seconds. 0.2 seconds = 200ms. 0.02 seconds = 20ms

@TheFern2
Copy link
Collaborator

TheFern2 commented Dec 30, 2020

@su600 post details on image and container if @dmroeder and @evaldes2015 suggestions don't work.

Oh nvm is on initial post, I don't have an arm64, that's not a raspberry pi, is it?

@su600
Copy link
Author

su600 commented Dec 31, 2020

My hardware is base on NXP ARM-Cotex-A53.
Actually, my project collect data by pylogix, and socket send to OPC UA Server, meanwhile data is written into InfluxDB. After many test, we have located segfault related to the Read of pylogix, so my code I posted before is just a simplified test program.

My original code of is like this. I have used Read together with with, and read from a taglist.

import time
import os
import csv
import socket
import pandas as pd
import logging

from influxdb_client import InfluxDBClient, Point, WriteOptions
from influxdb_client.client.write_api import SYNCHRONOUS

from pylogix import PLC
######################### 罗克韦尔 ##############################
global rockwellip,opcuaip,opcuaport,cycle,rockwell_device_list,taglist

with open('./configure.txt') as f:
    print(f'【读取配置文件configure.txt】')
    ff=f.readlines()
    rockwellip = ff[0][:-1].replace("'",'').split('=')[1]
    opcuaip= ff[1][:-1].replace("'",'').split('=')[1]
    opcuaport=int(ff[2][:-1].split('=')[1])
    cycle= int(ff[3].split('=')[1])
    print(f'设备地址为:{rockwellip} 采集周期为:{cycle}s')
    logging.warning(f'设备地址为:{rockwellip} 采集周期为:{cycle}s')
    print(f'UA Server地址为:{opcuaip}:{opcuaport}')
    logging.warning(f'UA Server地址为:{opcuaip}:{opcuaport}')

    ###########################  influxDB连接配置信息  ##########################################
    influxdbip = 'http://' + ff[4][:-1].replace("'", '').split('=')[1]
    influxdbport = ff[5][:-1].replace("'", '').split('=')[1]
    token = ff[6][:-1].replace("'", '').split('=')[1] + '=='
    bucket = ff[7][:-1].replace("'", '').split('=')[1]
    org = ff[8][:-1].replace("'", '').split('=')[1]
    influxdbip=influxdbip+":"+influxdbport
    print(influxdbip,token, bucket, org)
    ###########################################################################################
client = InfluxDBClient(url=influxdbip, token=token, org=org)
write_api = client.write_api(write_options=SYNCHRONOUS)

'''
    get taglist and uaname from variables.csv
'''
print(f'【读取变量表文件cvariables.csv】')
variables=pd.read_csv('variables.csv')
taglist=variables['tagname'].tolist()
uaname=variables['uaname'].tolist()


def rockwellread(rockwellip,taglist):   
    '''
    数据采集读取函数
    rockwellip:设备ip地址 来自于 configure.txt
    taglist: 变量表 来在于 variables.csv
    '''
    print('开始读取变量数据')
    # logging.warning('开始读取变量数据')
    taglist=taglist

    ### read tag values by group, 10 tags one group
    def readten(tag_list):
        l = len(tag_list)  # 变量表长度,如果大于10 必须分批读取保证不报错
        x,y=divmod(l,10) # Python内置函数返回 整除和余数
        xx = 1  ## 对于小于10个的情况,range(x)第二个if会输出个空列表,这里增加一个标记,如果不足10个,下面的if赋值为0
        if x == 0:
            x = 1  # 如果变量不足一组,需赋值为1
            xx = 0
        a = 0  # 每一组变量的上标
        val = []  # 初始化列表 每一组变量值
       
        for n in range(x):
            if n < x:
                val = val + comm.Read(tag_list[10 * a:10 * (a + 1)])
                a += 1
                n += 1
            if n == x and y != 0 and xx!=0:
                val = val + comm.Read(tag_list[10 * a:10 * a + y])
        vall=val
        return vall

    with PLC() as comm:
        comm.IPAddress=rockwellip
        tagname = []
        tagvalue = []
        aa = readten(taglist)  # 调用函数分批读取变量
 
    print("readten done")
    for a in aa:
        tagvalue.append(a.Value)
   
    return tagvalue

def opcuasocket(opcuaip,opcuaport,cycle,uaname,bucket,org):

    cc = int(cycle)
    ii = 1
    while 1:
        tagvalue = rockwellread(rockwellip,taglist)
        logging.warning(tagvalue)
        logging.warning(f'Loop Time {ii}')
        ii += 1
        time.sleep(cc)

############################ MAIN #######################################
print("Program Start")
opcuasocket(opcuaip,opcuaport,cycle,uaname,bucket,org)

image

I have also set the ulimit -s and ulimit -c of the container, this problem still happen.

Maybe it is a problem of docker or something else, I'll do more test about the Docker container or other Python version.

@su600
Copy link
Author

su600 commented Dec 31, 2020

One more thing, my taglist is like this:

PC_TEMPZONE[122]
PC_TEMPZONE[123]
PC_TEMPZONE[124]
PC_TEMPZONE[125]
RML_INPUT_ERROR_ALARM[1]
RML_INPUT_ERROR_ALARM[2]
RML_INPUT_ERROR_ALARM[3]
RML_INPUT_ERROR_ALARM[4]
RML_INPUT_ERROR_ALARM[5]
RML_INPUT_ERROR_ALARM[6]
RML_INPUT_ERROR_ALARM[7]
RML_INPUT_ERROR_ALARM[8]
RML_INPUT_ERROR_ALARM[9]
RML_INPUT_ERROR_ALARM[10]
RML_INPUT_ERROR_ALARM[11]
RML_INPUT_ERROR_ALARM[12]
RML_INPUT_ERROR_ALARM[13]
RML_INPUT_ERROR_ALARM[14]
RML_INPUT_ERROR_ALARM[15]
RML_INPUT_ERROR_ALARM[16]
RML_INPUT_ERROR_ALARM[17]
RML_INPUT_ERROR_ALARM[18]
RML_INPUT_ERROR_ALARM[19]
RML_INPUT_ERROR_ALARM[20]
RML_INPUT_ERROR_ALARM[21]
RML_INPUT_ERROR_ALARM[22]
RML_INPUT_ERROR_ALARM[23]
RML_INPUT_ERROR_ALARM[24]
Local:12:O.Data.0
HYD_ENABLED
VACUUM_PUMP_ENABLED

The top 4 tag value is Temperature, type is float, the RML_INPUT_ERROR_ALARM is Alarm signals, type is bool , the last 3 tags is also bool. For general use, tag list is imported for a csv file and can be edit outside the program, so I list all elements of the array in the csv file, didn't use the way of Read an array.
For pylogix==0.6.4 , I can get the right value all of above.
After update to 0.7.7, I get "None" from the RML_INPUT_ERROR_ALARM[x] tags by reading this taglist., other tags get the right value.
Simple read one tag RML_INPUT_ERROR_ALARM[1], everything is good.

@dmroeder
Copy link
Owner

dmroeder commented Dec 31, 2020

So for your last post, you say that if you read those tags as a list... for example...

tags = ['PC_TEMPZONE[122]'
            'PC_TEMPZONE[123]'
            'PC_TEMPZONE[124]'
            'PC_TEMPZONE[125]'
            'RML_INPUT_ERROR_ALARM[1]'
            'RML_INPUT_ERROR_ALARM[2]'
            'RML_INPUT_ERROR_ALARM[3]'
            'RML_INPUT_ERROR_ALARM[4]'
            'RML_INPUT_ERROR_ALARM[5]'
            'RML_INPUT_ERROR_ALARM[6]']
ret = comm.Read(tags)
print(ret)

... you have a value of None returned for the BOOL's but the REAL's will have the correct value? I just tried this and the BOOL's return True/False

edit: when the value is None, is the status "Success", or something else?

@su600
Copy link
Author

su600 commented Dec 31, 2020

So for your last post, you say that if you read those tags as a list... for example...

tags = ['PC_TEMPZONE[122]'
            'PC_TEMPZONE[123]'
            'PC_TEMPZONE[124]'
            'PC_TEMPZONE[125]'
            'RML_INPUT_ERROR_ALARM[1]'
            'RML_INPUT_ERROR_ALARM[2]'
            'RML_INPUT_ERROR_ALARM[3]'
            'RML_INPUT_ERROR_ALARM[4]'
            'RML_INPUT_ERROR_ALARM[5]'
            'RML_INPUT_ERROR_ALARM[6]']
ret = comm.Read(tags)
print(ret)

... you have a value of None returned for the BOOL's but the REAL's will have the correct value? I just tried this and the BOOL's return True/False

edit: when the value is None, is the status "Success", or something else?

Value=None, status=path destination unknown.

@dmroeder
Copy link
Owner

So Path Destination Unknown is the PLC's response, meaning it cannot find the tag you are requesting. I would triple check that the tag is spelled right when you are pulling it from the CSV. I've seen where people don't strip the whitespace, leaving a space at the beginning or end when they are parsing a CSV/TXT file. It's not always obvious. When they type the tag name themselves, they get it right and it works.

@TheFern2
Copy link
Collaborator

TheFern2 commented Dec 31, 2020

I think we have multiple layers here.

First we need to establish that your code that reads tags works fine from a laptop without using docker. I'm talking just pylogix and reading a list of tags, do not add any db, opc, or any other logic.

If that code works fine without docker, next is to try it on docker again just pylogix. If that gives you a seg fault then is definitely a docker issue.

https://dev.to/mizutani/how-to-get-core-file-of-segmentation-fault-process-in-docker-22ii

If pylogix works fine on both laptop and container, then add the other logic little by little, first db, then opc, etc. Containers are super fragile for example I've had a container that wouldn't stay on just because a db wasn't named properly.

Edit: I wrote all this before the last two responses.

@TheFern2
Copy link
Collaborator

TheFern2 commented Dec 31, 2020

Btw I would suggest you wrap your code where you read tags in a try/catch and print tag.Name, and tag.Status on the catch for easy debugging on which tags are failing

@dmroeder
Copy link
Owner

@TheFern2 you are right, there are two problems going on here. Reading and the segmentation fault.

Segmentation fault suggests that the python interpreter crashed. From my experience, this happens when a program which binds to some other language crashes. For example, OpenCV, where they have a python layer that binds to C. I don't believe pylogix is causing the segmentation fault, something else is crashing.

@su600
Copy link
Author

su600 commented Dec 31, 2020

I think we have multiple layers here.

First we need to establish that your code that reads tags works fine from a laptop without using docker. I'm talking just pylogix and reading a list of tags, do not add any db, opc, or any other logic.

If that code works fine without docker, next is to try it on docker again just pylogix. If that gives you a seg fault then is definitely a docker issue.

https://dev.to/mizutani/how-to-get-core-file-of-segmentation-fault-process-in-docker-22ii

If pylogix works fine on both laptop and container, then add the other logic little by little, first db, then opc, etc. Containers are super fragile for example I've had a container that wouldn't stat on just because a db wasn't named properly.

Edit: I wrote all this before the last two responses.

Yes, we are doing the something as you said, and need time.

@su600
Copy link
Author

su600 commented Dec 31, 2020

So Path Destination Unknown is the PLC's response, meaning it cannot find the tag you are requesting. I would triple check that the tag is spelled right when you are pulling it from the CSV. I've seen where people don't strip the whitespace, leaving a space at the beginning or end when they are parsing a CSV/TXT file. It's not always obvious. When they type the tag name themselves, they get it right and it works.

The tag name is right, and the code nothing changed, just update pylogix 0.6.4. to 0. 7. 7.
Maybe related to the sequence of the taglist.

@dmroeder
Copy link
Owner

So Path Destination Unknown is the PLC's response, meaning it cannot find the tag you are requesting. I would triple check that the tag is spelled right when you are pulling it from the CSV. I've seen where people don't strip the whitespace, leaving a space at the beginning or end when they are parsing a CSV/TXT file. It's not always obvious. When they type the tag name themselves, they get it right and it works.

The tag name is right, and the code nothing changed, just update pylogix 0.6.4. to 0. 7. 7.
Maybe related to the sequence of the taglist.

Interesting. I read a list of 10 tags or so where the first few were REAL and the rest were individual BOOL's of an array. I'll do some more experiments to see if maybe the number of tags matter.

@SDEarl
Copy link

SDEarl commented Dec 31, 2020

Can confirm this same issue ("None" returned for bools when list of tags read) and had to downgrade to 0.6.7 before it would work. Single tags reads worked though.

Edit: I keep the tag lists below ten items to prevent issues

@dmroeder
Copy link
Owner

Can confirm this same issue ("None" returned for bools when list of tags read) and had to downgrade to 0.6.7 before it would work. Single tags reads worked though.

Edit: I keep the tag lists below ten items to prevent issues

Curious, what controller are you working with?

@SDEarl
Copy link

SDEarl commented Dec 31, 2020

Contrologix but I don't know the model off the top of my head. I had used 0.6.7 on a different project with the same PLC. I switched to python 3.9 in between projects and upgraded pylogix at that time. I had a heck of a time trying to get this to work and went back to the version I had used previously so I couldn't tell you if it worked on any version between 0.6.7 and 0.7.x
Edit: Not sure if it helps at all but I have PC to PLC over ethernet connection with nothing between.

@TheFern2
Copy link
Collaborator

@dmroeder I've added a todo for me on the project to add list read for boolean to PylogixTests.py guess that one missed through the cracks. We had a multiread but not an extensive one. I wonder if it fails after certain N bytes.

@TheFern2
Copy link
Collaborator

@dmroeder I am about to push a new test to master, testing BaseBOOLArray, at first glance it looks like latest code can only read 4 bool tags with Success, 5 and above all return AssertionError: 'Success' != 'Path destination unknown', the problem with our initial test for multi read was that I wasn't testing for a big list and the multiread fixture really only made sure it got a response, but now I am checking to make sure is a success.

@su600
Copy link
Author

su600 commented Jan 1, 2021

Happy new year!

@dmroeder @TheFern2

I found this issue,
python-pillow/Pillow#1935
maybe we can get something from it, something about docker container
privileged or stack size control.

Sorry, I mistouch the close button on my phone.

@su600 su600 closed this as completed Jan 1, 2021
@su600 su600 reopened this Jan 1, 2021
@kyle-github
Copy link

@su600 I am not tracking what the Pillow issue has to do with this? Are you suggesting that pylogix has deep stack usage and therefore runs into problems on Alpine?

There is an informative post about a similar problem with Alpine/musl.

@TheFern2
Copy link
Collaborator

TheFern2 commented Jan 1, 2021

@dmroeder did a bit of troubleshooting this morning, I can confirm latest 0.7.7 there's a bug for boolean list read. #154 will def prevent this in the future as far as testing goes.

C:\git\pylogix>py -3.7 pylogix/testing/test.py

########### List Read

BaseBOOLArray[0] True Success
BaseBOOLArray[1] False Success
BaseBOOLArray[2] False Success
BaseBOOLArray[3] False Success
BaseBOOLArray[4] None Path destination unknown
BaseBOOLArray[5] None Path destination unknown
BaseBOOLArray[6] None Path destination unknown
BaseBOOLArray[7] None Path destination unknown

########### Array Read

BaseBOOLArray[0] [True, False, True, False, True, False, True, False, True, False] Success

However the same test passes fine in 0.7.5:

########### List Read

BaseBOOLArray[0] True Success
BaseBOOLArray[1] False Success
BaseBOOLArray[2] False Success
BaseBOOLArray[3] False Success
...
BaseBOOLArray[124] True Success
BaseBOOLArray[125] False Success
BaseBOOLArray[126] True Success
BaseBOOLArray[127] True Success

########### Array Read

BaseBOOLArray[0] [True, False, True, False, True, False, True, False, True, False] Success

@dmroeder
Copy link
Owner

dmroeder commented Jan 2, 2021

I pushed a commit that should take care of this BOOL array in list issue.

@su600
Copy link
Author

su600 commented Jan 7, 2021

@dmroeder @TheFern2
After add settrace in my code, I located the segmentation fault line.

2020-06-02T13:13:39.901294250Z line, /usr/local/lib/python3.8/site-packages/pylogix/eip.py:872
2020-06-02T13:13:39.901389875Z call, /usr/local/lib/python3.8/site-packages/pylogix/eip.py:1618
2020-06-02T13:13:39.901415875Z line, /usr/local/lib/python3.8/site-packages/pylogix/eip.py:1626
2020-06-02T13:13:39.901441125Z line, /usr/local/lib/python3.8/site-packages/pylogix/eip.py:1627
2020-06-02T13:13:39.912815875Z Segmentation fault (core dumped)

site-packages/pylogix/eip.py:1627 is part = self.Socket.recv(4096) in the function of recv_data.
image

This segmentation fault randomly, and nothing to do with the taglist, I just read 1 tag in a loop, it is also occured.

Could you analyze the code and provide a solution to avoid this? Thank you

@evaldes2015
Copy link

You're assuming that the issue is in pylogix. It may be network hardware or a driver. Have you done anything to rule that out?

@su600
Copy link
Author

su600 commented Jan 7, 2021

You're assuming that the issue is in pylogix. It may be network hardware or a driver. Have you done anything to rule that out?

I mean it is a issue of Python socket. I'm trying to do a test on Raspberry Pi.

@dmroeder
Copy link
Owner

dmroeder commented Jan 7, 2021

@su600, do you experience this when you run your code outside of docker?

@TheFern2
Copy link
Collaborator

TheFern2 commented Jan 7, 2021

When you say you ran one tag in a loop, did you happen to put a time.sleep in the loop? Please put a time.sleep(1) and monitor, if the issue occurs keep increasing sleep by 1s, until the stack trace doesn't occur. This looks more like the container network resources are throttling the connection, but that's just an educated guess I don't deal with docker much for pylogix.

@su600
Copy link
Author

su600 commented Jan 7, 2021

The segmentation fault occur in line 1627 time and time again for 6 times, after more test, the fault line is also random, but most related to socket.

I have put sleep(5) each loop, and even sleep(0.1) and use with PLC() between each Read when read by group, this also happen. I'll do even more test about sleep duration.

My hardware OS is build by ourselves and is complex to prepare the environment outside of Docker.
I'll do some test on Raspberry Pi first.
I think it is maybe nothing to do with pylogix .............
Maybe is our OS fault or Docker or hardware or socket, pylogix just exposed this problem.

@TheFern2
Copy link
Collaborator

TheFern2 commented Jan 7, 2021

I would check if you can run a network monitoring tool for the container and see if it's hitting is limits. Also did you put a try/catch when you read tags? This will certainly prevent the program from crashing. When the catch happens log to a file and see how many times is happening. Might also want to log cpu_percent and virtal_memory from psutils. You could also log these two all the time and then compare with the log file in catch to see if there's a huge change.

@su600
Copy link
Author

su600 commented Jan 7, 2021

I have tested on Raspberry Pi 3B, armv7l, nothing goes wrong, both in Docker and outside Docker.
I'll close this issue.
Thank you all.

@su600 su600 closed this as completed Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants