Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketAborted problem #17

Open
MaxKinny opened this issue Mar 3, 2020 · 15 comments
Open

SocketAborted problem #17

MaxKinny opened this issue Mar 3, 2020 · 15 comments
Assignees

Comments

@MaxKinny
Copy link

MaxKinny commented Mar 3, 2020

My Environment:
Activated MMA 12.0 works correctly on my Ubuntu 16.04 system.

I install WolframClient with following code:

pip install wolframclient -i https://pypi.doubanio.com/simple

Then I execute following codes in python 3.6.10:

from wolframclient.evaluation import WolframLanguageSession 
sess = WolframLanguageSession(path) 
sess.start() 

where, the path I try different values like the result of :

$ which WolframKernel

in my case, it's '/usr/local/bin/WolframKernel' , or the output of

First@$CommandLine 

which is
'/usr/local/Wolfram/Mathematica/12.0/SystemFiles/Kernel/Binaries/Linux-x86-64/WolframKernel'
or use this path
'/usr/local/Wolfram/Mathematica/12.0/Executables/WolframKernel' which is default path of WolframLanguageSession().

Then errors are vary when I choose different paths, but can be concluded into 2 types.

  1. When path = '/usr/local/Wolfram/Mathematica/12.0/SystemFiles/Kernel/Binaries/Linux-x86-64/WolframKernel' , it immediately returns following errors info:
Socket exception: Socket operation aborted.
Failed to start.
---------------------------------------------------------------------------
SocketAborted                             Traceback (most recent call last)
~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _kernel_start(self)
    436                 timeout=self.get_parameter("STARTUP_TIMEOUT"),
--> 437                 abort_event=_StartEvent(self.kernel_proc, self.trigger_termination_requested),
    438             )

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/zmqsocket.py in recv_abortable(socket, timeout, abort_check_period, abort_event, **kwargs)
     49                     if abort_event.is_set():
---> 50                         raise SocketAborted("Socket operation aborted.")
     51                 if timeout and (time.perf_counter() - start > timeout):

SocketAborted: Socket operation aborted.

During handling of the above exception, another exception occurred:

WolframKernelException                    Traceback (most recent call last)
<ipython-input-4-84c234a8332d> in <module>
----> 1 sess.start()

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/localsession.py in start(self, block, timeout)
    155                 self.terminate()
    156             finally:
--> 157                 raise e
    158 
    159     def start_future(self):

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/localsession.py in start(self, block, timeout)
    150             future = self.start_future()
    151             if future and block:
--> 152                 future.result(timeout=timeout)
    153         except Exception as e:
    154             try:

~/anaconda3/envs/detectron2/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/anaconda3/envs/detectron2/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in run(self)
    525             # Kernel start requested.
    526             if payload is self.START:
--> 527                 self._safe_kernel_start()
    528                 future.set_result(True)
    529                 future = None

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _safe_kernel_start(self)
    364                 self._kernel_terminate()
    365             finally:
--> 366                 raise e
    367 
    368     _KERNEL_OK = b"OK"

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _safe_kernel_start(self)
    358         """ Start a kernel. If something went wrong, clean-up resources that may have been created. """
    359         try:
--> 360             self._kernel_start()
    361         except Exception as e:
    362             logger.warning("Failed to start.")

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _kernel_start(self)
    454             logger.warning("Socket exception: %s", se)
    455             raise WolframKernelException(
--> 456                 "Failed to communicate with kernel: %s." % self.kernel
    457             )
    458 

WolframKernelException: Failed to communicate with kernel: [path].
  1. When path != '/usr/local/Wolfram/Mathematica/12.0/SystemFiles/Kernel/Binaries/Linux-x86-64/WolframKernel', it will wait for several seconds and return following errors:
Socket exception: Failed to read any message from socket tcp://127.0.0.1:39963 after 20.1 seconds and 200 retries.
Failed to start.
---------------------------------------------------------------------------
SocketOperationTimeout                    Traceback (most recent call last)
~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _kernel_start(self)
    436                 timeout=self.get_parameter("STARTUP_TIMEOUT"),
--> 437                 abort_event=_StartEvent(self.kernel_proc, self.trigger_termination_requested),
    438             )

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/zmqsocket.py in recv_abortable(socket, timeout, abort_check_period, abort_event, **kwargs)
     54                 "Failed to read any message from socket %s after %.1f seconds and %i retries."
---> 55                 % (socket.uri, time.perf_counter() - start, retry)
     56             )

SocketOperationTimeout: Failed to read any message from socket tcp://127.0.0.1:39963 after 20.1 seconds and 200 retries.

During handling of the above exception, another exception occurred:

WolframKernelException                    Traceback (most recent call last)
<ipython-input-3-84c234a8332d> in <module>
----> 1 sess.start()

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/localsession.py in start(self, block, timeout)
    155                 self.terminate()
    156             finally:
--> 157                 raise e
    158 
    159     def start_future(self):

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/localsession.py in start(self, block, timeout)
    150             future = self.start_future()
    151             if future and block:
--> 152                 future.result(timeout=timeout)
    153         except Exception as e:
    154             try:

~/anaconda3/envs/detectron2/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/anaconda3/envs/detectron2/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in run(self)
    525             # Kernel start requested.
    526             if payload is self.START:
--> 527                 self._safe_kernel_start()
    528                 future.set_result(True)
    529                 future = None

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _safe_kernel_start(self)
    364                 self._kernel_terminate()
    365             finally:
--> 366                 raise e
    367 
    368     _KERNEL_OK = b"OK"

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _safe_kernel_start(self)
    358         """ Start a kernel. If something went wrong, clean-up resources that may have been created. """
    359         try:
--> 360             self._kernel_start()
    361         except Exception as e:
    362             logger.warning("Failed to start.")

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py in _kernel_start(self)
    454             logger.warning("Socket exception: %s", se)
    455             raise WolframKernelException(
--> 456                 "Failed to communicate with kernel: %s." % self.kernel
    457             )
    458 

WolframKernelException: Failed to communicate with kernel: [path]
@DbxDev
Copy link
Contributor

DbxDev commented Mar 3, 2020

Some preliminary questions:

  • What version of the library do you grab from the package repository you specified?
  • Can you run the executables from a terminal and get a prompt? (In[1]:=)

@MaxKinny
Copy link
Author

MaxKinny commented Mar 3, 2020

Answer:

  • The version of wolframclient library is 1.1.2.
  • Yes. Every executable works well from all different paths.

By the way, my windows10_x86_64 laptop works well with MMA 11.3 and 1.1.2 library. On my laptop, I executed wl.NSolve('- 2*x + 3 == 0') expecting a "x=1.5" result, but only got a empty tuple. What's the usage of this NSolve function in python?

@DbxDev
Copy link
Contributor

DbxDev commented Mar 3, 2020

Can you provide a minimal example script that trigger the errors. From what I see above no code snippet is a working example (e.g. it is missing path).
Also make sure to close session or use with when initializing sessions otherwise orphan kernels are spawned.

@MaxKinny
Copy link
Author

MaxKinny commented Mar 4, 2020

Below two examples will give errors in one of 2 types depending on the path I choose.
minimal example1:

with WolframLanguageSession(path) as sess:
        sess.start()

minimal example2:

with WolframLanguageSession(path) as sess:
        sess.evaluate(wl.range(5))

@MaxKinny
Copy link
Author

MaxKinny commented Mar 4, 2020

Interestingly and strangely, I install MMA 12.0 and 1.1.2 library in docker container. In the container, minimal example scripts work well.

@DbxDev
Copy link
Contributor

DbxDev commented Mar 4, 2020

These are not minimal working example scripts. Think of a piece of code that could work if stored in a file and run using:
$ python3 example.py
For example they both miss the import statements.
Thanks.

@MaxKinny
Copy link
Author

MaxKinny commented Mar 4, 2020

These are not minimal working example scripts. Think of a piece of code that could work if stored in a file and run using

It's difficult to find the core part that generates the problem. But I tried to figure out it and found some clues.

Type 1 error occurs because abort_event.is_set() is True,where is_set() function will return self.subprocess.poll() is not None or self.abort_event.is_set(). I found that self.abort_event.is_set() is False, self.subprocess.poll() is 127. So it is self.subprocess that causes the problem.

Then I tried to find where the self.subprocess value is assigned. I found that in the 419 line of file

~/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py

in function _kernel_start(), there is a code block which assigns value to self.kernel_proc first and assgins value to self.subprocess later:

self.kernel_proc = Popen(cmd,
              stdin=self._stdin,
              stdout=self._stdout,
              stderr=self._stderr,
               startupinfo=startupinfo,)

so, I print out arguments that convey to the Popen():
cmd:

['/usr/local/Wolfram/Mathematica/12.0/SystemFiles/Kernel/Binaries/Linux-x86-64/WolframKernel', '-noprompt', '-initfile', '/home/[my_user_name]/anaconda3/envs/detectron2/lib/python3.6/site-packages/wolframclient/evaluation/kernel/initkernel.m', '-run', 'ClientLibrary`Private`SlaveKernelPrivateStart["tcp://127.0.0.1:41535", "tcp://127.0.0.1:40675"];']

stdin: -1, stdout: -1, stderr: -1, startupinfo: None

Then, 437 line in file kernelcontroller.py, self.kernel_proc will convey to self.subprocess by following codes:

abort_event=_StartEvent(self.kernel_proc, self.trigger_termination_requested)

, where self refers to class _StartEvent. But strangely, I found that the value of self.kernel_proc here is still None before conveying!

Therefore, I dug into function recv_abortable in file zmqsocket.py, to find out exactly when self.subprocess.poll() changes to 127. I changed the codes block of recv_abortable into

def recv_abortable(
            socket, timeout=None, abort_check_period=0.1, abort_event=None, **kwargs
        ):
            print(abort_event.is_set())  # print False 
            if not socket.bound:
                raise SocketException("ZMQ socket not bound.")
            print(abort_event.is_set())    # print True !!
            if timeout and timeout < 0:
                raise ValueError("Timeout must be a positive number.")

It seems that after executing socket.bound, self.subprocess.poll() changes to 127!

Hope it will help!

@cspipaon
Copy link

I think I have the same or similar issue on Wolfram Client Library version 1.1.4. I described this in this StackExchange post. My technical knowledge is limited but if there is a way to solve this please let me know.

@DbxDev
Copy link
Contributor

DbxDev commented Dec 14, 2020

It's not possible to debug issues without a minimal runnable script which hasn't been provided. As stated on the above mentioned StackExchange post, it's possible that some orphan kernels exist.
Use ps -eaf | grep "WolframKernel.*SlaveKernelPrivateStart" | grep -v grep to find them. If you have none, then please provide a working minimal example so I can try to understand what could go wrong.
Thanks.

@Stephan-Brandt
Copy link

Stephan-Brandt commented Jun 9, 2021

Facing the same issue on Ubuntu 20.04 with wolframclient==1.1.6 installed in a conda environment.

I cannot provide a minimal working example since nothing works, so here is my minimal crashing example:

from wolframclient.evaluation import WolframLanguageSession

with  WolframLanguageSession() as session:
    pass

This prints

Socket exception: Socket operation aborted.
Failed to start.

and the handling of that exception throws

Failed to communicate with kernel: /usr/local/Wolfram/Mathematica/12.1/Executables/WolframKernel.

Your explanation with the orphan kernels seems plausible since it worked once in the beginning, but the problem persists after reboot.

ps -eaf | grep "WolframKernel.*SlaveKernelPrivateStart" | grep -v grep does not find anything.

@DbxDev
Copy link
Contributor

DbxDev commented Jun 9, 2021

Here are some investigations to try narrow the source of bugs. The problem you're facing is that the kernel process ends abruptly. @MaxKinny that's why self.subprocess.poll() has a value. Whereas it should be None as long as the process is running (see doc).

Could you setup Python logging to see if there is anything useful printed using:

import logging
logging.basicConfig(level=logging.DEBUG)

It'll print to stdout. You should see the command used to start the WolframLanguage kernel. On my machine I get something like this:

DEBUG:wolframclient.evaluation.kernel.kernelcontroller:Kernel called using command: /Applications/Mathematica.app/Contents/MacOS/WolframKernel -noprompt -initfile /Users/dorianb/.../WolframClientForPython/wolframclient/evaluation/kernel/initkernel.m -run ClientLibrary`Private`KernelPrivateStart["tcp://127.0.0.1:57055", "tcp://127.0.0.1:57057"];.
INFO:wolframclient.evaluation.kernel.kernelcontroller:Kernel process started with PID: 33202

Note that this is the code for wolframclient 1.1.6. I'll use it later, and you'd need to use the one printed on your machine, if running an older version of the library.

Using this command, you can try to mock the python side with Wolfram Language and see if the problem remains.

In a fresh kernel, open two ZMQ sockets with appropriate types:

Needs["ZeroMQLink`"];
out = SocketOpen[Automatic, "ZMQ_PUSH"]
in = SocketOpen[Automatic, "ZMQ_PULL"]
portout = out["DestinationPort"]
portin = in["DestinationPort"]

Now edit the starting command with the above ports. In my case the command becomes:

/Applications/Mathematica.app/Contents/MacOS/WolframKernel -noprompt -initfile /Users/dorianb/.../WolframClientForPython/wolframclient/evaluation/kernel/initkernel.m -run 'ClientLibrary`Private`KernelPrivateStart["tcp://127.0.0.1:53004", "tcp://127.0.0.1:53008"];'

And run it in a terminal. Hopefully you'll see an error in your terminal that will help debugging your issue.

While the above runs, go back to the kernel and run the following.
First read the "OK" string returned when the process has started:

ByteArrayToString[SocketReadMessage[in]] === "OK"

If you get this it means that the kernel initialization was successful. I doubt you'll receive the message. But if you do, you can also try to evaluate something using:

ZMQSocketWriteMessage[out, BinarySerialize[{$MachineName, Now}]]
wxf = SocketReadMessage[in]

@DbxDev
Copy link
Contributor

DbxDev commented Jun 9, 2021

Also for the record, I just was able to run a 12.0 kernel using the version 1.1.6 of the library, but on MacOS, not Ubuntu.

@DbxDev DbxDev self-assigned this Jul 13, 2021
@DbxDev DbxDev mentioned this issue Jul 15, 2021
@GSam
Copy link

GSam commented Sep 16, 2021

import logging
logging.basicConfig(level=logging.DEBUG)

Used this code to help debug and realized that one of my Wolfram folders had the wrong permissions on Ubuntu. Must have accidentally run as root somewhere. Thanks for that tip!

@DbxDev
Copy link
Contributor

DbxDev commented Sep 16, 2021

I'm glad you were able to track this down.
Do you mind telling me what type of folder it was? I may add some pre-flight checks to the library so that it provides better error messages.

@GSam
Copy link

GSam commented Sep 16, 2021

I'm glad you were able to track this down.
Do you mind telling me what type of folder it was? I may add some pre-flight checks to the library so that it provides better error messages.

/home/ubuntu/.cache/Wolfram/WolframScript accidentally belonged to root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants