Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import problem on titan #1243

Closed
andre-merzky opened this issue Feb 24, 2017 · 3 comments
Closed

import problem on titan #1243

andre-merzky opened this issue Feb 24, 2017 · 3 comments
Assignees
Milestone

Comments

@andre-merzky
Copy link
Member

in an installation which worked before, I suddenly see an import problem popping up in the agent:

2017-02-23 19:01:56,719: agent_0.AgentWorker.0.child: agent_0.AgentWorker.0           : MainThread     : ERROR   : ERROR in agent main loop: cannot import name jsonapi
Traceback (most recent call last):
  File "/lustre/atlas/scratch/merzky1/bip103/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017220.0001-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/worker/agent.py", line 515, in idle_cb
    return self.check_units()
  File "/lustre/atlas/scratch/merzky1/bip103/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017220.0001-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/worker/agent.py", line 560, in check_units
    self.advance(cu_list, publish=False, push=True, prof=False)
  File "/lustre/atlas/scratch/merzky1/bip103/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017220.0001-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 961, in advance
    Component.advance(self, units, state, publish, push, prof)
  File "/lustre/atlas/scratch/merzky1/bip103/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017220.0001-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 903, in advance
    output.put(_unit)
  File "/lustre/atlas/scratch/merzky1/bip103/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017220.0001-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 513, in put
    _uninterruptible(self._q.send_json, msg)
  File "/lustre/atlas/scratch/merzky1/bip103/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017220.0001-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 45, in _uninterruptible
    return f(*args, **kwargs)
  File "/lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/ve_titan/lib/python2.7/site-packages/zmq/sugar/socket.py", line 506, in send_json
ImportError: cannot import name jsonapi

jsonapi however is installed and loadable in the pilot ve:

$ source ../ve_titan/bin/activate
$ python -c 'import jsonapi'

No idea what's up as of yet - but that needs fixing first before looking into #1235 and #1237 ...

Stack:

$ radical-stack 
python            : 2.7.9
virtualenv        : /autofs/nccs-svm1_home1/merzky1/ve_rc2
radical.utils     : v0.45.RC2@no-branch
saga-python       : v0.45.RC2@no-branch
radical.pilot     : v0.45.RC2@no-branch
radical.analytics : v0.1-113-gb032808@devel
@marksantcroos
Copy link
Contributor

marksantcroos commented Feb 24, 2017

Did you try to start fresh?

@andre-merzky
Copy link
Member Author

I want to first understand what's happening here... But yeah, I'll do that eventually...

@andre-merzky
Copy link
Member Author

andre-merzky commented Feb 27, 2017

I gave up trying to find the cause of this. My current assumption is that the lustre file system cache is inconsistent between nodes, but I don't really want to spend the time to rule this out or confirm. Just as a data point, I see also things like this for some nodes:

~/sandbox/rp.session.titan-ext1.merzky1.017224.0008-pilot.0000 $ cat agent_1.err
/lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017224.0008-pilot.0000/bootstrap_2.sh: line 40: /lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/ve_titan/bin/python: No such file or directory
/lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/rp.session.titan-ext1.merzky1.017224.0008-pilot.0000/bootstrap_2.sh: line 40: exec: /lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/ve_titan/bin/python: cannot execute: No such file or directory

~/sandbox/rp.session.titan-ext1.merzky1.017224.0008-pilot.0000 $ l /lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/ve_titan/bin/python
-rwxr-xr-x 1 merzky1 merzky1 14258 Feb 27 03:16 /lustre/atlas1/bip103/scratch/merzky1/radical.pilot.sandbox/ve_titan/bin/python*

And this right after agent_0 used the very same python executable on a different node...

So, I am gonna close this, and if deployment hiccups like this become too much of a problem, we'll have to take this to ORNL support.

PS.: I pasted a version with the wrong ve location - corrected now above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants