Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridGain client hangs up when networkTimeout exception. #95

Open
andresgomezfrr opened this issue Feb 6, 2015 · 4 comments
Open

GridGain client hangs up when networkTimeout exception. #95

andresgomezfrr opened this issue Feb 6, 2015 · 4 comments

Comments

@andresgomezfrr
Copy link

Hi all,

I have detected some problem when gridgain thrown networkTimeout exception, and I can simulate it, if you follow next steps:

  1. Sample client:

I build a sample gridgain client that put and get randoms K/V objects on a grid cache. My example store a object on the cache and after 100 milliseconds it queries this object.

The example's source is available on this gist:
https://gist.github.com/andresgomez92/f3bf78682acaecc8cde6

When client is running, you can see some like this:

PUT  --> KEY: 484c30c9-5b46-4271-9c71-6c72a8375524 VALUE: cae8750f-fee3-41cf-8467-ac84f5b8d5b7
GET  --> KEY: 484c30c9-5b46-4271-9c71-6c72a8375524 VALUE: cae8750f-fee3-41cf-8467-ac84f5b8d5b7
PUT  --> KEY: 45d78cbd-9d88-4a40-ad91-716eb82759c3 VALUE: 95e59ea9-06a4-439c-a46e-6eca83d58b36
GET  --> KEY: 45d78cbd-9d88-4a40-ad91-716eb82759c3 VALUE: 95e59ea9-06a4-439c-a46e-6eca83d58b36
PUT  --> KEY: 5cc5438b-d93a-4cce-a17d-51449c09fc29 VALUE: 4a85e0e3-2baa-4e6a-bafd-aa7ce33f8b3b
GET  --> KEY: 5cc5438b-d93a-4cce-a17d-51449c09fc29 VALUE: 4a85e0e3-2baa-4e6a-bafd-aa7ce33f8b3b
PUT  --> KEY: f1afe780-39f0-4af9-a146-423a5cd871ca VALUE: a3d9a91e-b88c-40c8-ba49-daf4ca5d16fc
GET  --> KEY: f1afe780-39f0-4af9-a146-423a5cd871ca VALUE: a3d9a91e-b88c-40c8-ba49-daf4ca5d16fc
PUT  --> KEY: 75727e96-34d5-492d-bbe5-ac9d3974014f VALUE: a62d1ff5-0568-4de3-9d77-bd031f9a426b
GET  --> KEY: 75727e96-34d5-492d-bbe5-ac9d3974014f VALUE: a62d1ff5-0568-4de3-9d77-bd031f9a426b
PUT  --> KEY: a2edad1a-c37c-4a1a-9b53-5ec3a16b6685 VALUE: 50775a30-edc8-4c79-82f5-7578c88719ce
  1. Now I use a application to simulate packets loss, you can find the application here: https://github.com/tylertreat/Comcast

While my client is running, I enable the packets loss simulation using this command:

 comcast --device=bond1 --packet-loss=40% 

I know that 40% of lost packets is maybe high, but this isn't the problem ... when you enable the packet loss, you can see how the client is getting slower, and if you wait some minutes you get this exception:

GET  --> KEY: f2fe30a5-efcf-4247-ae73-defaef89c587 VALUE: 6d1f465a-7e88-4842-a85e-36f1d066ae2e
PUT  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
GET  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
PUT  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
GET  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
Exception in thread "main" class org.gridgain.grid.cache.GridCacheAtomicUpdateTimeoutException: Cache update timeout out (consider increasing networkTimeout configuration property).
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation

    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.checkTimeout(GridNearAtomicUpdateFuture.java:301)
    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$19.onTimeout(GridDhtAtomicCache.java:1847)
    at org.gridgain.grid.kernal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:138)
    at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
    at java.lang.Thread.run(Unknown Source)

When this happen my client and java example hang up, now if I disable packet loss using this command:

 comcast --mode stop --device=bond1 

My gridgain node works fine, I can check my K/V objects using ggvisorcmd.sh, if I disable my node I can see how my gridgain client detects it, like this:

PUT  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
GET  --> KEY: 7125fb63-85c0-4ddc-bb0a-0bd6e1b03b5b VALUE: 639ebe67-ac2c-4880-a533-682d7e84066f
PUT  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
GET  --> KEY: a837db75-b571-44cd-bd10-92061e4ed4e7 VALUE: 0133d8cb-6085-44a5-ba27-4033020c03c6
Exception in thread "main" class org.gridgain.grid.cache.GridCacheAtomicUpdateTimeoutException: Cache update timeout out (consider increasing networkTimeout configuration property).
For more information see:
    Troubleshooting:      http://bit.ly/GridGain-Troubleshooting
    Documentation Center: http://bit.ly/GridGain-Documentation

    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.checkTimeout(GridNearAtomicUpdateFuture.java:301)
    at org.gridgain.grid.kernal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$19.onTimeout(GridDhtAtomicCache.java:1847)
    at org.gridgain.grid.kernal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:138)
    at org.gridgain.grid.util.worker.GridWorker.run(GridWorker.java:151)
    at java.lang.Thread.run(Unknown Source)

[11:57:27] Topology snapshot [ver=77, nodes=1, CPUs=4, heap=3.5GB]
[11:57:48] Topology snapshot [ver=78, nodes=2, CPUs=8, heap=10.0GB]

But my gridgain client can't write and query K/V objects again, he is hang up ...

I think that when the gridgain throw org.gridgain.grid.cache.GridCacheAtomicUpdateTimeoutException, the client must give me a null, like if it doesn't find the specific key, and it must continue working normally.

@dsetrakyan
Copy link

Thanks for detailed instructions. We will try to reproduce and get back to you.

@andresgomezfrr
Copy link
Author

Any update?

@dsetrakyan
Copy link

Can you try increasing network timeout as suggested by the exception? Default is 4000ms, so I would recommend setting it to 10000ms to give it enough time to deal with 40% packet loss.

If that does not help, we will need to take a look at the thread dumps from each node.

@dsetrakyan
Copy link

Also, please make sure that you are running on 6.6.2 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants