-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NATS cluster: tons of Error reading from client: invalid character '"' after top-level value #270
Comments
Quick question: are all your servers of the same version? |
@kozlovic Yes, they're all on the same gnatsd 0.8.0 version. |
Although verbose, we may be able to use the -DV log information could be helpful to the NATS team. I believe something other than a route connection is connection to the route port. |
Have you the log from 10.21.0.31 by any chance?
Also, the error says that it gets an INFO from |
@derekcollison No problem. Do you know a good place to place logs so that I would not pollute this thread? |
@kozlovic I have them: Start 10.21.0.31:
Start of 10.21.0.9:
|
I think you could just attach the logs by dragging into the comment box. |
Any chance that one of the server is trying to connect to the client port |
@derekcollison Attached log 10.21.0.9.verbose.zip @kozlovic The config file (second one) in my first post is what all 30 slaves are using (it's a template and the only part that's dynamic is # Managed by startup script one). Just a random thought but looking at the verbose log could it be that because all 30 (+3 seed) instances are restarting/dropping connection for some reason and that causes data to be just partially available on the wire? |
Thanks for the log, would you please provide the log for Are you able to have a successful deploy or is it always end-up like this? If you try to bring one server at a time, do you still get this issue? Thanks! |
@awdrius I think I was able to reproduce the issue with "invalid character" with lots of servers (with correctly connecting to the route port). I will keep you posted. Thanks again for the report! |
@kozlovic That's good news (not that it's good it happens (-. ). Let me know if you need any help. |
Will do. In the meantime, if you want to experiment and see if works better deploying one server at a time... |
@awdrius I have made some progress, so don't waste your time experimenting. I should get an update on the next step by the end of the day. |
So I tried getting a similar behavior while bringing slaves online one by one and got the same thing. I started gnatsd on all three mesos masters to act as a seed nodes. Checking /routez confirmed that they all are fine. Then I got one of slaves gnatsd process online with no issues and nor errors in any of the logs. Once I started the second gnatsd process on a second slave all three initial nodes started complaining: master 1 (10.21.0.2):
master 2 (10.21.0.3):
master 3 (10.21.0.4):
Slave 1 (in order of start) (10.21.0.27):
Slave 2 (10.21.0.35):
Summary: it looks like after starting second slave all seed gnatsd instances started complaining about first slave. Out of curiosity I started gnatsd on a third slave and got the following: master 1 (10.21.0.2):
master 2 (10.21.0.3):
master 3 (10.21.0.4):
Slave 1 (in order of start) (10.21.0.27):
Slave 2 (10.21.0.35):
Slave 3 (10.21.0.28):
|
In split buffer conditions, a buffer is used to accumulate bytes. After processing, this buffer needs to be reset. Resolves #270
@kozlovic Good to know. I didn't get notified about your message so posted that nice wall of text (-. |
Awesome. I'm happy to be helpful. |
@awdrius Feel free to pull from master, 0.8.1 and let us know if the issue is corrected. Thanks again for the feedback. |
@derekcollison Cloned repo and built. Binary reports version 0.8.0beta2 but I can confirm that fix is in place by checking changed source file. Deployed it to 3 seed nodes and 30 slave nodes. I can confirm that I do not see a bunch of cascading errors anymore and /routez reports 32 routes and that number is no longer jumping. I do still see errors in log files. master-1 (10.21.0.2):
master-2 (10.21.0.3):
master-3 (10.21.0.4):
Random slave (10.21.0.12):
I have a single service that subscribes to 2 subjects and a test requester and can confirm that request-reply (on Queue subscribers) work fine. I'm going to run more tests tomorrow and will report any findings. |
Master builds should report 0.8.1 version, can you double check? |
Yes, either those are older logs (I am not sure you would be able to get the full mesh with the prior bug) and the version should definitively report something else. Maybe some caching issue? |
I build gnats fallowing Building of the docs (go build). I now have a version built using Dockerfile in git repo root and will deploy it shortly. As for 0.8.0beta2 I restarted one gnatsd instance on one of the slaves and I got that cascading thing again (with /routez reporting num_routes jumping up and down) but it stabilized after couple of seconds. I'll follow up with the results of the 0.8.1 version. |
@derekcollison @kozlovic I can confirm that it works flawlessly with gnatsd 0.8.1. No more errors in logs, no more route number going up and down upon a single gnatsd instance restart. Apologies for the false alarm previously and thanks for the quick turnaround. |
Glad to know it works now! And again, thank you for the report. |
Hello, I'm trying to start introducing NATS as a microservice communication protocol on mesos. I'm using mesos master nodes as seed nodes for NATS and each mesos slave runs gnatsd pointing to those seed nodes. I observed a peculiar flood of messages once I start/restart a bunch of mesos slaves (and gnatsd as it runs on those slaves).
This is small extract from gnats log:
I tried using -DV flag for more info but I could not find anything related so I decided not to include it here as it's super verbose.
As a side effect noticed that sometimes (rarely) some of the gnatsd on mesos slaves keep running but are not connected to the rest of the cluster. Another case is when I restart gnatsd on one of the slave and it produces a cascading flood of logs like above. While observing /routez monitoring endpoint I can see "num_rutes" going up and down as if propagating routes cycles other live gnatsd instances.
Configuration files for seed nodes (routes changes depending on server):
Configuration for slave nodes:
I'm not certain if there is something wrong with the configuration files or my idea to run gnatsd on each mesos slave (for mesh cluster) is not the best use case for NATS.
If you need more details - let me know. I can alter configuration and deploy test code if needed.
The text was updated successfully, but these errors were encountered: