Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jaeger-spark-dependencies failing to access AWS Elasticsearch #668

Closed
mehstg opened this issue Sep 20, 2019 · 33 comments
Closed

jaeger-spark-dependencies failing to access AWS Elasticsearch #668

mehstg opened this issue Sep 20, 2019 · 33 comments

Comments

@mehstg
Copy link

mehstg commented Sep 20, 2019

Currently using the Jaeger Operator version 1.13.1 with an Amazon Elasticsearch v6.8 backend.

The 'jaeger-spark-dependencies' pods are currently erroring out with the following message:
19/09/17 23:56:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/09/17 23:56:27 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2019-09-17T00:00Z, reading from jaeger-span-2019-09-17 index, result storing to jaeger-dependencies-2019-09-17 19/09/17 23:57:28 ERROR NetworkClient: Node [https://10.3.146.146:9200] failed (java.net.SocketTimeoutException: connect timed out); no other nodes left - aborting... Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

All other pods can communicate with the ES backend successfully. Any ideas?

@pavolloffay
Copy link
Member

Did you try to configure

ES_NODES_WAN_ONLY/elasticsearchNodesWanOnly in the dependencies spec?

@mehstg
Copy link
Author

mehstg commented Sep 20, 2019

Hi there

I am unsure of where that would go in the spec. The documentation does not seem clear on this. What do you mean by 'dependencies spec'

@pavolloffay
Copy link
Member

@mehstg
Copy link
Author

mehstg commented Sep 23, 2019

Like this?

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: <ES URL>
  dependencies:
    ElasticsearchNodesWanOnly: true

@pavolloffay
Copy link
Member

It looks correct. You can verify the config by inspecting the cron job spec created by operator. There should be defined environmental variable for this option.

@mehstg
Copy link
Author

mehstg commented Sep 24, 2019

Confirmed that this still shows false in the cron job. It doesn't seem to be anywhere in the documentation where this should go or how it should be formatted in the YAML.

@pavolloffay
Copy link
Member

We haven't documented every possible option in the CR. I often point folks to godoc to see all the possible options, however it might be harder to read for no golang developers.

Unfortunatelly you make a mistake and the property has to start with small letter. The name of the property is in annotation json: in the godoc.

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: <ES URL>
  dependencies:
    elasticsearchNodesWanOnly: true

@mehstg
Copy link
Author

mehstg commented Sep 24, 2019 via email

@pavolloffay
Copy link
Member

Please let us know if it worked

@mehstg
Copy link
Author

mehstg commented Sep 25, 2019

Hi Pavol

Unfortunately not, I can still see :

      ES_CLIENT_NODE_ONLY:  false
      ES_NODES_WAN_ONLY:    false

when I describe the job.

@pavolloffay
Copy link
Member

Alright, we are doing one more mistake - the dependencies node should be nested under storage

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: <ES URL>
    dependencies:
      elasticsearchNodesWanOnly: true

@mehstg
Copy link
Author

mehstg commented Sep 30, 2019

Thanks for that. I can now see:
ES_NODES_WAN_ONLY: true

Unfortunately my job is still failing with the same error
kubectl logs jaeger-spark-dependencies-1569801300-57frm -n tracing 19/09/30 00:03:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/09/30 00:03:27 INFO ElasticsearchDependenciesJob: Running Dependencies job for 2019-09-30T00:00Z, reading from jaeger-span-2019-09-30 index, result storing to jaeger-dependencies-2019-09-30 19/09/30 00:04:28 ERROR NetworkClient: Node [https://vpc-prod-eu-west-1-prod01-jaeger-pzzdivavqjkmvtkxccfwua6ekm.eu-west-1.es.amazonaws.com:9200] failed (java.net.SocketTimeoutException: connect timed out); no other nodes left - aborting... Exception in thread "main" org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

@pavolloffay
Copy link
Member

@mehstg could you please paste full jaeger CR? I would like to see the full configuration and especially if you are using TLS.

@mehstg
Copy link
Author

mehstg commented Sep 30, 2019

Of course.

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://vpc-prod-eu-west-1-prod01-jaeger-pzzdivavqjkmvtkxccfwua6ekm.eu-west-1.es.amazonaws.com
    dependencies:
      elasticsearchNodesWanOnly: true
  ingress:
    enabled: false
  agent:
    strategy: DaemonSet
  collector:
    image: jaegertracing/jaeger-collector:1.14.0
  query:
    image: jaegertracing/jaeger-query:1.14.0
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
    iam.amazonaws.com/role: prod-eu-west-1-prod01-pod-tracing-jaeger

I have just noticed we are pinning particular versions of the collector/query. Not sure if this could cause an issue.

@pavolloffay
Copy link
Member

The version should not matter that much in this case. But I would not recommend pinning the version of the images. Jaeger operator knows the best which version of the components should be used.

I cannot debug your use case, but you can try the following configuration

  1. Use http://vpc-prod-eu-west-1-prod01-jaeger-pzzdivavqjkmvtkxccfwua6ekm.eu-west-1.es.amazonaws.com (mind http instead of https) as URL for spark-dependencies
  2. Set ES_CLIENT_NODE_ONLY to true

You can either edit the job spec manually, but it requires to undeploy the operator or deploy spark dependencies manually without the operator.

https://github.com/jaegertracing/spark-dependencies#elasticsearch

@mehstg
Copy link
Author

mehstg commented Oct 2, 2019

I have no issue undeploying/redeploying. I cannot connect to ES via http though. It is blocked on the security group and we will not be able to modify that in production due to security reasons.
I did see on another thread someone seemed to fix this by using the flag es.net.ssl = true however I haven't managed to find that in the operator yet.

@mehstg
Copy link
Author

mehstg commented Oct 4, 2019

@pavolloffay Is there any way I can just disable the jaeger-spark-dependencies? I am not sure it is even functionality I am using.

@pavolloffay
Copy link
Member

Yes, enabled: false in the dependencies node.

@Crevil
Copy link

Crevil commented Oct 13, 2019

I also have problems getting the dependencies job to run. I've tried combinations of elasticsearchNodesWanOnly and elasticsearchClientNodeOnly with no luck. I'm unsure if this s an issue with the operator setting up the jobs or the jobs them self.
I'm thinking it's the latter as the configuration changes are reflected in the job environment variables just fine.

Any leads in to what I can do to debug further?

@Crevil
Copy link

Crevil commented Oct 13, 2019

I think I might be on to something. If server-urls contains a port number these are not propageted to ES_NODES with that port. It then defaults to 9200.
For AWS hosts the port sjould be 443 when accessing it with HTTPS

I’ll see if I can replicate this in a simple setup.

@pavolloffay
Copy link
Member

ES_NODES should contain exactly the value of server-urls

{Name: "ES_NODES", Value: sFlagsMap["es.server-urls"]},

Could you please paste here Jaeger CR and job spec which does not contain the same values for ES urls?

@Crevil
Copy link

Crevil commented Oct 14, 2019

I didn’t get any longer with it yesterday. Just wanted to add a bit more context. I’m back at my computer tomorrow and will make sure to post the info you ask for.

When giving this more thought: could it be that the two components have different defaults? We specified the hodt name in server-urls without a port number and it worked for everything except the dependencies job.

@Crevil
Copy link

Crevil commented Oct 15, 2019

Ok. Specifying the port number did in fact work as you expected. I had a bad configuration of the jaeger instance making it non-reflected in the jobs.

In other words. Setting elasticsearchNodesWanOnly: true and ensuring the specify the port of the AWS Elasticsearch host makes the jobs work as expected.

spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://vpc-jaeger-tracing-unique.region.es.amazonaws.com:443
    dependencies:
      enabled: true
      elasticsearchNodesWanOnly: true

This leaves me with the assumption from above regarding different default value handlings.
The collector worked fine without specifying the 443 port in the host but the jobs did not. Much like @mehstg specified the host in this issue.

@Crevil
Copy link

Crevil commented Oct 15, 2019

It does indeed looks to be the case.

In the jaeger collector there is no addition of a port-number if non is specified: elastic.SetURL(c.Servers...) but the spark dependencies job will add a default port number 9200:

* `ES_NODES`: A comma separated list of elasticsearch hosts advertising http. Defaults to
              localhost. Add port section if not listening on port 9200. (...)

So this confirms the odd behaviour of the collector working but the jobs do not.

I guess it is guarded by the documentation of jaeger-collector that states a full URL must be specified:

--es.server-urls string    The comma-separated list of Elasticsearch servers, must be full url i.e. http://localhost:9200 (default "http://127.0.0.1:9200")

It might be nice if the operator guarded against this as well or maybe the collector would fail to start if the port is missing. What do you think?

@pavolloffay
Copy link
Member

@Crevil thanks for digging into this!

I do not understand how the collector can work when the port number is missing.

@pavolloffay
Copy link
Member

note that port 9200 in spark-dependencies is added automatically by spark ES connector.

@pavolloffay
Copy link
Member

I think the job should automatically set es.nodes.wan.only if the ES_NODES are specified. Most people are struggling with this.

Then we can check the hosts string and generate a warning if the port is missing.

@Crevil
Copy link

Crevil commented Oct 18, 2019

Setting es.nodes.wan.only would indeed have made my issues easier. This would be a breaking change, right?

olivere/elastic does no modifications on the URL before connecting (it just uses url.Parse() on provided values) and it uses an http.Client be default with a TLS configured http.Transport underneath.

Go's http.Transport uses connectMethodForRequest to connect. This in turn uses canonicalAddr to get the outbound address. Notice how a default port is added based on the scheme here. So HTTPS URLs will use port 443.

As AWS Elasticservice exposes the servers over HTTPS, this default port value is set and it works.

@pavolloffay
Copy link
Member

Setting es.nodes.wan.only would indeed have made my issues easier. This would be a breaking change, right?

It should not be breaking change. The clients will be able to connect, the ES client will just switch off auto-discovery.

To make that work I will have to submit a PR to the operator to not set es.nodes.wan.only if the value was not specified in the CR.

@pavolloffay
Copy link
Member

I have submitted #708 for the operator and jaegertracing/spark-dependencies#79 in spark-dependencies.

@Crevil
Copy link

Crevil commented Oct 18, 2019

Great. Thanks for taking time to add these changes. It should make it a lot easier for future users of this operator. 🙏

@pavolloffay
Copy link
Member

np thanks for digging into this you helped the most!

@pavolloffay
Copy link
Member

done in #708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants