Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot configure Stackdriver output plugin #761

Closed
theFroh opened this issue Sep 11, 2018 · 40 comments
Closed

Cannot configure Stackdriver output plugin #761

theFroh opened this issue Sep 11, 2018 · 40 comments
Assignees
Labels
not-an-issue waiting-for-user Waiting for more information, tests or requested changes

Comments

@theFroh
Copy link

theFroh commented Sep 11, 2018

Bug Report

Describe the bug
I have followed the configuration guide for Stackdriver in the manual, but have had no success in establishing a connection to Stackdriver.

To Reproduce

  1. Install fluent-bit on an Ubuntu 16.04 LTS box
  2. Create a service account following Google's instructions and copy the JSON key into /etc/google/auth/
  3. Modify /etc/td-agent-bit/td-agent-bit.conf to include:
    [OUTPUT]
        Name  stackdriver
        Match *
        google_service_credentials /etc/google/auth/________.json
    
  4. Restart the agent to reload the configuration via systemctl restart td-agent-bit.service
  5. Note that the authorisation phase of connecting to Stackdriver fails via systemctl status td-agent-bit.service:
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [ info] [engine] started (pid=16981)
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [error] [oauth2] could not get an upstream connection
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [error] [out_stackdriver] error retrieving oauth2 access token
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [ warn] [out_stackdriver] token retrieval failed
    

Expected behavior
I expected authentication to succeed against Stackdriver.

Your Environment

  • Version used: 0.14.1
  • Configuration: Default configuration but with the [OUTPUT] section as described above.
    I had to comment out the Plugins_File plugins.conf line as this file does not exist by default and I couldn't find any documentation on the intended contents of such a file. (I also attempted putting the [OUTPUT] config for stackdriver into this file, as well as just leaving the file blank)
  • Environment name and version (e.g. Kubernetes? What version?): Our own VPS
  • Server type and version: N/A
  • Operating System and version: Ubuntu 16.04.3 LTS
  • Filters and plugins: Default configuration; no filters, just the stackdriver output plugin.

Additional context
I'm trying to use fluent-bit to consume and send through server stats from a VPS we have, that is not part of our Google Cloud cluster.

@edsiper edsiper self-assigned this Sep 11, 2018
@edsiper
Copy link
Member

edsiper commented Sep 11, 2018

Hi @theFroh,

looking at the error I see the following

...[2018/09/11 02:24:10] [error] [oauth2] could not get an upstream connection

that means that the plugin could not establish a network connection with Google services, please validate in your end that your system can reach the following HTTPs end-points:

@theFroh
Copy link
Author

theFroh commented Sep 12, 2018

Hey @edsiper,

The machine definitely has outbound access, and in particular, those two end-points are definitely accessible from the machine:

$ nmap -p 443 logging.googleapis.com www.googleapis.com

Starting Nmap 7.01 ( https://nmap.org ) at 2018-09-12 05:37 UTC
Nmap scan report for logging.googleapis.com (172.217.25.170)
Host is up (0.0018s latency).
Other addresses for logging.googleapis.com (not scanned): 2404:6800:4006:803::200a 172.217.167.74 172.217.167.106 216.58.196.138 216.58.199.74 216.58.200.106 216.58.203.106 216.58.220.106 172.217.25.138
rDNS record for 172.217.25.170: sin01s16-in-f10.1e100.net
PORT    STATE SERVICE
443/tcp open  https

Nmap scan report for www.googleapis.com (216.58.203.106)
Host is up (0.0017s latency).
Other addresses for www.googleapis.com (not scanned): 2404:6800:4006:803::200a 216.58.220.138 172.217.25.138 172.217.167.74 172.217.167.106 216.58.196.138 216.58.199.42 216.58.199.74 216.58.200.106
rDNS record for 216.58.203.106: syd09s15-in-f10.1e100.net
PORT    STATE SERVICE
443/tcp open  https

Cheers for assisting!

@edsiper
Copy link
Member

edsiper commented Sep 15, 2018

would you please trace debug messages with 'Log_Level trace' (in [SERVICE] section) and share the output ?

@theFroh
Copy link
Author

theFroh commented Sep 17, 2018

No worries, that only really adds a JWT signature printout, though.

Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [ info] [engine] started (pid=1810)
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [debug] [out_stackdriver] JWT signature:
Sep 17 01:16:52 hostname td-agent-bit[1810]: xxx.xxx.xxx
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [error] [oauth2] could not get an upstream connection
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [error] [out_stackdriver] error retrieving oauth2 access token
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [ warn] [out_stackdriver] token retrieval failed
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [debug] [router] match rule cpu.0:stdout.0
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [debug] [router] match rule cpu.0:stackdriver.0

The JWT signature has a payload containing (with our correct account name removed):

{
  "iss": "<STATS SERVICE ACCOUNT>@<PROJECT NAME>.iam.gserviceaccount.com",
  "scope": "https://www.googleapis.com/auth/logging.write",
  "aud": "https://www.googleapis.com/oauth2/v4/token",
  "exp": 1537150012,
  "iat": 1537147012
}

And header:

{
  "alg": "RS256",
  "typ": "JWT"
}

I can't check if the JWT itself is valid as I've not got the secret or public key to verify with.

@edsiper
Copy link
Member

edsiper commented Sep 17, 2018

I will try to replicate the problem in a 16.04 box, I tested again in my 18.04 and works fine.

@edsiper
Copy link
Member

edsiper commented Sep 25, 2018

no issues here, if you generate a new token file does it works ?

@theFroh
Copy link
Author

theFroh commented Sep 26, 2018

What is providing your 16.04 testing box? Mine is just a standard, run of the mill VPS; not provided by AWS or the like.

To generate a new token, I've followed the following steps from Google as they seem the most applicable:

  1. Checked "Authorizing an Agent" -- which indicates that I should create a Service Account.
  2. Followed here and created a new Service Account with both Logging > Logs Writer and Monitoring > Monitoring Metric Writer roles.
  3. Used the default JSON private key export option to generate a key file.
  4. Moved this key file onto the server in question, dropped it into /etc/google/auth/ and then updated /etc/td-agent-bit/td-agent-bit.conf so that google_service_credentials is set correctly.
  5. systemctl restart td-agent-bit and systemctl status td-agent-bit

This reports the same [error] [oauth2] could not get an upstream connection

Am I missing any steps here, or misinterpretting any of the documentation, whether on Fluent Bit's or Google's end?

EDIT: I have also just nabbed the JWT signature from the logs again; it is definitely referencing the correct account in there.

@stevenarvar
Copy link

stevenarvar commented Oct 30, 2018

I deployed fluentbit 0.14 in K8S cluster.

The important config is the env variable
QA >> kubectl exec fluent-bit-77zr7 -n kube-system env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=fluent-bit-77zr7
GOOGLE_SERVICE_CREDENTIALS=/gcp/stackdriver-service-account.json

From the fluent-bit-ds.yaml file:

    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:0.14.5
        imagePullPolicy: Always
        ports:
          - containerPort: 2020
        env:
        - name: GOOGLE_SERVICE_CREDENTIALS
          value: /gcp/stackdriver-service-account.json
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: ssa-volume
          mountPath: /gcp
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
.
.
.
.
      volumes:
      - name: ssa-volume
        secret:
          secretName: stackdriver-service-account

The above config need to have a secrete created like so:

kubectl create secret generic --namespace=kube-system stackdriver-service-account --from-file=./stackdriver-service-account.json

I mostly following instruction from here:
https://docs.fluentbit.io/manual/installation/kubernetes

swapped the elasticsearch OUTPUT with stackdriver. But I also tried the simple configmap suggested here: https://docs.fluentbit.io/manual/output/stackdriver

Got the StackDriver authentication working I believe:

QA >> kubectl logs -n kube-system fluent-bit-77zr7
Fluent-Bit v0.14.5
Copyright (C) Treasure Data

[2018/10/30 19:28:51] [ info] [engine] started (pid=1)
[2018/10/30 19:28:51] [ info] [oauth2] HTTP Status=200
[2018/10/30 19:28:51] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved

Problem is I don't see logs in my stackdriver project.

The final configmap I use is:

QA >> cat fluent-bit-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
  labels:
    k8s-app: fluent-bit
data:
  fluent-bit.conf: |
    [INPUT]
        Name  cpu
        Tag   cpu

    [OUTPUT]
        Name        stackdriver
        Match       *

I did not set the env variables such as SERVICE_ACCOUNT_EMAIL & SERVICE_ACCOUNT_SECRET because I already have GOOGLE_SERVICE_CREDENTIALS setup. I did not set resource to global thinking this is already the default.

Is there any other logs I can get to dig in more? Don't know what else to try at this point.

@varun-da
Copy link

varun-da commented Jan 7, 2019

@stevenarvar look under Global, the first filter. Not under the service account.

@varun-da
Copy link

varun-da commented Jan 8, 2019

@theFroh this is definitely a issue with not being able to hit the google api servers from the box. Please check your connectivity from the box to those services. I was getting the same error and once I enabled the traffic to go through it works. Although in the beginning of the pod I do get a few errors but afterwards it works. The reason for initial connection failure in my env is I am running istio and those pods have to init before the traffic is routed correctly. I have tested with v0.14.9 and v1.0.1.

I had to enable traffic to the following urls:

logging.googleapis.com
www.googleapis.com

logs:

Fluent-Bit v0.14.9
Copyright (C) Treasure Data

[2019/01/07 23:24:09] [ info] [engine] started (pid=1)
[2019/01/07 23:24:09] [error] [oauth2] could not get an upstream connection
[2019/01/07 23:24:09] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/01/07 23:24:09] [ warn] [out_stackdriver] token retrieval failed
.
.
.
[2019/01/07 23:24:10] [error] [io] TCP connection failed: logging.googleapis.com:443 (Connection refused)
.
.
.
[2019/01/07 23:24:12] [ info] [oauth2] HTTP Status=200
[2019/01/07 23:24:12] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved
Fluent Bit v1.1.0
Copyright (C) Treasure Data

[2019/01/08 16:41:39] [ info] [storage] initializing...
[2019/01/08 16:41:39] [ info] [storage] in-memory
[2019/01/08 16:41:39] [ info] [storage] normal synchronization mode, checksum disabled
[2019/01/08 16:41:39] [ info] [engine] started (pid=1)
[2019/01/08 16:41:39] [error] [oauth2] could not get an upstream connection
[2019/01/08 16:41:39] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/01/08 16:41:39] [ warn] [out_stackdriver] token retrieval failed
.
.
.
[2019/01/08 16:41:40] [error] [io] TCP connection failed: logging.googleapis.com:443 (Connection refused)
.
.
.
[2019/01/08 16:41:43] [ info] [oauth2] HTTP Status=200
[2019/01/08 16:41:43] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved

@edsiper
Copy link
Member

edsiper commented Jan 8, 2019

is there any extra information that we could add to the documentation ? or is it good to close the ticket ?

@varun-da
Copy link

varun-da commented Jan 8, 2019

@edsiper the two domains should be added to the docs. And in the logging it should print the full url to which the access was deined or the request failed at, for examplemade a call to https://www.googleapis.com/oauth2/token to get the token and failed, connection refused (or in case of a HTTP error, received HTTP: 404, etc.). This way it is clear what is happening from the logs.

@theFroh
Copy link
Author

theFroh commented Jan 9, 2019

@varun-da Just in response to your own reply before, definitely understand that it is a likely cause, but the first thing we checked off in this issue was connectivity from the box to those two addresses. I can confirm I still have connectivity.

I'm still hitting the issue, though:

Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [ info] [engine] started (pid=15149)
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [out_stackdriver] JWT signature:
Jan 09 09:00:15 hostname td-agent-bit[15149]: removed
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [error] [oauth2] could not get an upstream connection
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [error] [out_stackdriver] error retrieving oauth2 access token
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [ warn] [out_stackdriver] token retrieval failed
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [router] match rule cpu.0:stdout.0
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [router] match rule cpu.0:stackdriver.0
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
Jan 09 09:00:16 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [input cpu.0] [mem buf] size = 317

Cheers for the assistance!

@varun-da
Copy link

varun-da commented Jan 9, 2019

@theFroh the next step I would take is making a call using curl with verbosity andd using the JWT token to the googleapis.com server to get the oauth2 token from that box. perhaps @edsiper can point to the documentation for doing this.

I think I found it: https://developers.google.com/identity/protocols/OAuth2ServiceAccount

Example from the page, I added the -v flag, and you would have to replace the JWT token with generated by the fluent-bit instance on that machine JWT token:

curl -v -d 'grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer&assertion=<JWT token from fluent-bit instance>' https://www.googleapis.com/oauth2/v4/token

curl -v -d 'grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer&assertion=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiI3NjEzMjY3OTgwNjktcjVtbGpsbG4xcmQ0bHJiaGc3NWVmZ2lncDM2bTc4ajVAZGV2ZWxvcGVyLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmdvb2dsZWFwaXMuY29tL2F1dGgvcHJlZGljdGlvbiIsImF1ZCI6Imh0dHBzOi8vYWNjb3VudHMuZ29vZ2xlLmNvbS9vL29hdXRoMi90b2tlbiIsImV4cCI6MTMyODU3MzM4MSwiaWF0IjoxMzI4NTY5NzgxfQ.RZVpzWygMLuL-n3GwjW1_yhQhrqDacyvaXkuf8HcJl8EtXYjGjMaW5oiM5cgAaIorrqgYlp4DPF_GuncFqg9uDZrx7pMmCZ_yHfxhSCXru3gbXrZvAIicNQZMFxrEEn4REVuq7DjkTMyCMGCY1dpMa8aWfTQFt3Eh7smLchaZsU' https://www.googleapis.com/oauth2/v4/token

This would definitely help in debugging this further.

@theFroh
Copy link
Author

theFroh commented Jan 11, 2019

@varun-da Ah, that's definitely a great way to test here.

Running it myself with the token as reported in the logs yields a success in my books:

*   Trying 2404:6800:4006:802::200a...
* Connected to www.googleapis.com (2404:6800:4006:802::200a) port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 596 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_ECDSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: *.googleapis.com (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: EC
* 	 certificate version: #3
* 	 subject: C=US,ST=California,L=Mountain View,O=Google LLC,CN=*.googleapis.com
* 	 start date: Wed, 19 Dec 2018 08:17:00 GMT
* 	 expire date: Wed, 13 Mar 2019 08:17:00 GMT
* 	 issuer: C=US,O=Google Trust Services,CN=Google Internet Authority G3
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
> POST /oauth2/v4/token HTTP/1.1
> Host: www.googleapis.com
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 747
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 747 out of 747 bytes
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=utf-8
< Vary: X-Origin
< Vary: Referer
< Date: Fri, 11 Jan 2019 01:56:36 GMT
< Server: ESF
< Cache-Control: private
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
< Accept-Ranges: none
< Vary: Origin,Accept-Encoding
< Transfer-Encoding: chunked
< 
{
  "access_token": "<access token omitted>",
  "expires_in": 3600,
  "token_type": "Bearer"
* Connection #0 to host www.googleapis.com left intact
}

Which doesn't really clear anything up unfortunately. I wonder how Fluentbit's networking differs.

@sudharsh
Copy link

+1, I am hit by this too. I get a 200 when I do the curl with the JWT token copied from the logs, and the same oauth error from fluentbit logs.

@jakeswenson
Copy link

I'm getting the exact same thing:

....
 Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* We are completely uploaded and fine
< HTTP/2 200
< content-type: application/json; charset=utf-8
< vary: X-Origin
< vary: Referer
< vary: Origin,Accept-Encoding
< date: Mon, 25 Mar 2019 19:21:31 GMT
< server: ESF
< cache-control: private
< x-xss-protection: 1; mode=block
< x-frame-options: SAMEORIGIN
< x-content-type-options: nosniff
< alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
< accept-ranges: none
<
{
  "access_token": "Removed",
  "expires_in": 3600,
  "token_type": "Bearer"
* Connection #0 to host www.googleapis.com left intact
}

@jakeswenson
Copy link

jakeswenson commented Mar 26, 2019

How did other folks resolve this?

Fluent Bit v1.0.4
Copyright (C) Treasure Data

[2019/03/26 15:55:49] [debug] [storage] [cio stream] new stream registered: syslog.0
[2019/03/26 15:55:49] [ info] [storage] initializing...
[2019/03/26 15:55:49] [ info] [storage] in-memory
[2019/03/26 15:55:49] [ info] [storage] normal synchronization mode, checksum disabled
[2019/03/26 15:55:49] [ info] [engine] started (pid=40718)
[2019/03/26 15:55:49] [debug] [engine] coroutine stack size: 65536 bytes (64.0K)
[2019/03/26 15:55:49] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2019/03/26 15:55:49] [debug] [out_stackdriver] JWT signature: <SNIP>
[2019/03/26 15:55:49] [error] [oauth2] could not get an upstream connection
[2019/03/26 15:55:49] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/03/26 15:55:49] [ warn] [out_stackdriver] token retrieval failed
[2019/03/26 15:55:49] [debug] [router] match rule syslog.0:stdout.0
[2019/03/26 15:55:49] [debug] [router] match rule syslog.0:stackdriver.0

i've tracked the error back to this line:

flb_error("[oauth2] could not get an upstream connection");

i don't know what can cause flb_upstream_conn_get to fail...

@theFroh
Copy link
Author

theFroh commented Mar 27, 2019

I was never able to.

@edsiper
Copy link
Member

edsiper commented Mar 27, 2019

that specific upstream connection error is a TCP connection error reaching the HTTPS end-point.

@jakeswenson
Copy link

Thanks for the pointer @edsiper
In my case this is on a freebsd jail, but curl works fine with https reaching the google apis. any pointers as to how to diagnose this SSL/TLS issue?
I can try getting a tcp dump to see if that shows any issues...

@edsiper
Copy link
Member

edsiper commented Mar 28, 2019

@jakeswenson did you try tls.debug N ?:

https://docs.fluentbit.io/manual/configuration/tls_ssl

If you try to do the same thing in a Linux box does it works ? I am wondering if is there any issue on BSD that needs to be fixed.

@jakeswenson
Copy link

jakeswenson commented Mar 28, 2019

@edsiper i just tried with that setting and i am seeing not new output. Does stackdriver respect this tls setting?

Fluent Bit v1.0.4
Copyright (C) Treasure Data

[2019/03/28 13:11:51] [debug] [storage] [cio stream] new stream registered: dummy.0
[2019/03/28 13:11:51] [debug] [storage] [cio stream] new stream registered: syslog.0
[2019/03/28 13:11:51] [ info] [storage] initializing...
[2019/03/28 13:11:51] [ info] [storage] in-memory
[2019/03/28 13:11:51] [ info] [storage] normal synchronization mode, checksum disabled
[2019/03/28 13:11:51] [ info] [engine] started (pid=87027)
[2019/03/28 13:11:51] [debug] [engine] coroutine stack size: 65536 bytes (64.0K)
[2019/03/28 13:11:51] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2019/03/28 13:11:51] [debug] [out_stackdriver] JWT signature: <SNIP>
[2019/03/28 13:11:51] [error] [oauth2] could not get an upstream connection
[2019/03/28 13:11:51] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/03/28 13:11:51] [ warn] [out_stackdriver] token retrieval failed
[2019/03/28 13:11:51] [debug] [router] match rule dummy.0:stdout.0
[2019/03/28 13:11:51] [debug] [router] match rule dummy.0:stackdriver.0
[0] dummy.log: [1553803912.848473852, {"message"=>"dummy"}]
[2019/03/28 13:11:56] [debug] [task] created task=0x801c40300 id=0 OK
[1] dummy.log: [1553803913.852387878, {"message"=>"dummy"}]
[2] dummy.log: [1553803914.863814322, {"message"=>"dummy"}]
[3] dummy.log: [1553803915.908904521, {"message"=>"dummy"}]
[2019/03/28 13:11:56] [debug] [retry] new retry created for task_id=0 attemps=1
[2019/03/28 13:11:56] [debug] [sched] retry=0x801c26f80 0 in 11 seconds

I ran with tls.debug 3

here is my config

[SERVICE]
	Flush 5
	Daemon off
	Log_Level trace
	Coro_Stack_Size 65536
	Parsers_File /usr/local/etc/fluent-bit/parsers.conf
[INPUT]
	Name dummy
	Tag dummy.log
[INPUT]
	Name syslog
	Path /tmp/in_syslog
	Chunk_Size 32
	Buffer_Size 64
	Tag syslog.log
[OUTPUT]
	Name stdout
	Match dummy.*
[OUTPUT]
	Name stackdriver
	Match dummy.*
	google_service_credentials /etc/gcp.creds.json
	resource global
	tls        On
	tls.verify Off
	tls.debug 3

also i ran a tcpdump and the only traffic i am getting is DNS requests for www.googleapis.com and logging.googleapis.com (both resolve) and no actual TCP traffic...
image

i can try to find a linux box to try this on, but it may take some time... until then it seems like the error is in the http library after dns but before actually sending a packet.... any thoughts @edsiper?

@edsiper
Copy link
Member

edsiper commented Mar 28, 2019

we use a pretty common libc function to resolve DNS:

https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L215

hmm not sure what can be since at least you should see a warning or error message.

@jakeswenson
Copy link

i've been able to patch a build my own version of fluent bit to print a bit more logging to try and find where the error is.
https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L311
this line is failing with errno 22 (EINVAL)
i have no idea why or what this means... any thoughts @edsiper?

@edsiper
Copy link
Member

edsiper commented Mar 28, 2019 via email

@jakeswenson
Copy link

yes, connect()

@sebbacon
Copy link

sebbacon commented Jul 7, 2019

This appears to be related to ipv6. If I turn off ipv6 support as follows, things work as expected.

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1

@jakeswenson
Copy link

jakeswenson commented Jul 18, 2019

Wait what? @sebbacon thanks for testing disabling ipv6 fixes. I think that it's a poor experience if instead of the plugin filtering the ipv6 address if it doesn't support it that I'd have to go modify my machine to disable ipv6 to run fluent-bit?
can anyone point me at the code that is at issue and i can try to look in to fixing this?

@jakeswenson
Copy link

Also i can verify that i have ipv6 enabled (on loopback...) and that google (obviously) has an AAAA record:

# host www.googleapis.com                                                                   
www.googleapis.com is an alias for googleapis.l.google.com.                                             
googleapis.l.google.com has address 172.217.3.202                                                       
googleapis.l.google.com has address 172.217.14.202                                                      
googleapis.l.google.com has address 172.217.14.234                                                      
googleapis.l.google.com has IPv6 address 2607:f8b0:400a:803::200a
# ifconfig                                                                                  
lo0: flags=8048<LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384                                          
        options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>                                           
        inet6 ::1 prefixlen 128 tentative                                                               
        inet6 fe80::1%lo0 prefixlen 64 tentative scopeid 0x1                                            
        inet 127.0.0.1 netmask 0xff000000                                                               
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>                                                       
        groups: lo                                                                                      
epair1b: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500                           
        options=8<VLAN_MTU>                                                                 
        inet 10.0.51.50 netmask 0xffff0000 broadcast 10.0.255.255                                        
        nd6 options=1<PERFORMNUD>                                                                       
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)                                             
        status: active                                                                                  
        groups: epair

@arabustams
Copy link
Contributor

arabustams commented Jul 19, 2019

In the environment in which priority set IPv6 higher than IPv4, I fined that it failed to establish upstream connection to oauth2 and stackdriver logging, so I reported #1348.

The fixes has been merged into v1.2.
If you can use v1.2 fluentbit, run it with the following setting.

[OUTPUT]
    Name stackdriver
    Match *
    IPv6 On

You can specify IPv6 On in the configuration of out_stackdriver as other out plugin, out_stackdriver module use IPv6 mode explicitly.

However, oauth2 is a little different.
In the fixes, the oauth2 module attempt to try to connect by IPv6 mode, if upstream connection by IPv4 was failed.
I wonder if it might be better to make oauth2 module as configurable like out plugin...

@arabustams
Copy link
Contributor

In addition, out_bigquery plugin probably has the same problem.
Since I was not able to test using bigquery and it was enough for me to fix out_stackdriver, so I did not fix out_bigquery.

edsiper added a commit that referenced this issue Jul 19, 2019
@edsiper
Copy link
Member

edsiper commented Jul 19, 2019

thanks everyone for the report, I've added ipv6 mode to out_bigquery on 466191c

edsiper added a commit that referenced this issue Jul 19, 2019
@jakeswenson
Copy link

i'm built and ran fluent-bit 1.2.1 on my freebsd machine and i'm still getting the same error:

# ./fluent-bit -c /etc/logs.conf
Fluent Bit v1.2.1
Copyright (C) Treasure Data

[2019/07/19 08:43:32] [debug] [storage] [cio stream] new stream registered: dummy.0
[2019/07/19 08:43:32] [debug] [storage] [cio stream] new stream registered: syslog.1
[2019/07/19 08:43:32] [ info] [storage] initializing...
[2019/07/19 08:43:32] [ info] [storage] in-memory
[2019/07/19 08:43:32] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2019/07/19 08:43:32] [ info] [engine] started (pid=43877)
[2019/07/19 08:43:32] [debug] [engine] coroutine stack size: 65536 bytes (64.0K)
[2019/07/19 08:43:32] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2019/07/19 08:43:32] [debug] [out_stackdriver] JWT signature:
eyJhbG<SNIP>
[2019/07/19 08:43:32] [error] [oauth2] could not get an upstream connection
[2019/07/19 08:43:32] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/07/19 08:43:32] [ warn] [out_stackdriver] token retrieval failed
[2019/07/19 08:43:32] [debug] [router] match rule dummy.0:stdout.0
[2019/07/19 08:43:32] [debug] [router] match rule dummy.0:stackdriver.1
[2019/07/19 08:43:32] [ info] [sp] stream processor started
^C[engine] caught signal (SIGINT)
[2019/07/19 08:43:35] [ info] [input] pausing dummy.0
[2019/07/19 08:43:35] [ info] [input] pausing syslog.1

config:

# cat /etc/logs.conf
[SERVICE]
        Flush 5
        Daemon off
        Log_Level trace
        Coro_Stack_Size 65536
        Parsers_File /usr/local/etc/fluent-bit/parsers.conf
[INPUT]
        Name dummy
        Tag dummy.log
[INPUT]
        Name syslog
        Path /tmp/in_syslog
        Chunk_Size 32
        Buffer_Size 64
        Tag syslog.log
[OUTPUT]
        Name stdout
        Match dummy.*
[OUTPUT]
        Name stackdriver
        Match dummy.*
        google_service_credentials /etc/gcp.creds.json
        resource global
        tls        On
        tls.verify Off
        tls.debug 4
        IPv6 On

i doesn't matter if i configure IPv6 to On or Off same error.

is there anything else i can do to help debug this?

@edsiper
Copy link
Member

edsiper commented Jul 25, 2019

looks like the output above don't have trace messages, would you please re-run it ? (I see the trace enabled in the config, but I don't see it in the output)

@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label Jul 25, 2019
@jakeswenson
Copy link

@edsiper as i'm sure you know trace requires fluent-bit to be built with tracing enabled... https://docs.fluentbit.io/manual/configuration/file#config_section

I'm certain it's not building that by default, and i need to read up on how its enabled using the options framework

Are there any log lines in particular you're looking for from tracing?

@edsiper
Copy link
Member

edsiper commented Aug 27, 2020

FYI: Stackdriver output plugin has been improved heavily the latest team (thanks to Google team involvement in the project), I am closing this ticket. Pls create a new one if you still faces an issue.

@edsiper edsiper closed this as completed Aug 27, 2020
@rquinlivan
Copy link

I am still seeing this in 1.7. The stackdriver plugin logs nothing even at trace.

@rquinlivan
Copy link

rquinlivan commented May 24, 2021

@theFroh @edsiper Can we reopen this issue? I am seeing the same issues with ipv6 reported in this thread. I installed the 1.7.4 amd64 version via the Debian package.

@edsiper
Copy link
Member

edsiper commented May 24, 2021

for new issues please open a new ticket.

FYI: v1.7.6 was tested extensible with Stackdriver on Google Cloud: 10 hours run sending 150k messages per second, no issues found.

rawahars pushed a commit to rawahars/fluent-bit that referenced this issue Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not-an-issue waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

No branches or pull requests

9 participants