Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all-in-one with non-memory storage (Kubernetes) #740

Closed
emailtovamos opened this issue Oct 30, 2019 · 30 comments
Closed

all-in-one with non-memory storage (Kubernetes) #740

emailtovamos opened this issue Oct 30, 2019 · 30 comments

Comments

@emailtovamos
Copy link

I could find an example of all-in-one Jaeger instance with in-memory storage. But there is no such example for doing it with elasticsearch.
Where can I find it?
I understand one has to have elasticsearch running already and THEN one can incorporate corresponding changes in the above yaml file. But is there a simple way/file to have both the Jaeger instance AND elasticsearch from same file?

For someone like me who is mainly worried about having a persistent storage(with default options) and not having to figure out/manage details of elasticsearch, this would really help.

@objectiser
Copy link
Contributor

There is another example that uses Badger local storage - this should give you what you are looking for. We need to update the documentation to clearly outline this option.

@emailtovamos
Copy link
Author

Thanks, I will try it out. But is there any other setting up needed for Badger? Like creating volume separately? Or your example takes care of it already?

@objectiser
Copy link
Contributor

@emailtovamos No additional setup - the example yaml sets up the volume for local storage.

@emailtovamos
Copy link
Author

Thanks @objectiser - I tried the Badger storage as per the file you mentioned. It worked fine as I could see my services in the UI. But just to test the persistence nature, I deleted the jaeger instance pod. Another pod sprung back on as expected. But this time I could no longer see my services in the UI. Is this expected behaviour? I was expecting it to still show stuff.

@jpkrohling
Copy link
Contributor

jpkrohling commented Oct 31, 2019

For production purposes, you probably would want to provision the storage yourself and specify the volume mount/volume in the Jaeger CR. The Jaeger Operator will only create emptyDir volumes, which effectively makes it just a bit better than ephemeral.

edit: I meant to say that our examples are using emptyDir, not that the operator will create emptyDir volumes (which doesn't make any sense...)

@emailtovamos
Copy link
Author

You mean like the options shown here: https://www.jaegertracing.io/docs/1.14/operator/#storage-options ?

@jpkrohling
Copy link
Contributor

The example that @objectiser mentioned and that you are probably using is the right way, just replace emptyDir in the volume definition with a production-quality concrete storage: https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes

@emailtovamos
Copy link
Author

Ok So I created a pvc:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: jaegerpvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "1Gi"
  storageClassName: "my-storageclass"

Then gave it in the jaeger instance with badger:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest
spec: 
  storage:
    type: badger
    options:
      badger:
        ephemeral: false
        directory-key: "/badger/key"
        directory-value: "/badger/data"
    volumeMounts:
    - name: data
      mountPath: /badger
    volumes:
    - name: data
      persistentVolumeClaim:
            claimName: jaegerpvc

I can see the pvc and pv running fine when I do e.g. kubectl get pv --all-namespaces
I was expecting the pod to restart and/or have a mention of the pvc. But nothing changed. How can I make sure it is working the way expected? Or am I missing any other step?

@jpkrohling
Copy link
Contributor

The indentation looks odd. Do you get any error messages when you try to apply this resource? Could you please start the operator with --log-level=debug and share the logs?

@emailtovamos
Copy link
Author

{"level":"debug","ts":1572625156.8930852,"caller":"app/span_processor.go:124","msg":"Span written to the storage by the collector","trace-id":"65c73a5837feb416","span-id":"65c73a5837feb416"}
{"level":"debug","ts":1572625157.892143,"caller":"processors/thrift_processor.go:116","msg":"Span(s) received by the agent","bytes-received":331}
{"level":"debug","ts":1572625157.8931003,"caller":"app/span_processor.go:124","msg":"Span written to the storage by the collector","trace-id":"5e915efb5e9f8483","span-id":"5e915efb5e9f8483"}
{"level":"debug","ts":1572625158.8920205,"caller":"processors/thrift_processor.go:116","msg":"Span(s) received by the agent","bytes-received":329}
{"level":"debug","ts":1572625158.892912,"caller":"app/span_processor.go:124","msg":"Span written to the storage by the collector","trace-id":"23c6edea6faf6a10","span-id":"23c6edea6faf6a10"}

@emailtovamos
Copy link
Author

The above logs look as expected right? I just deleted the pod and it restarted. But in the UI I could no longer see the older traces.

@objectiser
Copy link
Contributor

objectiser commented Nov 4, 2019

@emailtovamos Can you try with the modified indentation as below:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest
spec: 
  storage:
    type: badger
    options:
      badger:
        ephemeral: false
        directory-key: "/badger/key"
        directory-value: "/badger/data"
  volumeMounts:
  - name: data
    mountPath: /badger
  volumes:
  - name: data
    persistentVolumeClaim:
       claimName: jaegerpvc

The volumes and volumeMounts should be at the same level as storage.

@emailtovamos
Copy link
Author

emailtovamos commented Nov 4, 2019

Thanks @objectiser ! It works now! When I delete the Jaeger pod and the new pod gets created, I can now see the traces corresponding to the old pod in the Jaeger UI.

One last question:
Is there any way I can give option such that it only saves the latest 1GB of data or latest 7 days of data or something similar? Because no matter what storage I give in my PVC, it will eventually get filled up. What's the usual way to deal with this? I couldn't find any such option in the documentation: https://www.jaegertracing.io/docs/1.13/deployment/#badger-local-storage

@objectiser
Copy link
Contributor

There is a badger.span-store-ttl option, defaults to 72 hours, can be found here: https://www.jaegertracing.io/docs/1.14/cli/#jaeger-all-in-one-badger

@emailtovamos
Copy link
Author

Thanks!
span-store-ttl: "72h0m0s"
Is the above formatting ok? I mean with the inverted commas for the time.

@objectiser
Copy link
Contributor

Yes I believe so - let us know if you get problems with it.

@emailtovamos
Copy link
Author

Thanks.
Since badger wasn't mentioned in the Storage Options section in this page (https://www.jaegertracing.io/docs/1.14/operator/#storage-options), can I add information about badger setting up as discussed above here(https://github.com/jaegertracing/documentation/blob/master/content/docs/1.14/operator.md#storage-options) and make a pull request ?

@objectiser
Copy link
Contributor

Yes please!

@emailtovamos
Copy link
Author

Sure, will do.

BTW I tried to rerun the yaml using span-store-ttl value and now it is no longer running and giving the following badger-related error: failed to init storage factory.... I am not sure if this can be resolved without deleting the data. And if it has to be deleted then how.

{ 
   "level":"fatal",
   "ts":1572870354.2647974,
   "caller":"all-in-one/main.go:105",
   "msg":"Failed to init storage factory",
   "error":"Unable to replay value log: \"/badger/data/000006.vlog\": Value log truncate required to run DB. This might result in data loss.",
   "errorVerbose":"Value log truncate required to run DB. This might result in data loss.\ngithub.com/jaegertracing/jaeger/vendor/github.com/dgraph-io/badger.init\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/dgraph-io/badger/errors.go:98\nruntime.doInit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:5222\nruntime.doInit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:5217\nruntime.doInit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:5217\nruntime.doInit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:5217\nruntime.doInit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:5217\nruntime.main\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:190\nruntime.goexit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/asm_amd64.s:1357\nUnable to replay value log: \"/badger/data/000006.vlog\"\ngithub.com/jaegertracing/jaeger/vendor/github.com/dgraph-io/badger.(*valueLog).Replay\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/dgraph-io/badger/value.go:772\ngithub.com/jaegertracing/jaeger/vendor/github.com/dgraph-io/badger.Open\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/dgraph-io/badger/db.go:306\ngithub.com/jaegertracing/jaeger/plugin/storage/badger.(*Factory).Initialize\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/plugin/storage/badger/factory.go:119\ngithub.com/jaegertracing/jaeger/plugin/storage.(*Factory).Initialize\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/plugin/storage/factory.go:108\nmain.main.func1\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/all-in-one/main.go:104\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:826\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:914\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:864\nmain.main\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/all-in-one/main.go:171\nruntime.main\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:203\nruntime.goexit\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/asm_amd64.s:1357",
   "stacktrace":"main.main.func1\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/all-in-one/main.go:105\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:826\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:914\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:864\nmain.main\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/all-in-one/main.go:171\nruntime.main\n\t/home/travis/.gimme/versions/go1.13.4.linux.amd64/src/runtime/proc.go:203"
}

@objectiser
Copy link
Contributor

@emailtovamos An option to truncate was added recently. You could try this out using the jaegertracing/all-in-one:latest image in the CR (under allInOne node).

@objectiser
Copy link
Contributor

If this is reproducible, could you provide the log for the pod that fails - before you try restarting and getting this "Failed to init storage factory" error - it might help to detect and avoid this failure.

@emailtovamos
Copy link
Author

Thanks @objectiser . I added the truncate option although yet to check if now I can avoid that problem.
BTW since I am setting a pvc for badger with some storage (e.g. 50GB) what happens after this storage limit gets reached? I mean is there any automated way to resolve this or one just has to manually create another storage?

@objectiser
Copy link
Contributor

@burmanm would you be able to answer?

@burmanm
Copy link

burmanm commented Nov 6, 2019

I'm not sure what the question is really. If database runs out of diskspace, there's nothing it can do. It can't free space since it can't write the deletes and it can't rearrange the data either.

@emailtovamos
Copy link
Author

Thanks @burmanm . A practical scenario which happened to me today:
The badger database got filled up and there was no longer any new traces that was being saved. But then if I have set my span-store-ttl option to say 48h and 48hours have already passed since a high frequency update, I should expect the Pod to start writing traces again right? Since there would be some data which would get deleted which were "old" enough.

@burmanm
Copy link

burmanm commented Nov 7, 2019

No, it would not continue. The writing is never happening inplace, instead those SST files are always immutable. Thus, when the TTL is expired, the next compaction process will remove that old data (and write the new SST files without that expired data). But since there's no disk space, the compaction process cannot continue.

Also, I would assume at that point the WAL log has some operations which are also in the memtable and it can't be flushed for proper compaction thus the WAL log can't be cleaned. To provide consistency, the compaction process can't really continue since it can't write all the data to the disk and make correctly sorted SST files.

Thus, you should always have enough empty diskspace to ensure that the compactions can take place. Same applies to Cassandra backend also as these both are based on the LSM trees.

@emailtovamos
Copy link
Author

Thanks @burmanm for the detailed explanation.
How can I check if the deletion is really happening? Now I have given enough storage and gave 24 hours are the span-store-ttl. But when I checked the Disk space used by the Pod in the Google cloud console, I did not see any drop in space after 24 hours of starting.

@emailtovamos
Copy link
Author

Actually when I search for the old traces, I can't find them which is the expected behaviour.

But the only thing I am worried about is the constant increase of the disk space. I was expecting it to stay around the level which it reached at the end of the first 24 hours but it has almost always been increasing except a few drops. So no matter how much space I assign, there is always a chance of hitting the limit!
image

@pavolloffay
Copy link
Member

pavolloffay commented Nov 11, 2019

@emailtovamos could you please open this issue in the main repository? - regarding badger not cleaning the data properly

@emailtovamos
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants