Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task(buffers): ensure disk_v1 buffer uses same directory as before #10430

Closed
Tracked by #9476
tobz opened this issue Dec 13, 2021 · 3 comments
Closed
Tracked by #9476

task(buffers): ensure disk_v1 buffer uses same directory as before #10430

tobz opened this issue Dec 13, 2021 · 3 comments
Assignees
Labels
domain: buffers Anything related to Vector's memory/disk buffers meta: regression This issue represents a regression type: task Generic non-code related tasks

Comments

@tobz
Copy link
Contributor

tobz commented Dec 13, 2021

We didn't properly account for this ahead of time, so we need to make sure the directories used between 0.18.x and master match when the buffer type is disk (which is the alias for disk_v1), otherwise we'd have introduced a regression.

For users who encountered a warning message after upgrading to 0.19.1+:

If you've upgraded to 0.19.1+, you may be here after seeing an error message similar to this:

Found both old and new buffers with data for 'foo' sink. This may indicate that you upgraded to 0.19.x prior to a regression being fixed which deals with disk buffer directory names. Using new buffers and ignoring old. See #10430 for more information.

You can suppress this message by renaming the old buffer data directory to something else. Current path for old buffer data directory: /tmp/vector/foo_buffer, suggested path for renaming: /tmp/vector/foo_id

Scroll down (or click here) to see the comment about how to manually resolve this issue.

@tobz tobz added type: task Generic non-code related tasks domain: buffers Anything related to Vector's memory/disk buffers labels Dec 13, 2021
@tobz tobz self-assigned this Dec 13, 2021
@tobz tobz mentioned this issue Jan 10, 2022
18 tasks
@tobz tobz added the meta: regression This issue represents a regression label Jan 12, 2022
@tobz
Copy link
Contributor Author

tobz commented Jan 12, 2022

Well, running the same configuration (using the disk v1 buffer) on both 0.18.1 and 0.19.0 shows differing data directories, so that's a definite regression. Ugh. :(

@tobz
Copy link
Contributor Author

tobz commented Jan 14, 2022

Closed via #10826.

@tobz tobz closed this as completed Jan 14, 2022
@tobz
Copy link
Contributor Author

tobz commented Jan 18, 2022

Manual resolution of simultaneous old/new disk buffers

As part of 0.19.0, we released some changes to the buffer code that was an incremental part of our overall work of improving buffers. This work was intended to be backwards-compatible and transparent to you, the end user. Unfortunately, we (me, really) introduced a regression that unintentionally changed the data directory path for disk buffers. We pushed a subsequent partial fix -- #10826 -- but if you're here, it's because you've hit a particular corner case that we could not handle in the partial fix.

The regression itself essentially had the effect of creating a new buffer: since the data directory path changed, it would leave the old data in a directory that ended in _buffer, and create a new directory that ended in _id.

What this means for you, as a user, is that you now have orphaned data still present in the "old" buffer -- the data directory suffixed with _buffer -- that you may or may not want to process.

My data is ephemeral / old enough that I don't care about it anymore:

If you do not care about this data, potentially because now it's too old to matter, then you can simply delete the old data directory, the one suffixed with _buffer, and the warning message will go away.

I want to process all the data leftover in the old buffer:

If you want to process all of the messages in the "old" buffer, you can follow the steps below. For the sake of example, we use a sink called foo. The name of the sink would refer to the identifier given in the configuration. For example:

In TOML:

# `my_sink_id` is the ID of the sink.
[sinks.my_sink_id]
...

In YAML:

# `my_sink_id` is the ID of the sink.
sinks:
  my_sink_id:
...

Additionally, some of the log messages will refer to a full directory, which will be rooted at /tmp/vector for the sake of example. This will likely be different for you, but should match whatever you have data_dir set to in your configuration.

Alright, now that we know how to grab the ID of the sink, the actual steps:

  1. Stop the Vector process running the affected configuration, if it is still running.
  2. Rename the "new" data directory so that it does not have the _id suffix: something like foo_id to foo_id_moved.
  3. Start Vector.
  4. The following log message should appear in the logs during startup: Migrated old buffer data directory from '/tmp/vector/foo_buffer' to '/tmp/vector/foo_id' for 'foo' sink.
  5. Wait until Vector drains the buffer and can be safely stopped. (see note below)
  6. Stop the running Vector process.
  7. Rename the moved "new" data directory back to its original name, This would be going from foo_id_moved, back to foo_id, per the example on step 2.
  8. Start Vector.
  9. The following log message should appear in the logs during startup: Archived old buffer data directory from '/tmp/vector/foo_buffer' to '/tmp/vector/foo_buffer_old' for 'foo' sink.
  10. At this point, the /tmp/vector/foo_buffer_old directory will not actually be empty, in terms of leftover files, but it contains no more records and can be safely deleted. If it was not empty, you would see the original message again -- Found both old and new buffers with data for 'foo' sink. -- indicating that there was still data.

Quiescing Vector to allow buffer to drain:

The most complicated step here is step 5, or draining the old buffer. This requires ensuring that the buffer is fully drained before moving on, which can be hard to accomplish without also figuring out how to stop accepting events in Vector. One possible option is simply duplicating your configuration and changing the source(s) such that are no longer valid. This could mean listening on a different port for a source that listens for HTTP requests, or on a specific TCP port, etc. In the case of sources that pull their data from an external source, it may mean pointing the source at a non-existent bucket/topic/etc so that it runs but has no chance to actually get any events.

In certain circumstances, these options may be too heavyweight, so another option could be to simply remove everything but the sink in question from the duplicated config, and add a dummy source -- the socket source works well here -- to allow the sink to be configured correctly, but in a way where it will never receive any data. Here's an example of a dummy socket source:

[sources.dummy_source]
type = "socket"
address = "0.0.0.0:60000"
mode = "tcp"

# IMPORTANT: The sink ID _must_ stay the same, as it relates to which data directories it uses on disk.
[sinks.my_sink_id]
inputs = ["dummy_source"]
...

This example uses a very high port that should not conflict with much of anything, but there's a chance you may need to pick another port if something is already listening on that one.

Internal remediation

Firstly, we're very sorry.

It sucks that this bug managed to slip through our review process, and made it out into a release. While bugs are inevitable, they're especially impactful when they're related to the very features we ship in Vector to help you avoid data loss in the first place, like buffers.

We've already opened #10895 to better guard around this particular issue happening again, and will continue to try and find ways to better codify and test for consistent user experience between releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: buffers Anything related to Vector's memory/disk buffers meta: regression This issue represents a regression type: task Generic non-code related tasks
Projects
None yet
Development

No branches or pull requests

1 participant