How to know when to shutdown Vector? #23114

isbm · 2025-05-27T13:51:13Z

isbm
May 27, 2025

Hi, peeps.

Here's the problem: vector is used in a volatile, temporarily lived pipeline. Scenario is the following:

Something is uploading a file, where all logs are recorded elsewhere.
Each log entry is flying at Vector.
Vector probably is doing something with them (or even dropping some — depends on .vrl choices) — so I cannot guarantee by counting messages in/out, because I don't know what did it dropped or what user configured as .vrl. I also don't know what message is the youngest, even though I can send a very last message as a marker "Stream is finished". Seems Vector is adding it asynchronously, and therefore marker arrives faster than the other messages. Later on, they are landing in the TimescaleDB anyway correctly, so async isn't the problem for the Vector itself.
Once file is EOF, the whole pipeline must be shut down and then tests or other measures must be applied to the data. Specifically, we are using TimescaleDB (PostgreSQL).

But here is the catch: how do I know that Vector finished and all data is already out of it? Meaning, when do I begin my log data analysis? — immediately after the EOF? Wait 10 seconds? A minute? Longer?..

The easiest way would be to know how many messages it processed so far, or at least have some "kick" to it with API or so, to ensure it shuts down properly, flushing everything. For example, if I send SIGINT to the process (https://vector.dev/docs/reference/cli/#vector_graceful_shutdown_limit_secs), while Vector although claims "I am shutting down gracefully", it actually seems still chewing/yanking log messages in a pool of PostgreSQL sink, effectively killing them, eventually hindering my analysis results. 😉 Currently I just duct-taped it with a sleep of ~10 seconds, and then SIGINT'ing it (and that works). But this also sucks due to the obvious reasons.

So what would be the idea here? How to properly shut it down, 100% ensuring Vector is really finished?

@jszwedko would be nice if you lit a bit of light here! (thanks)

Answered by jszwedko

May 30, 2025

It is possible to disable the timeout via --no-graceful-shutdown-limit. See https://vector.dev/docs/reference/cli/#vector_no_graceful_shutdown_limit. Note Vector will shut down earlier if there is no more input to process; the limit is just the maximum amount of time it will wait.

View full answer

jszwedko · 2025-05-27T14:10:46Z

jszwedko
May 27, 2025
Maintainer

Hi @isbm ,

Vector unfortunately lacks a good way to know when processing is done. I think the workaround you came up with is probably the best you could do. #11095 is tracking these sorts of "ETL" use-cases.

0 replies

isbm · 2025-05-27T15:23:31Z

isbm
May 27, 2025
Author

@jszwedko well OK, but "graceful shutdown" on SIGINT should still at least catch all the waiting/processing messages in all pools/buffs and wait until they are done, right? IOW, you don't quit until any pool/source/sink across the whole thing still contains anything, and quit iff all sources/sinks are length of 0. Basically I would still expect:

Streaming finished (i.e. SIGINT issued, because there is no such thing as "EOF" in OTEL spec)
Vector drops the client connection to block any further input.
Vector says "OK, but wait, I am still processing some VRL/other stuff"
Once process quit — that's the event to catch for processing tests/analysis.

If that would not be the case right now, I would consider this as a bug that needs to be fixed, or? 😉 Because otherwise means you are losing data.

Unless I am wrong, I don't see that the fix is hard. Actually all it needs to be done is to define "length" on each sink/source in a trait, and force to implement it everywhere. Then you just loop over all defined sources/sinks checking if they are 0 and sleep for 0.05 😉. Would that fix work? Because if yes, we would PR that. At some point...

1 reply

jszwedko May 27, 2025
Maintainer

Right, what you describe is the expected behavior. When Vector starts shutting down, sources are expected to stop first (e.g. don't allow any more connections) but any in-flight data is still handled up to the graceful shutdown timeout (60 seconds by default).

isbm · 2025-05-30T10:20:26Z

isbm
May 30, 2025
Author

@jszwedko OK, so in this case "timeout" is a thing anyways. The question only remains:

Should we make the timeout dynamic?

I.e. Vector must be fully aware that it is safe to completely terminate, because all sinks/sources are already empty; instead of keep waiting for no reasons till the last second of fixed timeout, which also (potentially) can be too short, as you never know what kind of processing user do.

The benefit of this approach is that Vector would not need to wait a minute, but could terminate immediately as long as there are no more messages left to process, thus saving DevOps time to fire up their tests, likewise would wait even longer, if necessary.

12 replies

jszwedko May 30, 2025
Maintainer

It is possible to disable the timeout via --no-graceful-shutdown-limit. See https://vector.dev/docs/reference/cli/#vector_no_graceful_shutdown_limit. Note Vector will shut down earlier if there is no more input to process; the limit is just the maximum amount of time it will wait.

Answer selected by pront

pront Jun 11, 2025
Maintainer

I believe the setting Jesse shared above along with https://vector.dev/docs/reference/cli/#vector_graceful_shutdown_limit_secs answers this question. But feel free to followup with further questions.

Ichmed Jun 11, 2025

The way i understand it the timeout mechanism is designed in a way that would work for our usecase, but its not correctly implemented in the postgres sink (I have not tested it with other sinks), there doesn't seem to be anything in that sink that explicitly deals with the graceful shutdown, so I'm assuming by default the graceful shutdown is ignored by components and it is up to the author to act on it. If that is the case maybe it should be externally enforce on components?

In either case, with our setup the console output shows that graceful shutdown has started, but vector terminates immediatly afterwards (console says it will wait 59s for all components to finish) and we are missing some data in the postgres DB

pront Jun 11, 2025
Maintainer

That is a good observation. We will need to enhance the postgres sink shutdown, feel free to create an issue for this task.

isbm Jun 11, 2025
Author

@pront well, that's what I was talking the whole thread. 😆 Now question is how to do that? As I already wrote above: one way is to create a must-implement method in trait, that would hook on shutdown for each sink/source.

So if a specific sink/source does nothing — that's the issue with that sink/source, but Vector must call that method on shutdown, effectively flushing everything and blocking new data coming.

Would something like that do?

How to know when to shutdown Vector? #23114

Uh oh!

Uh oh!

isbm May 27, 2025

Replies: 3 comments · 13 replies

Uh oh!

jszwedko May 27, 2025 Maintainer

Uh oh!

Uh oh!

isbm May 27, 2025 Author

Uh oh!

jszwedko May 27, 2025 Maintainer

Uh oh!

isbm May 30, 2025 Author

Uh oh!

jszwedko May 30, 2025 Maintainer

Uh oh!

Uh oh!

pront Jun 11, 2025 Maintainer

Uh oh!

Ichmed Jun 11, 2025

Uh oh!

pront Jun 11, 2025 Maintainer

Uh oh!

isbm Jun 11, 2025 Author

isbm
May 27, 2025

Replies: 3 comments 13 replies

jszwedko
May 27, 2025
Maintainer

isbm
May 27, 2025
Author

jszwedko May 27, 2025
Maintainer

isbm
May 30, 2025
Author

jszwedko May 30, 2025
Maintainer

pront Jun 11, 2025
Maintainer

pront Jun 11, 2025
Maintainer

isbm Jun 11, 2025
Author