Vector's memory usage is erratic #22892

abhisgup · 2025-04-16T09:41:05Z

abhisgup
Apr 16, 2025

Today I conducted two runs (each lasting around an hour) to ingest logs into a deployment of vector running on AWS. The rate at which the logs were generated for both the runs was the same.

Vector was reading the SQS event notifications coming from the input S3 bucket to perform some transformations on the logs and output them to a output S3 bucket.

I noticed that Vector was consuming around 3 GB of memory for the first 40 minutes of the first run and then the memory used shot up to around 25 GB. The entire duration of the second run used around 25 GB.

Between the two runs, the docker container of vector was restarted.

timberio/vector:0.46.1-debian is the docker image I was using.

Few days ago I had used timberio/vector:0.46.0.custom.fba8185-debian docker image and conducted multiple runs (with the same log generation rate and duration) where I observed that vector would be producing the output appropriately (there was no significant lag) while consuming 3 GB of memory for quite some time and then the memory requirment would suddenly shoot up to 20-30 GB.

Any idea why the memory usage of vector might be so erratic?

pront · 2025-04-16T13:51:59Z

pront
Apr 16, 2025
Maintainer

Hi @abhisgup,

In your S3-to-S3 pipeline using Vector (timberio/vector:0.46.1-debian), seeing memory rise from ~3 GB
to 25–30 GB can be expected under specific conditions — but it may also indicate configuration issues or a bug.

Just because Vector's process memory stays high doesn't always mean it's actively using it.
Rust and many memory allocators may not return freed heap memory to the OS immediately.
The memory might be marked as free internally (and reusable by Vector), but still appear in OS tools
like top or htop as "used."

For complex topics like this, I usually have a notebook with multiple timeseries to help me get an overview at an abstract level:
https://vector.dev/guides/developer/debugging/#visualizing-and-querying-internal-metrics

I will try to describe all the important Vector themes that affect memory usage.

🔍 What affects Vector's memory usage?

1. In-Memory Buffers

Vector uses internal buffers between components (e.g., source ➝ transform ➝ sink).
These are usually bounded by event count, not bytes (e.g., 100 events).
If events are large (e.g., multi-KB JSON logs), memory usage scales with them.

2. Batching in Sinks

Sinks accumulate events into batches (based on count, size, or timeout).
Vector holds entire batches in memory before flushing them to S3.
Default: batch.max_bytes = 10MB, batch.max_events = 500, timeout_secs = 5.

Example: A 50MB batch will use ~50MB (very rough estimation) memory until flushed.

3. High-Cardinality Output Keys

If your S3 key_prefix includes high-entropy fields (e.g., time with seconds), you create many
active batches in memory.
Each unique key path may have its own batch buffer.
This can blow up memory even with small per-batch sizes.

4. Transform Behavior

Stateless transforms (like remap, filter) are lightweight and stream-friendly.
Stateful transforms (like aggregate, dedupe) hold data in memory and can grow depending depending on the config settings.
Long aggregation intervals can build up memory.
Expensive Lua transforms can also build up a lot of memory.

5. Concurrency in S3 Source

Vector processes multiple S3 files concurrently.
SQS settings and CPU count affect parallel fetches.
Large files + high concurrency -> more memory used.

🧠 When Is a Memory Spike "Normal"?

A memory spike can be expected if:

There's a burst of incoming S3 objects via SQS.
Batch flush is delayed (e.g., timeout not hit yet).
Large files are being ingested.
Batching or transformation config enables large memory usage.

Memory should stabilize after the burst.

🚨 When Should You Worry?

If you consistently hit 25–30 GB even with light input.
If you don’t have disk buffers enabled and batch limits are high.
If high memory correlates with S3 sink output stuck or slow.

✅ What You Can Do

Inspect your batch settings:
- e.g., batch.max_bytes = "5MB", max_events = 500.
Avoid over-partitioning with key_prefix (e.g., %S in paths).
Use buffer.type = "disk" to reduce memory footprint and offload to the disk.
Check for stateful transforms that might hold too much in memory.
Upgrade Vector — memory management and S3 sink have seen fixes in later versions.

📚 References

0 replies

abhisgup · 2025-04-17T04:24:14Z

abhisgup
Apr 17, 2025
Author

@pront Thanks a lot for responding. I have gone through the links you shared and will be trying the suggestions.

sources:
  s3_input_log_source:
    type: aws_s3
    region: us-west-2
    sqs:
      delete_message: true
      queue_url: "redacted_sqs_url"

transforms:
  split_single_line_logs_into_multiple_lines:
    type: remap
    inputs:
      - s3_input_log_source
    source: |
      start_symbol = "{\"user_id\""
      end_symbol = "}"
      records = split!(.message, end_symbol + start_symbol)
      new_records = []
      start_idx = 0
      end_idx = length(records) - 1
      for_each(records) -> |index, record| {
        if index == start_idx {
          record = record + end_symbol
        } else if index == end_idx {
          record = start_symbol + record
        } else {
          record = start_symbol + record + end_symbol
        }
        new_records = push(new_records, record)
      }
      .message = new_records
      . = unnest(.message)

  drop_invalid_logs:
    type: filter
    inputs:
      - split_single_line_logs_into_multiple_lines
    condition: |
      !match_any!(.message, [
          r'/url\"\:\{\"original\"\:\"\/\/\"\,/',
          r'/url\"\:\{\"original\"\:\"\/403\.shtml\"\,/'
        ]
      )

  drops_fields_and_format_as_key_equals_value:
    type: remap
    inputs:
        - drop_invalid_logs
    source: |
      field_ordering = ["timestamp", "user_id", "product_type", "server_ip", "server_response_time", "user_agent", "method", "mime_type", "bytes", "status_code", "version", "scheme", "domain", "uri"]
      .message = parse_json!(.message)
      .message = encode_key_value!({
        "timestamp": .message.event_timestamp,
        "user_id": .message.user_id,
        "product_type": .message.product_type,
        "server_ip": .message.server_ip,
        "server_response_time": .message.server_response_time,
        "user_agent": .message.user_agent.original,
        "method": .message.http.request.method,
        "mime_type": .message.http.request.mime_type,
        "bytes": .message.http.response.bytes,
        "status_code": .message.http.response.status_code,
        "version": .message.http.version,
        "scheme": .message.url.scheme,
        "domain": .message.url.domain,
        "uri": .message.url.original,
      }, field_ordering)

sinks:
  s3_output_log_sink:
    type: aws_s3
    inputs:
      - drops_fields_and_format_as_key_equals_value
    encoding:
      codec: text
    bucket: 'redacted_s3_bucket'
    region: us-west-2
    key_prefix: output/log-
    compression: none
    batch:
      timeout_secs: 5

Each input file has JSON formatted log records without a newline separator because of which I had to write split_single_line_logs_into_multiple_lines transformation to split the JSON log records into their own events. Post this there are a few simple remap transformations.

Vector ingested around 9 GB of gzipped logs over a period of 1 hour in all the runs, with the log generation (in the input S3 bucket) rate being constant. 2.56 MB of compressed logs per second 9*1024/(60*60). https://aws.amazon.com/ec2/instance-types/c6a/#:~:text=c6a.16xlarge,20 is a powerful machine running vector for these test runs. The CPU and memory are practically unconstrainted for this workload as I wanted to figure out the max resources can vector take up. With a constant rate of incoming logs, vector was able to produce the output at a constant rate using 3 GB of memory for quite sometime but then memory usage suddenly shot up. How can I figure out why this happened and is there a mechanism using which I can ensure a more predictable memory usage profile?

If vector always took 20-30 GB for this workload, I could have done the capacity planning taking that into account. But vector has produced the output without any extra delay even when using 3 GB of memory for quite some time (40 minutes out of 1 hour in the first run). That makes it difficult to do capacity planning - I can still plan as per the highest memory usage but that might lead me to overprovision the machines which isn't very economical. High memory usage is less of a problem than the variance in memory usage for the same workload.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector's memory usage is erratic #22892

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Vector's memory usage is erratic #22892

Uh oh!

abhisgup Apr 16, 2025

Replies: 2 comments

Uh oh!

Uh oh!

pront Apr 16, 2025 Maintainer

🔍 What affects Vector's memory usage?

1. In-Memory Buffers

2. Batching in Sinks

3. High-Cardinality Output Keys

4. Transform Behavior

5. Concurrency in S3 Source

🧠 When Is a Memory Spike "Normal"?

🚨 When Should You Worry?

✅ What You Can Do

📚 References

Uh oh!

Uh oh!

abhisgup Apr 17, 2025 Author

abhisgup
Apr 16, 2025

pront
Apr 16, 2025
Maintainer

abhisgup
Apr 17, 2025
Author