Large amount files will lead to slow throughput rate to kafka #15467

WilliamEricCheung · 2022-12-06T09:58:13Z

WilliamEricCheung
Dec 6, 2022

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Background
Hello, we are migrating services logs using vector (1 host) -- kafka (3 hosts) -- logstash (3 hosts) -- opensearch (1 kibana host) pipeline. Our services will output almost 25 gzip files per hour in certain directory, each size is about 900M, and we have a script to remove 1 day ago deprecated logs from this directory to reduce storage pressure.
When we start vector at Nov 16, 2022 @ 9:00:00, the throughput was steady until 12:00:00, during this 3 hour, we could get about 450M indice hits per hour in kibana. However, the hits rate rapidly down after 12 PM, we just got 100M at 3 PM, 80M at 6 PM, 70M at 9 PM, 60 M at Nov 17, 2022 0 AM, 50M at 3 AM and 40M at 6 AM.

Our thinking
Every hour our services put 25 files, for one whole day, we will have 24*25=600 files. Even we use a script to keep just one day logs, the file total size is very heavy here - 600 * 900M = 540G.

Question

Is any source constraint for vector? file number? file total size?
Should we change our script to just keep 12 hrs our 6 hrs logs to reduce the vector pressure? What if this condition do not help?

Basic Infomation
Sources format: gzip files
Files directory: /local/vectorLog
Files format: /local/vectorLog/requests.log.yyyy-MM-dd-HH.gz
Files size: ~900M

Configuration

[sources.gz_logs]
type = "file"
max_line_bytes = 2024000
ignore_older_secs = 600
include = [ "/local/vectorLog/requests*.gz" ]
multiline.mode = "halt_with"
multiline.condition_pattern = "---------------------------------------------*"
multiline.timeout_ms = 5000
multiline.start_pattern = '^\{'
# Parse gz_logs
[transforms.parse_logs]
type = "remap"
inputs = ["gz_logs"]
source = """
lines,err = split(.message, "\n")
.message = lines[0] + lines[1]
.message = to_string(.message)
.message = parse_json!(.message)
"""
# Print parsed logs to stdout
[sinks.print]
type = "kafka"
inputs = [ "parse_logs" ]
bootstrap_servers = "AAA:9092,BBB:9092,CCC:9092"
topic = "TOPIC"
compression = "none"
 [sinks.print.encoding]
  codec = "json"

Version

vector 0.23.3 (x86_64-unknown-linux-gnu af8c9e1 2022-08-10)

Debug Output

### 96 log files: continue source transfrom sink
2022-11-16T09:19:38.546454Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=230.000/sec bytes_throughput=1.283m/sec ratios={"discovery": 0.9952969, "other": 0.0002865836, "reading": 0.0043958127, "sending": 1.8007844e-5, "sleeping": 2.8511367e-6}
2022-11-16T09:19:38.549660Z DEBUG transform{component_kind="transform" component_id=parse_logs component_type=remap component_name=parse_logs}: vector::utilization: utilization=0.028800348759819846
2022-11-16T09:19:39.471776Z DEBUG vector::internal_events::file::source: Files checkpointed. count=96 duration_ms=2
2022-11-16T09:19:39.797520Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=288.000/sec bytes_throughput=1.504m/sec ratios={"discovery": 0.9959729, "other": 0.000243444, "reading": 0.0037625295, "sending": 1.9564282e-5, "sleeping": 2.7171284e-6}
2022-11-16T09:19:40.487837Z DEBUG vector::internal_events::file::source: Files checkpointed. count=96 duration_ms=4
2022-11-16T09:19:40.645823Z DEBUG sink{component_kind="sink" component_id=print component_type=kafka component_name=print}: vector::utilization: utilization=0.013442464730048877

### 287 log files: continue source transform sink
2022-11-16T17:45:30.182521Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=450.000/sec bytes_throughput=2.360m/sec ratios={"discovery": 0.9924744, "other": 0.00024555708, "reading": 0.007268966, "sending": 9.853384e-6, "sleeping": 1.3856921e-6}
2022-11-16T17:45:30.191391Z DEBUG transform{component_kind="transform" component_id=parse_logs component_type=remap component_name=parse_logs}: vector::utilization: utilization=0.0444073165530334
2022-11-16T17:45:30.472922Z DEBUG vector::internal_events::file::source: Files checkpointed. count=287 duration_ms=1
2022-11-16T17:45:30.647038Z DEBUG sink{component_kind="sink" component_id=print component_type=kafka component_name=print}: vector::utilization: utilization=0.023487958650561292
2022-11-16T17:45:31.479022Z DEBUG vector::internal_events::file::source: Files checkpointed. count=287 duration_ms=0

### 477 log files: looks like not working
2022-11-17T02:30:46.375905Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=9.267k/sec bytes_throughput=48.290m/sec ratios={"discovery": 0.831037, "other": 0.0038224799, "reading": 0.16500683, "sending": 0.00011769667, "sleeping": 1.6122965e-5}
2022-11-17T02:30:47.015750Z DEBUG vector::internal_events::file::source: Files checkpointed. count=477 duration_ms=12
2022-11-17T02:30:47.405331Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=8.586k/sec bytes_throughput=44.756m/sec ratios={"discovery": 0.8268399, "other": 0.0035622944, "reading": 0.16947487, "sending": 0.00010818543, "sleeping": 1.4872786e-5}
2022-11-17T02:30:48.027768Z DEBUG vector::internal_events::file::source: Files checkpointed. count=477 duration_ms=2
2022-11-17T02:30:48.418591Z DEBUG source{component_kind="source" component_id=gz_logs component_type=file component_name=gz_logs}:file_server: file_source::file_server: event_throughput=8.692k/sec bytes_throughput=45.377m/sec ratios={"discovery": 0.8270361, "other": 0.0035936986, "reading": 0.16924557, "sending": 0.0001098454, "sleeping": 1.4949319e-5}

Example Data

No response

Additional Context

No response

References

No response

WilliamEricCheung · 2022-12-06T09:58:35Z

WilliamEricCheung
Dec 6, 2022
Author

Since no one comments on my post, we tried to modify our scripts to just keep almost 100 logs within 4 hours. Now the vector can run smoothly, but it's not the solution we want for our business.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large amount files will lead to slow throughput rate to kafka #15467

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Large amount files will lead to slow throughput rate to kafka #15467

Uh oh!

WilliamEricCheung Dec 6, 2022

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

Replies: 1 comment

Uh oh!

WilliamEricCheung Dec 6, 2022 Author

WilliamEricCheung
Dec 6, 2022

WilliamEricCheung
Dec 6, 2022
Author