[indexer-alt] Run one pipeline at a time #19990

lxfind · 2024-10-23T22:00:13Z

Description

This should be the default mode of running indexer pipeline, for the following reasons:

It is naturally scalable since we could run each pipeline on different host with operational ease.
We should measure throughput using this as a starting point, instead of starting with the full set and divide as throughout drops. This represents the most scalable setup and the best number will give us confidence in one go.
It allows easy extension of custom indexers.

Test plan

CI

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2024-10-23T22:00:17Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments

Name	Status	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Oct 23, 2024 10:00pm
sui-kiosk	⬜️ Ignored (Inspect)	Oct 23, 2024 10:00pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Oct 23, 2024 10:00pm

amnn · 2024-10-24T07:35:00Z

Copying my reply in Slack for this -- it seems premature to force the binary to only allow single pipelines for now, but maybe I'm missing some context. Let's discuss in the stand-up:

This sounds like a reasonable hypothesis but do we have enough data today to bake this in? To me it seems not, because:

Separate pipelines means each pipeline has to download checkpoint data which means more network traffic and higher costs for us.

Pipelines are never truly fault separated because the reader requires that they are all within a certain range of each other.

Certain pipelines are necessarily coupled like the live objects set and objects history (this is why I introduced the regulator component to the ingestion service that stops it running ahead too far of one pipeline).
The benefit of just adding a new pipeline and letting it run alone from genesis exists today with the --pipeline flag.

Let's try it as an experiment -- we have all the machinery to do that and in fact it is one of the experiments in the list already, but I don't see the need to bake it in until we have the results.

For example, it seems bad to have an architecture that forces us to write a new pipeline that needs to download a day's worth of checkpoint data to write a single row into kv_epoch_start (...and same again for kv_epoch_end, kv_protocol_configs, kv_feature_flags, etc).

We have multiple bottlenecks to navigate (checkpoint download rate and cost, indexer CPU, indexer egress, database ingress, database writes in rows and bytes, database reads). Our aim is to saturate all of these, and I'd like to keep our variables open to do that until we know we can't or shouldn't mess with one of those variables.

[indexer-alt] Run one pipeline at a time

dfeb591

lxfind requested a review from amnn October 23, 2024 22:00

lxfind closed this Dec 11, 2024

lxfind deleted the indexer-alt-one-pipeline-at-a-time branch December 11, 2024 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[indexer-alt] Run one pipeline at a time #19990

[indexer-alt] Run one pipeline at a time #19990

Uh oh!

lxfind commented Oct 23, 2024

Uh oh!

vercel bot commented Oct 23, 2024

Uh oh!

amnn commented Oct 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[indexer-alt] Run one pipeline at a time #19990

[indexer-alt] Run one pipeline at a time #19990

Uh oh!

Conversation

lxfind commented Oct 23, 2024

Description

Test plan

Release notes

Uh oh!

vercel bot commented Oct 23, 2024

Uh oh!

amnn commented Oct 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants