Skip to content

RPC timeout issue 1.15.x #31718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JSRossiter opened this issue Apr 25, 2025 · 18 comments
Open

RPC timeout issue 1.15.x #31718

JSRossiter opened this issue Apr 25, 2025 · 18 comments

Comments

@JSRossiter
Copy link

System information

Geth version: 1.15.10 (full node)
CL client & version: lighthouse v7.0
OS & Version: Linux

Expected behaviour

Similar RPC performance to 1.14

Actual behaviour

After upgrading from 1.14.13 to 1.15.9 (then 1.15.10) my app started having majority of RPC requests timeout.
My app subscribes to pending transactions and then for each received pending transaction hash calls eth_getTransactionByHash.
When I run this app it also causes other clients to timeout, including lighthouse and eth.syncing in geth attach.
If I stop pending transaction subscription timeouts persist for other clients until I restart geth.
CPU usage is elevated (geth uses 110-150% instead of 10-50%) while issue is occurring.
Nothing out of the normal in geth logs, if an eth_call request times out it is logged.
Lighthouse logs "Error during execution engine upcheck"

I don't think I have any abnormal config, I have TransactionHistory = 0. Is there any other new options I've missed?

@JSRossiter
Copy link
Author

Updating this to note that behaviour also occurs under very light RPC load, it just doesn't happen immediately as it does when subscribing to pending transactions. I also set up the prometheus metrics dashboard, these charts show the issue starting at 14:21.
eth_syncing calls start taking 10s+, up from ~150ms
Various RPC QPS increase - this might be an application side thing (repeating failures) but they are only reaching 8-9 req/s combined (chart shows 5 min avg)
CPU usage +100%
goroutines constant increase
Disk read 180MB/s -> 1.8GB/s

Block processing storage read/update also get slower/spikier, but I'd assume that's caused by resource constraints.

Image

Image

@s1na
Copy link
Contributor

s1na commented Apr 28, 2025

We believe there is a goroutine leak somewhere in the transaction indexer which is causing it. More details are needed to find the root of the leak. Is it possible for you to enable --pprof --pprof.blockprofilerate 1 and then run the following command when the node is struggling:

curl http://localhost:6060/debug/pprof/block > block.prof

@rjl493456442
Copy link
Member

When I run this app it also causes other clients to timeout, including lighthouse and eth.syncing in geth attach.
If I stop pending transaction subscription timeouts persist for other clients until I restart geth.

Do you mean the subsequent RPCs are all failed with timeout error?

Can you reproduce the issue? If so, please dump out the go stacks and block profile.

  • geth attach --exec 'debug.stacks()' ./geth.ipc > stack.debug
  • geth attach --exec 'debug.blockProfile("block.profile", 10)' ./geth.ipc
  • geth attach --exec 'debug.mutexProfile("mutex.profile", 10)' ./geth.ipc

@JSRossiter
Copy link
Author

Not sure exact breakdown of subsequent failures, but I think eth.syncing always times out and other requests do frequently but <100%.

Here's the requested output:

https://gist.github.com/JSRossiter/3d1b675df57d85917e586ab9f380b842

profiles.zip

@rjl493456442
Copy link
Member

There are 120 go-routines pending for txIndex progress

github.com/ethereum/go-ethereum/core.(*txIndexer).txIndexProgress(...)
	github.com/ethereum/go-ethereum/core/txindexer.go:307
github.com/ethereum/go-ethereum/core.(*BlockChain).TxIndexProgress(...)
	github.com/ethereum/go-ethereum/core/blockchain_reader.go:407

And one request is being served which blocks all the subsequent ones.

goroutine 4271 [syscall]:
syscall.Syscall6(0x11, 0xe7, 0x7ff88ad57370, 0x1a86b91a, 0xc7b8aa, 0x0, 0x0)
	syscall/syscall_linux.go:95 +0x39
syscall.pread(0xc000bc7dc0?, {0x7ff88ad57370?, 0x457949?, 0xc00ac9ad58?}, 0x47b43f?)
	syscall/zsyscall_linux_amd64.go:1229 +0x45
syscall.Pread(...)
	syscall/syscall_unix.go:226
internal/poll.(*FD).Pread(0x0?, {0x7ff88ad57370, 0xc046d3b008?, 0x1a86b91a}, 0xc7b8aa)
	internal/poll/fd_unix.go:191 +0xee
os.(*File).pread(...)
	os/file_posix.go:38
os.(*File).ReadAt(0xc0093cf758, {0x7ff88ad57370?, 0x47f48c?, 0xc00ac9ae08?}, 0x4efb93?)
	os/file.go:142 +0x14c
github.com/cockroachdb/pebble/objstorage/objstorageprovider.(*vfsReadHandle).ReadAt(0xc060c0e8b0?, {0x2130690?, 0x319b0e0?}, {0x7ff88ad57370?, 0x1a86b91a?, 0x1a86b91a?}, 0xc7b8aa?)
	github.com/cockroachdb/pebble@v1.1.2/objstorage/objstorageprovider/vfs_readable.go:155 +0xc9
github.com/cockroachdb/pebble/sstable.(*Reader).readBlock(0xc021654008, {0x2130690, 0x319b0e0}, {0x11?, 0x11?}, 0x0, {0x21323a0, 0xc060c0e8b0}, 0x0, 0x0)
	github.com/cockroachdb/pebble@v1.1.2/sstable/reader.go:565 +0x310
github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock(0xc060c0e588, 0x1)
	github.com/cockroachdb/pebble@v1.1.2/sstable/reader_iter_single_lvl.go:398 +0x297
github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).seekGEHelper(0xc060c0e588, {0x2fa9d20, 0x14, 0x14}, 0x7ff8f71b88dc?, 0x0)
	github.com/cockroachdb/pebble@v1.1.2/sstable/reader_iter_single_lvl.go:712 +0x3cc
github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).seekPrefixGE(0xc060c0e588?, {0x2fa9d20?, 0x14?, 0x14?}, {0x2fa9d20?, 0x1?, 0x2fa9d20?}, 0x0?, 0x0?)
	github.com/cockroachdb/pebble@v1.1.2/sstable/reader_iter_single_lvl.go:828 +0x33f
github.com/cockroachdb/pebble/sstable.(*twoLevelIterator).SeekPrefixGE(0xc060c0e588, {0x2fa9d20, 0x14, 0x14}, {0x2fa9d20?, 0x38?, 0x1a94300?}, 0x1?)
	github.com/cockroachdb/pebble@v1.1.2/sstable/reader_iter_two_lvl.go:559 +0x765
github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc04116e600, {0x2fa9d20, 0x14, 0x14}, {0x2fa9d20, 0x14, 0x14}, 0x0)
	github.com/cockroachdb/pebble@v1.1.2/level_iter.go:770 +0x116
github.com/cockroachdb/pebble.(*getIter).Next(0xc04116e598)
	github.com/cockroachdb/pebble@v1.1.2/get_iter.go:207 +0xc12
github.com/cockroachdb/pebble.(*getIter).First(0xc00ac9b6c0?)
	github.com/cockroachdb/pebble@v1.1.2/get_iter.go:65 +0x13
github.com/cockroachdb/pebble.(*Iterator).iterFirstWithinBounds(0xc04116e008)
	github.com/cockroachdb/pebble@v1.1.2/iterator.go:2024 +0xcf
github.com/cockroachdb/pebble.(*Iterator).First(0xc04116e008)
	github.com/cockroachdb/pebble@v1.1.2/iterator.go:1622 +0xe5
github.com/cockroachdb/pebble.(*DB).getInternal(0xc0001e1208, {0x2fa9d20, 0x14, 0x14}, 0x0, 0x0)
	github.com/cockroachdb/pebble@v1.1.2/db.go:592 +0x3ad
github.com/cockroachdb/pebble.(*DB).Get(...)
	github.com/cockroachdb/pebble@v1.1.2/db.go:520
github.com/ethereum/go-ethereum/ethdb/pebble.(*Database).Get(0x10001004813b2?, {0x2fa9d20?, 0x3?, 0x4?})
	github.com/ethereum/go-ethereum/ethdb/pebble/pebble.go:309 +0xfa
github.com/ethereum/go-ethereum/core/rawdb.ReadTxIndexTail({0x7ff96ebec280?, 0xc00a7562b8?})
	github.com/ethereum/go-ethereum/core/rawdb/accessors_chain.go:264 +0x39
github.com/ethereum/go-ethereum/core.(*txIndexer).report(0xc00ac9bf20?, 0x1555296)
	github.com/ethereum/go-ethereum/core/txindexer.go:287 +0x7c
github.com/ethereum/go-ethereum/core.(*txIndexer).loop(0xc0519e12c0, 0xc0477ad008)
	github.com/ethereum/go-ethereum/core/txindexer.go:251 +0x3e5
created by github.com/ethereum/go-ethereum/core.newTxIndexer in goroutine 1
	github.com/ethereum/go-ethereum/core/txindexer.go:70 +0x146

The block line is only for a single database read, which should be fast enough.

@rjl493456442
Copy link
Member

rjl493456442 commented Apr 29, 2025

I tries to reproduce it locally with this code script, but all the RPC requests can be responded quickly,
with no timeout issue.

By dumping out the go stacks, no go-routine is piled up for waiting txIndex progress.

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/ethereum/go-ethereum/common"
	"github.com/ethereum/go-ethereum/ethclient"
	"github.com/ethereum/go-ethereum/rpc"
)

func main() {
	ctx := context.Background()
	rpcClient, err := rpc.DialContext(ctx, "ws://localhost:8546")
	if err != nil {
		log.Fatalf("Failed to connect to RPC: %v", err)
	}
	client := ethclient.NewClient(rpcClient)

	// Subscribe to pending transactions
	txHash := make(chan common.Hash)
	sub, err := client.Client().Subscribe(ctx, "eth", txHash, "newPendingTransactions")
	if err != nil {
		log.Fatalf("Failed to subscribe to pending transactions: %v", err)
	}
	fmt.Println("Subscribed to pending transactions...")

	for {
		select {
		case err := <-sub.Err():
			log.Fatalf("Subscription error: %v", err)
		case h := <-txHash:
			go func(txHash common.Hash) {
				tx, pending, err := client.TransactionByHash(ctx, txHash)
				if err == nil {
					fmt.Printf("Transaction found, pending: %t, hash: %x\n", pending, tx.Hash())
				} else {
					fmt.Printf("Transaction not found: %v", err)
				}
			}(h)
		}
	}
}

@oleksandr-everstake
Copy link

We had similar problems with our RPC servers.

System information
Geth version: 1.15.9
CL client & version: Prysm v6.0.0
OS & Version: Linux

After updating to version 1.15.9, we received complaints from teams using our RPCs about the large number of timeouts they receive after executing requests.

The request execution time has increased significantly compared to previous versions. Below are the statistics from our Haproxy that distributes traffic between servers.

Most third-party projects requiring ETH RPC had standard timeout values, and even if the request was executed correctly, it took too long and our clients received errors.

Unfortunately, I don't have any statistics on the processing time of specific methods, we are currently working on it, but it seems that there were problems with all the methods that can be called.

Only after rolling back the version to 1.15.5 did we manage to stabilize the situation and return to normal work.

Currently, we are still working on version 1.15.5 and have not yet updated to 1.15.10.

Image

@rjl493456442
Copy link
Member

@oleksandr-everstake

What kind of RPC requests are served by your node? Specifically, is eth_getLogs also served?

Can you please try the v1.15.10 again with --history.logs.disable flag?

@oleksandr-everstake
Copy link

This is how eth_getLogs requests are served, and as far as I remember from the logs, this type of request had the biggest problems. Their processing time has increased significantly compared to previous versions.
I will try your recommendation to try 1.15.10.

@JSRossiter
Copy link
Author

Thanks for noting 1.15.5 works @oleksandr-everstake - I wasn't sure if I had any options that didn't involve re-syncing. This version seems to be working for me too with a pending transaction subscription running.
I'm also seeing a significant degradation in latency for eth_getLogs and eth_syncing calls for 1.15.10 vs 1.15.5. Charts below show me downgrading geth to 1.15.5 and corresponding decrease in latency, with similar load.

Image

Image

@s1na
Copy link
Contributor

s1na commented Apr 29, 2025

It's unclear at this point whether the GetTransaction API is causing this slowdown and makes the node struggle under the load of other methods (getLogs and call) as well. Or that those are also themselves experiencing goroutine leaks similar to GetTransaction. If anyone was performing no GetTransaction but getLogs and eth_call and experiences the slowdown please let us know.

@oleksandr-everstake
Copy link

After I tried to update Geth to version 1.15.10 with the option --history.logs.disable. I got an error in Geth and the client is now stuck on a block. I need to resynchronize now for normal operation.

ERROR[04-30|09:46:49.666] Beacon backfilling failed err="retrieved hash chain is invalid: missing parent"

@rjl493456442
Copy link
Member

@oleksandr-everstake do you have unclean shutdown?

@oleksandr-everstake
Copy link

We use a docker-compose file to run our CL and EL. The version was changed, then run docker-compose down && docker-compose up -d

@oleksandr-everstake
Copy link

I did launch one of the servers on version 1.15.10. We are still watching how its work. If there are any problems, we will come to you with questions.

@JSRossiter
Copy link
Author

Hate to be that guy, but what's the status on this? Prague in <5 days and I'd like to know if I need to switch to a different EL client.
Is forking 1.15.5 and just applying the mainnet activation block an option? It's difficult for me to interpret what other changes are important.

@s1na
Copy link
Contributor

s1na commented May 2, 2025

@JSRossiter you are very right to ask this. We have a working fix: #31752. It is being reviewed right now, likely merged on Monday.

@JSRossiter
Copy link
Author

Great, appreciate your work and the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants