-
Notifications
You must be signed in to change notification settings - Fork 20.8k
RPC timeout issue 1.15.x #31718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Updating this to note that behaviour also occurs under very light RPC load, it just doesn't happen immediately as it does when subscribing to pending transactions. I also set up the prometheus metrics dashboard, these charts show the issue starting at 14:21. Block processing storage read/update also get slower/spikier, but I'd assume that's caused by resource constraints. |
We believe there is a goroutine leak somewhere in the transaction indexer which is causing it. More details are needed to find the root of the leak. Is it possible for you to enable
|
Do you mean the subsequent RPCs are all failed with timeout error? Can you reproduce the issue? If so, please dump out the go stacks and block profile.
|
Not sure exact breakdown of subsequent failures, but I think eth.syncing always times out and other requests do frequently but <100%. Here's the requested output: https://gist.github.com/JSRossiter/3d1b675df57d85917e586ab9f380b842 |
There are 120 go-routines pending for txIndex progress
And one request is being served which blocks all the subsequent ones.
The block line is only for a single database read, which should be fast enough. |
I tries to reproduce it locally with this code script, but all the RPC requests can be responded quickly, By dumping out the go stacks, no go-routine is piled up for waiting txIndex progress. package main
import (
"context"
"fmt"
"log"
"github.com/ethereum/go-ethereum/common"
"github.com/ethereum/go-ethereum/ethclient"
"github.com/ethereum/go-ethereum/rpc"
)
func main() {
ctx := context.Background()
rpcClient, err := rpc.DialContext(ctx, "ws://localhost:8546")
if err != nil {
log.Fatalf("Failed to connect to RPC: %v", err)
}
client := ethclient.NewClient(rpcClient)
// Subscribe to pending transactions
txHash := make(chan common.Hash)
sub, err := client.Client().Subscribe(ctx, "eth", txHash, "newPendingTransactions")
if err != nil {
log.Fatalf("Failed to subscribe to pending transactions: %v", err)
}
fmt.Println("Subscribed to pending transactions...")
for {
select {
case err := <-sub.Err():
log.Fatalf("Subscription error: %v", err)
case h := <-txHash:
go func(txHash common.Hash) {
tx, pending, err := client.TransactionByHash(ctx, txHash)
if err == nil {
fmt.Printf("Transaction found, pending: %t, hash: %x\n", pending, tx.Hash())
} else {
fmt.Printf("Transaction not found: %v", err)
}
}(h)
}
}
} |
We had similar problems with our RPC servers. System information After updating to version 1.15.9, we received complaints from teams using our RPCs about the large number of timeouts they receive after executing requests. The request execution time has increased significantly compared to previous versions. Below are the statistics from our Haproxy that distributes traffic between servers. Most third-party projects requiring ETH RPC had standard timeout values, and even if the request was executed correctly, it took too long and our clients received errors. Unfortunately, I don't have any statistics on the processing time of specific methods, we are currently working on it, but it seems that there were problems with all the methods that can be called. Only after rolling back the version to 1.15.5 did we manage to stabilize the situation and return to normal work. Currently, we are still working on version 1.15.5 and have not yet updated to 1.15.10. |
What kind of RPC requests are served by your node? Specifically, is Can you please try the v1.15.10 again with |
This is how eth_getLogs requests are served, and as far as I remember from the logs, this type of request had the biggest problems. Their processing time has increased significantly compared to previous versions. |
Thanks for noting 1.15.5 works @oleksandr-everstake - I wasn't sure if I had any options that didn't involve re-syncing. This version seems to be working for me too with a pending transaction subscription running. |
It's unclear at this point whether the GetTransaction API is causing this slowdown and makes the node struggle under the load of other methods (getLogs and call) as well. Or that those are also themselves experiencing goroutine leaks similar to GetTransaction. If anyone was performing no GetTransaction but getLogs and eth_call and experiences the slowdown please let us know. |
After I tried to update Geth to version 1.15.10 with the option --history.logs.disable. I got an error in Geth and the client is now stuck on a block. I need to resynchronize now for normal operation. ERROR[04-30|09:46:49.666] Beacon backfilling failed err="retrieved hash chain is invalid: missing parent" |
@oleksandr-everstake do you have unclean shutdown? |
We use a docker-compose file to run our CL and EL. The version was changed, then run |
I did launch one of the servers on version 1.15.10. We are still watching how its work. If there are any problems, we will come to you with questions. |
Hate to be that guy, but what's the status on this? Prague in <5 days and I'd like to know if I need to switch to a different EL client. |
@JSRossiter you are very right to ask this. We have a working fix: #31752. It is being reviewed right now, likely merged on Monday. |
Great, appreciate your work and the update. |
System information
Geth version: 1.15.10 (full node)
CL client & version: lighthouse v7.0
OS & Version: Linux
Expected behaviour
Similar RPC performance to 1.14
Actual behaviour
After upgrading from 1.14.13 to 1.15.9 (then 1.15.10) my app started having majority of RPC requests timeout.
My app subscribes to pending transactions and then for each received pending transaction hash calls eth_getTransactionByHash.
When I run this app it also causes other clients to timeout, including lighthouse and
eth.syncing
in geth attach.If I stop pending transaction subscription timeouts persist for other clients until I restart geth.
CPU usage is elevated (geth uses 110-150% instead of 10-50%) while issue is occurring.
Nothing out of the normal in geth logs, if an eth_call request times out it is logged.
Lighthouse logs "Error during execution engine upcheck"
I don't think I have any abnormal config, I have
TransactionHistory = 0
. Is there any other new options I've missed?The text was updated successfully, but these errors were encountered: