Skip to content

fix: fix batch ordering issue #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

marshallyale
Copy link
Contributor

@marshallyale marshallyale commented Jan 22, 2025

Fixes issue #75
This is my first open source pull request so I apologize for any formatting issues. Additionally, I don't know the repository as well as others so there may be a better way to implement the fix.

I believe I found the root cause of this. I added a pull request to fix, but I'm going to copy/paste what I believe is causing the error.

The main issue in the code is inside raw_client.rs inside the recv method implementation (snippet below):

fn recv(
&self,
receiver: &Receiver<ChannelMessage>,
req_id: usize,
) -> Result<serde_json::Value, Error> {
loop {
// Try to take the lock on the reader. If we manage to do so, we'll become the reader
// thread until we get our reponse
match self._reader_thread(Some(req_id)) {
Ok(response) => break Ok(response),
Err(Error::CouldntLockReader) => {
match receiver.recv()? {
// Received our response, returning it
ChannelMessage::Response(received) => break Ok(received),
ChannelMessage::WakeUp => {

When this is first called, the self._reader_thread will run. Inside the self._reader_thread, if the request id matches the response id, everything works fine. However, if the request id does not match the response id, we run the following code:

Some(resp_id) => {
// We have an id, but it's not our response. Notify the thread and
// move on
trace!("Reader thread received response for {}", resp_id);
if let Some(sender) = self.waiting_map.lock()?.remove(&resp_id) {
sender.send(ChannelMessage::Response(resp))?;
} else {
warn!("Missing listener for {}", resp_id);
}
}

The channel that the response is sent back into is not unique, but rather all the channels share the same sender.clone() and receiver. The only validation that is done is to check that the request id is still being searched for inside self.waiting_map. This means that the receiver channel receives whatever the next response is into the channel without any validation that it matches the request id which happens here match receiver.recv()?.

This is fixed by implementing unique channels for every request id. This fix can be verified with the code @johnzweng used to show the issue

use bitcoin::block::Header;
use electrum_client::{Client, ElectrumApi};
use std::process;

fn main() {
    let url = "tcp://exs.dyshek.org:50001";
    let client = Client::new(url).unwrap();

    loop {
        let heights: Vec<u32> = vec![1, 4, 8, 12, 222, 6666];
        let headers: Vec<Header> = client.batch_block_header(heights).unwrap();

        let mut err_counter = 0;
        if headers.get(0).unwrap().time != 1231469665 {
            // time of block 1
            err_counter += 1;
            println!("error 0");
            process::exit(1);
        }
        if headers.get(1).unwrap().time != 1231470988 {
            // time of block 4
            err_counter += 1;
            println!("error 1");
            process::exit(1);
        }
        if headers.get(2).unwrap().time != 1231472743 {
            // time of block 8
            err_counter += 1;
            println!("error 2");
            process::exit(1);
        }
        if headers.get(3).unwrap().time != 1231474888 {
            // time of block 12
            err_counter += 1;
            println!("error 3");
            process::exit(1);
        }
        if headers.get(4).unwrap().time != 1231770653 {
            // time of block 222
            err_counter += 1;
            println!("error 4");
            process::exit(1);
        }
        if headers.get(5).unwrap().time != 1236456633 {
            // time of block 6666
            err_counter += 1;
            println!("error 5");
            process::exit(1);
        }
        println!("==========================================\n")
    }
}

If you run this with the initial code, it will error out after 1-10 cycles normally. However, after the fix this runs indefinitely.

@notmandatory
Copy link
Member

@marshallyale thanks for identifying and working on this. Please rebase your branch since I just merged #159 which changes this projects MSRV and CI checks. Also you have a few clippy errors that need to be fixed. We'll need to give time to get it reviewed, so be patient with getting this merged. 🙂

@notmandatory notmandatory added the bug Something isn't working label Jan 23, 2025
@marshallyale marshallyale force-pushed the fix_transaction_ordering branch from 8c98446 to 2d3f4a5 Compare January 24, 2025 20:02
@marshallyale
Copy link
Contributor Author

No worries, I should have just rebased and pushed this. I'm working off my forked repo currently so there's no rush on getting any changes synced. Let me know if there's anything else I need to change!

@oleonardolima oleonardolima self-requested a review January 27, 2025 14:33
Copy link
Contributor

@oleonardolima oleonardolima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this fix!

I've reviewed the issue and proposed changes, it seems it's a library problem as it's not management and matching context between requests and responses, as the response for a batch call may be out of ordered (as per JSON-RPC spec).

Using an exclusive channel for each request in a batch, solves the problem, however I'm trying to think about any shortcomings of that approach.

@oleonardolima
Copy link
Contributor

I'm taking another look at this one this week, and I think it's best to come up with two #[test]'s: (i) checking/asserting that the request_id's are correct within a batch request, and (ii) single requests / batch requests doesn't mixups the request_id's too. By having a test running on CI/locally it's easier to iterate over this solution and other ones.

@ValuedMammal
Copy link
Contributor

I think the unique channels approach in this PR is a valid solution. Perhaps an alternative would be to change ChannelMessage::Response to carry the response id in a tuple of (usize, Value) and use this when inserting into answers during batch_call. Since answers is an ordered map, we'll still return the responses in the correct order.

@notmandatory
Copy link
Member

@marshallyale did you see the comment above? If you don't have time to look at it now we may open a new PR to try out his suggestion.

@marshallyale
Copy link
Contributor Author

Yes, apologies for the delay. I will write some tests and look into the ChannelMessage::Response this weekend sometime.

@marshallyale marshallyale force-pushed the fix_transaction_ordering branch from 2d3f4a5 to 294dab4 Compare March 15, 2025 18:17
@marshallyale
Copy link
Contributor Author

I've added a single test that tests batch_block_header and asserts that the timestamp matches up to the expected times. This is looped 10 times, although it would fail within 5 times normally. I had to use a different client (exs.dyshek.org:50001) since it appears that electrum.blockstream.info always returns results in order, which is not true of other servers.

I looked at trying to implement tests comparing request ids to response ids, but there isn't a great way to do this without doing a bit of code refactor and adding additional loops to the code. Both batch_call and raw_call don't return the json response ids, so we would have to break the actual json responses into a separate function that returns the raw request-response matchup, and then iterate over this to get the actual result in its current format.

Additionally, I looked into changing ChannelMessage::Response from Value to (req_id, value). This kind of gets back to the original issue with each request sharing the same channel for receiving, as the receiver will receive the response but would then have to check that the response id is equal to the request id and if not, would have to put this message back in the channel until the correct request id gets it. It could definitely be done, but would require a bit more reworking than the current separate channels approach.

I think that this could probably be done in a more elegant/faster way, but would require more of an overhaul of the send/receive logic in the code. Maybe this is just a function of me being very new to the code, but I personally think the recv function and _reader_thread code is confusing which makes it hard to come up with alternatives without a major code refactor.

Finally, it looks like the tests failed because actions/cache@v2 is deprecated. Want me to update to v3 (or v4)?

ValuedMammal added a commit to ValuedMammal/rust-electrum-client that referenced this pull request Mar 16, 2025
Copy link
Contributor

@ValuedMammal ValuedMammal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK

@ValuedMammal
Copy link
Contributor

Finally, it looks like the tests failed because actions/cache@v2 is deprecated. Want me to update to v3 (or v4)?

If you want to push an update to actions/cache that would be fine or else we can do it in another PR (#167).

@marshallyale
Copy link
Contributor Author

I added the changes and updated the actions/cache to v4. Hopefully easier than multiple prs.

Copy link
Contributor

@ValuedMammal ValuedMammal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK b6230be

marshallyale and others added 3 commits March 28, 2025 12:34
Fixes issue bitcoindevkit#75
Raw client waiting map was using the same channel
for every request/response. When items were put
back into the channel inside of _reader_thread
the waiting receiver in recv would just take the
next response without validating it on request id
request.
This fixes this by using unique channels for each
request response inside of the waiting map.
@ValuedMammal ValuedMammal force-pushed the fix_transaction_ordering branch from b6230be to a3488f4 Compare March 28, 2025 16:39
Copy link
Contributor

@ValuedMammal ValuedMammal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reACK a3488f4

Thanks again @marshallyale

@ValuedMammal ValuedMammal merged commit 862280a into bitcoindevkit:master Mar 28, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants