Skip to content

Fix get_custody_groups not sorting the result #7711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: unstable
Choose a base branch
from

Conversation

nethoxa
Copy link

@nethoxa nethoxa commented Jul 7, 2025

Issue Addressed

The function get_custody_groups does not follow the spec as it uses a hashset, which, upon iterated, follows an arbitrary order. The impact, at much, is LH nodes broadcasting and downloading data unorderly, while other nodes would do it in order (kind of LH nodes being misleading). However, as all nodes have the same values, and so the same downloaded data, no punishment in the peer score is applied (at first, would need to dig more).

Nevermind, this contradicts the specs which clearly enforces that the returned custody_groups must be ordered:

link

def get_custody_groups(node_id: NodeID, custody_group_count: uint64) -> Sequence[CustodyIndex]:
    assert custody_group_count <= NUMBER_OF_CUSTODY_GROUPS

    current_id = uint256(node_id)
    custody_groups: List[CustodyIndex] = []
    while len(custody_groups) < custody_group_count:
        custody_group = CustodyIndex(
            bytes_to_uint64(hash(uint_to_bytes(current_id))[0:8]) % NUMBER_OF_CUSTODY_GROUPS
        )
        if custody_group not in custody_groups:
            custody_groups.append(custody_group)
        if current_id == UINT256_MAX:
            # Overflow prevention
            current_id = uint256(0)
        else:
            current_id += 1

    assert len(custody_groups) == len(set(custody_groups))
    return sorted(custody_groups) <===================== HERE

This was not triggered by EF test suite as in the handler, it sorted the array, so even if the function returned the wrong value, the test passed:

link

    fn result(&self, _case_index: usize, _fork_name: ForkName) -> Result<(), Error> {
        let spec = E::default_spec();
        let node_id = U256::from_str_radix(&self.node_id, 10)
            .map_err(|e| Error::FailedToParseTest(format!("{e:?}")))?;
        let raw_node_id = node_id.to_be_bytes::<32>();
        let mut computed = get_custody_groups(raw_node_id, self.custody_group_count, &spec)
            .map(|set| set.into_iter().collect::<Vec<_>>())
            .expect("should compute custody groups");
        computed.sort();

        let expected = &self.result;
        if computed == *expected {
            Ok(())
        } else {
            Err(Error::NotEqual(format!(
                "Got {computed:?}\nExpected {expected:?}"
            )))
        }
    }

Proposed Changes

  1. Change HashSet to BTreeSet, which has the same functions and is assured to be ordered, as the test proves
  2. Remove the call to sort from the handler
  3. Change the types of the functions that call get_custody_groups to handle the correct type

Additional Info

As reference, check Prysm. I think another check for the length of the set should be added as in the specs, even if redundant, just in case.

@nethoxa nethoxa requested a review from jxs as a code owner July 7, 2025 18:14
Copy link

cla-assistant bot commented Jul 7, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

cla-assistant bot commented Jul 7, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nethoxa
Copy link
Author

nethoxa commented Jul 8, 2025

About the length check, it is not needed as it is done to prevent duplicates, but a BTreeSet is a set, so no duplicates allowed.

@chong-he chong-he added test improvement Improve tests ready-for-review The code is ready for review das Data Availability Sampling labels Jul 9, 2025
@jimmygchen
Copy link
Member

Hi @nethoxa, thanks for the PR and bringing this up!

is LH nodes broadcasting and downloading data unorderly, while other nodes would do it in order (kind of LH nodes being misleading)

I don't fully understand what you mean by this - when broadcasting we only use the custody columns for filtering, and when we download via gossip / rpc, the order is not determined by the get_custody_group ordering. This should not result in any peer penalty. The ordering should have no impact to functionality. Please correct me if I'm wrong.

I'm actually working on an validator-custody related refactor, and I'm now considering to preserve the exact order of the get_custody_group result (ordered but NOT sorted), so that we don't have to maintain multiple set of custody column indices for different CGCs. When the CGC increases, we can just derive the column indices by taking a bigger slice of the full column set.

Consider a scenario where the node's custody column is 8, it custodies the following columns
[32, 79, 125, 35, 10, 112, 52, 56]

Let's say mid way through the epoch, the node's CGC increased due to an extra validator being attached to the beacon node, now:

  • we immediately update our advertised CGC to 9
  • the custody columns are now [32, 79, 125, 35, 10, 112, 52, 56, 94], with 94 being the additional column
  • we immediately subscribe to the additional custody subnet 94
  • however our DA check still requires the previous 8 for all slots before the next epoch

We now have to maintain two sets of custody groups, one for current epoch, and one for the next epoch. However if we maintain one full ordered list (128 ordered indices), we could easily take slices of it to determine the columns for any custody count.

There are different ways to approach this of course. But I don't see the need to make our internal implementation match the spec implementation, unless it results in undesired behaviour?

@nethoxa
Copy link
Author

nethoxa commented Jul 23, 2025

Hi @jimmygchen, as you say, no functionality rn would be affected nor any penalty. About this:

But I don't see the need to make our internal implementation match the spec implementation

I think you should, as on futures upgrades on top may rely on compliance with the specs, so could lead to some issues in the future (has happened recently, that's why I'm saying it). About your work on an validator-custody related refactor, lemme some time to think about it.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-review The code is ready for review test improvement Improve tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants