Skip to content

Conversation

lingzhq
Copy link
Collaborator

@lingzhq lingzhq commented Jun 12, 2025

Introduces a novel data selection op based on semantic diversity across domains, designed to automatically select the most diverse subset of data samples, which is inspired by the DaaR paper.

  • Converts input samples into embeddings
  • Use embeddings to cluster pseudo-domains
  • Selects samples based on various distances to maximize diversity

[WIP] Ongoing development of additional operators derived from the DaaR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant