Address methodologies for improving evaluation scoring 

The current dataset being used for evaluation has the following properties:

1. Fetch 2048 samples from historical synthetic dataset
2. Fetch 2048 samples from latest 48 hours of synthetic data
3. Evaluate on the 4096 samples gathered. 

While this setup has been effective at preventing blatant overfitting, there are a few limitations. Mainly, the changing nature of the dataset lends itself to a higher degree of variance when it comes to scoring.

This issue aims to discuss potential solutions and the details of the implementation. 
Some potential suggestions from miners:

1. Utilize an epoch based dataset that is fixed in nature. A fixed dataset will reduce variance but requires some safeguards against overfitting
2. Re evaluate existing models given some cycle. Implementation for this will also depend on interactions with emissions and re-calculating models in the case of "fluke" scores


More recently, the sample size for evaluation has been adjusted as seen [here](https://github.com/impel-intelligence/dippy-bittensor-subnet/pull/66) to aid in reducing variance, but there may be other implementations that may be effective as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address methodologies for improving evaluation scoring #80

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Address methodologies for improving evaluation scoring #80

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions