Note
If you're a non-Stanford student and interested in submitting to the leaderboard, please create a pull request adding your result to the second table. To remain in the top 5, your submission must be verified, for which you should invite marcelroed
to a minimal repo containing a uv project with pyproject.toml
, uv.lock
and main.py
. Your script should be able to be reproduced on a single H100 by running uv run main.py
.
To submit to the leaderboard, submit a pull request that adds your results to the Markdown table below. The table should be sorted by increasing loss.
Note that your submission can run for at most 1.5 hours on an H100, and that you may only use the OpenWebText training dataset that we provide. The code must clearly be your own work, and you can't use external implementations for systems-critical aspects of your model.
The top 3 submissions will receive a prize at the end of the quarter, and the external top 3 submissions will receive a T-shirt. To make this fair, we will reorder the top 5 scoring students based on our reproduced training runs. Make sure you save a snapshot of your best code so it can be reproduced by us! We will reach out to the top few students after results have stabilized. Leading submissions that cannot be verified will be removed.
In your pull request description, you should include:
- The final validation loss that was recorded
- A link to an associated learning curve that clearly shows a wallclock-time x-axis that is less than 1.5 hours. You may either upload an image directly to the repo (use the ./images) folder or link to a publicly-viewable plot from a service like Weights and Biases.
- A description of what you did
We are considering adding an automated validation loss check, considering it's easy to measure your metrics wrong in a way that will place you higher on the leaderboard than you should be. If your loss seems too good to be true, make sure to validate your training and valdation datasets are correct, by checking decoded samples, and making sure your vocab is correct with 32k tokens. It should not be easy to get a validation loss better than 3.3.