Skip to content
Change the repository type filter

All

    Repositories list

    • Achieve error-rate fairness between societal groups for any score-based classifier.
      Python
      41902Updated Aug 21, 2025Aug 21, 2025
    • Python
      1300Updated Aug 19, 2025Aug 19, 2025
    • Python
      0400Updated Aug 5, 2025Aug 5, 2025
    • A framework for few-shot evaluation of language models.
      Python
      2.7k100Updated May 4, 2025May 4, 2025
    • folktexts

      Public
      Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data!
      Jupyter Notebook
      42400Updated Apr 8, 2025Apr 8, 2025
    • Code to reproduce the paper "Do causal predictors generalize better to new domains?"
      Python
      141200Updated Feb 7, 2025Feb 7, 2025
    • Jupyter Notebook
      0100Updated Jan 22, 2025Jan 22, 2025
    • Code to reproduce the paper "Questioning the Survey Responses of Large Language Models"
      Jupyter Notebook
      2900Updated Dec 8, 2024Dec 8, 2024
    • Code to reproduce the experiments in the paper Training on the Test Task Confounds Evaluation and Emergence.
      Jupyter Notebook
      11100Updated Dec 3, 2024Dec 3, 2024
    • lawma

      Public
      Lawma: A lightly fine-tuned Llama model for legal classification tasks.
      Jupyter Notebook
      02100Updated Sep 14, 2024Sep 14, 2024
    • BenchBench is a Python package to evaluate multi-task benchmarks.
      Python
      11600Updated Jul 18, 2024Jul 18, 2024
    • Datasets derived from US census data
      Python
      2226874Updated May 15, 2024May 15, 2024
    • tttlm

      Public
      Test-time-training on nearest neighbors for large language models
      Python
      54500Updated Apr 18, 2024Apr 18, 2024
    • Code for "Is your model predicting the past?"
      Jupyter Notebook
      0200Updated Mar 10, 2024Mar 10, 2024
    • whynot

      Public
      A Python sandbox for decision making in dynamics
      Python
      4342282Updated Aug 21, 2023Aug 21, 2023