This is the companion repo to this article about analyzing the GPT training pipeline using the weights of gpt-oss.
extract_embeddings.pyextracts the embedding matrix of the gpt-oss model.embedding_norms_to_csv.pycomputes the L2 norms of the extracted embeddings, with an option to only consider non-ascii tokens.embedding_distances_to_mean.pycomputes the embedding distances to the average.plot_token_norms.pyplots the calculated norms.find_chinese_tokens.pyandchinese_token_ids.pyfor isolating Chinese tokens.get_github_counts.pyfor getting occurances of token texts in Github using the search API.token_translation.pyfor evaluating the completions of various models given different tokens.analyze_model_accuracy.pyfor analyzing the results of the previous script.