Resources on Checklist and Testing ML Systems

Resources to check - checked mark means we have read the resource **thoroughly** (Ongoing effort, feel free to add and/or update):

1. Resources from Tiffany:
- [ ] Rohan Alexander, Lindsay Katz, Callandra Moore, Michael Wing-Cheung Wong, & Zane Schwartz. (2024). Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs.
- [x] Gawande, A. (2010). Checklist manifesto, the (HB). Penguin Books India.
- [ ] Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d'Alche-Buc, F., Fox, E., & Larochelle, H. (2021). Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). Journal of Machine Learning Research, 22(164), 1–20.
- [x] Jeremy Jordan. (2020). Effective testing for machine learning systems.
- [ ] Eugene Yan. (2020). How to Test Machine Learning Code and Systems. .
- [x] Ribeiro, M., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118.
  - Focuses on NLP models
  - Three kinds of post-training tests: *Invariance Tests*, *Directional Expectation Tests* and *Minimum Functionality Tests*.
- [ ] Cheng, D., Cao, C., Xu, C., & Ma, X. (2018). Manifesting Bugs in Machine Learning Code: An Explorative Study with Mutation Testing. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS) (pp. 313-324).
- [ ] Openja, M., Khomh, F., Foundjem, A., Ming, Z., Abidi, M., Hassan, A., & others (2023). Studying the Practices of Testing Machine Learning Software in the Wild. arXiv preprint arXiv:2312.12604.
- [ ] Silva, S., & De França, B. (2023). A Case Study on Data Science Processes in an Academia-Industry Collaboration. In Proceedings of the XXII Brazilian Symposium on Software Quality (pp. 1–10).
- [ ] Houssem Ben Braiek, & Foutse Khomh (2020). On testing machine learning programs. Journal of Systems and Software, 164, 110542.
- [ ] Wattanakriengkrai, S., Chinthanet, B., Hata, H., Kula, R., Treude, C., Guo, J., & Matsumoto, K. (2022). GitHub repositories with links to academic papers: Public access, traceability, and evolution. Journal of Systems and Software, 183, 111117.
- [ ] Schäfer, M., Nadi, S., Eghbali, A., & Tip, F. (2024). An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering, 50(1), 85-105.
- [ ] Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, & Michel C. Desmarais (2024). Effective test generation using pre-trained Large Language Models and mutation testing. Information and Software Technology, 107468.

2. Resources from our own research:
- [x] Yu, B. (2017). Testing on the Toilet: Keep Cause and Effect Clear.
- [x] Kent, K. (2024). Prefer Narrow Assertions in Unit Tests.
- [x] Yu, B. (2018). Testing on the Toilet: Keep Tests Focused.
- [x] Winters, T. (2024). Test Failures Should Be Actionable.
- [x] Trenk, A. (2014). Testing on the toilet: Writing descriptive test names.
- [ ] Augustus Odena, & Ian Goodfellow. (2018). TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing.
  - Coverage-Guided Fuzzing, similar to mutation testing?
  - "quantify the area covered by radial neighborhoods around these activation vectors"
  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resources on Checklist and Testing ML Systems #57

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resources on Checklist and Testing ML Systems #57

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions