Skip to content

Resources on Checklist and Testing ML Systems #57

@SoloSynth1

Description

@SoloSynth1

Resources to check - checked mark means we have read the resource thoroughly (Ongoing effort, feel free to add and/or update):

  1. Resources from Tiffany:
  • Rohan Alexander, Lindsay Katz, Callandra Moore, Michael Wing-Cheung Wong, & Zane Schwartz. (2024). Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs.
  • Gawande, A. (2010). Checklist manifesto, the (HB). Penguin Books India.
  • Pineau, J., Vincent-Lamarre, P., Sinha, K., Lariviere, V., Beygelzimer, A., d'Alche-Buc, F., Fox, E., & Larochelle, H. (2021). Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). Journal of Machine Learning Research, 22(164), 1–20.
  • Jeremy Jordan. (2020). Effective testing for machine learning systems.
  • Eugene Yan. (2020). How to Test Machine Learning Code and Systems. .
  • Ribeiro, M., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118.
    • Focuses on NLP models
    • Three kinds of post-training tests: Invariance Tests, Directional Expectation Tests and Minimum Functionality Tests.
  • Cheng, D., Cao, C., Xu, C., & Ma, X. (2018). Manifesting Bugs in Machine Learning Code: An Explorative Study with Mutation Testing. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS) (pp. 313-324).
  • Openja, M., Khomh, F., Foundjem, A., Ming, Z., Abidi, M., Hassan, A., & others (2023). Studying the Practices of Testing Machine Learning Software in the Wild. arXiv preprint arXiv:2312.12604.
  • Silva, S., & De França, B. (2023). A Case Study on Data Science Processes in an Academia-Industry Collaboration. In Proceedings of the XXII Brazilian Symposium on Software Quality (pp. 1–10).
  • Houssem Ben Braiek, & Foutse Khomh (2020). On testing machine learning programs. Journal of Systems and Software, 164, 110542.
  • Wattanakriengkrai, S., Chinthanet, B., Hata, H., Kula, R., Treude, C., Guo, J., & Matsumoto, K. (2022). GitHub repositories with links to academic papers: Public access, traceability, and evolution. Journal of Systems and Software, 183, 111117.
  • Schäfer, M., Nadi, S., Eghbali, A., & Tip, F. (2024). An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering, 50(1), 85-105.
  • Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, & Michel C. Desmarais (2024). Effective test generation using pre-trained Large Language Models and mutation testing. Information and Software Technology, 107468.
  1. Resources from our own research:
  • Yu, B. (2017). Testing on the Toilet: Keep Cause and Effect Clear.
  • Kent, K. (2024). Prefer Narrow Assertions in Unit Tests.
  • Yu, B. (2018). Testing on the Toilet: Keep Tests Focused.
  • Winters, T. (2024). Test Failures Should Be Actionable.
  • Trenk, A. (2014). Testing on the toilet: Writing descriptive test names.
  • Augustus Odena, & Ian Goodfellow. (2018). TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing.
    • Coverage-Guided Fuzzing, similar to mutation testing?
    • "quantify the area covered by radial neighborhoods around these activation vectors"

Metadata

Metadata

Labels

help wantedExtra attention is neededresearchStudies and/or research needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions