[GSOC 2025 PreProposal] Project 8 Benchmarking and Optimization #4975
tanishy7777
started this conversation in
GSoC Discussions
Replies: 1 comment 14 replies
-
I saw that you already submitted a pre-proposal. Is the submitted one the same as the above? In any case, quickly reading through the above, my first thought was that your pre-proposal could be improved by listing concrete examples of the areas where you would add performance benchmarks. E.g. a list of the the first 5 benchmarks you'll write and what they are supposed to measure and cover. See https://www.mdanalysis.org/benchmarks/ (and https://github.com/MDAnalysis/mdanalysis/tree/develop/benchmarks) for what's already there; the issue tracker may also have issues open for specific ones. |
Beta Was this translation helpful? Give feedback.
14 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe your relevant background and experiences.
For instance, you may wish to include your educational background and relevant experience with MDAnalysis, WESTPA, and/or Molecular Nodes and molecular dynamics, computational physics/chemistry/materials, or any other skills relevant to the project you are interested in.
Educational Background:
I am a sophomore at the Indian Institute of Technology(IIT), Gandhinagar pursuing a Bachelors' in A.I.
My Strengths:
I am among the top 0.5% of the 1,000,000 students who gave the Joint Engineering Entrance examination(JEE) in 2023, a highly competitive exam based on Physics, Chemistry, and Mathematics. This achievement earned me admission to the prestigious Indian Institute of Technology (IIT) Gandhinagar.
I was selected for the Dean's List in my first semester and received an academic citation in the third semester. My current CPI is 9.20.
As part of my coursework, I completed "Introduction to Biology", which covered topics like protein synthesis, DNA replication, and evolution, and "Undergraduate Science Laboratory", where we conducted experiments in chemistry and physics. In chemistry, I performed fingerprint detection, synthesis of Nylon 6,6, photochemical reaction of iodine and oxalate, electrolysis of water, and removal of organic pollutants using light. These courses sparked my interest in chemistry and biology. While I may not be an expert, I have a strong foundational understanding of these fields, which I believe will help me better grasp relevant concepts and contribute effectively to a GSOC project with MDAnalysis.
Link of the Lab Reports for Undergraduate Science laboratory: https://iitgnacin-my.sharepoint.com/:f:/g/personal/23110328_iitgn_ac_in/EvOvQVb9VptMmOi9YkLSz_EBnKNpRD_Szsomk5qei3Y4_A?e=wWCLTV
Experience with MDAnalysis:
I have been actively contributing to MDAnalysis since December 2024, and have merged 7 pull requests into the repository. During this time, I have gained deep insights into the codebase, as well as Python concepts such as object-oriented programming, magic methods, and data type mutability. I am currently working on two PRs related to parallelization of analysis classes.
This experience has taught me how to navigate and contribute to a large and complex codebase without being overwhelmed, and has further strengthened my confidence in collaborative open-source development.
Other Skills:
Languages: Python, C
Tools/Technologies: Git, GitHub, ASV (Airspeed Velocity), Cython
What is the background? What is the overarching question? You can also comment on why this is an interesting or difficult problem.
Clearly define the overall goal of what you want to find out.
Background:
MDAnalysis is an actively developed open-source Python library for analyzing molecular simulations. As with many open-source projects, contributions from multiple developers over time add new functionality, fix bugs, and improve usability. However, this continuous evolution of the codebase can unintentionally introduce performance regressions. Over time while this might lead to a better, "all cases covered" library but this can cost the performance of the library. Hence it might be a good idea to benchmark various components of the code over the lifetime of the project. This means we can identify performance regressions, i.e. code that still works but is slower than existing code.
Once we identify these commits we can improve the performance by adding optimized code while preserving functionality or by simply reverting back to the older version of the code(before the regression).
Currently, performance assessment coverage using the ASV framework is present but is limited, which hinders identification of bottlenecks/performance regressions over time.
Overarching Question:
How can we benchmark the MDAnalysis codebase to identify performance regressions/bottlenecks and optimize them.
Why This Is an Interesting Problem:
Performance directly affects user experience, especially in molecular dynamics, where datasets are often large and computational demands are high. Faster analysis tools mean researchers can process simulations more efficiently, enabling quicker scientific insights and advancing research in computational chemistry, biophysics, and related fields.
This project is also challenging, as it requires:
Regression search in ASV uses binary search over historical commits to identify when a performance regression was introduced. This method assumes that performance degrades in a monotonic manner. But if that is not the case then it might miss the actual regression commit and lead to false positives. This might need manual verification to see if there is actually a performance drop.
Overall Goal:
The goal of this project is to significantly improve the performance benchmarking infrastructure of MDAnalysis by expanding the ASV benchmark coverage across the core library and commonly used analysis tools. This will enable systematic detection of performance regressions and bottlenecks over the library's development lifecycle.
In addition to this, the project aims to analyze historical performance trends, identify and prioritize areas for optimization, and create comprehensive documentation to help contributors write and maintain benchmarks. Another yet impactful deliverable is that the project also seeks to optimize at least one critical performance bottleneck identified through benchmarking.
Describe how you are going to reach your goal (i.e., answer the overarching question).
Which algorithms are you going to use? Are there any libraries or other packages you want to use? Do you need to research
different solutions?
Be as concrete as possible; you want to convince your audience that it is feasible to solve this problem and you have an idea how to tackle it.
My Approach to Achieve the Goal
To systematically benchmark and optimize MDAnalysis, I will follow a three-step plan:
Libraries and Tools:
ASV (Airspeed Velocity):
I will use ASV to write and manage performance benchmarks. ASV allows benchmarking across historical commits and visualizing performance trends, which is essential for identifying regressions over time.
cProfile + RunSnake:
After detecting a performance regression using ASV, I will use cProfile to profile the affected code and identify the functions consuming the most execution time.
To make profiling insights more actionable, I will use RunSnake (a GUI tool) that visualizes cProfile outputs as "square map" or sortable tables of data, making it easier to drill down and compare pre- and post-regression commits.
Parallelization
In cases where performance can be improved through parallel processing, I will explore parallelizing AnalysisBase classes using Python’s multiprocessing, joblib, or existing MDAnalysis parallel frameworks. I am currently working on two PRs that parallelize AnalysisBase classes, giving me direct experience with performance tuning in this context.
Research and Prioritization:
Community Input and Usage Data:
I will analyze GitHub issue trackers, discussions, and engage with the MDAnalysis community (e.g., via Discord/forums) to identify the most commonly used features. This will help prioritize which modules or functions to benchmark and optimize first, ensuring the project delivers maximum user impact.
Regression Search and Handling Non-Monotonic Trends:
ASV's regression search uses binary search to locate the commit introducing the bottleneck. I will complement this with manual inspection or alternative methods when performance graphs are non-monotonic.
Feasibility and Technical Readiness
This approach is technically feasible because MDAnalysis already integrates ASV, and I have been contributing to the codebase since December, merging 7 PRs. My familiarity with MDAnalysis internals, ongoing work on parallelization, and experience with Python performance tools (ASV, cProfile) put me in a strong position to deliver the outcomes effectively.
I am also prepared to research optimization techniques (e.g., Cython, NumPy vectorization) when needed to improve performance while preserving functional similarity of the existing code.
Beta Was this translation helpful? Give feedback.
All reactions