[GSOC 2025 PreProposal] Project 8 Benchmarking and Optimization #4975

tanishy7777 · 2025-03-17T06:57:23Z

tanishy7777
Mar 17, 2025

Describe your relevant background and experiences.

For instance, you may wish to include your educational background and relevant experience with MDAnalysis, WESTPA, and/or Molecular Nodes and molecular dynamics, computational physics/chemistry/materials, or any other skills relevant to the project you are interested in.

Educational Background:
I am a sophomore at the Indian Institute of Technology(IIT), Gandhinagar pursuing a Bachelors' in A.I.

My Strengths:

I am among the top 0.5% of the 1,000,000 students who gave the Joint Engineering Entrance examination(JEE) in 2023, a highly competitive exam based on Physics, Chemistry, and Mathematics. This achievement earned me admission to the prestigious Indian Institute of Technology (IIT) Gandhinagar.
I was selected for the Dean's List in my first semester and received an academic citation in the third semester. My current CPI is 9.20.
As part of my coursework, I completed "Introduction to Biology", which covered topics like protein synthesis, DNA replication, and evolution, and "Undergraduate Science Laboratory", where we conducted experiments in chemistry and physics. In chemistry, I performed fingerprint detection, synthesis of Nylon 6,6, photochemical reaction of iodine and oxalate, electrolysis of water, and removal of organic pollutants using light. These courses sparked my interest in chemistry and biology. While I may not be an expert, I have a strong foundational understanding of these fields, which I believe will help me better grasp relevant concepts and contribute effectively to a GSOC project with MDAnalysis.

Link of the Lab Reports for Undergraduate Science laboratory: https://iitgnacin-my.sharepoint.com/:f:/g/personal/23110328_iitgn_ac_in/EvOvQVb9VptMmOi9YkLSz_EBnKNpRD_Szsomk5qei3Y4_A?e=wWCLTV

Additionally, this semester I am working on a project under Professor Nipun Batra focused on sleep stage detection to calculate the Apnea-Hypopnea Index (AHI). This involves analyzing physiological time-series data and building data processing pipelines to extract biological insights. I have gained experience in handling complex datasets, applying signal processing and statistical analysis, and using Python libraries such as NumPy and pandas. While this project lies in the domain of biomedical signal analysis, the computational techniques and problem-solving skills I’ve developed are highly transferable to molecular dynamics analysis. I believe this has prepared me to contribute effectively to MDAnalysis by applying similar approaches to analyzing and interpreting molecular simulation data.

Experience with MDAnalysis:
I have been actively contributing to MDAnalysis since December 2024, and have merged 7 pull requests into the repository. During this time, I have gained deep insights into the codebase, as well as Python concepts such as object-oriented programming, magic methods, and data type mutability. I am currently working on two PRs related to parallelization of analysis classes.

This experience has taught me how to navigate and contribute to a large and complex codebase without being overwhelmed, and has further strengthened my confidence in collaborative open-source development.

Other Skills:
Languages: Python, C
Tools/Technologies: Git, GitHub, ASV (Airspeed Velocity), Cython

What is the background? What is the overarching question? You can also comment on why this is an interesting or difficult problem.

Clearly define the overall goal of what you want to find out.

Background:
MDAnalysis is an actively developed open-source Python library for analyzing molecular simulations. As with many open-source projects, contributions from multiple developers over time add new functionality, fix bugs, and improve usability. However, this continuous evolution of the codebase can unintentionally introduce performance regressions. Over time while this might lead to a better, "all cases covered" library but this can cost the performance of the library. Hence it might be a good idea to benchmark various components of the code over the lifetime of the project. This means we can identify performance regressions, i.e. code that still works but is slower than existing code.
Once we identify these commits we can improve the performance by adding optimized code while preserving functionality or by simply reverting back to the older version of the code(before the regression).

Currently, performance assessment coverage using the ASV framework is present but is limited, which hinders identification of bottlenecks/performance regressions over time.

Overarching Question:
How can we benchmark the MDAnalysis codebase to identify performance regressions/bottlenecks and optimize them.

Why This Is an Interesting Problem:
Performance directly affects user experience, especially in molecular dynamics, where datasets are often large and computational demands are high. Faster analysis tools mean researchers can process simulations more efficiently, enabling quicker scientific insights and advancing research in computational chemistry, biophysics, and related fields.

This project is also challenging, as it requires:

A solid understanding of the MDAnalysis codebase
Knowledge of performance profiling and benchmarking tools
Potential use of low-level optimization techniques like vectorization, Cython, etc
Ensuring functionality is preserved while improving speed.
Difficulty in identifying regression commits due to non-monotonic performance trends:
Regression search in ASV uses binary search over historical commits to identify when a performance regression was introduced. This method assumes that performance degrades in a monotonic manner. But if that is not the case then it might miss the actual regression commit and lead to false positives. This might need manual verification to see if there is actually a performance drop.

Overall Goal:
The goal of this project is to significantly improve the performance benchmarking infrastructure of MDAnalysis by expanding the ASV benchmark coverage across the core library and commonly used analysis tools. This will enable systematic detection of performance regressions and bottlenecks over the library's development lifecycle.

In addition to this, the project aims to analyze historical performance trends, identify and prioritize areas for optimization, and create comprehensive documentation to help contributors write and maintain benchmarks. Another yet impactful deliverable is that the project also seeks to optimize at least one critical performance bottleneck identified through benchmarking.

Describe how you are going to reach your goal (i.e., answer the overarching question).

Which algorithms are you going to use? Are there any libraries or other packages you want to use? Do you need to research

different solutions?

Be as concrete as possible; you want to convince your audience that it is feasible to solve this problem and you have an idea how to tackle it.

My Approach to Achieve the Goal
To systematically benchmark and optimize MDAnalysis, I will follow a three-step plan:

Expand Benchmark Coverage
Detect and Analyze Performance Regressions
Prioritize and Optimize Key Bottlenecks

Libraries and Tools:

ASV (Airspeed Velocity):
I will use ASV to write and manage performance benchmarks. ASV allows benchmarking across historical commits and visualizing performance trends, which is essential for identifying regressions over time.
cProfile + RunSnake:
After detecting a performance regression using ASV, I will use cProfile to profile the affected code and identify the functions consuming the most execution time.
To make profiling insights more actionable, I will use RunSnake (a GUI tool) that visualizes cProfile outputs as "square map" or sortable tables of data, making it easier to drill down and compare pre- and post-regression commits.
Parallelization
In cases where performance can be improved through parallel processing, I will explore parallelizing AnalysisBase classes using Python’s multiprocessing, joblib, or existing MDAnalysis parallel frameworks. I am currently working on two PRs that parallelize AnalysisBase classes, giving me direct experience with performance tuning in this context.

Research and Prioritization:

Community Input and Usage Data:
I will analyze GitHub issue trackers, discussions, and engage with the MDAnalysis community (e.g., via Discord/forums) to identify the most commonly used features. This will help prioritize which modules or functions to benchmark and optimize first, ensuring the project delivers maximum user impact.
Regression Search and Handling Non-Monotonic Trends:
ASV's regression search uses binary search to locate the commit introducing the bottleneck. I will complement this with manual inspection or alternative methods when performance graphs are non-monotonic.

Feasibility and Technical Readiness

This approach is technically feasible because MDAnalysis already integrates ASV, and I have been contributing to the codebase since December, merging 7 PRs. My familiarity with MDAnalysis internals, ongoing work on parallelization, and experience with Python performance tools (ASV, cProfile) put me in a strong position to deliver the outcomes effectively.
I am also prepared to research optimization techniques (e.g., Cython, NumPy vectorization) when needed to improve performance while preserving functional similarity of the existing code.

orbeckst · 2025-03-17T21:47:56Z

orbeckst
Mar 17, 2025
Maintainer

I saw that you already submitted a pre-proposal. Is the submitted one the same as the above?

In any case, quickly reading through the above, my first thought was that your pre-proposal could be improved by listing concrete examples of the areas where you would add performance benchmarks. E.g. a list of the the first 5 benchmarks you'll write and what they are supposed to measure and cover.

See https://www.mdanalysis.org/benchmarks/ (and https://github.com/MDAnalysis/mdanalysis/tree/develop/benchmarks) for what's already there; the issue tracker may also have issues open for specific ones.

14 replies

orbeckst Mar 19, 2025
Maintainer

Reading is typically much more important than writing. The existing tests for DCD, XTC etc only test reading with something like

reader = XXXReader(trajfile)
for ts in reader:
   pass

For writers one could do something similar with

u = mda.Universe(....)
with mda.Writer(outfile, u.atoms.n_atoms) as w:
    for ts in u.trajectory:
        w.write(u.atoms)

orbeckst Mar 19, 2025
Maintainer

Looks like a sensible list.

tanishy7777 Mar 19, 2025
Author

Looks like a sensible list.

Cool! Will add these to the preproposal in the morning and submit

tanishy7777 Mar 19, 2025
Author

Reading is typically much more important than writing.

Makes sense

tanishy7777 Mar 19, 2025
Author

@orbeckst I have added this to the Preproposal and submitted it. Looking forward to your positive response and hopefully working with you over the summer for GSoC'25. Thanks! :D

Added this for the final list

Use a numbered list to state 3–5 measurable non-trivial outcomes that you need to achieve in order to reach the overall goal.
These are the milestones that you have to reach; they are possibly dependent on each other. For each objective it must be clear how to decide if you fulfilled it or not. Objectives are formulated in terms of actions and deliverables.

I. Expand Benchmark Coverage for Core MDAnalysis Functionality
Action:
Write and integrate ASV benchmarks for major core modules, including topology, Universe object, trajectory readers and writers, etc(minimum these but if time permits will expand more).

Deliverable:
At least one ASV benchmark file committed per module, merged into the main codebase.

Measurement:
Verify through ASV reports that benchmark coverage includes all targeted modules.

II. Extend Benchmark Coverage to Frequently Used Analysis Tools
Action:
Write ASV benchmarks for popular analysis modules not yet covered like hbonds, density and dssp, etc(minimum these but if time permits will expand more).

Deliverable:
ASV benchmarks for the analysis tools listed above, merged into the repository.

Measurement:
Confirm via benchmark suite reports and pull request logs that benchmarks are active for each selected tool.

Concrete Steps/Approach to Expand Benchmark Coverage for Core MDAnalysis Functionality and Frequently Used Analysis Tools:

Benchmarks for core Topology object (https://docs.mdanalysis.org/stable/documentation_pages/core/topology.html#MDAnalysis.core.topology.Topology):
class MDAnalysis.core.topology.Topology

Methods to Benchmark:
add_Residue, add_Segment, add_TopologyAttr, copy, del_TopologyAttr

Properties:
guessed_attributes, read_attributes

class MDAnalysis.core.topology.TransTable
Methods to Benchmark:
atoms2residues, atoms2segments, copy, move_atom, move_residue, etc

Helper functions
MDAnalysis.core.topology.make_downshift_arrays

Adding benchmarks for Trajectory readers and writers for all supported formats(minimum 3 but will try to get as much done as possible) https://docs.mdanalysis.org/stable/documentation_pages/coordinates_modules.html

Currently benchmarks are only there for XTC, TRR, DCD, NCDF trajectory readers. So we can add the benchmarks for all other for all supported formats.

Code suggested by Oliver Beckstein(@orbeckst)

Benchmark for Readers is already implemented. For writers,
u = mda.Universe(....)
with mda.Writer(outfile, u.atoms.n_atoms) as w:
for ts in u.trajectory:
w.write(u.atoms)

this could be added

Benchmark for Core object: Universe(https://docs.mdanalysis.org/stable/documentation_pages/core/universe.html)
class MDAnalysis.core.universe.Universe

Methods to benchmark:
add_Residue, add_Segment, add_TopologyAttr, add_angles, etc

Properties:
models, coord, etc

Functions:
MDAnalysis.core.universe.Merge

Benchmarks for Analyis Tool: hbonds:
class MDAnalysis.analysis.hydrogenbonds.hbond_analysis.HydrogenBondAnalysis

Methods to benchmark:
run,
count_by_ids(), count_by_time(), count_by_type(), guess_acceptors(), guess_donors, guess_hydrogens(), lifetime()

Properties:
results.hbonds

Benchmarks for Analysis Tools: density classMDAnalysis.analysis.density.DensityAnalysis

Properties:
results.density

Methods to benchmark
run

Will also try to benchmark the density object methods and properties

Benchmarks for Analysis Tools: dssp class MDAnalysis.analysis.dssp.dssp.DSSP

Properties:
results.dssp, results.dssp_ndarray, results.resids

Methods to benchmark
run

Functions:
MDAnalysis.analysis.dssp.dssp.assign
MDAnalysis.analysis.dssp.dssp.translate

III. Identify Performance Regressions and Prioritize Optimization Targets
Action:
Use ASV’s regression search across commit history and supplement with community input to generate a priority list of performance bottlenecks.

Deliverable:
A documented list of key regressions, associated commits, and affected functions/modules, published in a markdown file or issue tracker.

Measurement:
Completion of a ranked priority list with supporting data, confirmed by mentor/maintainers.

IV. Create a Benchmarking Tutorial for New Contributors
Action:
Write a short tutorial or blog post for the MDAnalysis website demonstrating how to write ASV benchmarks and integrate them into PRs.

Deliverable:
Published tutorial or guide on the official MDAnalysis blog or documentation site.

Measurement:
Tutorial live on the website and shared with the community; tracked via PR/merge to the repo.

V. Optimize Performance for at Least One Identified Bottleneck (Optional but Preferred)
Action:
Use profiling tools (cProfile + RunSnake) to analyze a prioritized bottleneck and implement an optimized solution (e.g., parallelization, vectorization, or Cython).

Deliverable:
PR merged that demonstrates measurable performance improvement in ASV benchmarks for the optimized function/module.

Measurement:
Benchmark results show quantifiable speed-up compared to pre-optimization; validated by maintainers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GSOC 2025 PreProposal] Project 8 Benchmarking and Optimization #4975

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 14 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[GSOC 2025 PreProposal] Project 8 Benchmarking and Optimization #4975

Uh oh!

Uh oh!

tanishy7777 Mar 17, 2025

Describe your relevant background and experiences.

For instance, you may wish to include your educational background and relevant experience with MDAnalysis, WESTPA, and/or Molecular Nodes and molecular dynamics, computational physics/chemistry/materials, or any other skills relevant to the project you are interested in.

What is the background? What is the overarching question? You can also comment on why this is an interesting or difficult problem.

Clearly define the overall goal of what you want to find out.

Describe how you are going to reach your goal (i.e., answer the overarching question).

Which algorithms are you going to use? Are there any libraries or other packages you want to use? Do you need to research

different solutions?

Be as concrete as possible; you want to convince your audience that it is feasible to solve this problem and you have an idea how to tackle it.

Replies: 1 comment · 14 replies

Uh oh!

orbeckst Mar 17, 2025 Maintainer

Uh oh!

orbeckst Mar 19, 2025 Maintainer

Uh oh!

orbeckst Mar 19, 2025 Maintainer

Uh oh!

tanishy7777 Mar 19, 2025 Author

Uh oh!

tanishy7777 Mar 19, 2025 Author

Uh oh!

Uh oh!

tanishy7777 Mar 19, 2025 Author

tanishy7777
Mar 17, 2025

Replies: 1 comment 14 replies

orbeckst
Mar 17, 2025
Maintainer

orbeckst Mar 19, 2025
Maintainer

orbeckst Mar 19, 2025
Maintainer

tanishy7777 Mar 19, 2025
Author

tanishy7777 Mar 19, 2025
Author

tanishy7777 Mar 19, 2025
Author