Skip to content

GSoC 2024 ‐ Aman Bhansali

Aman Bhansali edited this page Aug 19, 2024 · 19 revisions

About me

I’m Aman Bhansali, an undergraduate from the Indian Institute of Technology Jodhpur. I am from Purnia, Bihar, India. I am passionate about mathematics and programming, with a strong interest in numerical linear algebra and scientific computing. My experience in areas like computer vision and NLP has deepened my understanding of mathematical models' role in algorithm development. Additionally, applications involving the use cases of machine learning to address challenges in the real world greatly fascinate me.

Project overview

In linear algebra, BLAS routines are predominantly employed to perform operations on matrices and vectors. BLAS are commonly utilized in the development of advanced linear algebra software owing to their efficiency, portability, and extensive accessibility. The major goals associated with the project are:

  • Add JavaScript implementation.
  • Add C implementation.
  • Re-implementing BLAS subroutines using the existing lapack-netlib Fortran implementation to free-form the current version of Fortran-95. • Write node.js bindings to allow calling BLAS routines in compiled C/ Fortran from JavaScript.

The approach I followed for the project was the same as mentioned in the proposal. Start with the Level 1 routines. In each level, the priority will be first real, followed by complex datatypes, and further inside real for real double precision, followed by real single precision, and similarly for complex double and single-precision. I proposed to complete a single routine in 2 PRs one for the Js implementation and the other for the C/Fortran implementation, but with practice and understanding, I did the entire implementation in a single PR thus reducing the reviewing time as well.

TODO: add a description of the project goals and technical approach.

Project recap

During the community bonding period, I began coding by focusing on the BLAS Level 1 routines, as outlined in tracking issue #2039. I initially concentrated on mastering the simpler single- and double-precision routines before advancing to more complex ones. The complex routines at the initial stages were challenging to grasp, but after working on a few packages, I became used to them. Implementing C/Fortran was initially challenging, but with understanding and practice, things became easier eventually. Thus, single- and double-precision routines were implemented entirely within the decided timeframe. The complex single- and double-precision routines that are independent of other BLAS routines have also been fully implemented, including their C and Fortran implementations. However, due to the ongoing development of the tooling required for integrating packages across C and Fortran, a few of the complex Level 1 routines remain. Native Implementation Level 1 Signature:

sswap( N, x, strideX, y, strideY )

Ndarray Implementation Level 1 Signature:

sswap( N, x, strideX, offsetX, y, strideY, offsetY )

This additional offset parameter in the ndarray implementation gives the user the freedom to select their starting index for the operation along with the stride.

After completing the Level 1 routines, I switched to working on Level 2 routines, which involve matrix-vector operations. At this stage, things became more interesting and, at the same time, challenging to work on. My approach remained systematic, starting with the real single and double precision routines and commencing with their native Js implementations based on the reference Fortran implementation. At this stage, both code and R&D became equally important, leading to an extensive round of refactoring and discussions with Athan, during which we figured out a way to implement modern BLAS routines. The existing reference Lapack implementation is Fortran-based and hence follows a column-major layout by default. However, in Js, we can provide the user with the freedom to choose whether they want to pass the matrix in a row-major or column-major order. This flexibility is important since matrices are represented as arrays in JavaScript, ensuring contiguous memory allocation. However, the key innovation comes after this.

Let's take a small example:

A = [ 1, 2, 3 ]
    [ 4, 5, 6 ] (2X3)
A = [ 1, 2, 3, 4, 5, 6 ] (row-major)
A = [ 1, 4, 2, 5, 3, 6 ] (column-major)

Native Implementation Signature:

sgemv( trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )
sgemv( order, trans, M, N, alpha, A, lda, x, strideX, beta, y, strideY )

Similar to the ndarray implementation for the Level 1 routine, we have an offset parameter associated with the matrix and the vector. Additionally, there are use cases where we need to perform operations on a specific submatrix within a larger global matrix. To accommodate this, our ndarray implementation includes the following parameters:

  • sa1: stride along the first dimension of matrix A.
  • sa2: stride along the second dimension of matrix A.

These parameters give users full flexibility to utilize the BLAS implementation as needed, even allowing for negative stride values depending on the use case. At each stage of implementation, the idea was to reduce code duplication and maintain cache optimality.

Ndarray Implementation Signature:

sgemv( trans, M, N, alpha, A, strideA1, strideA2, offsetA, x, strideX, offsetX, beta, y, strideY, offsetY )

A short example to understand how stride parameters for matrix A operate:

A = [ 999.0, 6.0,   999.0, 5.0,   999.0 ], 
    [ 999.0, 999.0, 999.0, 999.0, 999.0 ], 
    [ 999.0, 4.0,   999.0, 3.0,   999.0 ],
    [ 999.0, 999.0, 999.0, 999.0, 999.0 ], 
    [ 999.0, 2.0,   999.0, 1.0,   999.0 ]

A = [ 999.0, 6.0, 999.0, 5.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 4.0, 999.0, 3.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 999.0, 2.0, 999.0, 1.0, 999.0 ] (row-major)
Now let's say our required sub-matrix for operation is:
A = [ 1, 2 ],
    [ 3, 4 ],
    [ 5, 6 ]
Hence, the stride parameters along with offset give flexibility here.
sa1 = -10
sa2 = -2
offsetA = 23

With the initial offset, we reach the position of 1 in the array representation, and using the sa1 (1-->3) and sa2 (1-->2) we perform the required operation seamlessly.

Once the implementation was ready for a package, the next step was adding test suites. However, most existing BLAS implementation test suites were neither comprehensive nor extensive. Therefore, the objective shifted toward developing a thorough test suite for each package, ensuring 100% code coverage when I began by designing examples on paper. Also, given our unique ndarray implementation covering all edge cases, particularly validating how the custom strides, offsets, and other parameters interacted was important. Thus, each of the possible combinations for the variation of the parameters involved was checked, and now we can claim that we have a robust test suite available for our routines.

The same follows for the other Level 2 routines as mentioned above the operation might look similar but the matrix representation varied along packages. There exist sets such as: SGEMV - where A is an MXN matrix. SSYMV - where A is an NXN matrix. STPMV - where A is an NXN matrix supplied as a packed form STBMV - where A is an NXN band matrix

After getting a few of the real single and double-precision routines over the finish line, I moved on to implementing a Level 3 routine: matrix-matrix multiplication (sgemm). This proved to be one of the most complex and challenging packages I worked on. Most existing implementations stick closely to the native Fortran approach, but we took it further by adding support for the ndarray. This made the task further more complex, as we had to balance flexibility with performance. The key challenge was to ensure our ndarray implementation could smoothly handle different matrix formats while still matching the performance of the Fortran-based routines. Ndarray Implementation Signature:

sgemm( transA, transB, M, N, K, alpha, A, strideA1, strideA2, offsetA, strideB1, strideB2, offsetB, beta, strideC1, strideC2, offsetC )

Maintaining cache optimality here was difficult since we have around 5 parameters that vary simultaneously: transA, transB, [strideA1, strideA2], [strideB1, strideB2], and [strideC1, strideC2].  This led to an extensive discussion phase and eventually with Athan's help the package was implemented. Benchmarks and test suites are both highly extensive and cover all the possible combinations for the varying parameters.

This summarizes my entire workflow, though there were times when to reduce code duplication, we created independent packages and, based on that, refactored the existing implementations again. This again serves as a quick recap of my work furthermore, for almost every package there exists a story of the implementation.

Completed work

Merged PRs:

Open PRs:

Current state

As previously mentioned, we have fully implemented all real single- and double-precision Level 1 routines, including their JavaScript, C, and Fortran versions. For the complex single and double-precision routines, approximately 5 out of 11 remain in each category, with ongoing infrastructure work as discussed. For Level 2 routines, we have implemented nearly all real single- and double-precision routines in JavaScript, while ongoing tooling development is still pending their implementation in C and Fortran. For Level 3 routines, we have implemented the single and double-precision matrix multiplication routines (sgemm and dgemm), with approximately 5 packages left in each category. Issue #2039: RFC: Add BLAS bindings and implementations for linear algebra tracks the progress of these implementations.

What remains

As previously mentioned, we still need to implement 5 out of 11 complex single- and double-precision Level 1 routines. We can implement these once the tooling work is complete. In Level 2, we need to implement the complex single and double-precision routines, as well as the C and Fortran implementations for real single and double-precision. For Level 3, around 5 packages remain, while the gemm routine awaits C and Fortran implementations. I will persist in my efforts to complete the implementation of the entire BLAS routine set.

Challenges and lessons learned

I faced a few challenges, and I would like to discuss a few here. Maintaining stdlib code standards was the first challenge I faced after understanding the entire code flow. When multiple people work on related topics, it becomes crucial to maintain standards to ensure a consistent user experience across all packages. Gradually, as I worked on multiple packages and carefully considered the feedback on each of my PRs, I was able to overcome this challenge. Special thanks to Athan for correcting me and guiding me at each stage. The other challenge was that, as discussed above, there are packages that depend on other packages, and adding their C/Fortran implementations was not possible hence, this was unexpected at the start of the project, and we need to work on the infrastructure to support that. Since this project was extensive and hence at numerous stages required discussions, there was not a single moment during the course of the project where I felt blocked every time there was something to work upon.

The most important lessons that I learned throughout were:

  • Doing something is not enough; striving for perfection is what turns ordinary into excellence.
  • Always seek improvement—continuously think, read, and explore new possibilities to optimize your work and expand its use cases.  

Conclusion

Overall, the entire program was a great experience filled with learning and execution. I learned about how any feature or operation I use or perform is implemented at the backend and how rigorous the testing and benchmarking process is. Throughout this journey, I acquired knowledge in various areas and the art of writing clean code. I would like to thank my mentors and special thanks to Athan, who provided unwavering support and guidance whenever I encountered challenges. I look forward to continuing to work and contribute to stdlib, completing all the remaining packages, and eventually helping to build the infrastructure over which further applications or algorithms can be crafted.

Clone this wiki locally