Different precisions in nonlinear programming on GPU #87

KSepetanc · 2024-04-24T11:46:34Z

KSepetanc
Apr 24, 2024

Performant double precision GPU's (such as V100 or A100) are very expensive and not available to most scientific communities. In my opinion we should support Float32 GPU optimization.

Currently, there is convert_data function to prepare data for different backends but does not have option to prepare data for 32bit optimization (despite ExaCore() having support for it). Running AC OPF example from documentation with T=Float32 setting and MadNLP with CuCholeskySolver results in restoration failure in second iteration.

Additionally, convert_data is not documented function. If possible, it should be documented and explained its usage in provided example codes.

sshin23 · 2024-04-26T18:48:30Z

sshin23
Apr 26, 2024
Maintainer

Note that the interior point method on GPUs relies on a condensation strategy, which is more susceptible to ill-conditioning than the full-space approaches on CPUs. Due to this, on double precision, the convergence behavior becomes less reliable when tol is set to below 1e-4. On single precision, it is probably around 1e-2, which may not be practical for many instances.

Documentation will be updated in the future PR.

0 replies

KSepetanc · 2024-04-27T14:16:05Z

KSepetanc
Apr 27, 2024
Author

You are right. I tested your hypothesis. I converted input data manually to required types CuArray of Int32 or Float32 for GPU optimization. Also, I made CPU version without CuArray. Both non-condensed CPU version and GPU version fail to compute AC OPF in single precision. CPU version fares better however, finding 1.09e5 instead of 1.07e5 solution on 89 bus system (with worsening results with increasing network size). GPU 32 Fp appears to almost completely unusable.

As this findings conclude that Fp32 won't find use in this kind of optimization I am closing this issue, but I have some questions to help me on future hardware decisions and software understanding. I am tagging @amontoison as some of them are probably his expertise.

Does cuDSS use in anyway 64-bit tensor cores (TF64) such as available in Ampere architecture (or any other GPU solver used by MadNLP)?

Is the cuSOLVER the same thing as CuCholeskySolver solver in MadNLP? I got the impression that cuSOLVER is the umbrella of several solvers such as Cholesky, RF or GLU.

While cuDSS currently supports only single GPU, I noticed there are papers using multiple GPU's with cuSOLVER. Does any solver here support multiple GPU's?

I am really looking forward to future upgrades and documentation expansion. I can not express enough how incredible work you have done. I will try to support you by submitting issues if I find any.

0 replies

amontoison · 2024-04-27T14:30:15Z

amontoison
Apr 27, 2024
Maintainer

We never got good results with single precision to solve nonlinear optimization problems.
One of my colleague, Nick Gould (@nimgould), worked on GALAHAD for more 20 years and concluded that we should only use double precision (or higher) for IPM.

The sparse Cholesky factorization (CuCholeskySolver) is in CUSOLVER but it's not anymore documented, I discovered it by looking into the header files but you can't find it online.
They added the routines with CUDA 7.0 (a long time ago).
They also have a sparse QR!
You can see the interfacet here: https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusolver/sparse_factorizations.jl
We can still find the symbols in libcusolver.so with a nm -D but It's not anymore maintened.

For the question about multiple GPUs, the library cusolvermg in CUSOLVER can solve DENSE linear systems with multiple devices.
We have nothing equivalent for sparse linear systems.
It's very hard to saturate almost 7000 cores on an A100 when we work on sparse problems because is difficult to find such parallelism.

For TF64 support in cuDSS, I don't know what they do internally (more or less a black-box) but because it's require recent GPU architectures and CUDA 12.x, I highly expect that they want to use the last hardware accelerations available.

0 replies

KSepetanc · 2024-04-27T15:04:33Z

KSepetanc
Apr 27, 2024
Author

Currently computation time for 13k bus system is for JuMP at 27s, 12.5s for AMPL, 8s for ExaModels cpu, 5.0s for gtx 1060 and 2.2s for rtx 3090. I wonder if even rtx 3090 is saturated which is not good double precision gpu. It could even occur that A100 takes longer due to slower single-core performance of Epyc CPU's as compared to desktop/workstation CPU's. I am basically trying to determine if there is any benefit from using compute X100 chips (e.g. A100, H100 or V100) as compared to good gaming chips (more than x10 lower double precision compute performance). Next week I will be connecting to our university supercomputer and try it on A100 (with Zen 3 Epyc). I will also try even bigger networks to try to saturate it. I will make sure to report this back to you here.

0 replies

sshin23 · 2024-04-27T15:57:56Z

sshin23
Apr 27, 2024
Maintainer

Thanks @KSepetanc for sharing these in detail. You may play around with tol parameter of MadNLP to see how far you can push on single precision. But my expectation is 1e-3 at most. As @amontoison mentioned, IPM causes quite serious ill conditioning, so we typically can solve only up to sqrt(eps), but condensation is making it even worse on GPU.

We have primarily tested the solver on A* V* H* hardware so far, but experiences on consumer GPU would be quite valuable to us. Please feel free to give us updates.

Multi-GPU support is something we’re envisioning, but nothing concrete at this point. We have a specialized implementation for SC OPFs based on Schur complement approach.
https://arxiv.org/abs/2301.04869

One of the primary reason for going to multi GPU is we want more memory, rather than more cores. We have tried to go for larger problem by considering multi-period OPFs, and it seems that most of time is spent for GCing device memory. As you pointed out in one of the discussions, there seems to be some inefficiencies in manging device memory right now.

0 replies

nimgould · 2024-04-28T07:34:25Z

nimgould
Apr 28, 2024

Just to add on from what @amontoison said, low accuracy IPM solutions frequently do not tell you much. As you will appreciate, IPMs work by approximately tracking a trajectory (the central path) to a particular solution. The path itself is often very bendy, and a low accuracy solution can often lie well short of the one desired. Frequently IPMs are used as the first stage of the solution process to try to predict the active/binding constraints at a solution; thereafter a "crossover" is used to refine the solution. The main issue is that the low accuracy solution often gives a very poor prediction of the optimal active set. There are additional issues for nonlinear problems (such as the lack of a strictly complementary solution) that can make things even harder. Normally, I wouldn't recommend single precision for anything but a properly-scaled linear program, double is better, and sometimes one may even need to simulate quad precision at the latter stages for nasty nonlinear problems. If you can make headway with GPUs, that would be very useful.

0 replies

KSepetanc · 2024-04-28T17:55:14Z

KSepetanc
Apr 28, 2024
Author

@nimgould , I see now the importance of high precision for IPM. In fact I am interested if quad precision can be simulated as I have nasty nonlinear (ill-conditioned) problem. One of my PhD goals was to solve bilevel optimization where lower level is AC OPF, so I used convex QCQP (quadratically constrained quadratic program) local AC OPF approximation (second-order Taylor expansion with a few additional approximations and relaxations) in the lower level, converted the lower level to SOCP (second-order cone) and used KKT's (primal SOCP, dual SOCP and SOCP complementary conditions) to reduce the bilevel to single level equivalent. As IPM solves KKT system during solution procedure, there is serious ill-conditioning as KKT's of KKT system of equations does not hold due to linear dependency. I got around this issue by smoothing SOC complimentary conditions. The result is that IPM convergence can be usually achieved for up-to 50 bus x 24 hour systems which is improvement to some extent. However, maybe we can achieve more with quad precision.

I will reference the papers: local AC OPF, bilevel model, bilevel solution techniques benchmark and SOC smoothing technique (section 4)

@sshin23 the paper you linked (which you are also the author of) got me going to find better solutions than AMPL. That is how I found out for ExaModels and MadNLP and with help got it working at proof of concept level. It was my intention to get in contact with the authors which I inadvertently got thru by creating a github issue. Creating a custom solver is mostly beyond my capacity as I work mostly alone (I am from Croatia which is a small country) with primary focus on using existing solvers and packages to develop efficient equations and models for electrical power engineering problems (e.g. industrial grade SC OPF). It is my understanding that the solver from the paper is not yet available as a user tool. I see that cuDSS is scheduled to have multigpu support in the future (announcement). Could you comment do you think parallel cuDSS could be competitive with your approach? Also, do you think that condensation approach could or should be avoided in future MadNLP release as cuDSS is actually a sparse solver that will have multi-gpu support since the benefit of avoiding condensation is better convergence.

0 replies

sshin23 · 2024-05-01T20:48:16Z

sshin23
May 1, 2024
Maintainer

@KSepetanc Well, thanks for reaching out to us :) As our GPU capabilities become more mature, we wanted to have power users test our tools and provide us with constructive comments. You might be the ideal person who can do that role! Please share us any comments/suggestions.

Simulating quadruple precision is a great idea. In fact, I played with Quadmath.jl and DoubleFloats.jl recently and on CPUs with LDLFactorizations.jl. Quadruple precision works perfectly for Quadmath.jl, and I expect it would be the same for DoubleFloats.jl once these issues are resolved: JuliaMath/DoubleFloats.jl#199 JuliaMath/DoubleFloats.jl#200
We were able to observe that with Quadruple precision, the condensed-space interior point methods can achieve pretty high precision (even higher than 1e-8). Native quadruple precision on GPUs is not possible currently, but simulating it would still be possible on GPUs. However, currently, we don't have pure-Julia Cholesky/LDL solver on GPUs; if someone can implement it, it will facilitate investigating different precisions on GPUs.

Condensation will be necessary on GPUs. What we mean by "condensation" here doesn't mean that our linear system is dense. The system is still sparse after the "condensation". It just means that we are eliminating the inequality block. Without condensation, we cannot guarantee positive definiteness in the KKT system, and there is no good way to solve indefinite sparse systems on GPUs efficiently, as far as we know. So, at this point, we expect that condensation will be needed for any GPU implementation of nonlinear optimization algorithms with direct linear solvers.

0 replies

KSepetanc · 2024-05-09T13:36:07Z

KSepetanc
May 9, 2024
Author

@sshin23 thank you for the info and sorry for the late reply. I have been preparing project application "Coordinated Hydrogen and Electricity SyStems and markets" for which we expect to get funding. Part of it is about ill-conditioned bilevel models with AC OPF and OGF and we could definitely do papers with quad or double64 (extended precision) optimization. If you could help us run it, we are open for cooperation.

I haven't yet run on our supercomputer to compare the consumer gpu with compute gpu. I expect I will have more time for this next week.

0 replies

KSepetanc · 2024-06-16T21:16:57Z

KSepetanc
Jun 16, 2024
Author

@sshin23 I got the results (computation times) for different hardware and AD/solver configurations. A100 gpu is about 20% faster on 78k bus AC OPF compared to rtx3090 (albeit 3090 is running slightly faster single-core performance cpu). This indicates that compute is not the bottleneck despite being double precision computation, but presumably memory bandwidth. This is important result as it means that consumer gpu's can be used for all but most demanding cases (due to lower memory capacity of consumer gpu's). I will email you full results tomorrow.
Btw. I am very interested in running the interior point optimization in extended precision. Hopefully we can cooperate to ultimately make a paper on this topic if you are interested.

2 replies

sshin23 Jun 17, 2024
Maintainer

Thanks, @KSepetanc for sharing these results. Also note that CUDSS performance is dependent on CPU performance. It seems that CUDSS is doing some part of the operations on CPUs. So, multiple factors may have contributed to the observed 20% difference.

For extended precision, do you mean fp128, etc.?

KSepetanc Jun 18, 2024
Author

@sshin23 you have an email from me with all results (on @mit.edu).

By extended precision I mean fp128 and double float.

I don't have other contacts working with Julia as developers, so if I don't find a way before, I might have a few more questions regarding how to precompile Julia code (maybe continue on email or maybe even online meet).

Different precisions in nonlinear programming on GPU #87

Uh oh!

Uh oh!

KSepetanc Apr 24, 2024

Replies: 10 comments · 2 replies

Uh oh!

sshin23 Apr 26, 2024 Maintainer

Uh oh!

KSepetanc Apr 27, 2024 Author

Uh oh!

Uh oh!

amontoison Apr 27, 2024 Maintainer

Uh oh!

KSepetanc Apr 27, 2024 Author

Uh oh!

sshin23 Apr 27, 2024 Maintainer

Uh oh!

nimgould Apr 28, 2024

Uh oh!

KSepetanc Apr 28, 2024 Author

Uh oh!

sshin23 May 1, 2024 Maintainer

Uh oh!

Uh oh!

KSepetanc May 9, 2024 Author

Uh oh!

KSepetanc Jun 16, 2024 Author

Uh oh!

sshin23 Jun 17, 2024 Maintainer

Uh oh!

KSepetanc Jun 18, 2024 Author

KSepetanc
Apr 24, 2024

Replies: 10 comments 2 replies

sshin23
Apr 26, 2024
Maintainer

KSepetanc
Apr 27, 2024
Author

amontoison
Apr 27, 2024
Maintainer

KSepetanc
Apr 27, 2024
Author

sshin23
Apr 27, 2024
Maintainer

nimgould
Apr 28, 2024

KSepetanc
Apr 28, 2024
Author

sshin23
May 1, 2024
Maintainer

KSepetanc
May 9, 2024
Author

KSepetanc
Jun 16, 2024
Author

sshin23 Jun 17, 2024
Maintainer

KSepetanc Jun 18, 2024
Author