Replies: 10 comments 2 replies
-
Note that the interior point method on GPUs relies on a condensation strategy, which is more susceptible to ill-conditioning than the full-space approaches on CPUs. Due to this, on double precision, the convergence behavior becomes less reliable when tol is set to below 1e-4. On single precision, it is probably around 1e-2, which may not be practical for many instances. Documentation will be updated in the future PR. |
Beta Was this translation helpful? Give feedback.
-
You are right. I tested your hypothesis. I converted input data manually to required types CuArray of Int32 or Float32 for GPU optimization. Also, I made CPU version without CuArray. Both non-condensed CPU version and GPU version fail to compute AC OPF in single precision. CPU version fares better however, finding 1.09e5 instead of 1.07e5 solution on 89 bus system (with worsening results with increasing network size). GPU 32 Fp appears to almost completely unusable. As this findings conclude that Fp32 won't find use in this kind of optimization I am closing this issue, but I have some questions to help me on future hardware decisions and software understanding. I am tagging @amontoison as some of them are probably his expertise. Does cuDSS use in anyway 64-bit tensor cores (TF64) such as available in Ampere architecture (or any other GPU solver used by MadNLP)? Is the cuSOLVER the same thing as CuCholeskySolver solver in MadNLP? I got the impression that cuSOLVER is the umbrella of several solvers such as Cholesky, RF or GLU. While cuDSS currently supports only single GPU, I noticed there are papers using multiple GPU's with cuSOLVER. Does any solver here support multiple GPU's? I am really looking forward to future upgrades and documentation expansion. I can not express enough how incredible work you have done. I will try to support you by submitting issues if I find any. |
Beta Was this translation helpful? Give feedback.
-
We never got good results with single precision to solve nonlinear optimization problems. The sparse Cholesky factorization (CuCholeskySolver) is in CUSOLVER but it's not anymore documented, I discovered it by looking into the header files but you can't find it online. For the question about multiple GPUs, the library cusolvermg in CUSOLVER can solve DENSE linear systems with multiple devices. For TF64 support in cuDSS, I don't know what they do internally (more or less a black-box) but because it's require recent GPU architectures and CUDA 12.x, I highly expect that they want to use the last hardware accelerations available. |
Beta Was this translation helpful? Give feedback.
-
Currently computation time for 13k bus system is for JuMP at 27s, 12.5s for AMPL, 8s for ExaModels cpu, 5.0s for gtx 1060 and 2.2s for rtx 3090. I wonder if even rtx 3090 is saturated which is not good double precision gpu. It could even occur that A100 takes longer due to slower single-core performance of Epyc CPU's as compared to desktop/workstation CPU's. I am basically trying to determine if there is any benefit from using compute X100 chips (e.g. A100, H100 or V100) as compared to good gaming chips (more than x10 lower double precision compute performance). Next week I will be connecting to our university supercomputer and try it on A100 (with Zen 3 Epyc). I will also try even bigger networks to try to saturate it. I will make sure to report this back to you here. |
Beta Was this translation helpful? Give feedback.
-
Thanks @KSepetanc for sharing these in detail. You may play around with tol parameter of MadNLP to see how far you can push on single precision. But my expectation is 1e-3 at most. As @amontoison mentioned, IPM causes quite serious ill conditioning, so we typically can solve only up to sqrt(eps), but condensation is making it even worse on GPU. We have primarily tested the solver on A* V* H* hardware so far, but experiences on consumer GPU would be quite valuable to us. Please feel free to give us updates. Multi-GPU support is something we’re envisioning, but nothing concrete at this point. We have a specialized implementation for SC OPFs based on Schur complement approach. One of the primary reason for going to multi GPU is we want more memory, rather than more cores. We have tried to go for larger problem by considering multi-period OPFs, and it seems that most of time is spent for GCing device memory. As you pointed out in one of the discussions, there seems to be some inefficiencies in manging device memory right now. |
Beta Was this translation helpful? Give feedback.
-
Just to add on from what @amontoison said, low accuracy IPM solutions frequently do not tell you much. As you will appreciate, IPMs work by approximately tracking a trajectory (the central path) to a particular solution. The path itself is often very bendy, and a low accuracy solution can often lie well short of the one desired. Frequently IPMs are used as the first stage of the solution process to try to predict the active/binding constraints at a solution; thereafter a "crossover" is used to refine the solution. The main issue is that the low accuracy solution often gives a very poor prediction of the optimal active set. There are additional issues for nonlinear problems (such as the lack of a strictly complementary solution) that can make things even harder. Normally, I wouldn't recommend single precision for anything but a properly-scaled linear program, double is better, and sometimes one may even need to simulate quad precision at the latter stages for nasty nonlinear problems. If you can make headway with GPUs, that would be very useful. |
Beta Was this translation helpful? Give feedback.
-
@nimgould , I see now the importance of high precision for IPM. In fact I am interested if quad precision can be simulated as I have nasty nonlinear (ill-conditioned) problem. One of my PhD goals was to solve bilevel optimization where lower level is AC OPF, so I used convex QCQP (quadratically constrained quadratic program) local AC OPF approximation (second-order Taylor expansion with a few additional approximations and relaxations) in the lower level, converted the lower level to SOCP (second-order cone) and used KKT's (primal SOCP, dual SOCP and SOCP complementary conditions) to reduce the bilevel to single level equivalent. As IPM solves KKT system during solution procedure, there is serious ill-conditioning as KKT's of KKT system of equations does not hold due to linear dependency. I got around this issue by smoothing SOC complimentary conditions. The result is that IPM convergence can be usually achieved for up-to 50 bus x 24 hour systems which is improvement to some extent. However, maybe we can achieve more with quad precision. I will reference the papers: local AC OPF, bilevel model, bilevel solution techniques benchmark and SOC smoothing technique (section 4) @sshin23 the paper you linked (which you are also the author of) got me going to find better solutions than AMPL. That is how I found out for ExaModels and MadNLP and with help got it working at proof of concept level. It was my intention to get in contact with the authors which I inadvertently got thru by creating a github issue. Creating a custom solver is mostly beyond my capacity as I work mostly alone (I am from Croatia which is a small country) with primary focus on using existing solvers and packages to develop efficient equations and models for electrical power engineering problems (e.g. industrial grade SC OPF). It is my understanding that the solver from the paper is not yet available as a user tool. I see that cuDSS is scheduled to have multigpu support in the future (announcement). Could you comment do you think parallel cuDSS could be competitive with your approach? Also, do you think that condensation approach could or should be avoided in future MadNLP release as cuDSS is actually a sparse solver that will have multi-gpu support since the benefit of avoiding condensation is better convergence. |
Beta Was this translation helpful? Give feedback.
-
@KSepetanc Well, thanks for reaching out to us :) As our GPU capabilities become more mature, we wanted to have power users test our tools and provide us with constructive comments. You might be the ideal person who can do that role! Please share us any comments/suggestions. Simulating quadruple precision is a great idea. In fact, I played with Quadmath.jl and DoubleFloats.jl recently and on CPUs with LDLFactorizations.jl. Quadruple precision works perfectly for Quadmath.jl, and I expect it would be the same for DoubleFloats.jl once these issues are resolved: JuliaMath/DoubleFloats.jl#199 JuliaMath/DoubleFloats.jl#200 Condensation will be necessary on GPUs. What we mean by "condensation" here doesn't mean that our linear system is dense. The system is still sparse after the "condensation". It just means that we are eliminating the inequality block. Without condensation, we cannot guarantee positive definiteness in the KKT system, and there is no good way to solve indefinite sparse systems on GPUs efficiently, as far as we know. So, at this point, we expect that condensation will be needed for any GPU implementation of nonlinear optimization algorithms with direct linear solvers. |
Beta Was this translation helpful? Give feedback.
-
@sshin23 thank you for the info and sorry for the late reply. I have been preparing project application "Coordinated Hydrogen and Electricity SyStems and markets" for which we expect to get funding. Part of it is about ill-conditioned bilevel models with AC OPF and OGF and we could definitely do papers with quad or double64 (extended precision) optimization. If you could help us run it, we are open for cooperation. I haven't yet run on our supercomputer to compare the consumer gpu with compute gpu. I expect I will have more time for this next week. |
Beta Was this translation helpful? Give feedback.
-
@sshin23 I got the results (computation times) for different hardware and AD/solver configurations. A100 gpu is about 20% faster on 78k bus AC OPF compared to rtx3090 (albeit 3090 is running slightly faster single-core performance cpu). This indicates that compute is not the bottleneck despite being double precision computation, but presumably memory bandwidth. This is important result as it means that consumer gpu's can be used for all but most demanding cases (due to lower memory capacity of consumer gpu's). I will email you full results tomorrow. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Performant double precision GPU's (such as V100 or A100) are very expensive and not available to most scientific communities. In my opinion we should support Float32 GPU optimization.
Currently, there is
convert_data
function to prepare data for different backends but does not have option to prepare data for 32bit optimization (despiteExaCore()
having support for it). Running AC OPF example from documentation with T=Float32 setting and MadNLP with CuCholeskySolver results in restoration failure in second iteration.Additionally,
convert_data
is not documented function. If possible, it should be documented and explained its usage in provided example codes.Beta Was this translation helpful? Give feedback.
All reactions