Deployment Best Practices for cuPyNumeric #5
-
I'm starting to think about what the deployment options are for cuPyNumeric, and I'd like to get your advice. Here are the properties that I think I'd like to see in any potential solution:
My impression is that for a pure-Python project, where all dependencies are available in PyPI, this is possible to achieve today with the https://github.com/scipy/scipy/blob/main/pyproject.toml#L49 Then when you install the project, https://docs.astral.sh/uv/guides/projects/#project-structure By contrast, as far as I'm aware, Conda's support is extremely poor. You can create YAML files that correspond to an environment, but they have loose (approximate) dependencies only. Conda has no native notion of a lock file. While you can generate one manually (by "freezing" the list of installed packages to a file), there are no tools to work with such files automatically. Conda has no notion of a minimal re-solve based on an edit to a environment YAML file (or even any notion of updating an environment at all; once you create your environment you either manually I've heard that cuPyNumeric supports PyPI but I'm not sure how high quality this solution actually is (e.g., does it automatically pull in CUDA, NCCL, UCX, etc. as required?). Meanwhile, everyone I know deploys cuPyNumeric via Conda, but based on my understanding of the state of Conda based tools, there really is no good solution to be found in that direction. Also consider that I often require access to experimental builds and if those are not made to PyPI then in many cases I'll be out luck. If there are any best practices, or thoughts on how to do this, I'd appreciate them. And perhaps it would be good to document this as well. |
Beta Was this translation helpful? Give feedback.
Replies: 0 comments 7 replies
-
The PyPI packages support CUDA and multi-node execution out-of-the-box, but may be less performant than conda packages without tweaking (e.g. the version of UCX distributed on PyPI doesn't include Infiniband support). There are also some rough edges wrt MPI compatibility (which is not distributed on PyPI), see https://docs.nvidia.com/legate/latest/networking-wheels.html.
For good reason. Conda is a better fit for distributing libraries that are not pure-Python.
I don't see why what you ask couldn't be done on top of conda. That said, I don't know of any such solutions. Maybe @cryos @jameslamb @bryevdv know of one?
Our space allocation on PyPI is not sufficient to accommodate nightly builds, but we're looking into using a custom repository to upload these (once we have some more automation in place). |
Beta Was this translation helpful? Give feedback.
-
Sounds like a user-created conda env file so far.
Are you asking for something different than - zipp=3.21.0=pyhd8ed1ab_1
- zlib=1.3.1=h8359307_2
- zstd=1.5.6=hb46c0d2_0
- pip:
- accessible-pygments==0.0.5
- alabaster==1.0.0
- bokeh==3.8.0.dev1+7.g4035dcc3 Maybe that's different from a "true lock file" somehow?
This isn't really up to you, or the tools. If some packages are not available across platforms, then they are not available. No workflow is going to fix that. But assuming they are, I don't see why an exported env file with exact constraints as above would not install them. Are you stating you have encountered otherwise?
Just re: conda, the assumption from the start was always that environments be cheap, so that the recommended way to "update" an env was to make a new one. I think that assumption has been tested through various periods of growing pains with different solvers straining under load, but the But IMO this POV is reflected in the fact that an env file is specifically a recipe for making new environments. Once an env is created it has no particular association to the env file that was used to created it, and editing that env file is not really relevant to it any way at all. So, AFAIK there has not been a change in this philosophical approach that envs are "disposable", so this might be a point you would be forced to compromise on. But to be clear I have not been actively involved in conda dev since ~2013, so maybe there are options I don't know about these days.
Also sounds like conda env files to me. I think it might help to try and implement a POC of the workflow you want and see what specific issues you run into. |
Beta Was this translation helpful? Give feedback.
-
The hashes above are non-portable, because they encode (a) the architecture, (b) the OS, (c) the Python version. (And possibly others? E.g., they may be sensitive to CUDA versions as well.) That means you cannot take a file generated by
According to the official Conda documentation:
However, what this really does is just reproduce the original, loosely specified versions you requested in your original environment file—if you asked for
See above. The Conda hashes are platform-specific. And Conda's tools for dealing with them are particularly coarse, forcing you to choose between: (a) the loose, human-specified versions, and (b) the fully-specified version which is locked to the exact platform you solved it on. Any of the above moves force you to throw away (b) and revert to (a). Obviously I cannot force a binary package to be available on a platform it's not available on, but that's not what I'm talking about. I'm saying: if you installed version 1.2.3 on x86 Linux, you should also install 1.2.3 on ARM macOS (and yes, those will be different files and I am relying on some level of sanity from the maintainers in keeping some level of compatibility between them, but you're stuck with that either way). In practice a majority of my dependencies (besides cuPyNumeric) are platform-agnostic Python code anyway, so this is a non-issue in the majority of cases. For what it's worth: my experience the last time I managed a Conda-based environment was that, even with CI-based testing, our CI would break regularly because the human-written environment was not fully specified and something would get upgraded in an incompatible way. I would then spend about one day a month figuring out what changed and going back into human-written environment to backport the tighter version bound. In some cases that required adding packages that were not even direct dependencies to our Conda environment, because the thing that broke was a transitive dependency. I ended up essentially littering our human-written environment file with these manual version bounds on things I didn't care about but would break my software if I didn't constrain. But even that wasn't enough: in a handful of cases, I discovered that Conda Forge package maintainers reuploaded the exact same X.Y.Z point version with a new file. This became a problem because the maintainers upgraded their OS infrastructure and ended up capturing dependencies on libc versions newer than what our machines had. Therefore, specifying
I think you missed my point here. I strongly against "environments as pets" and strongly in favor of "environments as cattle". My issue is not whether environments are disposable, but about the quality of the tools to work with them. As you have seen above, the Conda tools are simply not well-thought-out. Here's an example that "just works" in
I don't even know how to do this in Conda.
Also, note that all of this happens on the For what it's worth, I tried uv late last week and it's such a breath of fresh air. Everything I've listed above just works. The PyPI issues are unfortunate and may stand in the way of using it, but in every other respect it is so far ahead of Conda that the two are really not remotely comparable. Let me add something else: the ultimate end user of this is not me. The end user is a domain scientist who is not an expert programmer. I need to come up with a system that is easy to use, that does the right thing automatically, doesn't require a lot of hacks (especially editing machine-generated anything) to work with, and is hard to get wrong. In my initial testing uv ticks all of those boxes. My experience with Conda is that it makes this difficult or impossible, and what happens in practice is that scientists put no effort into reproducibility and every new machine becomes a |
Beta Was this translation helpful? Give feedback.
-
@manopapad Can we get this fixed? It seems like a glaring hole in the ecosystem. Otherwise uv seems to be working for me in my initial testing last week. |
Beta Was this translation helpful? Give feedback.
-
Explicit environment management was one of the principle features of conda when it came out. Just as refresher, prior to 2012 the situation was
It's entirely believable that 13 years later there's even newer and even better tools, that can take advantage of new developments like wheels, |
Beta Was this translation helpful? Give feedback.
-
Please understand if I sound frustrated it's because of a painful previous experience trying my hardest to get Conda to work for another application, not because I have anything against Conda per se. I am not trying to minimize Conda's historical contributions and I am interested in technical solutions to my problems, which as best I can tell still exist. My understanding from the discussion so far is that uv (or any pure PyPI-based solution) is insufficient at the moment due to @manopapad's comment nv-legate/cupynumeric#1182 (comment) . I do hope this gets resolved, but it puts uv (and poetry, pipenv, and the whole slew of other Python-only technologies) on the back burner for the moment. That is not my preference, but it's the reality I have to work with. I identified my ideal requirements in the top comment in this issue, and I listed a number of specific technical challenges in my comment nv-legate/cupynumeric#1182 (comment). These are not abstract issues. These are specific problems I hit in production using Conda in my last application, and which I am concerned would likely impact any future application. If there are solutions to these problems I am very interested to hear them. If there are not solutions to these problems, I strong advise the cuPyNumeric team to consider pushing improvements into the PyPI ecosystem so that at least in the future users have better options available. Given what I know at the moment, here's the best I can come up with as a solution (though it's really only half of one) to my problems:
I am still not sure what to do about GASNet vs UCX-based builds. As best I can tell, those would need to be two entirely separate Overall it's still a very manual process, and I worry about it breaking after I hand the project off to users, but unless Conda has some new features I'm not aware of, or someone fixes the PyPI-based build, I'm not sure what else I can do. |
Beta Was this translation helpful? Give feedback.
-
I did some more digging and found pixi. pixi is basically uv but for the Conda ecosystem. Similar to uv, you have a project specification and environments are implicit solved every time you run a command like Lock files are cross-platform in the sense that pixi forces you to pick what your supported platforms are, and then solves each one independently. So even if I develop on x86, I can configure for ARM at the same time and know that the project specification has a solution on that platform, even without running on it. pixi also allows you to generate multiple environments from a single project specification. I have used this feature for example to prototype different environments for UCX and GASNet. This gives me some confidence that if someone needed to deploy this to a Slingshot machine down the line, in principle it would work. Again, you solve every environment all the time, so it's impossible to miss a dependency on a rarely used platform. pixi is cross-language so it supports both For posterity, here's a copy of my (Note: edited to add a CPU-only configuration so you can install this on machines without CUDA. The default and gex environments still assume CUDA is available.) [project]
name = "..."
version = "0.1.0"
description = "..."
readme = "README.md"
authors = [
{ name = "...", email = "..." },
]
requires-python = ">=3.12"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.pixi.workspace]
channels = [
"conda-forge",
"legate/label/experimental",
"legate/label/ucc140",
]
platforms = ["linux-64", "linux-aarch64"]
[tool.pixi.dependencies]
basedpyright = "==1.28.3"
cupynumeric = ">=25.3.2"
h5py = ">=3.12.1"
numpy = ">=1.26.4,<2.1"
ruff = ">=0.11.5"
scipy = ">=1.15.2"
scipy-stubs = ">=1.15.2"
[tool.pixi.pypi-dependencies]
my_project_name = { path = ".", editable = true }
[tool.pixi.tasks]
check = "basedpyright"
format = "ruff format"
[tool.pixi.feature.gpu.system-requirements]
cuda = "12.2" # must have CUDA installed on the system
[tool.pixi.feature.gpu.dependencies]
cupy = ">=13.4.1"
cupynumeric = { version = ">=25.3.2", build = "*_gpu" }
[tool.pixi.feature.gex]
channels = [
"conda-forge",
"legate/label/gex-experimental",
"legate/label/experimental",
]
platforms = ["linux-64", "linux-aarch64"]
[tool.pixi.feature.gex.dependencies]
realm-gex-wrapper = "*"
legate-mpi-wrapper = "*"
[tool.pixi.environments]
default = { features = ["gpu"], solve-group = "default" }
cpu = { solve-group = "cpu" }
gex = { features = ["gpu", "gex"], solve-group = "gex" } I'd encourage others to try this out and see if it works for them, since it's a lot more reproducible than one-off Conda environments (even when you start from an |
Beta Was this translation helpful? Give feedback.
I did some more digging and found pixi. pixi is basically uv but for the Conda ecosystem. Similar to uv, you have a project specification and environments are implicit solved every time you run a command like
pixi shell
. The last solution is stored automatically in apixi.lock
file and ensures every user sees the same set of packages (until someone explicitly upgrades a dependency).Lock files are cross-platform in the sense that pixi forces you to pick what your supported platforms are, and then solves each one independently. So even if I develop on x86, I can configure for ARM at the same time and know that the project specification has a solution on that platform, even without running …