Deployment Best Practices for cuPyNumeric #5

elliottslaughter · 2025-04-10T22:44:19Z

elliottslaughter
Apr 10, 2025

I'm starting to think about what the deployment options are for cuPyNumeric, and I'd like to get your advice.

Here are the properties that I think I'd like to see in any potential solution:

All dependencies are listed in one place. Note that my project depends on cuPyNumeric as well as other Python libraries, and also tools like formatters (ruff) and type checkers (basedpyright).
Dependencies should be listed with approximate versions (e.g., numpy>=1.25.2,<2 or similar).
There should also be a way to record the exact versions of packages actually installed in my deployment. (A.k.a., a "lock" file in Cargo/uv parlance.)
All of the above should be portable (e.g., I can install the same package versions on both x86 and ARM without losing the precise version information.)
Ideally tools should interact with the above in a clean way. E.g., editing the NumPy dependency to numpy>=2.0.0,<2.3 should result in minimally re-solving the environment. I should never be throwing my environment away completely and starting from scratch.
Everything above should be deployable even if all access to the actual installed environment is lost (i.e., everything important should be checked into Git).
Ideally, all of the above should be the same deployment I use for actual performance runs. I.e., the same system should be used on laptops and supercomputers.

My impression is that for a pure-Python project, where all dependencies are available in PyPI, this is possible to achieve today with the uv package manager. You write a pyproject.toml file which tracks your dependencies. E.g.:

https://github.com/scipy/scipy/blob/main/pyproject.toml#L49

Then when you install the project, uv automatically generates a uv.lock which records what you actually installed. Notably, this is all automatic and mostly works seamlessly.

https://docs.astral.sh/uv/guides/projects/#project-structure

By contrast, as far as I'm aware, Conda's support is extremely poor. You can create YAML files that correspond to an environment, but they have loose (approximate) dependencies only. Conda has no native notion of a lock file. While you can generate one manually (by "freezing" the list of installed packages to a file), there are no tools to work with such files automatically. Conda has no notion of a minimal re-solve based on an edit to a environment YAML file (or even any notion of updating an environment at all; once you create your environment you either manually conda install new packages or you destroy it and start over from scratch).

I've heard that cuPyNumeric supports PyPI but I'm not sure how high quality this solution actually is (e.g., does it automatically pull in CUDA, NCCL, UCX, etc. as required?). Meanwhile, everyone I know deploys cuPyNumeric via Conda, but based on my understanding of the state of Conda based tools, there really is no good solution to be found in that direction.

Also consider that I often require access to experimental builds and if those are not made to PyPI then in many cases I'll be out luck.

If there are any best practices, or thoughts on how to do this, I'd appreciate them. And perhaps it would be good to document this as well.

Answered by elliottslaughter

Apr 15, 2025

I did some more digging and found pixi. pixi is basically uv but for the Conda ecosystem. Similar to uv, you have a project specification and environments are implicit solved every time you run a command like pixi shell. The last solution is stored automatically in a pixi.lock file and ensures every user sees the same set of packages (until someone explicitly upgrades a dependency).

Lock files are cross-platform in the sense that pixi forces you to pick what your supported platforms are, and then solves each one independently. So even if I develop on x86, I can configure for ARM at the same time and know that the project specification has a solution on that platform, even without running …

View full answer

manopapad · 2025-04-14T06:11:50Z

manopapad
Apr 14, 2025
Maintainer

I've heard that cuPyNumeric supports PyPI but I'm not sure how high quality this solution actually is (e.g., does it automatically pull in CUDA, NCCL, UCX, etc. as required?).

The PyPI packages support CUDA and multi-node execution out-of-the-box, but may be less performant than conda packages without tweaking (e.g. the version of UCX distributed on PyPI doesn't include Infiniband support). There are also some rough edges wrt MPI compatibility (which is not distributed on PyPI), see https://docs.nvidia.com/legate/latest/networking-wheels.html.

Meanwhile, everyone I know deploys cuPyNumeric via Conda

For good reason. Conda is a better fit for distributing libraries that are not pure-Python.

My impression is that for a pure-Python project, where all dependencies are available in PyPI, this is possible to achieve today with the uv package manager.
but based on my understanding of the state of Conda based tools, there really is no good solution to be found in that direction.

I don't see why what you ask couldn't be done on top of conda. That said, I don't know of any such solutions. Maybe @cryos @jameslamb @bryevdv know of one?

Also consider that I often require access to experimental builds and if those are not made to PyPI then in many cases I'll be out luck.

Our space allocation on PyPI is not sufficient to accommodate nightly builds, but we're looking into using a custom repository to upload these (once we have some more automation in place).

0 replies

bryevdv · 2025-04-14T16:58:59Z

bryevdv
Apr 14, 2025
Maintainer

All dependencies are listed in one place. Note that my project depends on cuPyNumeric as well as other Python libraries, and also tools like formatters (ruff) and type checkers (basedpyright).

Dependencies should be listed with approximate versions (e.g., numpy>=1.25.2,<2 or similar).

Sounds like a user-created conda env file so far.

There should also be a way to record the exact versions of packages actually installed in my deployment. (A.k.a., a "lock" file in Cargo/uv parlance.)

Are you asking for something different than conda env export? That generates an environment.yml with exact package constraints specified from the current packages installed in the env, e.g.

  - zipp=3.21.0=pyhd8ed1ab_1
  - zlib=1.3.1=h8359307_2
  - zstd=1.5.6=hb46c0d2_0
  - pip:
      - accessible-pygments==0.0.5
      - alabaster==1.0.0
      - bokeh==3.8.0.dev1+7.g4035dcc3

Maybe that's different from a "true lock file" somehow?

All of the above should be portable (e.g., I can install the same package versions on both x86 and ARM without losing the precise version information.)

This isn't really up to you, or the tools. If some packages are not available across platforms, then they are not available. No workflow is going to fix that. But assuming they are, I don't see why an exported env file with exact constraints as above would not install them. Are you stating you have encountered otherwise?

Ideally tools should interact with the above in a clean way. E.g., editing the NumPy dependency to numpy>=2.0.0,<2.3 should result in minimally re-solving the environment. I should never be throwing my environment away completely and starting from scratch.

Just re: conda, the assumption from the start was always that environments be cheap, so that the recommended way to "update" an env was to make a new one. I think that assumption has been tested through various periods of growing pains with different solvers straining under load, but the mamba solver being made default seems to have cleared things up again at least for now.

But IMO this POV is reflected in the fact that an env file is specifically a recipe for making new environments. Once an env is created it has no particular association to the env file that was used to created it, and editing that env file is not really relevant to it any way at all.

So, AFAIK there has not been a change in this philosophical approach that envs are "disposable", so this might be a point you would be forced to compromise on. But to be clear I have not been actively involved in conda dev since ~2013, so maybe there are options I don't know about these days.

Everything above should be deployable even if all access to the actual installed environment is lost (i.e., everything important should be checked into Git).

Ideally, all of the above should be the same deployment I use for actual performance runs. I.e., the same system should be used on laptops and supercomputers.

Also sounds like conda env files to me. I think it might help to try and implement a POC of the workflow you want and see what specific issues you run into.

0 replies

elliottslaughter · 2025-04-14T19:13:16Z

elliottslaughter
Apr 14, 2025
Author

Are you asking for something different than conda env export? That generates an environment.yml with exact package constraints specified from the current packages installed in the env, e.g.
  - zipp=3.21.0=pyhd8ed1ab_1
  - ... snip ...
Maybe that's different from a "true lock file" somehow?

The hashes above are non-portable, because they encode (a) the architecture, (b) the OS, (c) the Python version. (And possibly others? E.g., they may be sensitive to CUDA versions as well.)

That means you cannot take a file generated by conda env export and:

Move from one architecture to another (e.g., x86 to ARM; required if you use both conventional x86 clusters as well as Grace Hopper nodes)
Move from one OS to another (e.g., laptop with macOS to cluster with Linux)
Upgrade Python versions while keeping everything else the same

According to the official Conda documentation:

If you want to make your environment file work across platforms, you can use the conda env export --from-history flag. This will only include packages that you’ve explicitly asked for, as opposed to including every package in your environment.
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#exporting-an-environment-file-across-platforms

However, what this really does is just reproduce the original, loosely specified versions you requested in your original environment file—if you asked for python==3.7 and it installed version 3.7.3 (or whatever), it will save3.7 not 3.7.3 in the generated environment. Unfortunately Python packages do break when upgrading from one patch release to the next; I just encountered an example of this during my cursory testing last week.

All of the above should be portable (e.g., I can install the same package versions on both x86 and ARM without losing the precise version information.)

This isn't really up to you, or the tools. If some packages are not available across platforms, then they are not available. No workflow is going to fix that. But assuming they are, I don't see why an exported env file with exact constraints as above would not install them. Are you stating you have encountered otherwise?

See above. The Conda hashes are platform-specific. And Conda's tools for dealing with them are particularly coarse, forcing you to choose between: (a) the loose, human-specified versions, and (b) the fully-specified version which is locked to the exact platform you solved it on. Any of the above moves force you to throw away (b) and revert to (a).

Obviously I cannot force a binary package to be available on a platform it's not available on, but that's not what I'm talking about. I'm saying: if you installed version 1.2.3 on x86 Linux, you should also install 1.2.3 on ARM macOS (and yes, those will be different files and I am relying on some level of sanity from the maintainers in keeping some level of compatibility between them, but you're stuck with that either way). In practice a majority of my dependencies (besides cuPyNumeric) are platform-agnostic Python code anyway, so this is a non-issue in the majority of cases.

For what it's worth: my experience the last time I managed a Conda-based environment was that, even with CI-based testing, our CI would break regularly because the human-written environment was not fully specified and something would get upgraded in an incompatible way. I would then spend about one day a month figuring out what changed and going back into human-written environment to backport the tighter version bound. In some cases that required adding packages that were not even direct dependencies to our Conda environment, because the thing that broke was a transitive dependency. I ended up essentially littering our human-written environment file with these manual version bounds on things I didn't care about but would break my software if I didn't constrain.

But even that wasn't enough: in a handful of cases, I discovered that Conda Forge package maintainers reuploaded the exact same X.Y.Z point version with a new file. This became a problem because the maintainers upgraded their OS infrastructure and ended up capturing dependencies on libc versions newer than what our machines had. Therefore, specifying some_package==X.Y.Z was not enough and we had to include the hashes for those specific packages (again, tracking down in CI to even find what the old versions were, since we had not needed to care previously). But this led to a fundamentally non-portable solution, as this more-specific environment file could not be used across architectures. So we ended up splitting the Conda environment on a per-architecture basis and that ended up being even more of a mess.

Ideally tools should interact with the above in a clean way. E.g., editing the NumPy dependency to numpy>=2.0.0,<2.3 should result in minimally re-solving the environment. I should never be throwing my environment away completely and starting from scratch.

Just re: conda, the assumption from the start was always that environments be cheap. ... So, AFAIK there has not been a change in this philosophical approach that envs are "disposable", so this might be a point you would be forced to compromise on.

I think you missed my point here. I strongly against "environments as pets" and strongly in favor of "environments as cattle". My issue is not whether environments are disposable, but about the quality of the tools to work with them. As you have seen above, the Conda tools are simply not well-thought-out.

Here's an example that "just works" in uv:

Write your initial human-specified pyproject.toml file with loose dependencies.
uv run python automatically solves the environment, writes a lock file, and installs everything before starting the Python shell.
Edit pyproject.toml to bump your NumPy dependency (e.g., from >=1.26,<2.0 to >=2.0,<2.1).
Run uv run python again. uv automatically figures out the minimal diff in the environment vs the existing lock file and installs the new packages. Notably, uv could do this by throwing away the environment and starting from scratch, and it would be functionally equivalent. The key difference is not whether the environment is new, it is using the lock file to preserve existing versions in so far as possible so that upgrading some dependency X does not break some unrelated dependency Y.

I don't even know how to do this in Conda.

If you use human-specified loose dependencies in your env file, then upgrading NumPy will result in everything else being upgraded at the same time. There is no notion of "upgrade this dependency and only this dependency" when working with environments, unless you use persistent environments and conda upgrade, which is non-reproducible.
If you use conda env export, then you are forced to hand-edit machine generated files, which is gross. But even ignoring that, the new environment likely won't be solvable at all. Suppose you have a dependency A version 1 that depends on dependency B version 1, while A version 2 depends on B version 2. A is a direct dependency and B is a transitive dependency (maybe you haven't even heard of before). Now when you go to upgrade A, the environment doesn't solve because B is in conflict. In reality this isn't just one B, this could be a half dozen or dozen different B dependencies that you have captured transitively through your various dependencies that all need to get updated. Now you are essentially replicating the work of the Conda environment solver by hand, which seems pointless.

Also, note that all of this happens on the uv run ... command you use to do everything with uv. There is no need for explicit environment management, and thus the user cannot forget to keep the environment up to date, (or fail to sync changes across the various architecture-specific files you are forced to keep when you use Conda to do the same thing).

For what it's worth, I tried uv late last week and it's such a breath of fresh air. Everything I've listed above just works. The PyPI issues are unfortunate and may stand in the way of using it, but in every other respect it is so far ahead of Conda that the two are really not remotely comparable.

Let me add something else: the ultimate end user of this is not me. The end user is a domain scientist who is not an expert programmer. I need to come up with a system that is easy to use, that does the right thing automatically, doesn't require a lot of hacks (especially editing machine-generated anything) to work with, and is hard to get wrong. In my initial testing uv ticks all of those boxes. My experience with Conda is that it makes this difficult or impossible, and what happens in practice is that scientists put no effort into reproducibility and every new machine becomes a conda env create ... from scratch. I simply do not see how I can put my users on a stable footing moving forward given the tools that Conda provides.

0 replies

elliottslaughter · 2025-04-14T19:15:17Z

elliottslaughter
Apr 14, 2025
Author

the version of UCX distributed on PyPI doesn't include Infiniband support

@manopapad Can we get this fixed? It seems like a glaring hole in the ecosystem. Otherwise uv seems to be working for me in my initial testing last week.

0 replies

bryevdv · 2025-04-14T19:31:11Z

bryevdv
Apr 14, 2025
Maintainer

There is no need for explicit environment management

Explicit environment management was one of the principle features of conda when it came out. Just as refresher, prior to 2012 the situation was

Install your own python and all non-python dependencies, as well as all python python packages, manually, yourself. You'll also probably need multiple entire compilation toolchains to build things yourself. On windows there might be binary packages maintained on a server by by one random guy named Christopher. BTW all of this probably only goes in/works with your system Python or your one lone user Python version install, unless you "know what you are doing" Good luck!

It's entirely believable that 13 years later there's even newer and even better tools, that can take advantage of new developments like wheels, pyproject.toml and project-centric workflows—none of which existed when conda was released. If Astral were interested in prioritizing non-python dependencies ala conda it seems like they might just take over in general. But AFAIK they explicitly are not interested in doing that, for better or worse. In any case, you seem to have come into this with your preferred solution in mind so I am not sure what all this discussion is needed for.

0 replies

elliottslaughter · 2025-04-14T22:49:01Z

elliottslaughter
Apr 14, 2025
Author

Please understand if I sound frustrated it's because of a painful previous experience trying my hardest to get Conda to work for another application, not because I have anything against Conda per se. I am not trying to minimize Conda's historical contributions and I am interested in technical solutions to my problems, which as best I can tell still exist.

My understanding from the discussion so far is that uv (or any pure PyPI-based solution) is insufficient at the moment due to @manopapad's comment nv-legate/cupynumeric#1182 (comment) . I do hope this gets resolved, but it puts uv (and poetry, pipenv, and the whole slew of other Python-only technologies) on the back burner for the moment. That is not my preference, but it's the reality I have to work with.

I identified my ideal requirements in the top comment in this issue, and I listed a number of specific technical challenges in my comment nv-legate/cupynumeric#1182 (comment). These are not abstract issues. These are specific problems I hit in production using Conda in my last application, and which I am concerned would likely impact any future application. If there are solutions to these problems I am very interested to hear them. If there are not solutions to these problems, I strong advise the cuPyNumeric team to consider pushing improvements into the PyPI ecosystem so that at least in the future users have better options available.

Given what I know at the moment, here's the best I can come up with as a solution (though it's really only half of one) to my problems:

Use hand-written, human-readable Conda environment.yml files.
Make no attempt to use conda env export (except possibly for documentation purposes); the lack of cross-platform support makes this a non-starter for me. Maybe revisit this if users report specific issues in production.
Instead code the dependencies in environment.yml defensively:
- For each dependency, always encode the current version as the minimum (e.g., if I plan to use NumPy 2.0.3, write the dependency as numpy>=2.0.3,...)...
- ... and also write a defensive maximum so that the dependency doesn't get upgraded to the next minor version (i.e., numpy>=2.0.3,<2.1 in my example).
- In cases where a specific regression is known, pin the full version number (e.g., numpy==2.0.3).
- Do not attempt to pin to specific hashes, as those would be non-portable, unless specific issues arise in production.
This approach will not defend against bad upgrades to transitive dependencies, but I'm not sure what I can do about that, because as previously identified the exported environments are not portable. Therefore I suppose I'll just have to leave this risk unmitigated, and if it breaks down the line someone else will need to fix it.

I am still not sure what to do about GASNet vs UCX-based builds. As best I can tell, those would need to be two entirely separate environment.yml files, which would then have a lot of overlap. But again it's what you get stuck with in this approach.

Overall it's still a very manual process, and I worry about it breaking after I hand the project off to users, but unless Conda has some new features I'm not aware of, or someone fixes the PyPI-based build, I'm not sure what else I can do.

0 replies

elliottslaughter · 2025-04-15T23:09:06Z

elliottslaughter
Apr 15, 2025
Author

I did some more digging and found pixi. pixi is basically uv but for the Conda ecosystem. Similar to uv, you have a project specification and environments are implicit solved every time you run a command like pixi shell. The last solution is stored automatically in a pixi.lock file and ensures every user sees the same set of packages (until someone explicitly upgrades a dependency).

Lock files are cross-platform in the sense that pixi forces you to pick what your supported platforms are, and then solves each one independently. So even if I develop on x86, I can configure for ARM at the same time and know that the project specification has a solution on that platform, even without running on it.

pixi also allows you to generate multiple environments from a single project specification. I have used this feature for example to prototype different environments for UCX and GASNet. This gives me some confidence that if someone needed to deploy this to a Slingshot machine down the line, in principle it would work. Again, you solve every environment all the time, so it's impossible to miss a dependency on a rarely used platform.

pixi is cross-language so it supports both pyproject.toml as well as its own pixi.toml format. I'm using pyproject.toml just to keep everything in one file, though other users might prefer the pixi.toml format if their projects aren't Python specific.

For posterity, here's a copy of my pyproject.toml to show what you can do. I believe this solves all of my needs for the moment, although in the future I would anticipate coming back to tweak the Conda channels as things that I use hit the release channels.

(Note: edited to add a CPU-only configuration so you can install this on machines without CUDA. The default and gex environments still assume CUDA is available.)

[project]
name = "..."
version = "0.1.0"
description = "..."
readme = "README.md"
authors = [
    { name = "...", email = "..." },
]
requires-python = ">=3.12"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.pixi.workspace]
channels = [
    "conda-forge",
    "legate/label/experimental",
    "legate/label/ucc140",
]
platforms = ["linux-64", "linux-aarch64"]

[tool.pixi.dependencies]
basedpyright = "==1.28.3"
cupynumeric = ">=25.3.2"
h5py = ">=3.12.1"
numpy = ">=1.26.4,<2.1"
ruff = ">=0.11.5"
scipy = ">=1.15.2"
scipy-stubs = ">=1.15.2"

[tool.pixi.pypi-dependencies]
my_project_name = { path = ".", editable = true }

[tool.pixi.tasks]
check = "basedpyright"
format = "ruff format"

[tool.pixi.feature.gpu.system-requirements]
cuda = "12.2" # must have CUDA installed on the system

[tool.pixi.feature.gpu.dependencies]
cupy = ">=13.4.1"
cupynumeric = { version = ">=25.3.2", build = "*_gpu" }

[tool.pixi.feature.gex]
channels = [
    "conda-forge",
    "legate/label/gex-experimental",
    "legate/label/experimental",
]
platforms = ["linux-64", "linux-aarch64"]

[tool.pixi.feature.gex.dependencies]
realm-gex-wrapper = "*"
legate-mpi-wrapper = "*"

[tool.pixi.environments]
default = { features = ["gpu"], solve-group = "default" }
cpu = { solve-group = "cpu" }
gex = { features = ["gpu", "gex"], solve-group = "gex" }

I'd encourage others to try this out and see if it works for them, since it's a lot more reproducible than one-off Conda environments (even when you start from an environment.yml file) and seems to support the main configuration options I've needed so far.

0 replies

Deployment Best Practices for cuPyNumeric #5

Uh oh!

Uh oh!

elliottslaughter Apr 10, 2025

Replies: 0 comments · 7 replies

Uh oh!

manopapad Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

bryevdv Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

elliottslaughter Apr 14, 2025 Author

Uh oh!

elliottslaughter Apr 14, 2025 Author

Uh oh!

Uh oh!

bryevdv Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

elliottslaughter Apr 14, 2025 Author

Uh oh!

Uh oh!

elliottslaughter Apr 15, 2025 Author

elliottslaughter
Apr 10, 2025

Replies: 0 comments 7 replies

manopapad
Apr 14, 2025
Maintainer

bryevdv
Apr 14, 2025
Maintainer

elliottslaughter
Apr 14, 2025
Author

elliottslaughter
Apr 14, 2025
Author

bryevdv
Apr 14, 2025
Maintainer

elliottslaughter
Apr 14, 2025
Author

elliottslaughter
Apr 15, 2025
Author