NVIDIA driver for kbatch environment #108

bkellenb · 2022-09-21T09:35:01Z

bkellenb
Sep 21, 2022

Hello Planetary Computer community,

I would like to launch a long-running job on the Planetary Computer with GPU computations on a CUDA-capable accelerator. I have set up the code, verified in the online Hub and the GPU/PyTorch environment that it works, and have also created the necessary YAML config and shell script files as per kbatch API documentation.

The job gets executed and starts to run through, but eventually fails because no NVIDIA driver is installed in the environment (typical PyTorch error message: "RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx"). Sanity checks with a script that runs the usual suspects (nvidia-smi, lshw -C display) confirm this.

Do I need to use another image than the default mcr.microsoft.com/planetary-computer/python:latest? If so, I couldn't find a list of alternatives (I've tried pre-built ones from the Docker Hub, but to no avail).
Does the Planetary Computer batch environment actually provide a CUDA accelerator? Perhaps I am missing a step, flag, etc.?

Apologies if this question is stupid, but I have not found an answer to my issue and am still new to Kubernetes, Dask, etc.

Thank you very much!

TomAugspurger · 2022-09-21T13:38:39Z

TomAugspurger
Sep 21, 2022

I think this possible, but it's a bit rough / under-documented at the moment. You'll need to do a few things to ensure you have the right hardware and software environment by setting some things in the Job that's submitted to Kubernetes.

The https://pccompute.westeurope.cloudapp.azure.com/compute/services/kbatch/profiles/ endpoint lists the some of the information you'll need.

I'm seeing now that there's a --profile option https://github.com/kbatch-dev/kbatch/blob/1827cae164a53b73574bdddcfcd4d47d78613644/kbatch/kbatch/cli.py#L114, which maybe wraps this up conveniently, but I'm not 100% sure about that.

BTW, the valid tags for the gpu-pytorch image are at https://mcr.microsoft.com/v2/planetary-computer/gpu-pytorch/tags/list.

0 replies

bkellenb · 2022-09-21T14:09:09Z

bkellenb
Sep 21, 2022
Author

Thank you very much for the pointers!

Getting closer, but still no luck.
Parameter --profile gpu-pytorch to the kbatch job submit call results in an internal error of the kbatch library:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/kbatch/bin/kbatch", line 8, in <module>
    sys.exit(cli())
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/cli.py", line 281, in submit_job
    result = _core.submit_job(
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_core.py", line 152, in submit_job
    data = make_job(job, profile=profile).to_dict()
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_backend.py", line 240, in make_job
    job_spec = _make_job_spec(job, profile, labels, annotations)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_backend.py", line 87, in _make_job_spec
    resources = profile.get("resources", {})
AttributeError: 'str' object has no attribute 'get'

Adding it to the configuration YAML file works (or perhaps does nothing). The kbatch API (ver. 0.4.1; Python ver. 3.8.13) is highly flaky anyway; adding an uppercase letter to the job name throws a seemingly random JSON decoding error.

With these two lines in the YAML file:

image: "mcr.microsoft.com/planetary-computer/gpu-pytorch:latest"
profile: "gpu-pytorch"

and no particular flags in the kbatch job submit command, apart from some environment variables, the job gets submitted and appears in the list as "active". With the gpu-pytorch image, however, querying its output (kbatch pod logs $pod_id) results in an internal server error for the first few moments:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/kbatch/bin/kbatch", line 8, in <module>
    sys.exit(cli())
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/cli.py", line 348, in logs
    result = _core.logs(pod_name, kbatch_url, token, read_timeout=read_timeout)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_core.py", line 196, in logs
    result = next(gen)
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_core.py", line 233, in _logs
    r.raise_for_status()
  File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/httpx/_models.py", line 736, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'https://pccompute.westeurope.cloudapp.azure.com/compute/services/kbatch/jobs/logs/job-pod-id-here/'
For more information check: https://httpstatuses.com/500

Eventually, the logs can get pulled, but the process once more fails at the exact same position: no NVIDIA driver found.

1 reply

weiji14 Sep 21, 2022

Try setting the NVIDIA_DRIVER_CAPABILITIES environment variable. Something like kbatch job submit ... --env="{\"NVIDIA_DRIVER_CAPABILITIES\": \"compute,utility\"}"

bkellenb · 2022-09-21T15:23:04Z

bkellenb
Sep 21, 2022
Author

Brilliant, that was the missing piece! This combination (use mcr.microsoft.com/planetary-computer/gpu-pytorch:latest as image and set the mentioned environment variable) works. The amount of system memory seems to be greatly reduced, though—even if the profile lists 27 GB as system RAM; I have to completely disable data loading parallelization to not run into new errors.

If anyone is interested about the profile loading bug in the kbatch API, you may want to check the merge request I just submitted.

Again, many thanks to you both!

1 reply

TomAugspurger Sep 23, 2022

Thanks for tracking this down. I'll get an update to the docs at some point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVIDIA driver for kbatch environment #108

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

NVIDIA driver for kbatch environment #108

Uh oh!

bkellenb Sep 21, 2022

Replies: 3 comments · 2 replies

Uh oh!

TomAugspurger Sep 21, 2022

Uh oh!

Uh oh!

bkellenb Sep 21, 2022 Author

Uh oh!

weiji14 Sep 21, 2022

Uh oh!

bkellenb Sep 21, 2022 Author

Uh oh!

TomAugspurger Sep 23, 2022

bkellenb
Sep 21, 2022

Replies: 3 comments 2 replies

TomAugspurger
Sep 21, 2022

bkellenb
Sep 21, 2022
Author

bkellenb
Sep 21, 2022
Author