Replies: 3 comments 2 replies
-
I think this possible, but it's a bit rough / under-documented at the moment. You'll need to do a few things to ensure you have the right hardware and software environment by setting some things in the Job that's submitted to Kubernetes. The https://pccompute.westeurope.cloudapp.azure.com/compute/services/kbatch/profiles/ endpoint lists the some of the information you'll need. I'm seeing now that there's a BTW, the valid tags for the gpu-pytorch image are at https://mcr.microsoft.com/v2/planetary-computer/gpu-pytorch/tags/list. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for the pointers! Getting closer, but still no luck. Traceback (most recent call last):
File "/opt/miniconda3/envs/kbatch/bin/kbatch", line 8, in <module>
sys.exit(cli())
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/cli.py", line 281, in submit_job
result = _core.submit_job(
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_core.py", line 152, in submit_job
data = make_job(job, profile=profile).to_dict()
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_backend.py", line 240, in make_job
job_spec = _make_job_spec(job, profile, labels, annotations)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_backend.py", line 87, in _make_job_spec
resources = profile.get("resources", {})
AttributeError: 'str' object has no attribute 'get' Adding it to the configuration YAML file works (or perhaps does nothing). The kbatch API (ver. 0.4.1; Python ver. 3.8.13) is highly flaky anyway; adding an uppercase letter to the job name throws a seemingly random JSON decoding error. With these two lines in the YAML file: image: "mcr.microsoft.com/planetary-computer/gpu-pytorch:latest"
profile: "gpu-pytorch" and no particular flags in the Traceback (most recent call last):
File "/opt/miniconda3/envs/kbatch/bin/kbatch", line 8, in <module>
sys.exit(cli())
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/cli.py", line 348, in logs
result = _core.logs(pod_name, kbatch_url, token, read_timeout=read_timeout)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_core.py", line 196, in logs
result = next(gen)
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/kbatch/_core.py", line 233, in _logs
r.raise_for_status()
File "/opt/miniconda3/envs/kbatch/lib/python3.8/site-packages/httpx/_models.py", line 736, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '500 Internal Server Error' for url 'https://pccompute.westeurope.cloudapp.azure.com/compute/services/kbatch/jobs/logs/job-pod-id-here/'
For more information check: https://httpstatuses.com/500 Eventually, the logs can get pulled, but the process once more fails at the exact same position: no NVIDIA driver found. |
Beta Was this translation helpful? Give feedback.
-
Brilliant, that was the missing piece! This combination (use If anyone is interested about the profile loading bug in the kbatch API, you may want to check the merge request I just submitted. Again, many thanks to you both! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Planetary Computer community,
I would like to launch a long-running job on the Planetary Computer with GPU computations on a CUDA-capable accelerator. I have set up the code, verified in the online Hub and the GPU/PyTorch environment that it works, and have also created the necessary YAML config and shell script files as per kbatch API documentation.
The job gets executed and starts to run through, but eventually fails because no NVIDIA driver is installed in the environment (typical PyTorch error message: "RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx"). Sanity checks with a script that runs the usual suspects (
nvidia-smi
,lshw -C display
) confirm this.Do I need to use another image than the default
mcr.microsoft.com/planetary-computer/python:latest
? If so, I couldn't find a list of alternatives (I've tried pre-built ones from the Docker Hub, but to no avail).Does the Planetary Computer batch environment actually provide a CUDA accelerator? Perhaps I am missing a step, flag, etc.?
Apologies if this question is stupid, but I have not found an answer to my issue and am still new to Kubernetes, Dask, etc.
Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions