Skip to content

[Jobs] Add huggingface-cli jobs commands #3211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions docs/source/en/guides/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -604,3 +604,144 @@ Copy-and-paste the text below in your GitHub issue.
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
```

## huggingface-cli jobs

Run compute jobs on Hugging Face infrastructure with a familiar Docker-like interface.

`huggingface-cli jobs` is a command-line tool that lets you run anything on Hugging Face's infrastructure (including GPUs and TPUs!) with simple commands. Think `docker run`, but for running code on A100s.

```bash
# Directly run Python code
>>> huggingface-cli jobs run python:3.12 python -c "print('Hello from the cloud!')"

# Use GPUs without any setup
>>> huggingface-cli jobs run --flavor a10g-small pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel \
... python -c "import torch; print(torch.cuda.get_device_name())"

# Run from Hugging Face Spaces
>>> huggingface-cli jobs run hf.co/spaces/lhoestq/duckdb duckdb -c "select 'hello world'"

# Run a Python script with `uv` (experimental)
>>> huggingface-cli jobs uv run my_script.py
```

### ✨ Key Features

- 🐳 **Docker-like CLI**: Familiar commands (`run`, `ps`, `logs`, `inspect`) to run and manage jobs
- 🔥 **Any Hardware**: From CPUs to A100 GPUs and TPU pods - switch with a simple flag
- 📦 **Run Anything**: Use Docker images, HF Spaces, or your custom containers
- 🔐 **Simple Auth**: Just use your HF token
- 📊 **Live Monitoring**: Stream logs in real-time, just like running locally
- 💰 **Pay-as-you-go**: Only pay for the seconds you use

### Quick Start

#### 1. Run your first job

```bash
# Run a simple Python script
>>> huggingface-cli jobs run python:3.12 python -c "print('Hello from HF compute!')"
```

This command runs the job and shows the logs. You can pass `--detach` to run the Job in the background and only print the Job ID.

#### 2. Check job status

```bash
# List your running jobs
>>> huggingface-cli jobs ps

# Inspect the status of a job
>>> huggingface-cli jobs inspect <job_id>

# View logs from a job
>>> huggingface-cli jobs logs <job_id>

# Cancel a job
>>> huggingface-cli jobs cancel <job_id>
```

#### 3. Run on GPU

You can also run jobs on GPUs or TPUs with the `--flavor` option. For example, to run a PyTorch job on an A10G GPU:

```bash
# Use an A10G GPU to check PyTorch CUDA
>>> huggingface-cli jobs run --flavor a10g-small pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel \
... python -c "import torch; print(f"This code ran with the following GPU: {torch.cuda.get_device_name()}")"
```

Running this will show the following output!

```bash
This code ran with the following GPU: NVIDIA A10G
```

That's it! You're now running code on Hugging Face's infrastructure. For more detailed information checkout the [Quickstart Guide](docs/quickstart.md).

### Common Use Cases

- **Model Training**: Fine-tune or train models on GPUs (T4, A10G, A100) without managing infrastructure
- **Synthetic Data Generation**: Generate large-scale datasets using LLMs on powerful hardware
- **Data Processing**: Process massive datasets with high-CPU configurations for parallel workloads
- **Batch Inference**: Run offline inference on thousands of samples using optimized GPU setups
- **Experiments & Benchmarks**: Run ML experiments on consistent hardware for reproducible results
- **Development & Debugging**: Test GPU code without local CUDA setup

### Pass Environment variables and Secrets

You can pass environment variables to your job using

```bash
# Pass environment variables
>>> huggingface-cli jobs run -e FOO=foo -e BAR=bar python:3.12 python -c "import os; print(os.environ['FOO'], os.environ['BAR'])"
```

```bash
# Pass an environment from a local .env file
>>> huggingface-cli jobs run --env-file .env python:3.12 python -c "import os; print(os.environ['FOO'], os.environ['BAR'])"
```

```bash
# Pass secrets - they will be encrypted server side
>>> huggingface-cli jobs run -s MY_SECRET=psswrd python:3.12 python -c "import os; print(os.environ['MY_SECRET'])"
```

```bash
# Pass secrets from a local .env.secrets file - they will be encrypted server side
>>> huggingface-cli jobs run --secrets-file .env.secrets python:3.12 python -c "import os; print(os.environ['MY_SECRET'])"
```

### Hardware

Available `--flavor` options:

- CPU: `cpu-basic`, `cpu-upgrade`
- GPU: `t4-small`, `t4-medium`, `l4x1`, `l4x4`, `a10g-small`, `a10g-large`, `a10g-largex2`, `a10g-largex4`,`a100-large`
- TPU: `v5e-1x1`, `v5e-2x2`, `v5e-2x4`

(updated in 03/25 from Hugging Face [suggested_hardware docs](https://huggingface.co/docs/hub/en/spaces-config-reference))

### UV Scripts (Experimental)

Run UV scripts (Python scripts with inline dependencies) on HF infrastructure:

```bash
# Run a UV script (creates temporary repo)
>>> huggingface-cli jobs uv run my_script.py

# Run with persistent repo
>>> huggingface-cli jobs uv run my_script.py --repo my-uv-scripts

# Run with GPU
>>> huggingface-cli jobs uv run ml_training.py --flavor gpu-t4-small

# Pass arguments to script
>>> huggingface-cli jobs uv run process.py input.csv output.parquet --repo data-scripts

# Run a script directly from a URL
>>> huggingface-cli jobs uv run https://huggingface.co/datasets/username/scripts/resolve/main/example.py
```

UV scripts are Python scripts that include their dependencies directly in the file using a special comment syntax. This makes them perfect for self-contained tasks that don't require complex project setups. Learn more about UV scripts in the [UV documentation](https://docs.astral.sh/uv/guides/scripts/).
15 changes: 15 additions & 0 deletions src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@
"add_space_variable",
"auth_check",
"cancel_access_request",
"cancel_job",
"change_discussion_status",
"comment_discussion",
"create_branch",
Expand Down Expand Up @@ -194,6 +195,7 @@
"duplicate_space",
"edit_discussion_comment",
"enable_webhook",
"fetch_job_logs",
"file_exists",
"get_collection",
"get_dataset_tags",
Expand All @@ -210,11 +212,13 @@
"get_user_overview",
"get_webhook",
"grant_access",
"inspect_job",
"list_accepted_access_requests",
"list_collections",
"list_datasets",
"list_inference_catalog",
"list_inference_endpoints",
"list_jobs",
"list_lfs_files",
"list_liked_repos",
"list_models",
Expand Down Expand Up @@ -251,6 +255,7 @@
"resume_inference_endpoint",
"revision_exists",
"run_as_future",
"run_job",
"scale_to_zero_inference_endpoint",
"set_space_sleep_time",
"space_info",
Expand Down Expand Up @@ -792,6 +797,7 @@
"auth_switch",
"cached_assets_path",
"cancel_access_request",
"cancel_job",
"change_discussion_status",
"comment_discussion",
"configure_http_backend",
Expand Down Expand Up @@ -825,6 +831,7 @@
"enable_webhook",
"export_entries_as_dduf",
"export_folder_as_dduf",
"fetch_job_logs",
"file_exists",
"from_pretrained_fastai",
"from_pretrained_keras",
Expand All @@ -851,12 +858,14 @@
"grant_access",
"hf_hub_download",
"hf_hub_url",
"inspect_job",
"interpreter_login",
"list_accepted_access_requests",
"list_collections",
"list_datasets",
"list_inference_catalog",
"list_inference_endpoints",
"list_jobs",
"list_lfs_files",
"list_liked_repos",
"list_models",
Expand Down Expand Up @@ -907,6 +916,7 @@
"resume_inference_endpoint",
"revision_exists",
"run_as_future",
"run_job",
"save_pretrained_keras",
"save_torch_model",
"save_torch_state_dict",
Expand Down Expand Up @@ -1143,6 +1153,7 @@ def __dir__():
add_space_variable, # noqa: F401
auth_check, # noqa: F401
cancel_access_request, # noqa: F401
cancel_job, # noqa: F401
change_discussion_status, # noqa: F401
comment_discussion, # noqa: F401
create_branch, # noqa: F401
Expand Down Expand Up @@ -1172,6 +1183,7 @@ def __dir__():
duplicate_space, # noqa: F401
edit_discussion_comment, # noqa: F401
enable_webhook, # noqa: F401
fetch_job_logs, # noqa: F401
file_exists, # noqa: F401
get_collection, # noqa: F401
get_dataset_tags, # noqa: F401
Expand All @@ -1188,11 +1200,13 @@ def __dir__():
get_user_overview, # noqa: F401
get_webhook, # noqa: F401
grant_access, # noqa: F401
inspect_job, # noqa: F401
list_accepted_access_requests, # noqa: F401
list_collections, # noqa: F401
list_datasets, # noqa: F401
list_inference_catalog, # noqa: F401
list_inference_endpoints, # noqa: F401
list_jobs, # noqa: F401
list_lfs_files, # noqa: F401
list_liked_repos, # noqa: F401
list_models, # noqa: F401
Expand Down Expand Up @@ -1229,6 +1243,7 @@ def __dir__():
resume_inference_endpoint, # noqa: F401
revision_exists, # noqa: F401
run_as_future, # noqa: F401
run_job, # noqa: F401
scale_to_zero_inference_endpoint, # noqa: F401
set_space_sleep_time, # noqa: F401
space_info, # noqa: F401
Expand Down
126 changes: 126 additions & 0 deletions src/huggingface_hub/_jobs_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# coding=utf-8
# Copyright 2019-present, the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any, Dict, List, Optional

from huggingface_hub import constants
from huggingface_hub._space_api import SpaceHardware
from huggingface_hub.utils._datetime import parse_datetime
from huggingface_hub.utils._http import fix_hf_endpoint_in_url


class JobStage(str, Enum):
"""
Enumeration of possible stage of a Job on the Hub.

Value can be compared to a string:
```py
assert JobStage.COMPLETED == "COMPLETED"
```

Taken from https://github.com/huggingface/moon-landing/blob/main/server/job_types/JobInfo.ts#L61 (private url).
"""

# Copied from moon-landing > server > lib > Job.ts
COMPLETED = "COMPLETED"
CANCELED = "CANCELED"
ERROR = "ERROR"
DELETED = "DELETED"
RUNNING = "RUNNING"


class JobUrl(str):
"""Subclass of `str` describing a job URL on the Hub.

`JobUrl` is returned by `HfApi.create_job`. It inherits from `str` for backward
compatibility. At initialization, the URL is parsed to populate properties:
- endpoint (`str`)
- namespace (`Optional[str]`)
- job_id (`str`)
- url (`str`)

Args:
url (`Any`):
String value of the job url.
endpoint (`str`, *optional*):
Endpoint of the Hub. Defaults to <https://huggingface.co>.

Example:
```py
>>> HfApi.run_job("ubuntu", ["echo", "hello"])
JobUrl('https://huggingface.co/jobs/lhoestq/6877b757344d8f02f6001012', endpoint='https://huggingface.co', job_id='6877b757344d8f02f6001012')
```

Raises:
[`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError)
If URL cannot be parsed.
"""

def __new__(cls, url: Any, endpoint: Optional[str] = None):
url = fix_hf_endpoint_in_url(url, endpoint=endpoint)
return super(JobUrl, cls).__new__(cls, url)

def __init__(self, url: Any, endpoint: Optional[str] = None) -> None:
super().__init__()
# Parse URL
self.endpoint = endpoint or constants.ENDPOINT
namespace, job_id = url.split("/")[-2:]

# Populate fields
self.namespace = namespace
self.job_id = job_id
self.url = str(self) # just in case it's needed

def __repr__(self) -> str:
return f"JobUrl('{self}', endpoint='{self.endpoint}', job_id='{self.job_id}')"


@dataclass
class JobStatus:
stage: JobStage
message: Optional[str]

def __init__(self, **kwargs) -> None:
self.stage = kwargs["stage"]
self.message = kwargs.get("message")


@dataclass
class JobInfo:
id: str
created_at: Optional[datetime]
docker_image: Optional[str]
space_id: Optional[str]
command: Optional[List[str]]
arguments: Optional[List[str]]
environment: Optional[Dict[str, Any]]
secrets: Optional[Dict[str, Any]]
flavor: Optional[SpaceHardware]
status: Optional[JobStatus]

def __init__(self, **kwargs) -> None:
self.id = kwargs["id"]
created_at = kwargs.get("createdAt") or kwargs.get("created_at")
self.created_at = parse_datetime(created_at) if created_at else None
self.docker_image = kwargs.get("dockerImage") or kwargs.get("docker_image")
self.space_id = kwargs.get("spaceId") or kwargs.get("space_id")
self.command = kwargs.get("command")
self.arguments = kwargs.get("arguments")
self.environment = kwargs.get("environment")
self.secrets = kwargs.get("secrets")
self.flavor = kwargs.get("flavor")
self.status = JobStatus(**(kwargs["status"] if isinstance(kwargs.get("status"), dict) else {}))
Loading