Skip to content

Poor performance of image embedding model #111

@bmachek

Description

@bmachek

Hi there,

I thinking of using uform as a replacement in my application for CLIP.
Since uform supports onnx out of the box it would be a great addition to my existing onnx based stack.

However performance seems bad on my Mac M4 Pro 24GB.

I'm using the following code for generating a lot of image embeddings:

def generate_embeddings():
    """ PHASE 1: Generate and save embeddings using the UForm model. """
    if os.path.exists(EMBEDDINGS_FILE):
        print(f"Embeddings file already exists at {EMBEDDINGS_FILE}. Skipping.")
        return

    print("--- Starting Phase 1: Embedding Generation (UForm) ---")

    processors, models = get_model(
            'unum-cloud/uform3-image-text-multilingual-base',
            device=None,
            modalities=[Modality.IMAGE_ENCODER],
            backend="onnx",
        )
    
    model_image = models[Modality.IMAGE_ENCODER]
    processor_image = processors[Modality.IMAGE_ENCODER]
    
    embedding_dim = 256
    ava_dataset = AVADataset(AVA_LABELS_FILE, AVA_IMAGES_DIR)
    
    all_image_paths, all_scores_list, all_genres_list = [], [], []
    for path, score, genres in tqdm(ava_dataset, desc="Collecting valid dataset items"):
        all_image_paths.append(path)
        all_scores_list.append(score)
        all_genres_list.append(genres)

    print(f"Generating embeddings for {len(all_image_paths)} images...")
    
    all_embeds = []
    
    for i in tqdm(range(0, len(all_image_paths), EMBEDDING_BATCH_SIZE), desc="Generating embeddings in batches"):
        batch_paths = all_image_paths[i:i+EMBEDDING_BATCH_SIZE]
        batch_images = []
        for image_path in batch_paths:
            try:
                image = Image.open(image_path).convert("RGB")
                batch_images.append(image)
            except Exception as e:
                print(f"Error processing image {image_path}: {e}")
                continue

        if not batch_images:
            continue

        image_data = processor_image(batch_images)
        # The model returns features and pooled embeddings, we use the embeddings
        _, image_embeddings = model_image.encode(image_data, return_features=True)
        all_embeds.extend(image_embeddings)

    all_embeds = np.array(all_embeds)

    print(f"Saving {len(all_embeds)} items to {EMBEDDINGS_FILE}...")
    np.savez_compressed(
        EMBEDDINGS_FILE,
        embeddings=all_embeds,
        scores=np.array(all_scores_list),
        genres=np.array(all_genres_list),
        embedding_dim=embedding_dim)
    print("--- Phase 1 Finished ---")

I tried different EMBEDDING_BATCH_SIZE from 1-256, but I cannot seem to get past generating ~ 1 embedding/s.
The images from the AVA dataset are small so to my understanding the process should be faster. With open clip I got speeds from 8 - 16 emb/s with similar sized models.

This an example output from my script:

--- Starting Phase 1: Embedding Generation (UForm) ---
2025-10-09 09:35:34.196 python[39166:1255416] 2025-10-09 09:35:34.196442 [W:onnxruntime:, coreml_execution_provider.cc:113 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 112 number of nodes in the graph: 1056 number of nodes supported by CoreML: 739
Loading AVA labels...
Collecting valid dataset items: 255508it [00:01, 243766.45it/s]
Generating embeddings for 255508 images...
Generating embeddings in batches:   0%|                                                                                                                                              | 0/31939 [00:00<?, ?it/s]Context leak detected, CoreAnalytics returned false
Context leak detected, CoreAnalytics returned false
Context leak detected, CoreAnalytics returned false
Generating embeddings in batches:   0%|                                                                                                                                   | 3/31939 [00:05<15:00:16,  1.69s/it]

As you can see CoreML is used, which is fine for my Mac. If I look at asitop, I can see only the CPU cores from my M4 are used, no ANE and no GPU load is generated.

Any ideas? (aka Help me Obi-wan Kenobi) 😄

Should the CoreML / ONNX warnings give me a hint?

Am I doing sth. wrong?

Best regards,
Bastian

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions