Skip to content

[Bug]: EmbeddingStrategy mixes generator & embedding configs + leftover mock causes “fried rice” variations and 403s #1574

@NickKotte

Description

@NickKotte

crawl4ai version

0.7.6

Expected Behavior

Adaptive crawler. strategy="embedding",
embedding_model="openai/text-embedding-3-small", embedding_llm_config=LLMConfig( provider="openai/gpt-4o-mini", // openai/text-embedding-3-small fails here too api_token=OPENAI_API_KEY, ),
Query:
'capabilities, services,
certifications, description of work,
products'

Query Expansion:
Original query expanded to 12 variations

  1. Where can I compare prices for various products?
  2. What new products have been launched this year?
  3. What products are recommended for pet owners?
  4. What are the must-have products for outdoor activities?

Current Behavior

•	Query variations are replaced by a hard-coded “fried rice” list.
•	embedding_llm_config is reused for both generation and embeddings, so the wrong provider/model can hit the wrong API:
•	Chat model sent to embeddings endpoint → 403.
•	Embedding model used as a “provider” for text generation → failures or zero variations.
•	Embedding dimension sometimes mismatches the configured embedding_model.

Is this reproducible?

Yes

Inputs Causing the Bug

line 700
map_query_semantic_space function uses left over mock data
doesn't use the correct model for expansion or embedding

Steps to Reproduce

A) hard coded query variations:
1.	Use strategy="embedding" and call AdaptiveCrawler.digest(...).
2.	Observe variations list: always food-related (“fried rice…”) regardless of query.
B) 403 when embeddings are requested
`AdaptiveConfig(
  strategy="embedding",
  embedding_model="openai/text-embedding-3-small",
  embedding_llm_config=LLMConfig(
    provider="openai/gpt-4o-mini",
    api_token=OPENAI_API_KEY,
  ),
  n_query_variations=12,
)`
2.	Run digest(...).
3.	Intermittently see: `403 - You are not allowed to generate embeddings from this model` or end up with variations: 0, and embedding dims/behavior inconsistent with the configured model.

Code snippets

adaptive = AdaptiveCrawler(crawler, adaptive_cfg)
 result = await adaptive.digest(start_url=start_url, query=query)

Or literally just run the adaptive crawler example available in the Craw4ai repository.

OS

macOS

Python version

3.13.5

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

litellm.exceptions.BadRequestError: litellm.BadRequestError: OpenAIException - Error code: 403 - {'error': {'message': 'You are not allowed to generate embeddings from this model', 'type': 'invalid_request_error', 'param': None, 'code': None}}
and
Adaptive Crawl Stats - Query:
'capabilities, services,
certifications, description of work,
products'
Query Expansion:
Original query expanded to 4 variations

  1. how to add flavor to vegetable fried rice?
  2. what are the best vegetables to use in fried rice?
  3. are there any tips for making healthy fried rice with vegetables?
  4. how do I make vegetable fried rice from scratch?
    ...
    `

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions