Skip to content

Draft: Real vectors for tests #980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Draft: Real vectors for tests #980

wants to merge 4 commits into from

Conversation

I8dNLo
Copy link
Contributor

@I8dNLo I8dNLo commented Apr 29, 2025

Using real vectors for tests proof-of-concept

Copy link

netlify bot commented Apr 29, 2025

Deploy Preview for poetic-froyo-8baba7 ready!

Name Link
🔨 Latest commit 888d94e
🔍 Latest deploy log https://app.netlify.com/sites/poetic-froyo-8baba7/deploys/681a05d3acc47c00089cdd6b
😎 Deploy Preview https://deploy-preview-980--poetic-froyo-8baba7.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@I8dNLo I8dNLo changed the base branch from master to dev April 29, 2025 10:27
Copy link

coderabbitai bot commented Apr 29, 2025

📝 Walkthrough

"""

Walkthrough

The changes standardize the vector sizes used in various test modules, updating all dense vector size constants to 384. Randomly generated vectors in test setups are replaced with predefined sample queries imported from tests.fixtures.points.sample_queries. This affects the initialization of query vectors and batch queries across multiple test files, ensuring consistency in test data. The implementation of helper functions and test data initialization is modified to use these sample queries instead of NumPy-generated random vectors. Additionally, the tests/fixtures/points.py module was enhanced to load and preprocess external vector data from disk, provide a new sample_queries function to sample vectors from this dataset, and modify vector generation functions to optionally use these preloaded vectors. The tests for multivector discovery queries now use the new sample_random_multivector function instead of the previous random multivector generator. No changes were made to the signatures or declarations of any exported or public entities outside of tests/fixtures/points.py.

Suggested reviewers

  • coszio
    """

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5319ed7 and 888d94e.

📒 Files selected for processing (3)
  • tests/congruence_tests/test_common.py (1 hunks)
  • tests/congruence_tests/test_group_search.py (4 hunks)
  • tests/congruence_tests/test_multivector_discovery_queries.py (5 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/congruence_tests/test_multivector_discovery_queries.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/congruence_tests/test_group_search.py
  • tests/congruence_tests/test_common.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

🛑 Comments failed to post (37)
docs/source/qdrant_client.http.api.snapshots_api.rst (1)

4-7: 💡 Verification agent

❓ Verification inconclusive

Verify inclusion in the HTTP API index.
Ensure this new docs file is referenced in docs/source/qdrant_client.http.api.rst so that the snapshots API appears in the HTTP API toctree:


🏁 Script executed:

rg -n "snapshots_api.rst" docs/source/qdrant_client.http.api.rst

Length of output: 64


Add snapshots_api.rst to HTTP API toctree
The new docs/source/qdrant_client.http.api.snapshots_api.rst file isn’t referenced in the HTTP API index. Please update:

  • File: docs/source/qdrant_client.http.api.rst
    Under the existing .. toctree:: directive, add:
       snapshots_api.rst
docs/source/qdrant_client.http.api.cluster_api.rst (1)

4-7: 💡 Verification agent

❓ Verification inconclusive

Verify inclusion in the HTTP API index.
Make sure this file is added to docs/source/qdrant_client.http.api.rst so the cluster API is included:


🏁 Script executed:

rg -n "cluster_api.rst" docs/source/qdrant_client.http.api.rst

Length of output: 62


Add cluster_api.rst to HTTP API docs index
The cluster_api.rst page isn’t referenced in docs/source/qdrant_client.http.api.rst, so the cluster endpoints won’t be included in the generated documentation. Please update the toctree in that file:

• File needing update:
docs/source/qdrant_client.http.api.rst

• Example patch:

 .. toctree::
    :maxdepth: 2

    collection_api.rst
+   cluster_api.rst
    # …other modules
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

# docs/source/qdrant_client.http.api.rst

.. toctree::
   :maxdepth: 2

   collection_api.rst
   cluster_api.rst
   # …other modules
docs/source/qdrant_client.fastembed_common.rst (1)

4-7: 💡 Verification agent

❓ Verification inconclusive

Verify inclusion in the main documentation.
Ensure this file appears in your toctree (e.g., in docs/source/qdrant_client.rst or docs/source/index.rst):


🏁 Script executed:

rg -n "fastembed_common.rst" docs/source/qdrant_client.rst docs/source/index.rst

Length of output: 80


Add fastembed_common.rst to the main toctree
The file docs/source/qdrant_client.fastembed_common.rst is not referenced in any toctree in docs/source/qdrant_client.rst or docs/source/index.rst.

• In docs/source/qdrant_client.rst (or docs/source/index.rst), insert:

.. toctree::
   :maxdepth: 2
   :caption: Qdrant Client Reference

   qdrant_client.fastembed_common
.github/workflows/integration-tests.yml (2)

60-68: 💡 Verification agent

🧩 Analysis chain

Ensure HF_TOKEN is correctly documented

The addition of the Hugging Face token suggests integration with embedding models, which aligns with the PR objective of using real vectors for tests. Ensure this requirement is documented for developers who will run these tests locally.


🏁 Script executed:

#!/bin/bash
# Check if HF_TOKEN usage is documented in the project

echo "Checking for HF_TOKEN documentation in README or docs..."
rg -i 'HF_TOKEN|hugging\s*face\s*token' --type md .
rg -i 'HF_TOKEN|hugging\s*face\s*token' --type rst .
rg -i 'HF_TOKEN|hugging\s*face\s*token' README.md || echo "No mention of HF_TOKEN in README.md"

echo "Checking for HF_TOKEN usage in test code..."
rg -i 'HF_TOKEN|hugging\s*face' --type py ./tests

Length of output: 513


Add HF_TOKEN setup documentation for local integration tests

The HF_TOKEN environment variable is now required to run the integration tests, but there’s no mention of this in the project docs. Please document how to obtain and set HF_TOKEN for local development and CI.

• In README.md (or under a new “Integration Tests” section), include:

  • A note that HF_TOKEN must be set before running ./tests/integration-tests.sh
  • Instructions or a link on how to generate an access token from Hugging Face (e.g., via https://huggingface.co/settings/tokens)
  • Examples for setting the variable in shell (export HF_TOKEN=…) or in a .env file

• If you maintain separate contributor or developer docs, add the same information there.


40-51: 💡 Verification agent

🧩 Analysis chain

Verify if conditional tests should run on all Python versions

The async client and inspection cache tests are currently limited to Python 3.10.x. Consider whether these should also run on other Python versions to ensure consistent behavior across environments.


🏁 Script executed:

#!/bin/bash
# Check if there are any Python version-specific dependencies or features used in the async client
# or inspection cache that would justify limiting these tests to Python 3.10.x only

echo "Checking for Python 3.10-specific imports or features in async client code..."
rg -i 'python\s*3\.10' --type py ./qdrant_client
rg -i 'match\s+.*:' --type py ./qdrant_client  # Pattern matching (3.10 feature)
rg -i 'PEP.*?585' --type py ./qdrant_client

echo "Checking if test scripts have any Python version checks..."
cat ./tests/async-client-consistency-check.sh
cat ./tests/inspection-cache-consistency-check.sh

Length of output: 2715


I don’t see any Python-3.10-only constructs in the client code or test scripts, and the tooling scripts themselves invoke generic python (rather than pinning to 3.10). The consistency-check steps should work on all supported Python versions.

Update the workflow to run these tests on every Python version

 .github/workflows/integration-tests.yml
@@ lines 40-51
-      - name: Run async client generation tests
-        run: |
-          if [[ ${{ matrix.python-version }} == "3.10.x" ]]; then
-            ./tests/async-client-consistency-check.sh
-          fi
-        shell: bash
+      - name: Run async client generation tests
+        run: ./tests/async-client-consistency-check.sh
+        shell: bash

-      - name: Run inspection cache population tests
-        run: |
-          if [[ ${{ matrix.python-version }} == "3.10.x" ]]; then
-            ./tests/inspection-cache-consistency-check.sh
-          fi
-        shell: bash
+      - name: Run inspection cache population tests
+        run: ./tests/inspection-cache-consistency-check.sh
+        shell: bash

This ensures both checks execute across all Python versions in your CI matrix.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

      - name: Run async client generation tests
        run: ./tests/async-client-consistency-check.sh
        shell: bash

      - name: Run inspection cache population tests
        run: ./tests/inspection-cache-consistency-check.sh
        shell: bash
qdrant_client/embed/models.py (1)

1-12: ⚠️ Potential issue

Add from __future__ import annotations for Python 3.8 compatibility

list[...] generics are evaluated at runtime and therefore raise TypeError: 'type' object is not subscriptable on Python 3.8.
Unless the client has officially dropped 3.8 support, this will be a runtime blocker.

+from __future__ import annotations
 from typing import Union

Alternatively, fall back to typing.List / typing.Dict, but the future-import keeps the code cleaner.

Also applies to: 13-23

qdrant_client/embed/type_inspector.py (1)

42-50: ⚠️ Potential issue

Avoid premature return False when iterating collections

inspect() exits on the first non-BaseModel element, potentially missing valid models later in the iterable.

-                else:
-                    return False
+                else:
+                    # Skip non-pydantic items and continue inspecting
+                    continue

With this change, the loop continues scanning the remaining elements, ensuring no false negatives.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        elif isinstance(points, Iterable):
            for point in points:
                if isinstance(point, BaseModel):
                    self.parser.parse_model(point.__class__)
                    if self._inspect_model(point):
                        return True
                else:
                    # Skip non-pydantic items and continue inspecting
                    continue
        return False
qdrant_client/common/version_check.py (1)

12-18: 🛠️ Refactor suggestion

Add timeout & exception handling to avoid client hangs

httpx.get is executed without a timeout and outside a try/except block.
A transient network issue will block indefinitely or raise and crash the caller.

-    response = httpx.get(rest_uri, headers=rest_headers, auth=auth_provider)
+    try:
+        response = httpx.get(
+            rest_uri,
+            headers=rest_headers,
+            auth=auth_provider,
+            timeout=5.0,  # seconds – tune if necessary
+        )
+    except httpx.HTTPError as exc:
+        logging.debug(f"Failed to fetch server version: {exc!r}")
+        return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def get_server_version(
    rest_uri: str, rest_headers: dict[str, Any], auth_provider: Optional[BearerAuth]
) -> Optional[str]:
    try:
        response = httpx.get(
            rest_uri,
            headers=rest_headers,
            auth=auth_provider,
            timeout=5.0,  # seconds – tune if necessary
        )
    except httpx.HTTPError as exc:
        logging.debug(f"Failed to fetch server version: {exc!r}")
        return None

    if response.status_code == 200:
        version_info = response.json().get("version", None)
qdrant_client/hybrid/fusion.py (1)

7-11: 🛠️ Refactor suggestion

Rank index should be 1-based in Reciprocal Rank Fusion

Classical RRF uses 1 / (k + rank) where rank starts at 1 (best item).
Using the 0-based loop index over-weights every entry by 1.

-    def compute_score(pos: int) -> float:
+    def compute_score(pos: int) -> float:
         ranking_constant = (
             2  # the constant mitigates the impact of high rankings by outlier systems
         )
-        return 1 / (ranking_constant + pos)
+        return 1 / (ranking_constant + pos + 1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def compute_score(pos: int) -> float:
        ranking_constant = (
            2  # the constant mitigates the impact of high rankings by outlier systems
        )
        return 1 / (ranking_constant + pos + 1)
qdrant_client/_pydantic_compat.py (2)

65-67: 🛠️ Refactor suggestion

Return a homogeneous type from model_config

For Pydantic v2 the function now returns a ConfigDict, not a plain dict, which breaks the declared return annotation and downstream dict operations (e.g. model_config(...)["populate_by_name"]). Wrap with dict() for v2 to keep behaviour identical across versions.

-        return model.model_config
+        return dict(model.model_config)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def model_config(model: Type[BaseModel]) -> dict[str, Any]:
    if PYDANTIC_V2:
-        return model.model_config
+        return dict(model.model_config)

19-20: ⚠️ Potential issue

Prevent possible KeyError in to_jsonable_python for Pydantic v1

ENCODERS_BY_TYPE[type(x)] will raise when type(x) has no registered encoder (e.g., Decimal). A safe fallback keeps the helper usable for arbitrary objects:

-    def to_jsonable_python(x: Any) -> Any:
-        return ENCODERS_BY_TYPE[type(x)](x)
+    def to_jsonable_python(x: Any) -> Any:
+        encoder = ENCODERS_BY_TYPE.get(type(x))
+        return encoder(x) if encoder else json.loads(json.dumps(x, default=str))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def to_jsonable_python(x: Any) -> Any:
        encoder = ENCODERS_BY_TYPE.get(type(x))
        return encoder(x) if encoder else json.loads(json.dumps(x, default=str))
qdrant_client/http/api_client.py (2)

86-89: 🛠️ Refactor suggestion

timeout is copied but not removed from query params

After converting "timeout" to an integer you still leave "timeout" inside params, so the same value is sent twice: once as an HTTP query parameter and once as the HTTPX timeout. This can confuse the server and breaks caching. Remove it from params after extraction.

-        if "params" in kwargs and "timeout" in kwargs["params"]:
-            kwargs["timeout"] = int(kwargs["params"]["timeout"])
+        if "params" in kwargs and "timeout" in kwargs["params"]:
+            kwargs["timeout"] = int(kwargs["params"].pop("timeout"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        url = urljoin((self.host or ""), url.format(**path_params))
        if "params" in kwargs and "timeout" in kwargs["params"]:
            kwargs["timeout"] = int(kwargs["params"].pop("timeout"))
        request = self._client.build_request(method, url, **kwargs)

109-120: ⚠️ Potential issue

Robust JSON parsing when handling 429 responses

Indexing resp["status"] will raise KeyError if the field is absent. Use .get() chaining to avoid masking the original error handling with an accidental KeyError.

-                resp = response.json()
-                message = resp["status"]["error"] if resp["status"] and resp["status"]["error"] else ""
+                resp = response.json()
+                message = (
+                    resp.get("status", {}).get("error")  # graceful if key missing
+                    or ""
+                )

Replicate the same fix in the async branch (lines 193-203).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if response.status_code == 429:
            retry_after_s = response.headers.get("Retry-After", None)
            try:
                resp = response.json()
                message = (
                    resp.get("status", {}).get("error")  # graceful if key missing
                    or ""
                )
            except Exception:
                message = ""

            if retry_after_s:
                raise ResourceExhaustedResponse(message, retry_after_s)

        if response.status_code in [200, 201, 202]:
qdrant_client/http/api/aliases_api.py (1)

18-44: 💡 Verification agent

🧩 Analysis chain

Consider extracting common utility functions to a shared module

The JSON serialization utilities (to_json and jsonable_encoder) are duplicated across multiple API modules. Consider extracting these to a shared utility module to reduce duplication.


🏁 Script executed:

#!/bin/bash
# Check how many files contain duplicated JSON serialization utility functions

echo "Files with to_json function:"
rg "def to_json\(model: BaseModel" --type py qdrant_client/

echo "Files with jsonable_encoder function:"
rg "def jsonable_encoder\(" --type py qdrant_client/

Length of output: 1745


Extract common JSON serialization utilities to a shared module

The to_json and jsonable_encoder functions are duplicated in nine API modules. Centralize them into a single utility file to reduce maintenance overhead and ensure consistency.

Files containing the duplication:

  • qdrant_client/http/api/aliases_api.py
  • qdrant_client/http/api/beta_api.py
  • qdrant_client/http/api/collections_api.py
  • qdrant_client/http/api/distributed_api.py
  • qdrant_client/http/api/indexes_api.py
  • qdrant_client/http/api/points_api.py
  • qdrant_client/http/api/search_api.py
  • qdrant_client/http/api/service_api.py
  • qdrant_client/http/api/snapshots_api.py

Suggested steps:

  1. Create a new file, e.g. qdrant_client/http/api/utils.py, containing the shared functions:
    # qdrant_client/http/api/utils.py
    from typing import Any, Union, SetIntStr, DictIntStrAny
    from pydantic import BaseModel
    from .version import PYDANTIC_V2
    
    def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
        if PYDANTIC_V2:
            return model.model_dump_json(*args, **kwargs)
        return model.json(*args, **kwargs)
    
    def jsonable_encoder(
        obj: Any,
        include: Union[SetIntStr, DictIntStrAny] = None,
        exclude=None,
        by_alias: bool = True,
        skip_defaults: bool = None,
        exclude_unset: bool = True,
        exclude_none: bool = True,
    ):
        if hasattr(obj, "json") or hasattr(obj, "model_dump_json"):
            return to_json(
                obj,
                include=include,
                exclude=exclude,
                by_alias=by_alias,
                exclude_unset=bool(exclude_unset or skip_defaults),
                exclude_none=exclude_none,
            )
        return obj
  2. In each API module, remove the local definitions and update imports:
    - from pydantic import BaseModel
    - from .version import PYDANTIC_V2
    -
    - def to_json(…): …
    - def jsonable_encoder(…): …
    + from .utils import to_json, jsonable_encoder

This refactor will eliminate duplication and centralize future changes.

qdrant_client/http/api/search_api.py (1)

7-8: 🛠️ Refactor suggestion

Avoid import * – switch to explicit imports

Star imports obfuscate the public surface, make static analysis harder, and are flagged by Ruff (F403/F405). Import the required symbols explicitly or keep the models as m alias only.

-from qdrant_client.http.models import *
-from qdrant_client.http.models import models as m
+from qdrant_client.http.models import models as m            # single, explicit entry-point
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

-from qdrant_client.http.models import *
-from qdrant_client.http.models import models as m
+from qdrant_client.http.models import models as m  # single, explicit entry-point
🧰 Tools
🪛 Ruff (0.8.2)

7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/hybrid/formula.py (2)

109-116: ⚠️ Potential issue

exponent.is_integer() fails when exponent is int

int objects have no is_integer() method. When exponent is an int, this raises AttributeError.

-        if base >= 0 or (base != 0 and exponent.is_integer()):
+        if base >= 0 or (base != 0 and (isinstance(exponent, int) or (isinstance(exponent, float) and exponent.is_integer()))):

Alternatively, cast to float and guard with math.isfinite.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        # Check for valid input
-        if base >= 0 or (base != 0 and exponent.is_integer()):
+        if base >= 0 or (base != 0 and (isinstance(exponent, int) or (isinstance(exponent, float) and exponent.is_integer()))):
            try:
                return math.pow(base, exponent)
            except OverflowError:
                pass

        raise_non_finite_error(f"{base}^{exponent}")

348-358: ⚠️ Potential issue

Replace assert False in tests

assert False is stripped under optimisation (python -O). Use pytest.raises or raise AssertionError directly.

-        parse_variable("$score[invalid]")
-        assert False
+        with pytest.raises(ValueError):
+            parse_variable("$score[invalid]")

Repeat for the second occurrence.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

350-350: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


356-356: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

pyproject.toml (1)

25-35: 🛠️ Refactor suggestion

Consider upper-bounding protobuf & numpy to avoid unexpected breaking majors

protobuf = ">=3.20.0" and numpy for non-3.9 ranges are currently un-bounded on the upper side. Both libraries regularly ship backwards-incompatible major releases (e.g., protobuf 5.*). A conservative constraint such as <5 (protobuf) and <2.2 (numpy) gives you time to test new majors before the ecosystem forces them in.

-protobuf = ">=3.20.0"
+protobuf = ">=3.20.0,<5"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

pydantic = ">=1.10.8,!=2.0.*,!=2.1.*,!=2.2.0"  # can't use poetry ">=1.10.8,<2.0 || >=2.2.1" since pip is complaining
grpcio = { version = ">=1.41.0", allow-prereleases = true }
- protobuf = ">=3.20.0"
+ protobuf = ">=3.20.0,<5"
urllib3 = ">=1.26.14,<3"
portalocker = "^2.7.0"
fastembed = [
    { version = "0.6.1", optional = true },
]
fastembed-gpu = [
    { version = "0.6.1", optional = true },
]
qdrant_client/connection.py (4)

98-103: ⚠️ Potential issue

stream-stream interceptor: remove unnecessary await

grpc.aio.StreamStreamCall isn’t awaitable. Replace with the pattern used above.


89-94: ⚠️ Potential issue

stream-unary interceptor suffers from identical pattern

Apply the same fix (call = continuation(...)) and adjust the return path.


70-76: ⚠️ Potential issue

Double-await bug: intercept_unary_unary breaks on real calls

response is awaited once here and converted to a plain protobuf message.
process_response then tries to await it again, raising TypeError: object ResponseMessage is not awaitable.

-        response = await continuation(new_details, next_request)
-        return await postprocess(response) if postprocess else response
+        call = continuation(new_details, next_request)          # call is awaitable
+        if postprocess:
+            return await postprocess(call)
+        return await call
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        new_details, new_request_iterator, postprocess = await self._fn(
            client_call_details, iter((request,)), False, False
        )
        next_request = next(new_request_iterator)
        call = continuation(new_details, next_request)
        if postprocess:
            return await postprocess(call)
        return await call

80-85: ⚠️ Potential issue

Same double-await issue for unary-stream

For streaming responses the call object is not awaitable at all, so the current await continuation(...) raises immediately.

-        response_it = await continuation(new_details, next(new_request_iterator))
-        return await postprocess(response_it) if postprocess else response_it
+        call = continuation(new_details, next(new_request_iterator))
+        return await postprocess(call) if postprocess else call
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        new_details, new_request_iterator, postprocess = await self._fn(
            client_call_details, iter((request,)), False, True
        )
        call = continuation(new_details, next(new_request_iterator))
        return await postprocess(call) if postprocess else call
qdrant_client/conversions/common_types.py (1)

68-75: ⚠️ Potential issue

list[...] breaks Py < 3.9 compatibility

list[PointId] and similar PEP-585 generics are only valid syntax starting with Python 3.9.
The header comment (lines 143-145) says we still support Python 3.7, so importing this
module on 3.7/3.8 will raise SyntaxError before users can even reach the runtime
version-check in their own code.

-PointsSelector = Union[
-    list[PointId],
+from typing import List  # at top of file
+PointsSelector = Union[
+    List[PointId],

Update every other occurrence of the builtin-generic syntax (list[...], dict[...])
in this file (there are several) or drop 3.7/3.8 from the supported versions list.

Committable suggestion skipped: line range outside the PR's diff.

qdrant_client/http/api/distributed_api.py (2)

4-7: 🛠️ Refactor suggestion

Duplicate & wildcard imports clutter the namespace

Lines 4-7 import BaseModel twice and pull everything from
qdrant_client.http.models via *. Apart from the Ruff warnings, this:

  • masks the second BaseModel reference,
  • makes static analysis and IDE completion much harder,
  • risks name collisions.
-from pydantic import BaseModel
-from pydantic.main import BaseModel
-from qdrant_client.http.models import *
+from pydantic import BaseModel
+from qdrant_client.http import models as m

(The as m alias is already used below.)
Please remove the redundant import and avoid import *.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

-from pydantic import BaseModel
-from pydantic.main import BaseModel
-from qdrant_client.http.models import *
+from pydantic import BaseModel
+from qdrant_client.http import models as m
 from pydantic.version import VERSION as PYDANTIC_VERSION
🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)


18-23: ⚠️ Potential issue

to_json does not propagate exclude_unset/none to Pydantic v2 correctly

model_dump_json() takes exclude_none but not exclude_unset; passing an
unexpected kwarg raises TypeError.
Because jsonable_encoder always forwards exclude_unset=True, code paths running
under Pydantic v2 will explode.

Consider splitting the kwargs explicitly:

-def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
-    if PYDANTIC_V2:
-        return model.model_dump_json(*args, **kwargs)
+def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
+    if PYDANTIC_V2:
+        kwargs.pop("exclude_unset", None)  # not supported
+        return model.model_dump_json(*args, **kwargs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def to_json(model: BaseModel, *args: Any, **kwargs: Any) -> str:
    if PYDANTIC_V2:
        kwargs.pop("exclude_unset", None)  # not supported
        return model.model_dump_json(*args, **kwargs)
    else:
        return model.json(*args, **kwargs)
qdrant_client/embed/model_embedder.py (2)

32-35: ⚠️ Potential issue

ModelEmbedderWorker.start() causes TypeError

start() passes an unexpected threads keyword to ModelEmbedderWorker.__init__,
which only accepts batch_size and **kwargs.

-        return cls(threads=1, batch_size=batch_size, **kwargs)
+        # ParallelWorkerPool already sets the number of processes; no need for “threads”.
+        return cls(batch_size=batch_size, **kwargs)

Without this fix every attempt to spin up a worker process fails immediately.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    @classmethod
    def start(cls, batch_size: int, **kwargs: Any) -> "ModelEmbedderWorker":
        # ParallelWorkerPool already sets the number of processes; no need for “threads”.
        return cls(batch_size=batch_size, **kwargs)

268-273: 🛠️ Refactor suggestion

Early return inside list traversal skips subsequent inference objects

If the first element of a list is not an inference object we return None
prematurely, silently discarding the rest of the list.

-        if isinstance(data, list):
-            for value in data:
-                if not isinstance(value, get_args(INFERENCE_OBJECT_TYPES)):
-                    return None
-                self._accumulate(value)
+        if isinstance(data, list):
+            for value in data:
+                if isinstance(value, get_args(INFERENCE_OBJECT_TYPES)):
+                    self._accumulate(value)
+            return None

This ensures all items are inspected.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if isinstance(data, list):
            for value in data:
                if isinstance(value, get_args(INFERENCE_OBJECT_TYPES)):
                    self._accumulate(value)
            return None
qdrant_client/http/api/service_api.py (1)

4-7: 🛠️ Refactor suggestion

Remove the duplicate BaseModel import and avoid import *.

BaseModel is imported twice (lines 4 & 5), triggering Ruff F811 and creating ambiguous symbols.
In addition, the wildcard import on line 7 hides the public surface and hurts IDE auto-completion.

-from pydantic import BaseModel
-from pydantic.main import BaseModel
-from qdrant_client.http.models import *
+from pydantic import BaseModel
+from qdrant_client.http import models as m

(The last line lets you access generated models through the existing m alias.)
This keeps the namespace clean and eliminates the redefinition warning.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from pydantic import BaseModel
from pydantic.version import VERSION as PYDANTIC_VERSION
from qdrant_client.http import models as m
🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/http/api/collections_api.py (1)

4-7: 🛠️ Refactor suggestion

Same duplicate import & wildcard issue as in service_api.py.

Please apply the import-cleanup change here as well to silence F811/F403 and keep the public namespace tidy.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/grpc/collections_pb2.py (1)

50-52: ⚠️ Potential issue

Avoid polluting the module namespace with None.

globals()['None'] = 0 creates a module attribute called None and risks severe confusion:

from qdrant_client.grpc.collections_pb2 import None  # now legal!

Although it doesn’t overwrite the built-in constant, importing this symbol elsewhere shadows the real None, leading to hard-to-trace bugs.
If the enum generator must expose a “None” value, prefer a safer alias such as Modifier_None = 0.

qdrant_client/embed/embedder.py (2)

278-286: 🛠️ Refactor suggestion

⚠️ Potential issue

**options or {} triggers a SyntaxError

Inside a call expression the ** operator must receive a single mapping expression.
Using or directly after ** is illegal Python syntax and the file will not import.

Refactor once and reuse locally-normalised kwargs:

-embedding_model_inst = self.get_or_init_model(model_name=model_name, **options or {})
+model_opts: dict[str, Any] = options or {}
+embedding_model_inst = self.get_or_init_model(model_name=model_name, **model_opts)

Consider extracting a small helper such as _normalize_options(options) to DRY this pattern across all _embed_* helpers.

Also applies to: 296-304, 327-338, 348-356, 365-373, 380-385


26-30: ⚠️ Potential issue

Syntax error: invalid keyword in class definition

arbitrary_types_allowed=True cannot be passed as a keyword argument in the class header – only metaclass (and since 3.11, slots) are allowed.
Move that flag into a Pydantic configuration block to avoid a hard runtime SyntaxError.

-class ModelInstance(BaseModel, Generic[T], arbitrary_types_allowed=True):  # type: ignore[call-arg]
-    model: T
-    options: dict[str, Any]
-    deprecated: bool = False
+class ModelInstance(Generic[T], BaseModel):
+    model: T
+    options: dict[str, Any]
+    deprecated: bool = False
+
+    class Config:  # Pydantic v1
+        arbitrary_types_allowed = True
+
+    model_config = {  # Pydantic v2 (ignored by v1)
+        "arbitrary_types_allowed": True,
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class ModelInstance(Generic[T], BaseModel):
    model: T
    options: dict[str, Any]
    deprecated: bool = False

    class Config:  # Pydantic v1
        arbitrary_types_allowed = True

    model_config = {  # Pydantic v2 (ignored by v1)
        "arbitrary_types_allowed": True,
    }
qdrant_client/http/api/snapshots_api.py (1)

2-7: 🛠️ Refactor suggestion

Duplicate import & star-import hide real problems

  1. BaseModel is imported twice (line 4 and 5).
  2. from qdrant_client.http.models import * makes type-checking noisy and is the reason SnapshotPriority appears “possibly undefined”.
-from typing import IO, TYPE_CHECKING, Any, Dict, Set, TypeVar, Union
-from pydantic import BaseModel
-from pydantic.main import BaseModel
+from typing import IO, TYPE_CHECKING, Any, Dict, Set, TypeVar, Union
+from pydantic import BaseModel

Replace the star import with explicit names (or keep the models as m alias only) and import SnapshotPriority directly:

-from qdrant_client.http.models import *
-from qdrant_client.http.models import models as m
+from qdrant_client.http.models import models as m, SnapshotPriority

This eliminates re-definitions, improves IDE support, and resolves the Ruff warnings F403/F811/F405.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/http/api/points_api.py (1)

2-8: 🛠️ Refactor suggestion

Import hygiene: duplicated BaseModel and uncontrolled namespace

Same issue as in snapshots_api.py:

  • Redundant BaseModel import (line 4 & 5)
  • from qdrant_client.http.models import * obscures missing symbols such as WriteOrdering.

Adopt the same fix (keep a single BaseModel import and replace the star import with explicit names or the existing alias m). This will also silence Ruff F403/F811/F405.

🧰 Tools
🪛 Ruff (0.8.2)

5-5: Redefinition of unused BaseModel from line 4

(F811)


7-7: from qdrant_client.http.models import * used; unable to detect undefined names

(F403)

qdrant_client/async_qdrant_fastembed.py (1)

398-425: 🛠️ Refactor suggestion

⚠️ Potential issue

Avoid relying on assert for production-level validation

assert statements are stripped when Python is executed with the -O (optimized) flag, which would silently bypass all the collection-compatibility checks. Converting these into explicit exceptions keeps the safeguards intact in all execution modes and allows you to surface a precise error type/message to the caller.

-assert (
-    embeddings_size == vector_params.size
-), f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+if embeddings_size != vector_params.size:
+    raise ValueError(
+        f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+    )

(…repeat for the remaining assert statements in this block…)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        (embeddings_size, distance) = self._get_model_params(model_name=self.embedding_model_name)
        vector_field_name = self.get_vector_field_name()
        assert isinstance(
            collection_info.config.params.vectors, dict
        ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}"
        assert (
            vector_field_name in collection_info.config.params.vectors
        ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}, expected {vector_field_name}"
        vector_params = collection_info.config.params.vectors[vector_field_name]
-       assert (
-           embeddings_size == vector_params.size
-       ), f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+       if embeddings_size != vector_params.size:
+           raise ValueError(
+               f"Embedding size mismatch: {embeddings_size} != {vector_params.size}"
+           )
        assert (
            distance == vector_params.distance
        ), f"Distance mismatch: {distance} != {vector_params.distance}"
        sparse_vector_field_name = self.get_sparse_vector_field_name()
        if sparse_vector_field_name is not None:
            assert (
                sparse_vector_field_name in collection_info.config.params.sparse_vectors
            ), f"Collection have incompatible vector params: {collection_info.config.params.vectors}"
            if self.sparse_embedding_model_name in IDF_EMBEDDING_MODELS:
                modifier = collection_info.config.params.sparse_vectors[
                    sparse_vector_field_name
                ].modifier
                assert (
                    modifier == models.Modifier.IDF
                ), f"{self.sparse_embedding_model_name} requires modifier IDF, current modifier is {modifier}"
qdrant_client/async_client_base.py (1)

382-386: 🛠️ Refactor suggestion

Inconsistent “sync” method in an async interface

upload_points is defined as a synchronous method inside an otherwise asynchronous base class.
All real implementations that hit the network or disk will need an async API to avoid blocking the event loop.

Consider:

-    def upload_points(
+    async def upload_points(
         self, collection_name: str, points: Iterable[types.PointStruct], **kwargs: Any
     ) -> None:

and mark it @abstractmethod (see next comment).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    async def upload_points(
        self, collection_name: str, points: Iterable[types.PointStruct], **kwargs: Any
    ) -> None:
        raise NotImplementedError()
qdrant_client/local/async_qdrant_local.py (1)

916-961: ⚠️ Potential issue

Critical bug: inconsistent initial scroll limit
In create_collection(), the first scroll uses limit=2 while subsequent loops use batch_size = 100. This will only copy two points initially instead of the intended batch size. It should read:

-            (records, next_offset) = await self.scroll(
-                from_collection_name, limit=2, with_vectors=True
-            )
+            (records, next_offset) = await self.scroll(
+                from_collection_name, limit=batch_size, with_vectors=True
+            )

Please align the initial limit with batch_size to ensure consistent data copying.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if src_collection and from_collection_name:
            batch_size = 100
            (records, next_offset) = await self.scroll(
                from_collection_name, limit=batch_size, with_vectors=True
            )
            self.upload_records(collection_name, records)
            while next_offset is not None:
                (records, next_offset) = await self.scroll(
                    from_collection_name, offset=next_offset, limit=batch_size, with_vectors=True
                )
                self.upload_records(collection_name, records)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/congruence_tests/test_discovery.py (1)

25-25: Improved test consistency by using sample queries.

Replacing random vector generation with predefined sample queries creates more consistent test behavior. However, the function still accepts a dims parameter that's no longer used.

Consider updating the function signature to remove the unused parameter:

-def random_vector(dims: int) -> list[float]:
+def random_vector(_: int = None) -> list[float]:

Or better yet, rename the function to reflect its new behavior:

-def random_vector(dims: int) -> list[float]:
+def sample_vector(_: int = None) -> list[float]:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5adcf93 and 1073e87.

📒 Files selected for processing (7)
  • tests/congruence_tests/test_collections.py (1 hunks)
  • tests/congruence_tests/test_common.py (1 hunks)
  • tests/congruence_tests/test_discovery.py (1 hunks)
  • tests/congruence_tests/test_group_search.py (4 hunks)
  • tests/congruence_tests/test_query.py (3 hunks)
  • tests/congruence_tests/test_query_batch.py (3 hunks)
  • tests/congruence_tests/test_recommendation.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/congruence_tests/test_recommendation.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (17)
tests/congruence_tests/test_collections.py (1)

104-104: Vector size standardization improves test consistency.

The vector size increase from 2 to 384 aligns with the standardized vector size used across the test suite. This change ensures all test vectors have consistent dimensions, creating more realistic test scenarios.

tests/congruence_tests/test_common.py (2)

17-19: Standardized vector dimensions improve test realism.

Updating all dense vector size constants to 384 creates consistency across different vector types and makes tests more representative of real-world scenarios. This change supports the PR objective of using real vectors for testing.


22-24: Standardized sparse vector dimensions improve consistency.

Unifying sparse vector dimensions to 384 (from previously varied sizes) ensures consistency between dense and sparse vector tests. This standardization better represents real-world usage patterns.

tests/congruence_tests/test_discovery.py (1)

19-19: Added sample_queries import for using real vectors.

Good addition of the import for sample_queries from fixtures, supporting the move away from random vectors to predefined real vectors for testing.

tests/congruence_tests/test_query_batch.py (4)

22-26: Improved imports for real vector testing.

Good addition of sample_queries import along with the existing random vector generation functions, supporting the transition to using real vectors for testing.


46-50: Enhanced test realism with sample queries for text vectors.

Replacing random vector generation with predefined sample queries creates more consistent and realistic test inputs. This systematic change to how text queries and prefetch queries are constructed aligns well with the PR objective.


58-59: Standardized approach to image and code vector queries.

Good consistency in applying the same pattern of using sample queries for both image and code vector queries. This ensures all vector types benefit from the improved test realism.

Also applies to: 66-67


107-114: Improved DBSF fusion query tests with real vectors.

Using sample queries for the DBSF fusion prefetch queries completes the systematic replacement of random vectors, ensuring all test scenarios use realistic vectors.

tests/congruence_tests/test_group_search.py (5)

21-21: Good addition of the sample_queries import.

This change aligns with the PR's objective to use real vectors for testing, replacing random vectors with predefined samples.


30-33: Excellent standardization of test vectors.

Replacing random vector generation with predefined sample queries improves test determinism and consistency. This change ensures that tests use the same vector data across runs, making test results more reproducible.


222-223: Good standardization of vector sizes.

Using the text_vector_size constant instead of hardcoded values makes the code more maintainable and consistent with other tests. This change supports the standardization of vector sizes across the test suite.


233-239: Good update to query vector initialization.

Using sample_queries() instead of random vectors improves test consistency. The wrapping of the query vector in a NumPy array is maintained for the first test case, which preserves the existing test coverage for different input types.


241-243: Simplified function call by passing vector directly.

Passing the vector directly without unnecessary conversion improves code clarity. This change is consistent with the goal of standardizing how vectors are handled in tests.

tests/congruence_tests/test_query.py (4)

36-36: Good addition of the sample_queries import.

This change aligns with the PR's objective to standardize test vectors across the test suite.


50-58: Excellent standardization of query vector initialization.

Using predefined sample queries from sample_queries() instead of random vectors ensures test consistency and reproducibility. The modification to dense_vector_query_text_bis on line 56 (creating a slightly different vector by adding 42.0) is preserved, maintaining the existing test behavior while benefiting from standardized test data.


1465-1467: Good standardization of vector configuration.

Using the text_vector_size constant for vector configuration instead of hardcoded values improves consistency and maintainability. This change supports the standardization of vector sizes across the test suite.


1469-1469: Good update to fixture generation parameters.

Using the text_vector_size parameter in the generate_fixtures call ensures consistency with the vector configuration and other test files. This change further supports the standardization effort.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/fixtures/points.py (1)

141-142: Consider an edge case for the skip_vectors functionality

When using preloaded vectors and the skip_vectors option is enabled, there's no special handling for the case where vectors are skipped based on random chance (as seen in lines 143-146).

Consider how the skip_vectors functionality should work with preloaded vectors. If a vector is skipped, the index-to-vector mapping might become inconsistent.

You might want to add a comment explaining this behavior or adjust the code to handle this edge case properly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1073e87 and 5319ed7.

📒 Files selected for processing (1)
  • tests/fixtures/points.py (5 hunks)
🧰 Additional context used
🪛 GitHub Actions: Integration tests
tests/fixtures/points.py

[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Redirect rules - poetic-froyo-8baba7
  • GitHub Check: Header rules - poetic-froyo-8baba7
  • GitHub Check: Pages changed - poetic-froyo-8baba7
🔇 Additional comments (4)
tests/fixtures/points.py (4)

26-42: Verify vector dimensionality matches

The function random_vectors now uses preloaded vectors when idx is provided. However, there's no validation to ensure that the dimensionality of the preloaded vectors matches the vector_sizes parameter.

Add validation to ensure that the dimensions of the preloaded vectors match the expected dimensions:

def random_vectors(vector_sizes: Union[dict[str, int], int], idx=None) -> models.VectorStruct:
    if isinstance(vector_sizes, int):
        if idx:
+            # Verify vector dimensions match
+            if len(_text_vectors_clean[idx]) != vector_sizes:
+                raise ValueError(f"Preloaded vector dimension {len(_text_vectors_clean[idx])} does not match requested dimension {vector_sizes}")
            return _text_vectors_clean[idx]
        else:
            return np.random.random(vector_sizes).round(3).tolist()

This will help avoid subtle bugs when the dimensions don't match.


46-48: Handle potential index out of range errors

The random_multivectors function now uses sample_random_multivector but there's no check to ensure that vec_count is not larger than the total number of available vectors in _text_vectors_clean.

Add validation to ensure that vec_count doesn't exceed the number of available vectors:

def random_multivectors(vector_sizes: Union[dict[str, int], int]) -> models.VectorStruct:
    if isinstance(vector_sizes, int):
        vec_count = random.randint(1, 10)
+        vec_count = min(vec_count, len(_text_vectors_clean))
        return sample_random_multivector(vector_sizes, vec_count)

This will prevent index out of range errors.


51-53: Apply the same validation for multi-vector dictionary case

Similar to the previous comment, the multi-vector dictionary case needs the same validation to prevent potential index out of range errors.

Add validation for the dictionary case as well:

for vector_name, vector_size in vector_sizes.items():
    vec_count = random.randint(1, 10)
+    vec_count = min(vec_count, len(_text_vectors_clean))
    vectors[vector_name] = sample_random_multivector(vector_size, vec_count)

This ensures consistency between the two code paths.


124-125: Add bounds check for number of points

The generate_points function samples indices without checking if num_points exceeds the number of available vectors.

Add a validation to ensure num_points doesn't exceed the available vectors:

+ if num_points > len(_text_vectors_clean):
+     raise ValueError(f"Requested {num_points} points, but only {len(_text_vectors_clean)} vectors are available")
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=num_points, replace=False)

This prevents potential runtime errors when requesting more points than available vectors.

Comment on lines +13 to +15
_text_vectors = np.load("data/text.npy", mmap_mode="r")
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add fallback mechanism for missing data files

Currently, if the data files are missing, the test will fail completely. Consider adding a fallback mechanism that uses randomly generated vectors when the data files are not found.

Implement a try-except block to gracefully handle missing data files:

try:
    _text_vectors = np.load("data/text.npy", mmap_mode="r")
    _text_vectors_unique = np.unique(_text_vectors, axis=0)
    _text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
    _using_real_vectors = True
except FileNotFoundError:
    print("Warning: data/text.npy not found. Using randomly generated vectors instead.")
    _text_vectors_clean = []  # Empty list as fallback
    _using_real_vectors = False

try:
    _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
    _query_vectors_unique = np.unique(_query_vectors, axis=0)
    _query_vectors_clean = _query_vectors_unique.tolist()
except FileNotFoundError:
    print("Warning: data/queries.npy not found. Using randomly generated vectors instead.")
    _query_vectors_clean = []  # Empty list as fallback

Then modify the functions to check _using_real_vectors and fall back to random generation when needed:

def sample_queries(n: int) -> list[np.array]:
+    if not _using_real_vectors or len(_query_vectors_clean) == 0:
+        # Fallback to random generation
+        return [np.random.random(384).round(3).tolist() for _ in range(n)]
    sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
    return [_query_vectors_clean[i] for i in sampled_vectors]

This would make the tests more robust and still allow them to run without the data files.

Also applies to: 18-24, 65-68, 124-124

🧰 Tools
🪛 GitHub Actions: Integration tests

[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.


⚠️ Potential issue

Fix the file path reference issue

The code is attempting to load data files with hardcoded paths, but the pipeline has failed with a FileNotFoundError: No such file or directory: 'data/text.npy'. This indicates that either:

  1. The data files are missing from the repository, or
  2. The path is incorrect relative to where the tests are executed

Consider one of these solutions:

  1. Include the data files in the PR or document how to obtain them
  2. Use a relative path that accounts for the project structure:
- _text_vectors = np.load("data/text.npy", mmap_mode="r")
+ _text_vectors = np.load(os.path.join(os.path.dirname(__file__), "../../data/text.npy"), mmap_mode="r")

Don't forget to add the import os statement at the top of the file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_text_vectors = np.load("data/text.npy", mmap_mode="r")
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[~np.isnan(_text_vectors_unique).any(axis=1)].tolist()
import os
_text_vectors = np.load(
os.path.join(os.path.dirname(__file__), "../../data/text.npy"),
mmap_mode="r"
)
_text_vectors_unique = np.unique(_text_vectors, axis=0)
_text_vectors_clean = _text_vectors_unique[
~np.isnan(_text_vectors_unique).any(axis=1)
].tolist()
🧰 Tools
🪛 GitHub Actions: Integration tests

[error] 13-13: FileNotFoundError: No such file or directory: 'data/text.npy'. The test failed because the required data file 'data/text.npy' is missing.

Comment on lines +65 to +68
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
return [_text_vectors_clean[i] for i in sampled_vectors]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add dimension validation to sample_random_multivector

The sample_random_multivector function doesn't validate that the dimensions of the preloaded vectors match the requested vec_size.

Add dimension validation:

def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
    sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
+    # Verify vector dimensions match
+    for i in sampled_vectors:
+        if len(_text_vectors_clean[i]) != vec_size:
+            raise ValueError(f"Preloaded vector dimension {len(_text_vectors_clean[i])} does not match requested dimension {vec_size}")
    return [_text_vectors_clean[i] for i in sampled_vectors]

This will help catch dimension mismatches early.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
return [_text_vectors_clean[i] for i in sampled_vectors]
def sample_random_multivector(vec_size: int, vec_count: int) -> list[list[float]]:
sampled_vectors = np.random.choice(len(_text_vectors_clean), size=vec_count, replace=False)
# Verify vector dimensions match
for i in sampled_vectors:
if len(_text_vectors_clean[i]) != vec_size:
raise ValueError(
f"Preloaded vector dimension {len(_text_vectors_clean[i])} "
f"does not match requested dimension {vec_size}"
)
return [_text_vectors_clean[i] for i in sampled_vectors]

Comment on lines +18 to +24
def sample_queries(n: int) -> list[np.array]:
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors = _query_vectors_unique.tolist()
sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
return [_query_vectors[i] for i in sampled_vectors]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve data loading efficiency and error handling

The sample_queries function loads the data from disk on every call, which is inefficient. Additionally, it lacks error handling for the file loading operation.

Consider loading the query vectors once at module level, similar to how you handle _text_vectors:

+ _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
+ _query_vectors_unique = np.unique(_query_vectors, axis=0)
+ _query_vectors_clean = _query_vectors_unique.tolist()

def sample_queries(n: int) -> list[np.array]:
-    _query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
-    _query_vectors_unique = np.unique(_query_vectors, axis=0)
-    _query_vectors = _query_vectors_unique.tolist()
-    sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
-    return [_query_vectors[i] for i in sampled_vectors]
+    sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
+    return [_query_vectors_clean[i] for i in sampled_vectors]

Also, add exception handling to provide better error messages if the files can't be loaded.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def sample_queries(n: int) -> list[np.array]:
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors = _query_vectors_unique.tolist()
sampled_vectors = np.random.choice(len(_query_vectors), size=n, replace=False)
return [_query_vectors[i] for i in sampled_vectors]
# Module‐level loading & preprocessing of query vectors
_query_vectors = np.load("data/queries.npy", allow_pickle=True).astype(np.float32)
_query_vectors_unique = np.unique(_query_vectors, axis=0)
_query_vectors_clean = _query_vectors_unique.tolist()
def sample_queries(n: int) -> list[np.array]:
sampled_vectors = np.random.choice(len(_query_vectors_clean), size=n, replace=False)
return [_query_vectors_clean[i] for i in sampled_vectors]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant