-
-
Notifications
You must be signed in to change notification settings - Fork 617
Description
Bug Description:
Hi,
I'm joining a vector table (t2
, primary search source) with another one (t1
) that contains attributes needed to filter the result set.
In some cases I need to use a OR
condition on the right-side table. This however generates a super slow query, much slower than running the query twice with either side of the OR
condition.
Looks like an (unneeded?) full table scan is performed in that case?
> DESCRIBE t1;
+-------------------+-----------+------------+
| Field | Type | Properties |
+-------------------+-----------+------------+
| id | bigint | |
[...]
| site_id | mva64 | |
| parent_id | mva64 | |
| flag | bool | |
+-------------------+-----------+------------+
19 rows in set (0.00 sec)
> DESCRIBE t2;
+------------+--------------+------------+
| Field | Type | Properties |
+------------+--------------+------------+
| id | bigint | |
| account_id | bigint | |
| t1_id | bigint | |
[...]
| embedding | float_vector | knn |
+------------+--------------+------------+
5 rows in set (0.00 sec)
This is the result of running the needed query, with the OR
condition: (t1.flag=1 OR ANY(parent_id) IN (67890))
> SELECT t1_id, knn_dist() AS knn_dist, t1.site_id AS site_id, t1.parent_id AS parent_id
FROM t2
INNER JOIN t1 ON t1.id=t2.t1_id
WHERE KNN(embedding, 4000, (...vector...), {ef=2000, oversampling=3.0, rescore=1})
AND knn_dist<=0.35
AND account_id=3
AND ANY(site_id)=12345
AND (t1.flag=1 OR ANY(parent_id) IN (67890))
LIMIT 2000 OPTION max_matches=2000;
show meta;
[...]
2000 rows in set (0.27 sec)
--- 2000 out of 5049 results in 269ms ---
+----------------+-------+
| Variable_name | Value |
+----------------+-------+
| total | 2000 |
| total_found | 5049 |
| total_relation | eq |
| time | 0.269 |
+----------------+-------+
Result returned in 269ms.
Now lets run the same query but with the right side of the OR
condition removed: (t1.flag=1)
> SELECT t1_id, knn_dist() AS knn_dist, t1.site_id AS site_id, t1.parent_id AS parent_id
FROM t2
INNER JOIN t1 ON t1.id=t2.t1_id
WHERE KNN(embedding, 4000, (...vector...), {ef=2000, oversampling=3.0, rescore=1})
AND knn_dist<=0.35
AND account_id=3
AND ANY(site_id)=12345
AND t1.flag=1
LIMIT 2000 OPTION max_matches=2000;
show meta;
Empty set (0.01 sec)
--- 0 out of 0 results in 10ms ---
+----------------+-------+
| Variable_name | Value |
+----------------+-------+
| total | 0 |
| total_found | 0 |
| total_relation | eq |
| time | 0.010 |
+----------------+-------+
4 rows in set (0.00 sec)
This is super fast.
Now with the left side of the OR
condition removed: (ANY(parent_id) IN (67890))
> SELECT t1_id, knn_dist() AS knn_dist, t1.site_id AS site_id, t1.parent_id AS parent_id
FROM t2
INNER JOIN t1 ON t1.id=t2.t1_id
WHERE KNN(embedding, 4000, (...vector...), {ef=2000, oversampling=3.0, rescore=1})
AND knn_dist<=0.35
AND account_id=3
AND ANY(site_id)=56391
AND ANY(parent_id) IN (3677791)
LIMIT 2000 OPTION comment='console',max_matches=2000;
show meta;
[...]
2000 rows in set (0.02 sec)
--- 2000 out of 5049 results in 19ms ---
+----------------+-------+
| Variable_name | Value |
+----------------+-------+
| total | 2000 |
| total_found | 5049 |
| total_relation | eq |
| time | 0.019 |
+----------------+-------+
4 rows in set (0.00 sec)
Still very fast, <20ms!
Increasing join_batch_size does improve the response time of the complete query (eg, using join_batch_size=4000
returns in 150ms instead of 260ms with the default join_batch_size=1000
), but this is still an order of magnitude greater.
Switching the order of the OR
condition doesn't change anything.
Secondary indices are disabled (secondary_indexes = 0
).
Manticore Search Version:
13.13.0 e5465fe44@25100704 (columnar 8.1.0 e1522a2@25100213) (secondary 8.1.0 e1522a2@25100213) (knn 8.1.0 e1522a2@25100213) (embeddings 1.0.1)
Operating System Version:
Ubuntu 24.04
Have you tried the latest development version?
No
Internal Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
- Implementation completed
- Tests developed
- Documentation updated
- Documentation reviewed