Filtered Search
## ANN Search Filtering
This document explains how to use the filtering capabilities to improve Approximate Nearest Neighbor (ANN) search.
### Why Filtering?
Filters allow you to narrow down search results dynamically based on:
- Metadata (e.g., tags, IDs, labels)
- Numeric thresholds (e.g., only items above/below a value)
- Custom user-defined logic
This improves both precision and flexibility of search.
#### Example: Python API
```python
from rust_annie import AnnIndex
import numpy as np
# 1. Create an index with vector dimension 128
index = AnnIndex(dimension=128)
# 2. Add data with metadata
vector0 = np.random.rand(128).astype(np.float32)
vector1 = np.random.rand(128).astype(np.float32)
index.add_item(0, vector0, metadata={"category": "A"})
index.add_item(1, vector1, metadata={"category": "B"})
# 3. Define a filter function (e.g., only include items where category == "A")
def category_filter(metadata):
return metadata.get("category") == "A"
# 4. Perform search with the filter applied
query_vector = np.random.rand(128).astype(np.float32)
results = index.search(query_vector, k=5, filter=category_filter)
print("Filtered search results:", results)
Supported Filters¶
This library supports applying filters to narrow down ANN search results dynamically.
| Filter type | Example |
|---|---|
| Equals | Filter.equals("category", "A") |
| Greater than | Filter.gt("score", 0.8) |
| Less than | Filter.lt("price", 100) |
| Custom predicate | Filter.custom(lambda metadata: ...) |
Filters work on the metadata you provide when adding items to the index.
New Feature: Filtered Search with Custom Python Callbacks¶
The library now supports filtered search using custom Python callbacks, allowing for more complex filtering logic directly in Python.
Example: Filtered Search with Python Callback¶
from rust_annie import AnnIndex, Distance
import numpy as np
# Create index
index = AnnIndex(3, Distance.EUCLIDEAN)
data = np.array([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0]
], dtype=np.float32)
ids = np.array([10, 20, 30], dtype=np.int64)
index.add(data, ids)
# Filter function
def even_ids(id: int) -> bool:
return id % 2 == 0
# Filtered search
query = np.array([1.0, 2.0, 3.0], dtype=np.float32)
filtered_ids, filtered_dists = index.search_filter_py(
query,
k=3,
filter_fn=even_ids
)
print(filtered_ids) # [10, 30] (20 is filtered out)
Sorting Behavior¶
The BruteForceIndex now uses total_cmp for sorting, which provides NaN-resistant sorting behavior. This change ensures that any NaN values in the data are handled consistently, preventing potential issues with partial comparisons.
Benchmarking Indices¶
The library now includes a benchmarking function to evaluate the performance of different index types, specifically PyHnswIndex and AnnIndex. This function measures the average, maximum, and minimum query times, providing insights into the efficiency of each index type.
Example: Benchmarking Script¶
import numpy as np
import time
from rust_annie import PyHnswIndex, AnnIndex
def benchmark(index_cls, name, dim=128, n=10_000, q=100, k=10):
print(f"\nBenchmarking {name} with {n} vectors (dim={dim})...")
# Data
data = np.random.rand(n, dim).astype(np.float32)
ids = np.arange(n, dtype=np.int64)
queries = np.random.rand(q, dim).astype(np.float32)
# Index setup
index = index_cls(dims=dim)
index.add(data, ids)
# Warm-up + Timing
times = []
for i in range(q):
start = time.perf_counter()
_ = index.search(queries[i], k=k)
times.append((time.perf_counter() - start) * 1000)
print(f" Avg query time: {np.mean(times):.3f} ms")
print(f" Max query time: {np.max(times):.3f} ms")
print(f" Min query time: {np.min(times):.3f} ms")
if __name__ == "__main__":
benchmark(PyHnswIndex, "HNSW")
benchmark(AnnIndex, "Brute-Force")
Integration & Extensibility¶
- Filters are exposed from Rust to Python via PyO3 bindings.
- New filters can be added by extending
src/filters.rsin the Rust code. - Filters integrate cleanly with the existing ANN index search logic, so adding or combining filters doesn't require changes in the core search API.
See also¶
```