Self-contained tooling for indexing and searching the Nemotron Personas Japan dataset with local emulators (Qdrant, Elasticsearch, Neo4j). The CLI coordinates data ingestion, vector and keyword search, and persona graph context so developers can explore personas without reading the source first.
Data flow
- Parquet shards ->
PersonaRepository
(stream records with optional limits) PersonaIndexer
-> Qdrant (vector), Elasticsearch (keyword), Neo4j (context graph)- Query text -> embedder ->
PersonaSearchService
-> merge vector hits, keyword fallbacks, graph context
Key components
PersonaRepository
streams records from one or more parquet files (batch aware, limit ready).PersonaIndexer
normalizes persona text fields, builds embeddings, and writes to each emulator service.PersonaSearchService
embeds the query, runs Qdrant vector search, enriches results with Elasticsearch keyword hits and Neo4j persona context, then returns a combined list.
Each search result exposes a score
field:
- For candidates returned by Qdrant,
score
is the cosine similarity reported by Qdrant (higher is better). - When Elasticsearch supplies a fallback persona that was not in the vector shortlist, the Elasticsearch
_score
is mapped into the samescore
field. --verbose
mode prints hit counts:vector_hits
,keyword_hits
,context_calls
, andresults
so you can diagnose which backend produced the answer set.
Path | Purpose |
---|---|
search_ja_persona/cli.py |
Rich CLI entry point for indexing, searching, downloading, and clearing emulators |
search_ja_persona/indexer.py |
Batch ingestion into Qdrant, Elasticsearch, and Neo4j |
search_ja_persona/search.py |
Query orchestration and hit fusion logic |
search_ja_persona/embeddings.py |
Embedding backends (hashed n-gram, SentenceTransformers, fastembed) |
search_ja_persona/services.py |
Thin HTTP transports for emulator APIs |
qa_samples/qa_sample.parquet |
1k-row sample used by quick QA flows |
scripts/generate_qa_sample.py |
Regenerate the QA sample parquet from Hugging Face |
- Python 3.12+
uv
for dependency management (recommended)- Local emulators running: change into
emulator/
and use./start-emulators.sh
(ordocker compose up -d
).
-
Hugging Face cache: run the CLI downloader (safe to rerun).
uv run python -m search_ja_persona.cli download-dataset \ --dataset-name nvidia/Nemotron-Personas-Japan \ --split train \ --cache-dir .cache
The cached parquet shards sit under
~/.cache/huggingface
. You can also copy them intodatasets/Nemotron-Personas-Japan/data/
as this repo already demonstrates. -
Optional: regenerate the bundled 1k sample parquet.
uv run python -m scripts.generate_qa_sample --limit 1000
The repository ships with qa_samples/qa_sample.parquet
. Regenerate it whenever you
need a fresh slice or after updating the source dataset:
- Ensure the Hugging Face cache already contains
nvidia/Nemotron-Personas-Japan
.- Run the
download-dataset
command above, or allow the script to fall back to an existing.arrow
shard in the cache directory.
- Run the
- Execute the helper script (idempotent; overwrites the existing parquet):
The script writes up to 1,000 rows into
uv run python -m scripts.generate_qa_sample --limit 1000
qa_samples/qa_sample.parquet
using the persona text fields defined insearch_ja_persona/persona_fields.py
.
Override the default count with --limit
when needed (for example, --limit 2000
).
Ingest every shard in datasets/Nemotron-Personas-Japan/data/
using a SentenceTransformer preset. Adjust --batch-size
to match available memory; leaving --limit
unset consumes all rows.
uv run python -m search_ja_persona.cli index \
--dataset datasets/Nemotron-Personas-Japan/data \
--batch-size 512 \
--embedder mini-lm \
--persona-fields all \
--qdrant-host localhost --qdrant-port 6333 \
--es-host localhost --es-port 9200 \
--neo4j-host localhost --neo4j-port 7474
Limit ingestion to the bundled sample parquet. The --limit
guard ensures only the first 1,000 personas are processed even if you regenerate the sample with more rows.
uv run python -m search_ja_persona.cli index \
--dataset qa_samples/qa_sample.parquet \
--batch-size 128 \
--limit 1000 \
--embedder mini-lm \
--persona-fields all
Tip:
just qa-index embedder="mini-lm" persona_fields="all"
wraps the same command.
After every indexing run, .cache/index_metadata.json
records the embedder preset, dimensions, persona fields, and collection/index names. Subsequent search
runs reuse this metadata when you omit --embedder
or --persona-fields
.
Once indexing completes, issue free-text queries with combined vector plus keyword retrieval.
uv run python -m search_ja_persona.cli search \
--query "care manager with elder care experience" \
--limit 5 \
--format table \
--verbose
--format table
renders a Rich table;--format json
prints structured JSON.--verbose
surfaces per-backend hit statistics alongside the unified result list.- To reuse the last indexed embedder or persona field set, omit
--embedder
and--persona-fields
(the CLI will read.cache/index_metadata.json
).
uv run python -m search_ja_persona.cli clear-emulators
drops the Qdrant collection, Elasticsearch index, Neo4j persona nodes, and deletes cached metadata (asks for confirmation).just test
runs the full pytest suite (emulator integration tests are skipped unless the emulators are up and the dataset cache is populated).
- Ensure
./start-emulators.sh
completed and ports 6333, 9200, 7474 are reachable. - Hugging Face downloads require authentication when the dataset is gated; pass
--token
todownload-dataset
if needed. - If you switch embedder presets or persona field subsets, the CLI prompts to reset existing indexes so vector dimensions stay aligned across services.
With the pipeline indexed, you can explore prompts against the million-persona corpus or the 1k QA slice by swapping the dataset path and --limit
flag in the commands above.