-
Notifications
You must be signed in to change notification settings - Fork 59
Description
I generated BI dataset CSVs using the ldbc_snb_datagen_spark repo. Data generation succeeded, but the CSVs cannot be directly imported into Neo4j using neo4j-admin database import because the required Neo4j-style header annotations (:ID, :START_ID, :END_ID) are missing.
🔄 Steps to Reproduce
Clone repo and install dependencies:
git clone https://github.com/ldbc/ldbc_snb_datagen_spark.git
cd ldbc_snb_datagen_spark/scripts
./install-dependencies.sh
Build:
./build.sh
Get Spark:
./get-spark-to-home.sh
Generate BI dataset:
cd ~/ldbc_snb_datagen_spark/tools
./run.py -- --format csv --format-options raw=true,labels=true,header=true,quoteAll=true --scale-factor 1 --mode bi
Data is generated under:
~/ldbc_snb_datagen_spark/tools/out/graphs/csv/bi/composite-merged-fk/
Try Neo4j import:
$NEO4J_HOME/bin/neo4j-admin database import full neo4j
--overwrite-destination
--delimiter="|"
--nodes=Person=/ldbc_snb_datagen_spark/tools/out/graphs/csv/bi/composite-merged-fk/initial_snapshot/dynamic/Person/*.csv /ldbc_snb_datagen_spark/tools/out/graphs/csv/bi/composite-merged-fk/initial_snapshot/dynamic/Person_knows_Person/*.csv
--relationships=KNOWS=
Import fails with error:
Caused by: org.neo4j.internal.batchimport.input.HeaderException:
Missing header of type START_ID, among entries [creationDate, Person1Id, Person2Id]
✅ Expected Behavior
Generated CSV files for the Neo4j import format should include proper headers, e.g.:
Node file (Person):
PersonId:ID|firstName|string|lastName|string|...
Relationship file (KNOWS):
Person1Id:START_ID(Person)|Person2Id:END_ID(Person)|creationDate:long
So that users can directly import with neo4j-admin without manually creating header files.
❌ Actual Behavior
The generated CSVs contain plain column names only, e.g.:
Person_knows_Person:
creationDate|Person1Id|Person2Id
Person:
PersonId|firstName|lastName|...
These cannot be parsed by neo4j-admin import.
🔧 Environment
Repo: ldbc_snb_datagen_spark
Commit:
Neo4j version: 5.26.0
JDK: 21 (runtime), switched to 11 for benchmark driver
OS: Ubuntu 22.04 ARM64 (AWS Graviton)
🙏 Request
Is this a bug in the CSV export logic for --format csv with --mode bi?
Or is the expectation that users must provide Neo4j-specific header files manually?
If headers are intentionally excluded, could the docs clarify the steps needed to adapt the CSVs for Neo4j import?