Skip to content

LSMTree compaction creates duplicate timestamped indexes that are not cleaned up #2701

@tae898

Description

@tae898

Description

When creating indexes on large datasets (33.8M records), ArcadeDB's LSMTree compaction process creates multiple timestamped duplicate indexes that persist in the database instead of being cleaned up after compaction completes.

Steps to Reproduce

  1. Import a large dataset (e.g., MovieLens ml-latest with 33,832,163 ratings)
  2. Create indexes on the imported data:
CREATE INDEX ON Movie (movieId) UNIQUE
CREATE INDEX ON Rating (userId) NOTUNIQUE
CREATE INDEX ON Rating (movieId) NOTUNIQUE
CREATE INDEX ON Link (movieId) UNIQUE
CREATE INDEX ON Tag (movieId) NOTUNIQUE
  1. Query the schema to see all indexes:
SELECT name, typeName, properties, unique, automatic
FROM schema:indexes
ORDER BY typeName, name

Expected Behavior

Expected 5 indexes total (one per CREATE INDEX command).

Actual Behavior

Found 80 indexes instead of 5 - with 15+ timestamped duplicates per table:

  • Movie[movieId] (expected)
  • Movie_0_172987397898984 (duplicate)
  • Movie_1_172987421520553 (duplicate)
  • Movie_2_172987445142122 (duplicate)
  • ... (13 more duplicates)
    All duplicates are marked as automatic=true.

Analysis

Based on source code review:

  1. LSMTreeIndexMutable.java (line 168):
public LSMTreeIndexCompacted createNewForCompaction() {
final String newName = componentName.substring(0, last_) + "_" + System.nanoTime();
return new LSMTreeIndexCompacted(..., newName, ...);
}
  1. LSMTreeIndex.java (line 548):
protected LSMTreeIndexMutable splitIndex(...) {
final String newName = mutable.getName().substring(0, last_) + "_" + System.nanoTime();
final LSMTreeIndexMutable newMutableIndex = new LSMTreeIndexMutable(..., newName, ...);
}

These timestamped index files are created during compaction but appear not to be properly cleaned up after compaction completes.

Impact

  • Functional: ✅ Queries work correctly using the main indexes
  • Performance: ⚠️ Duplicates don't affect query speed but waste disk space
  • Storage: ❌ 16x storage overhead for index files

Environment

  • Dataset: MovieLens ml-latest (33,832,163 ratings, 86,538 movies, 2,328,316 tags, 9,742 links)
  • ArcadeDB: Python bindings via arcadedb_embedded
  • JVM Heap: 8GB
  • Database: Embedded mode

Logs

During index creation on large dataset:

⚠️ Index creation failed: Command failed: com.arcadedb.exception.NeedRetryException:
Cannot create a new index while asynchronous tasks are running (LSMTreeIndexCompactor)

LSMTree compaction logs show:

LSMTreeIndex 'Movie[movieId]' compacted 50 pages, remaining 0 pages
(totalKeys=289037 totalValues=2251732)

Questions

  1. Are timestamped index files intended to be temporary during compaction?
  2. Should they be automatically cleaned up after compaction completes?
  3. Is there a configuration to control compaction cleanup behavior?

Suggested Fix

After compaction completes, cleanup logic should:

  1. Identify timestamped index files matching pattern {indexName}_\d+
  2. Remove them from schema if they're marked as temporary/compaction artifacts
  3. Delete the corresponding physical files

Workaround

Users can manually drop timestamped indexes:

DROP INDEX Movie_0_172987397898984;
-- Repeat for all timestamped duplicates

However, this requires knowing which indexes are duplicates vs. legitimate user-created indexes.[+] Tested on 25.10.1-SNAPSHOT

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions