Understanding Parallel Writes #802

relativityhd · 2025-03-04T19:51:41Z

relativityhd
Mar 4, 2025

Hi,

TL;DR; Why is it not possible to open a Repository from a process in a multiprocessing setting?

I am coming from this discussion where I received the recommendation to use icechunk.
My use case is basically that I want to write a library, where multiple Users (real people, not processes) should be able to procedurally download on-demand a region of interest of a STAC dataset (unaligned chunks) and then store their tiles in a shared datacube, so that when the next person is interested in a region with the same tile they can just read from the datacube, basically using the datacube as a persistent cache.

After a few hours of refactoring I got again a sync and a threading version running, which is also a lot cleaner thanks to icechunks git-like method of updating the datacube. I really like the approach, I think it is a great concept, so thank you a lot for your work! :D

Now I am (again) in a deadlock when trying to use multiprocessing.
Currently, I create a new session for every time I read or write something to the datacube, to always be on the most recent commit at the time my read/write starts.
Because I wrap the icechunk repository in my class, which is created in each process, the code gets in a deadlock when calling icechunk.Repository.open_or_create(storage).
So I have basically this:

class DownloadAndLoadHandler:
  def __init__(self, storage: icechunk.Storage, ...):
    self.repo = icechunk.Repository.open_or_create(storage)
    ...

def _task(i):
    storage = icechunk.local_filesystem_storage("arcticdem_32m.zarr")
    h = DownloadAndLoadHandler(storage)
    res = h.do_something(i)
    return i, res

if __name__ == "__name__":
  with ProcessPoolExecutor(max_workers=3) as executor:
     results = list(executor.map(_mp_task, list(range(3))))
  print(results)

Since the documentation about Distributed Writes states that one should pass a session to each process, instead of creating new ones, this seems to be the expected behavior.

But this got me wondering: why must the sessions be created and handled on the main process?
Shouldn't the commit functionality be enough to provide safety to race-conditions?
Or did I misinterpret the way icechunk should be used?

Since I don't have control on how users will use their distributed computing setup, I can't have access to the main thread. Or at least I don't want to enforce it through my library design. Hence, the example in the docs won't work for me as-is.

I pushed the current status of my project: https://github.com/relativityhd/smart-geocubes/tree/66651de032437a4ec6ff32a4f67226475242f75f

Answered by paraseba

Mar 5, 2025

@relativityhd adding mp.set_start_method('forkserver') at the start of main solved it for me. Could you try it?

View full answer

rabernat · 2025-03-04T20:13:23Z

rabernat
Mar 4, 2025
Maintainer

Quick clarification question: in your intended production environment, will the data be stored in cloud object storage or local file storage?

1 reply

relativityhd Mar 5, 2025
Author

The user should decide, hence I want to let them pass the storage object.

rabernat · 2025-03-04T21:43:30Z

rabernat
Mar 4, 2025
Maintainer

There are fundamentally two different modes for distributed writes in Icechunk:

"Cooperative" distributed writes, in which all of the changes being written are part of the same transaction. The point of this is to allow large scale, massively parallel writing to the store as part of a single coordinated job. In this scenario, it's the user's job to align the writing process with the Zarr chunks. Note that the consistency guarantees within the same session are the same as with vanilla Zarr. That's why you have to pass around the session object across your writers. That's what is described on this doc page: https://icechunk.io/en/latest/icechunk-python/parallel/
"Uncooperative" writes, in which multiple workers are attempting to write to the same store in an uncoordinated way. This path relies on the optimistic concurrency mechanism to detect and resolve conflicts.

I think you want uncooperative.

Here's an example of uncooperative mode based on zarr-developers/zarr-python#2868.

import multiprocessing as mp
import icechunk as ic
import shutil
import zarr


def get_storage():
    # Ironically, local storage is not safe wrt race conditions on commit due to limitations of object_store
    # storage = ic.local_filesystem_storage("data.icechunk")
    storage = ic.s3_storage(bucket="icechunk-test", prefix="zarr_issue_2868", from_env=True)
    return storage


def worker(i):
    print(f"Stated worker {i}")
    storage = get_storage()
    repo = ic.Repository.open(storage)
    # keep trying until it succeeds
    while True:
        try:
            session = repo.writable_session("main")
            z = zarr.open(session.store, mode="r+")
            print(f"Opened store for {i} | {dict(z.attrs)}")
            a = z.attrs.get("done", [])
            a.append(i)
            z.attrs["done"] = a
            session.commit(f"wrote from worker {i}")
            break
        except ic.ConflictError:
            print(f"Conflict for {i}, retying")
            pass


def main():

    storage = get_storage()
    repo = ic.Repository.create(storage)
    session = repo.writable_session("main")

    zarr.create(
        shape=(10, 10),
        chunks=(5, 5),
        store=session.store,
        overwrite=True,
    )
    session.commit("initialized dataset")

    p1 = mp.Process(target=worker, args=(1,))
    p2 = mp.Process(target=worker, args=(2,))
    p1.start()
    p2.start()
    p1.join()
    p2.join()

    session = repo.readonly_session(branch="main")
    z = zarr.open(session.store, mode="r")
    print(z.attrs["done"])
    print(list(repo.ancestry(branch="main")))


if __name__ == "__main__":
    main()

This outputs

Stated worker 1
Stated worker 2
Opened store for 1 | {}
Opened store for 2 | {}
Conflict for 1, retying
Opened store for 1 | {'done': [2]}
[2, 1]
[SnapshotInfo(id="MGPV1YE1SY0799AZFFB0", parent_id=YAN3D2N7ANCNKCFN3JSG, written_at=datetime.datetime(2025,3,4,21,40,57,19985, tzinfo=datetime.timezone.utc), message="wrote from..."), SnapshotInfo(id="YAN3D2N7ANCNKCFN3JSG", parent_id=0M5H3J6SC8MYBQYWACC0, written_at=datetime.datetime(2025,3,4,21,40,56,734126, tzinfo=datetime.timezone.utc), message="wrote from..."), SnapshotInfo(id="0M5H3J6SC8MYBQYWACC0", parent_id=WKKQ9K7ZFXZER26SES5G, written_at=datetime.datetime(2025,3,4,21,40,56,47192, tzinfo=datetime.timezone.utc), message="initialize..."), SnapshotInfo(id="WKKQ9K7ZFXZER26SES5G", parent_id=None, written_at=datetime.datetime(2025,3,4,21,40,55,868277, tzinfo=datetime.timezone.utc), message="Repository...")]

10 replies

rabernat Mar 5, 2025
Maintainer

That's weird. I can't reproduce this on my side. Pinging @mpiannucci and @paraseba to see if they have any insight into why the deadlock is occurring.

paraseba Mar 5, 2025
Maintainer

Let's try running with the following environment variable:
ICECHUNK_LOG=trace

paraseba Mar 5, 2025
Maintainer

Actually, I just tried @rabernat 's code and reproduced the issue @relativityhd is seeing. I'll investigate.

paraseba Mar 5, 2025
Maintainer

@relativityhd adding mp.set_start_method('forkserver') at the start of main solved it for me. Could you try it?

Answer selected by relativityhd

relativityhd Mar 5, 2025
Author

I guess you see the same trace information, however for completeness:

  2025-03-05T16:36:19.355154Z DEBUG icechunk::asset_manager: Writing snapshot, id: 0C4295R7NJJV7WGXN8Y0, size_bytes: 176
    at icechunk/src/asset_manager.rs:547
    in icechunk::asset_manager::write_snapshot
    in icechunk::repository::create

  2025-03-05T16:36:19.365290Z DEBUG icechunk::repository: Preloading manifests
    at icechunk/src/repository.rs:716
    in icechunk::repository::preload_manifests with snapshot_id: 0308249707aca5b3f21daa3c
    in icechunk::repository::writable_session with branch: "main"

  2025-03-05T16:36:19.365856Z DEBUG icechunk::asset_manager: Downloading snapshot, snapshot_id: 0C4295R7NJJV7WGXN8Y0
    at icechunk/src/asset_manager.rs:558
    in icechunk::asset_manager::fetch_snapshot with snapshot_id: 0308249707aca5b3f21daa3c

  2025-03-05T16:36:19.376064Z  INFO icechunk::session: Commit started, branch_name: "main", old_snapshot_id: 0C4295R7NJJV7WGXN8Y0
    at icechunk/src/session.rs:1634
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.376182Z TRACE icechunk::session: New node, writing a manifest, path: /
    at icechunk/src/session.rs:1518
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.377041Z TRACE icechunk::session: Building new snapshot
    at icechunk/src/session.rs:1522
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.377222Z TRACE icechunk::session: Creating transaction log, transaction_log_id: NXKQ78WN8XVBCNT88QV0
    at icechunk/src/session.rs:1590
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.379363Z DEBUG icechunk::asset_manager: Writing snapshot, id: NXKQ78WN8XVBCNT88QV0, size_bytes: 540
    at icechunk/src/asset_manager.rs:547
    in icechunk::asset_manager::write_snapshot
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.379504Z DEBUG icechunk::asset_manager: Getting snapshot timestamp, snapshot_id: NXKQ78WN8XVBCNT88QV0
    at icechunk/src/asset_manager.rs:319
    in icechunk::asset_manager::get_snapshot_last_modified with snapshot_id: af6773a3954776b6574845f6
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.380565Z DEBUG icechunk::asset_manager: Writing transaction log, transaction_id: NXKQ78WN8XVBCNT88QV0, size_bytes: 128
    at icechunk/src/asset_manager.rs:617
    in icechunk::asset_manager::write_transaction_log with transaction_id: af6773a3954776b6574845f6
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.380737Z DEBUG icechunk::session: Updating branch, branch_name: "main", new_snapshot_id: NXKQ78WN8XVBCNT88QV0
    at icechunk/src/session.rs:1640
    in icechunk::session::commit with initialized dataset

  2025-03-05T16:36:19.380899Z  INFO icechunk::session: Commit done, branch_name: "main", old_snapshot_id: 0C4295R7NJJV7WGXN8Y0, new_snapshot_id: NXKQ78WN8XVBCNT88QV0
    at icechunk/src/session.rs:1661
    in icechunk::session::commit with initialized dataset

Stated worker 1
Stated worker 2

relativityhd Mar 5, 2025
Author

@relativityhd adding mp.set_start_method('forkserver') at the start of main solved it for me. Could you try it?

Works now, thank you! Is this the wished / expected behavior?

dcherian Mar 5, 2025
Maintainer

I suspect it's what is described in https://docs.pola.rs/user-guide/misc/multiprocessing/ but I don't fully understand it

rabernat Mar 5, 2025
Maintainer

It certainly seems relevant! Polars also uses a Rust / Tokio runtime to provide parallel processing, just like Icechunk.

rabernat · 2025-03-04T21:51:53Z

rabernat
Mar 4, 2025
Maintainer

the code gets in a deadlock when calling icechunk.Repository.open_or_create(storage).

This is a bit of a red herring I think.

You definitely do not want to be attempting to create a Repo from scratch in the same place from multiple uncoordinated process. Instead, create the repo once from the master process (as in my example above) and then open it from the child processes.

1 reply

relativityhd Mar 5, 2025
Author

Yes, I always created the repo before in the main-process and used the open_or_create just for convenience for single-process uses. You are right, I should change the API there to ensure that in a multiprocessing setting only open is called.

ric-domingues · 2025-05-07T13:49:42Z

ric-domingues
May 7, 2025

Hello,
I've been looking into Icechunk as an option for enabling parallel uncoordinated writes to a zarr store, and I've came across this discussion.
One of the challenges I am facing is that the proposed solution for uncoordinated writes involves catching the conflict, and opening a new session agains main and trying again. The issue with this solution is that the actual to_zarr(session.store, region='auto') step is already quite time-consuming, causing a long write queue amongst the many parallel processing trying to write to the store, which is defeating the purpose of parallelization.

Looking at the docs, I have tried a branching approach where at each write I would create a new branch for my dataset, write to the branch, and then try to merge in main. However, looks like icechunk only currently support merge operations amongst uncommitted writes between sessions.

repo.create_branch("feature", snapshot_id=main_branch_snapshot_id)
session_branch = repo.writable_session(branch="feature")
ds.to_zarr(session_branch.store. region='auto', consolidated=False) 
session_main = repo.writable_session(branch="main")
session_main.merge(session_branch)
print(session_main.commit("Merge Branch"))

Now the issue here is that by the time merge commit is executed, the main branch has likely already evolved from another process.
So ideally what we would need is the capability of committing to the branch, rebasing the branch right before the merge, and then committing.

Wondering if anyone had any luck executing a workflow like that?
Part of my challenge is that the parallel uncoordinated processes would write / append to the same chunks in the store - we're creating a temporally optimized store, so ideally we'd be able to process different incoming timestamps in parallel and write them to the store.

Thanks a lot for any information!

10 replies

rabernat May 7, 2025
Maintainer

Forgive the naive question but...why don't you just set up the workers to align better with the chunk boundaries?

ric-domingues May 7, 2025

Workers are aligned with the upstream file delivery scheme - we're ingesting forecast data from an upstream data provider. Files are delivered in grib format and each file contains a single forecast step for the entire domain. File availability triggers workers in parallel that independently ingest each file. Now as you can imagine, our challenge arises from the fact that the we want a temporally optimized store available to downstream users, where a single chunk of data would contain the entire forecast horizon for a given location(s) - essentially to support more efficient time-series queries.

I started looking into icechunk hoping it would help me with that - so that I could indeed ingest everything in parallel and with that ensure minimum added latency. The alternative to that is looking like an intermediate write to a store with a chunking schema aligned with the file delivery, and then reading from this store, re-chunking and writing to the temporally optimized store.

rabernat May 7, 2025
Maintainer

Got it, very helpful context.

So this is really a rechunking problem you're trying to solve! The fix you are looking for would not scale well anyway. We have other solutions on the roadmap for rechunking. 😉

rabernat May 7, 2025
Maintainer

Another question: are you doing a backfill here? Or just ingesting new forecasts on a rolling basis? And do you get all of the gribs for the entire forecast at once? Or are they released sequentially as the forecast is produced?

Your scenario sounds very similar to the one we covered in our Analysis Ready Forecast Data Cubes with Zarr webinar. These resources might be helpful.

We would probably recommend not processing each individual grib file one at a time but rather aggregating them as part of your pipeline.

ric-domingues May 7, 2025

This is very helpful @rabernat !
I will definitely take a look at the seminar / repo.
Really appreciate it.

Understanding Parallel Writes #802

Uh oh!

relativityhd Mar 4, 2025

Replies: 4 comments · 22 replies

Uh oh!

rabernat Mar 4, 2025 Maintainer

Uh oh!

relativityhd Mar 5, 2025 Author

Uh oh!

Uh oh!

rabernat Mar 4, 2025 Maintainer

Uh oh!

rabernat Mar 5, 2025 Maintainer

Uh oh!

paraseba Mar 5, 2025 Maintainer

Uh oh!

paraseba Mar 5, 2025 Maintainer

Uh oh!

paraseba Mar 5, 2025 Maintainer

Uh oh!

relativityhd Mar 5, 2025 Author

Uh oh!

relativityhd Mar 5, 2025 Author

Uh oh!

dcherian Mar 5, 2025 Maintainer

Uh oh!

rabernat Mar 5, 2025 Maintainer

Uh oh!

rabernat Mar 4, 2025 Maintainer

Uh oh!

relativityhd Mar 5, 2025 Author

Uh oh!

ric-domingues May 7, 2025

Uh oh!

rabernat May 7, 2025 Maintainer

Uh oh!

Uh oh!

ric-domingues May 7, 2025

Uh oh!

rabernat May 7, 2025 Maintainer

Uh oh!

Uh oh!

rabernat May 7, 2025 Maintainer

Uh oh!

Uh oh!

ric-domingues May 7, 2025

relativityhd
Mar 4, 2025

Replies: 4 comments 22 replies

rabernat
Mar 4, 2025
Maintainer

relativityhd Mar 5, 2025
Author

rabernat
Mar 4, 2025
Maintainer

rabernat Mar 5, 2025
Maintainer

paraseba Mar 5, 2025
Maintainer

paraseba Mar 5, 2025
Maintainer

paraseba Mar 5, 2025
Maintainer

relativityhd Mar 5, 2025
Author

relativityhd Mar 5, 2025
Author

dcherian Mar 5, 2025
Maintainer

rabernat Mar 5, 2025
Maintainer

rabernat
Mar 4, 2025
Maintainer

relativityhd Mar 5, 2025
Author

ric-domingues
May 7, 2025

rabernat May 7, 2025
Maintainer

rabernat May 7, 2025
Maintainer

rabernat May 7, 2025
Maintainer