`client.upload_file` appears to execute file, and throws `ModuleNotFoundError` on module already installed with `dask.distributed.PipInstall` #239

erthward · 2023-07-14T19:49:24Z

erthward
Jul 14, 2023

Hi, All.

I am trying to run kbatch jobs for an analysis I've been developing using a dask cluster on the MSPC Hub. Everything was going well (and strangely, the current code I have even ran for a long time the first time I tried it), but then I started getting a cryptic string of error output that it's taken a while to pick apart. I believe I have finally traced it down to the following MRE (using the real dependency I'm working with, called FyeldGenerator):

1.) open a bash terminal

2.) create test_module.py in the cwd, with the following content:

# import the dependency
import FyeldGenerator

3.) launch ipython

4.) run the following code, interactively:

import dask_gateway
from dask.distributed import PipInstall

# get client
gateway = dask_gateway.Gateway()
cluster = gateway.new_cluster()
cluster.scale(1)         # problem of course occurs with >1 worker, too
client = cluster.get_client()

# install dependency on worker(s)
plugin = PipInstall(packages=["FyeldGenerator"])
client.register_worker_plugin(plugin)

# check that worker install(s) succeed(s)
def test_clust_fn():
    import FyeldGenerator
    return FyeldGenerator.__file__

out = client.submit(test_clust_fn)
client.gather(out)
# correct path output suggests pkg is successfully installed

# upload the test module
client.upload_file('test_module.py')

5.) get the following error (NOTE: full traceback copy-pasted at the bottom of this message):

ModuleNotFoundError: No module named 'FyeldGenerator'

This raises two crucial questions for me:

Why does client.upload_file appear to be executing my file instead of just uploading it? (The docs don't seem to suggest this as the expected behavior, but I'm probably just misinterpreting something.)
Even when it does execute the file, why does the worker then appear to have no module named FyeldGenerator, even after I've successfully installed it?

I apologize if there's something obvious that I'm doing wrong or missing! I've looked through all the relevant docs I could think of, but I may just have gotten something confused?

Thanks, in advance!

full traceback:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[9], line 1
----> 1 client.upload_file('test_module.py')

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py:3750, in Client.upload_file(self, filename, **kwargs)
   3742     results = await asyncio.gather(
   3743         self.register_scheduler_plugin(
   3744             SchedulerUploadFile(filename), name=name
   3745         ),
   3746         self.register_worker_plugin(UploadFile(filename), name=name),
   3747     )
   3748     return results[1]  # Results from workers upload
-> 3750 return self.sync(_)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py:351, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    349     return future
    350 else:
--> 351     return sync(
    352         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    353     )

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py:418, in sync(loop, func, callback_timeout, *args, **kwargs)
    416 if error:
    417     typ, exc, tb = error
--> 418     raise exc.with_traceback(tb)
    419 else:
    420     return result

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py:391, in sync.<locals>.f()
    389         future = wait_for(future, callback_timeout)
    390     future = asyncio.ensure_future(future)
--> 391     result = yield future
    392 except Exception:
    393     error = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/gen.py:767, in Runner.run(self)
    765 try:
    766     try:
--> 767         value = future.result()
    768     except Exception as e:
    769         # Save the exception for later. It's important that
    770         # gen.throw() not be called inside this try/except block
    771         # because that makes sys.exc_info behave unexpectedly.
    772         exc: Optional[Exception] = e

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py:3742, in Client.upload_file.<locals>._()
   3741 async def _():
-> 3742     results = await asyncio.gather(
   3743         self.register_scheduler_plugin(
   3744             SchedulerUploadFile(filename), name=name
   3745         ),
   3746         self.register_worker_plugin(UploadFile(filename), name=name),
   3747     )
   3748     return results[1]

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py:4778, in Client._register_scheduler_plugin(self, plugin, name, idempotent)
   4777 async def _register_scheduler_plugin(self, plugin, name, idempotent=False):
-> 4778     return await self.scheduler.register_scheduler_plugin(
   4779         plugin=dumps(plugin),
   4780         name=name,
   4781         idempotent=idempotent,
   4782     )

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/core.py:1359, in PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc(**kwargs)
   1357 prev_name, comm.name = comm.name, "ConnectionPool." + key
   1358 try:
-> 1359     return await send_recv(comm=comm, op=key, **kwargs)
   1360 finally:
   1361     self.pool.reuse(self.addr, comm)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/core.py:1143, in send_recv(comm, reply, serializers, deserializers, **kwargs)
   1141     _, exc, tb = clean_exception(**response)
   1142     assert exc
-> 1143     raise exc.with_traceback(tb)
   1144 else:
   1145     raise Exception(response["exception_text"])

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/core.py:922, in _handle_comm()
    920     result = handler(**msg)
    921 if inspect.iscoroutine(result):
--> 922     result = await result
    923 elif inspect.isawaitable(result):
    924     raise RuntimeError(
    925         f"Comm handler returned unknown awaitable. Expected coroutine, instead got {type(result)}"
    926     )

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/scheduler.py:5699, in register_scheduler_plugin()
   5697     result = plugin.start(self)
   5698     if inspect.isawaitable(result):
-> 5699         await result
   5701 self.add_plugin(plugin, name=name, idempotent=idempotent)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/diagnostics/plugin.py:326, in start()
    325 async def start(self, scheduler: Scheduler) -> None:
--> 326     await scheduler.upload_file(self.filename, self.data)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/core.py:523, in upload_file()
    521     except Exception as e:
    522         logger.exception(e)
--> 523         raise e
    525 return {"status": "OK", "nbytes": len(data)}

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/core.py:519, in upload_file()
    517 if load:
    518     try:
--> 519         import_file(out_filename)
    520         cache_loads.data.clear()
    521     except Exception as e:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py:1100, in import_file()
   1098     for name in names_to_import:
   1099         logger.info("Reload module %s from %s file", name, ext)
-> 1100         loaded.append(importlib.reload(importlib.import_module(name)))
   1101 finally:
   1102     if tmp_python_path is not None:

File /srv/conda/envs/notebook/lib/python3.11/importlib/__init__.py:126, in import_module()
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1204, in _gcd_import()

File <frozen importlib._bootstrap>:1176, in _find_and_load()

File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked()

File <frozen importlib._bootstrap>:690, in _load_unlocked()

File <frozen importlib._bootstrap_external>:940, in exec_module()

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed()

File /tmp/dask-scratch-space/scheduler-q136p4zn/test_module.py:1

ModuleNotFoundError: No module named 'FyeldGenerator'

Answered by erthward

Jul 26, 2023

Hi, Tom.

Thanks so much! This all makes sense. Sounds like I'm bumping into an awkward dependency limitation of running analysis on MSPC. That's an inevitable consequence of relying on a free and general-purpose platform for a specific project.

I considered moving onto our own compute, and that is certainly an option we may consider later on nonetheless. However, I stepped back even further and realized that this all really boiled down to just needing to use one function from one package (FyeldGenerator). I took a look at the source code and it is remarkably short and sweet. Thus, in the end, I just cannibalized that single function directly into my scripts, removing the need for pip inst…

View full answer

TomAugspurger · 2023-07-17T13:03:06Z

TomAugspurger
Jul 17, 2023

Why does client.upload_file appear to be executing my file instead of just uploading it? (The docs don't seem to suggest this as the expected behavior, but I'm probably just misinterpreting something.)

In https://docs.dask.org/en/stable/futures.html?highlight=upload_file#distributed.Client.upload_file, the load keyword controls whether the uploaded file is imported. The default is True.

Even when it does execute the file, why does the worker then appear to have no module named FyeldGenerator, even after I've successfully installed it?

This one I'm not too sure about... Dynamically installing / loading Python modules is always a bit tricky.

Can you check where that code is executing? I see register_scheduler_plugin in the traceback. This sounds somewhat familiar to dask/distributed#7834 (though the traceback is different).

0 replies

erthward · 2023-07-17T14:50:55Z

erthward
Jul 17, 2023
Author

Hi, Tom.

Thanks! This is a huge help.

That behavior makes more sense to me. Strangely, the docs on MSPC do not explicitly show the load keyword, instead just providing a **kwargs catchall that doesn't point to any parent documentation with more details.

>>> dask.distributed.Client.upload_file?
Signature: dask.distributed.Client.upload_file(self, filename, **kwargs)
Docstring:
Upload local package to workers

This sends a local file up to all worker nodes.  This file is placed
into the working directory of the running worker, see config option
``temporary-directory`` (defaults to :py:func:`tempfile.gettempdir`).

This directory will be added to the Python's system path so any .py,
.egg or .zip  files will be importable.

Parameters
----------
filename : string
    Filename of .py, .egg or .zip file to send to workers
**kwargs : dict
    Optional keyword arguments for the function

Examples
--------
>>> client.upload_file('mylibrary.egg')  # doctest: +SKIP
>>> from mylibrary import myfunc  # doctest: +SKIP
>>> L = client.map(myfunc, seq)  # doctest: +SKIP
File:      /srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py
Type:      function

I wondered if the more explicit docs on the dask website were introduced in a version between what's currently on MSPC Hub (dask.__version__ gives '2023.5.0') and the stable version (looks like 2023.7.0), but I peeked at the changelog and didn't see any indication that this has changed.

At any rate, I suspect that the version issue may be the problem, because I get an identical error and traceback even when I provide load=False as an argument. In fact, I can feed it any nonsense kwarg without changing the behavior (e.g., when I upload a working module containing only the line import os, then client.upload_file('working_module.py') succeeds without error, but so does client.upload_file('working_module.py', foo=77, bar=['a', 'b'], baz=False)). To me, this suggests that this keyword isn't implemented in the version I'm running.

I tried to update by running pip install --upgrade dask in the hub Bash terminal, and it did indeed upgrade to 2023.7.0, but I then get the following warning when I execute client.register_worker_plugin(plugin):

/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py:1388: VersionMismatchWarning: Mismatched versions found

+---------+----------+-----------+---------+
| Package | Client   | Scheduler | Workers |
+---------+----------+-----------+---------+
| dask    | 2023.7.0 | 2023.5.0  | None    |
+---------+----------+-----------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

I have tried running pip install --upgrade dask on the workers too, but nothing changes of course, since the table printout above clearly shows that the scheduler is the problem. And now I'm at the edge of my understanding of dask and cloud architecture and the like -- so I'm not sure if/how I can run pip commands on the scheduler (whose local directory, according to client.scheduler_info, is `/tmp/dask-scratch-space/worker-bg7jkjtm').

Curious to hear what I should do next!

I actually don't know how best to check where it's executing (again, probably because I'm maxing out my dask/cloud understanding at this point), but here are some perhaps-useful outputs, to start:

on the main node:

import os, sys
>>> os.getcwd()
'/home/jovyan/avoided_forest_conversion/analysis'

>>> sys.path
['/srv/conda/envs/notebook/bin',
 '/srv/conda/envs/notebook/lib/python311.zip',
 '/srv/conda/envs/notebook/lib/python3.11',
 '/srv/conda/envs/notebook/lib/python3.11/lib-dynload',
 '',
 '/srv/conda/envs/notebook/lib/python3.11/site-packages']

>>> sys.executable
'/srv/conda/envs/notebook/bin/python3.11'

on a worker:

>>> def test_clust_fn():
             import os, sys
             return (os.path, sys.path, sys.executable)
>>> out = client.submit(test_clust_fn)
>>> client.gather(out)
(<module 'posixpath' (frozen)>,
 ['/tmp/dask-scratch-space/worker-n44nje0_',
  '/tmp/dask-scratch-space',
  '/srv/conda/envs/notebook/bin',
  '/srv/conda/envs/notebook/lib/python311.zip',
  '/srv/conda/envs/notebook/lib/python3.11',
  '/srv/conda/envs/notebook/lib/python3.11/lib-dynload',
  '/srv/conda/envs/notebook/lib/python3.11/site-packages'],
 '/srv/conda/envs/notebook/bin/python3.11')

3 replies

erthward Jul 24, 2023
Author

Hi again, Tom.

Sorry to bug you, and no rush if you're busy with other things. Just curious if you saw my last response and if you had any other thoughts.

Thanks again!

Drew

TomAugspurger Jul 24, 2023

Just saw it this morning :)

I have tried running pip install --upgrade dask on the workers too, but nothing changes of course, since the table printout above clearly shows that the scheduler is the problem.

As you discovered, this won't solve your issue. When you create a Dask cluster on the PC Hub, you're using Dask Gateway to start the scheduler and workers on the AKS cluster the Hub is running on. Dask Gateway is configured to use a specific image, and it'll use the version of Dask in that image to start the scheduler / workers. By the time you get to trying to upgrade the scheduler / worker versions, it's already too late.

Your options on the Hub are a bit limited at this point. I think it's technically possible to override the container image used on the scheduler / workers. So you would need to build your own image, and ensure that it's compatible with your (modified) environment on the notebook server. That's fragile.

Your other options are to use your own compute in your own Azure subscription, if you have access to one. That'll give you full control over the hardware and software used for your workload.

Your last option is to wait till I have a chance to update the images used on the Hub. I don't have an ETA for that, but it'll be a little while.

erthward Jul 26, 2023
Author

Hi, Tom.

Thanks so much! This all makes sense. Sounds like I'm bumping into an awkward dependency limitation of running analysis on MSPC. That's an inevitable consequence of relying on a free and general-purpose platform for a specific project.

I considered moving onto our own compute, and that is certainly an option we may consider later on nonetheless. However, I stepped back even further and realized that this all really boiled down to just needing to use one function from one package (FyeldGenerator). I took a look at the source code and it is remarkably short and sweet. Thus, in the end, I just cannibalized that single function directly into my scripts, removing the need for pip installs altogether.

I'm running a trial now, and it looks to have worked. I'm hopeful that this will resolve the situation altogether (at least for now). It's not really best/ideal practice, but in this particular case it is minimal and clean, so actually feels like the most parsimonious fix.

Thanks again! And I'm sorry for the hassle!

D

Answer selected by erthward

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`client.upload_file` appears to execute file, and throws `ModuleNotFoundError` on module already installed with `dask.distributed.PipInstall` #239

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

client.upload_file appears to execute file, and throws ModuleNotFoundError on module already installed with dask.distributed.PipInstall #239

Uh oh!

erthward Jul 14, 2023

Replies: 2 comments · 3 replies

Uh oh!

TomAugspurger Jul 17, 2023

Uh oh!

erthward Jul 17, 2023 Author

Uh oh!

erthward Jul 24, 2023 Author

Uh oh!

TomAugspurger Jul 24, 2023

Uh oh!

erthward Jul 26, 2023 Author

`client.upload_file` appears to execute file, and throws `ModuleNotFoundError` on module already installed with `dask.distributed.PipInstall` #239

erthward
Jul 14, 2023

Replies: 2 comments 3 replies

TomAugspurger
Jul 17, 2023

erthward
Jul 17, 2023
Author

erthward Jul 24, 2023
Author

erthward Jul 26, 2023
Author