-
Notifications
You must be signed in to change notification settings - Fork 14
Description
For EC2 and GCP VM instances, with private address yyy.yyy.yyy.yyy and public address xxx.xxx.xxx.xxx, addprocs only works with tunnel=true (ssh forward between master and worker processes). This prevents worker processes located in EC2 and GCP, respectively, from communicating, as they only see the private address of the communication partner host returned by the workers ("julia_worker:port#yyy.yyy.yyy.yyy"). In fact, in the workers at EC2 and GCP instances, getipaddr() always returns the private address, and the public address isn't even visible through getipaddrs() , which is causing a "no ports available" error if you try to add --bind-to="xxx.xxx.xxx.xxx" to exeflags to force workers to return the public address to the master process (i.e., "julia_worker:port#xxx.xxx.xxx.xxx") in order to make connections with tunnel = false possible.
With CloudClusters.jl, a package we are developing, we are investigating multicluster computation and involving clusters of VM instances in an IaaS provider (currently, EC2 or GCP), where two clusters may be at different providers and need to exchange data to synchronize their local computations (inter-cluster communication).
The solution that has worked for us is the following modification to managers.jl, in the else clause of a if condition testing tunnel=true:
from: (s, bind_addr) = connect_to_worker(bind_addr, port)
to: (s, bind_addr) = connect_to_worker(pubhost, port)
So, for tunnel=false connections, the manager will connect to the worker using the pubhost, i.e., the public address that the master process uses to reach the worker host., instead of bind_addr, which may point to private addresses in some cases, as in EC2 and GCP instances, due to the reasons previously exposed.
This update seems safe since it is the purpose of the public address to reach the worker host from other hosts. However, it seems to make unusable the public address returned by the worker to the master ("julia_worker:port#yyy.yyy.yyy.yyy").