Skip to content

addprocs on EC2 and GCP instances does not work with tunnel=false, which prevents communication between two processes respectively instantiated on EC2 and GCP. #133

@decarvalhojunior-fh

Description

@decarvalhojunior-fh

For EC2 and GCP VM instances, with private address yyy.yyy.yyy.yyy and public address xxx.xxx.xxx.xxx, addprocs only works with tunnel=true (ssh forward between master and worker processes). This prevents worker processes located in EC2 and GCP, respectively, from communicating, as they only see the private address of the communication partner host returned by the workers ("julia_worker:port#yyy.yyy.yyy.yyy"). In fact, in the workers at EC2 and GCP instances, getipaddr() always returns the private address, and the public address isn't even visible through getipaddrs() , which is causing a "no ports available" error if you try to add --bind-to="xxx.xxx.xxx.xxx" to exeflags to force workers to return the public address to the master process (i.e., "julia_worker:port#xxx.xxx.xxx.xxx") in order to make connections with tunnel = false possible.

With CloudClusters.jl, a package we are developing, we are investigating multicluster computation and involving clusters of VM instances in an IaaS provider (currently, EC2 or GCP), where two clusters may be at different providers and need to exchange data to synchronize their local computations (inter-cluster communication).

The solution that has worked for us is the following modification to managers.jl, in the else clause of a if condition testing tunnel=true:

from: (s, bind_addr) = connect_to_worker(bind_addr, port)
to: (s, bind_addr) = connect_to_worker(pubhost, port)

So, for tunnel=false connections, the manager will connect to the worker using the pubhost, i.e., the public address that the master process uses to reach the worker host., instead of bind_addr, which may point to private addresses in some cases, as in EC2 and GCP instances, due to the reasons previously exposed.

This update seems safe since it is the purpose of the public address to reach the worker host from other hosts. However, it seems to make unusable the public address returned by the worker to the master ("julia_worker:port#yyy.yyy.yyy.yyy").

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions