Skip to content

Default interfaces and other things #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 25, 2024
Merged

Conversation

JamesWrigley
Copy link
Collaborator

This PR has a few changes, but the important ones are:

Default worker interface

We now try to automatically pick the fastest interface for the worker to listen on. This is done by using a minimal version of NetworkInterfaceControllers.jl (thanks @JBlaschke ❤️) and a Linux-only check for the negotiated link speed. In the future we could add support for passing an interface name/CIDR blocks/interface type to addprocs() for explicit selection.

Unfortunately testing the functionality is a bit tricky, for now there's just a sanity test in the tests and I've tested it manually on two clusters with both ethernet and infiniband networks.

Passing environment variables

addprocs(::SSHManager) now passes all JULIA_* environment variables by default. I found out we didn't do this already when I added a worker on a node with a different microarchitecture and saw segfaults on the local node from the pkgimages being compiled without all the archs in JULIA_CPU_TARGET... For context, these are the environment variables our Julia module sets by default:

julia> filter(startswith("JULIA_"), keys(ENV))
Set{String} with 5 elements:
  "JULIA_REVISE_POLL"
  "JULIA_CPU_TARGET"
  "JULIA_CONDAPKG_BACKEND"
  "JULIA_DEPOT_PATH"
  "JULIA_PYTHONCALL_EXE"

All of them are important for a good out-of-the-box experience. On the other hand, admittedly this change is very much biased towards adding workers on a cluster with a shared filesystem. Personally I think it's a good default, but alternatively we could support a JULIA_DISTRIBUTEDNEXT_PASS_ALL_VARS env var for admins to set that will enable/disable passing everything by default.

@JamesWrigley JamesWrigley self-assigned this Nov 22, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 22, 2024

Codecov Report

Attention: Patch coverage is 93.58974% with 5 lines in your changes missing coverage. Please review.

Project coverage is 86.83%. Comparing base (2e52ffe) to head (915eb61).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/network_interfaces.jl 95.08% 3 Missing ⚠️
src/cluster.jl 92.30% 1 Missing ⚠️
src/managers.jl 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
+ Coverage   86.37%   86.83%   +0.45%     
==========================================
  Files          10       11       +1     
  Lines        1982     2051      +69     
==========================================
+ Hits         1712     1781      +69     
  Misses        270      270              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for putting this together! Just a few small tweaks.

- Make LocalProcess.bind_port an Int so it doesn't print in hex.
- Use @error when launching workers so we can show the full backtrace if it
  fails.
`Base.active_project()` is the public API for this.
It's otherwise very annoying and error-prone to set these with every call to
`addprocs()`.
@JamesWrigley JamesWrigley merged commit 13ce55b into master Nov 25, 2024
11 checks passed
@JamesWrigley JamesWrigley deleted the default-interface branch November 25, 2024 20:25
@JamesWrigley JamesWrigley mentioned this pull request Nov 28, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants