Skip to content

Combi dyn split -- non-uniform distributions of full grids onto process groups #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 53 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
a46b08b
Changes to compile on Hazel Hen
Mar 1, 2017
ec35b44
Added username in Makefile
Mar 1, 2017
99dba59
Added compile.sh file
Mar 1, 2017
7b10a30
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-hh-ft
Mar 1, 2017
4f4d291
Delete .gitlab-ci.yml
Mar 15, 2017
987d5c7
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-hh-ft
Apr 3, 2017
e74dca1
fix to manager.cpp
Apr 4, 2017
56d9d82
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-hh-ft
Apr 4, 2017
358ed68
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-hh-ft
Apr 4, 2017
007bbcf
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-hh-ft
May 9, 2017
80b3859
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-hh-ft
obersteiner May 18, 2017
d3b4edf
Removed duplicate code
WorldSEnder Jun 18, 2017
ecd1ad8
Reading proc vector
WorldSEnder Jun 18, 2017
81f826a
Removed more duplicate code
WorldSEnder Jun 18, 2017
526ae8e
Updated example & fixed gitignore
WorldSEnder Jun 18, 2017
777adc0
vector nature of nproc is now honored, except in recover and reduce
WorldSEnder Jun 22, 2017
217d568
non uniform reduce comm
WorldSEnder Jul 3, 2017
0e71ae0
Fixing assertion errors
WorldSEnder Jul 4, 2017
c676db6
Fixed group managers being assigned wrongly
WorldSEnder Jul 4, 2017
f8defed
Slowly running through combi example to see what's working
WorldSEnder Jul 4, 2017
1ec050a
parallelization vectors for non-uniform group sizes
WorldSEnder Jul 11, 2017
880852a
Running all the code
WorldSEnder Jul 11, 2017
9564b74
outputting timers.json in out directory
WorldSEnder Jul 11, 2017
9947790
team comms
WorldSEnder Aug 19, 2017
2ddc282
logging communicator structure in init + bug fixes
WorldSEnder Aug 30, 2017
c1fccbe
"stable" identifier in debug log
WorldSEnder Aug 30, 2017
105079e
stash pop
WorldSEnder Nov 22, 2017
d91b6dc
improved speed of subspace assignment
WorldSEnder Nov 22, 2017
750dd64
fixed local comm create
WorldSEnder Nov 22, 2017
5136ce0
Predictable ordering in team comm via ordering of team coords
WorldSEnder Dec 5, 2017
2249987
using subarray types for team master
WorldSEnder Dec 27, 2017
23a3d8f
distributedTeamGather & Scatter
WorldSEnder Jan 3, 2018
60920f3
bug fixes
WorldSEnder Jan 4, 2018
79e6118
temp commit
WorldSEnder Jan 5, 2018
2fe7724
setting stat attributes, and finishing all tasks
WorldSEnder Jan 5, 2018
7b0a5d4
freeing team data types
WorldSEnder Jan 5, 2018
dd78133
fixed team compositions for large nonuniform teams
WorldSEnder Jan 6, 2018
48663e6
fixed communication from manager to masters
WorldSEnder Jan 6, 2018
f4c7999
correct subspace sizes when the smallest group is not size=1
WorldSEnder Jan 6, 2018
151a18c
fixing empty subspaces
WorldSEnder Jan 6, 2018
f54a4fe
async distributedTeamGather/Scatter
WorldSEnder Jan 6, 2018
9fd5984
can not serialize a communicator.
WorldSEnder Apr 12, 2018
71c0eae
fixing a few type issues
WorldSEnder Apr 12, 2018
47f1ab2
silencing wrongly diagnosed "unused" warning
WorldSEnder Apr 12, 2018
7ebe698
fixing discrepancy between size_t and "mpi size" = int
WorldSEnder Apr 12, 2018
965061c
Prefering CommunicatorType over MPI_Comm
WorldSEnder Apr 12, 2018
85ff9af
Using a low-level MPI_Comm here
WorldSEnder Apr 12, 2018
e898948
starting to cut down code-duplication
WorldSEnder Apr 12, 2018
1a4e679
fixing gene_distributed example
WorldSEnder Apr 12, 2018
f95d6a8
Merge remote-tracking branch 'origin/combi_gene_faults' into combi-dy…
WorldSEnder Apr 24, 2018
90e3533
ditched initWorld in favor of MPIInitHelper
WorldSEnder Apr 24, 2018
2554213
not forcing the MPI System to be initialized in registerUniformSG
WorldSEnder May 3, 2018
57daf08
added BOOST_TEST_HOSTFILE to specify a custom hostfile for tests
WorldSEnder May 4, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@


#getting rid of weverything without a . in examples folder and all subfolders (executables on linux)
*/examples/*
!*/examples/*/
!*/examples/*.*
*/examples/**
!*/examples/**/*/
!*/examples/**/*.*
*/examples/**/out/

#ignoring c++ build files
*.o
Expand Down
15 changes: 11 additions & 4 deletions SConstruct
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,8 @@ vars.Add(BoolVariable("USE_OCL", "Enable OpenCL support (only actually enabled i
vars.Add(BoolVariable("USE_CUDA", "Enable CUDA support (you might need to provide an 'CUDA_TOOLKIT_PATH')", False))
vars.Add("OCL_INCLUDE_PATH", "Set path to the OpenCL header files (parent directory of CL/)")
vars.Add("OCL_LIBRARY_PATH", "Set path to the OpenCL library")
vars.Add("BOOST_INCLUDE_PATH", "Set path to the Boost header files", "/usr/include")
vars.Add("BOOST_LIBRARY_PATH", "Set path to the Boost library", "/usr/lib/x86_64-linux-gnu")
vars.Add("BOOST_INCLUDE_PATH", "Set path to the Boost header files", "/zhome/academic/HLRS/ipv/ipvalf/hlrs-tools/boost_1_58_0/")
vars.Add("BOOST_LIBRARY_PATH", "Set path to the Boost library", "/zhome/academic/HLRS/ipv/ipvalf/hlrs-tools/boost_1_58_0/stage/lib")
vars.Add(BoolVariable("COMPILE_BOOST_TESTS",
"Compile the test cases written using Boost Test", True))
vars.Add(BoolVariable("COMPILE_BOOST_PERFORMANCE_TESTS",
Expand Down Expand Up @@ -140,6 +140,7 @@ vars.Add(BoolVariable("PRINT_INSTRUCTIONS", "Print instructions for installing S
vars.Add('GLPK_INCLUDE_PATH', 'Specifies the location of the glpk header files.', '/usr/include')
vars.Add('GLPK_LIBRARY_PATH', 'Specifies the location of the glpk library.', '/usr/lib/x86_64-linux-gnu')
vars.Add("TEST_PROCESS_COUNT", "How many processes are used for parallel test cases", "9")
vars.Add("BOOST_TEST_HOSTFILE", "Specifies a hostfile to use for parallel boost tests", "")


# create temporary environment to check which system and compiler we should use
Expand Down Expand Up @@ -372,8 +373,14 @@ if env["RUN_PYTHON_TESTS"] and env["SG_PYTHON"]:

if env["COMPILE_BOOST_TESTS"]:
proc_count = int(env["TEST_PROCESS_COUNT"])
run_cmd = "mpiexec -n %s " % proc_count if proc_count > 1 else ""
builder = Builder(action=run_cmd + "./$SOURCE")
hostfile = str(env["BOOST_TEST_HOSTFILE"])
run_cmd = ["mpiexec"]
if proc_count > 1:
run_cmd += ["-n", "{}".format(proc_count)]
if hostfile:
run_cmd += ["--hostfile", env.File(hostfile).abspath]
run_cmd += ["$SOURCE"]
builder = Builder(action=[run_cmd])
env.Append(BUILDERS={"BoostTest" : builder})

# Building the modules
Expand Down
6 changes: 6 additions & 0 deletions compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

USRNAME="ipvalf"
scons -j 16 OPT=1 VERBOSE=1 SG_ALL=0 SG_DISTRIBUTEDCOMBIGRID=1 CXX=CC COMPILE_BOOST_TESTS=0 RUN_BOOST_TESTS=0 BOOST_LIBRARY_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/boost_1_58_0/stage/lib BOOST_INCLUDE_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/boost_1_58_0 GLPK_LIBRARY_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/glpk/lib GLPK_INCLUDE_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/glpk/include BUILD_STATICLIB=1

scons -j 16 OPT=1 VERBOSE=1 SG_ALL=0 SG_DISTRIBUTEDCOMBIGRID=1 CXX=CC COMPILE_BOOST_TESTS=0 RUN_BOOST_TESTS=0 BOOST_LIBRARY_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/boost_1_58_0/stage/lib BOOST_INCLUDE_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/boost_1_58_0 GLPK_LIBRARY_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/glpk/lib GLPK_INCLUDE_PATH=/zhome/academic/HLRS/ipv/$USRNAME/hlrs-tools/glpk/include BUILD_STATICLIB=0
2 changes: 1 addition & 1 deletion distributedcombigrid/SConscript
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Import("*")

moduleDependencies = []

additionalDependencies = ["boost_serialization","glpk"]
additionalDependencies = ["boost_serialization", "boost_mpi", "glpk"]

module = ModuleHelper.Module(moduleDependencies, additionalDependencies)

Expand Down
27 changes: 16 additions & 11 deletions distributedcombigrid/examples/combi_example/TaskExample.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
#ifndef TASKEXAMPLE_HPP_
#define TASKEXAMPLE_HPP_

#include <map>
#include <boost/serialization/map.hpp>

#include "sgpp/distributedcombigrid/fullgrid/DistributedFullGrid.hpp"
#include "sgpp/distributedcombigrid/task/Task.hpp"

Expand All @@ -21,12 +24,13 @@ class TaskExample: public Task {
*/
TaskExample(DimType dim, LevelVector& l, std::vector<bool>& boundary,
real coeff, LoadModel* loadModel, real dt,
size_t nsteps, IndexVector p = IndexVector(0),FaultCriterion *faultCrit = (new StaticFaults({0,IndexVector(0),IndexVector(0)})) ) :
size_t nsteps, std::map<size_t, CartRankCoords> pByProc,
FaultCriterion *faultCrit = (new StaticFaults({0,IndexVector(0),IndexVector(0)})) ) :
Task(dim, l, boundary, coeff, loadModel, faultCrit), dt_(dt), nsteps_(
nsteps), p_(p), initialized_(false), stepsTotal_(0), dfg_(NULL) {
nsteps), p_( std::move(pByProc) ), initialized_(false), stepsTotal_(0), dfg_(NULL) {
}

void init(CommunicatorType lcomm, std::vector<IndexVector> decomposition = std::vector<IndexVector>()){
void init(CommunicatorType lcomm, std::vector<IndexVector> decomposition = std::vector<IndexVector>()) override {
assert(!initialized_);
assert(dfg_ == NULL);

Expand All @@ -48,7 +52,7 @@ class TaskExample: public Task {
IndexVector p(dim, 1);
const LevelVector& l = this->getLevelVector();

if (p_.size() == 0) {
if (p_[np].size() == 0) {
// compute domain decomposition
IndexType prod_p(1);

Expand All @@ -72,7 +76,8 @@ class TaskExample: public Task {
prod_p *= p[k];
}
} else {
p = p_;
auto& pBySize = p_[np];
std::transform(pBySize.begin(), pBySize.end(), p.begin(), [](int i){ return static_cast<size_t>(i); });
}

if (lrank == 0) {
Expand Down Expand Up @@ -172,6 +177,11 @@ class TaskExample: public Task {

}

~TaskExample() {
if (dfg_ != NULL)
delete dfg_;
}

protected:
/* if there are local variables that have to be initialized at construction
* you have to do it here. the worker processes will create the task using
Expand All @@ -182,18 +192,13 @@ class TaskExample: public Task {
initialized_(false), stepsTotal_(1), dfg_(NULL) {
}

~TaskExample() {
if (dfg_ != NULL)
delete dfg_;
}

private:
friend class boost::serialization::access;

// new variables that are set by manager. need to be added to serialize
real dt_;
size_t nsteps_;
IndexVector p_;
std::map<size_t, CartRankCoords> p_;

// pure local variables that exist only on the worker processes
bool initialized_;
Expand Down
63 changes: 47 additions & 16 deletions distributedcombigrid/examples/combi_example/combi_example.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@
#include <mpi.h>
#include <vector>
#include <string>
#include <map>
#include <numeric>
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/ini_parser.hpp>
#include <boost/property_tree/json_parser.hpp>
#include <boost/serialization/export.hpp>

// compulsory includes for basic functionality
Expand All @@ -28,12 +30,27 @@
#include "TaskExample.hpp"

using namespace combigrid;
using boost::property_tree::ptree;

// this is necessary for correct function of task serialization
BOOST_CLASS_EXPORT(TaskExample)
BOOST_CLASS_EXPORT(StaticFaults)
BOOST_CLASS_EXPORT(WeibullFaults)
BOOST_CLASS_EXPORT(FaultCriterion)

template <typename T>
std::vector<T> get_as_vector(ptree const& pt, ptree::key_type const& key)
{
std::vector<T> result;
result.reserve( pt.size() );

for(const auto& child : pt.get_child(key)) {
const ptree& childPt = child.second;
result.push_back( childPt.get_value<T>() );
}
return result;
}

int main(int argc, char** argv) {
MPI_Init(&argc, &argv);

Expand All @@ -44,15 +61,30 @@ int main(int argc, char** argv) {

// read in parameter file
boost::property_tree::ptree cfg;
boost::property_tree::ini_parser::read_ini("ctparam", cfg);
boost::property_tree::json_parser::read_json("ctparam", cfg);

// number of process groups and number of processes per group
size_t ngroup = cfg.get<size_t>("manager.ngroup");
size_t nprocs = cfg.get<size_t>("manager.nprocs");
std::vector<size_t> nprocs = get_as_vector<size_t>( cfg, "manager.nprocs" );

DimType dim = cfg.get<DimType>("ct.dim");
std::map<size_t, CartRankCoords> pByNProcs;
{
const ptree& ps = cfg.get_child("ct.p");
for(const auto& pConfig : ps) {
IndexVector pIdx( dim );
pConfig.second.get_value<std::string>() >> pIdx;

CartRankCoords p( dim );
std::transform(pIdx.begin(), pIdx.end(), p.begin(), [](size_t i) { return static_cast<int>(i); });
IndexType procCount = std::accumulate( p.begin(), p.end(), 1, std::multiplies<IndexType>{} );
pByNProcs[procCount] = std::move( p );
}
}

// divide the MPI processes into process group and initialize the
// corresponding communicators
theMPISystem()->init( ngroup, nprocs );
theMPISystem()->configure().withGroups( ngroup, nprocs ).withParallelization( pByNProcs ).init();

// this code is only executed by the manager process
WORLD_MANAGER_EXCLUSIVE_SECTION {
Expand All @@ -61,25 +93,22 @@ int main(int argc, char** argv) {
*/
ProcessGroupManagerContainer pgroups;
for (size_t i = 0; i < ngroup; ++i) {
int pgroupRootID(i);
int pgroupRootID = theMPISystem()->getGroupBaseWorldRank(i);
pgroups.emplace_back(
std::make_shared< ProcessGroupManager > ( pgroupRootID )
std::make_shared< ProcessGroupManager > ( i )
);
}

// create load model
LoadModel* loadmodel = new LinearLoadModel();

/* read in parameters from ctparam */
DimType dim = cfg.get<DimType>("ct.dim");
LevelVector lmin(dim), lmax(dim), leval(dim);
IndexVector p(dim);
combigrid::real dt;
size_t nsteps, ncombi;
cfg.get<std::string>("ct.lmin") >> lmin;
cfg.get<std::string>("ct.lmax") >> lmax;
cfg.get<std::string>("ct.leval") >> leval;
cfg.get<std::string>("ct.p") >> p;
ncombi = cfg.get<size_t>("ct.ncombi");
dt = cfg.get<combigrid::real>("application.dt");
nsteps = cfg.get<size_t>("application.nsteps");
Expand All @@ -88,10 +117,9 @@ int main(int argc, char** argv) {
std::vector<bool> boundary(dim, true);

// check whether parallelization vector p agrees with nprocs
IndexType checkProcs = 1;
for (auto k : p)
checkProcs *= k;
assert(checkProcs == IndexType(nprocs));
for(const auto& nproc : nprocs) {
assert(pByNProcs.find(nproc) != pByNProcs.end() && "Need parallelization for every proc size");
}

/* generate a list of levelvectors and coefficients
* CombiMinMaxScheme will create a classical combination scheme.
Expand All @@ -113,7 +141,7 @@ int main(int argc, char** argv) {
std::vector<int> taskIDs;
for (size_t i = 0; i < levels.size(); i++) {
Task* t = new TaskExample(dim, levels[i], boundary, coeffs[i],
loadmodel, dt, nsteps, p);
loadmodel, dt, nsteps, pByNProcs);
tasks.push_back(t);
taskIDs.push_back( t->getID() );
}
Expand All @@ -130,14 +158,17 @@ int main(int argc, char** argv) {

std::cout << "set up component grids and run until first combination point"
<< std::endl;

/* distribute task according to load model and start computation for
* the first time */

std::cerr << "run first" << std::endl;
Stats::startEvent("manager run first");
manager.runfirst();
Stats::stopEvent("manager run first");
std::cerr << "run first complete" << std::endl;

for (size_t i = 0; i < ncombi; ++i) {
std::cerr << "ncombi: " << i << std::endl;
Stats::startEvent("combine");
manager.combine();
Stats::stopEvent("combine");
Expand Down Expand Up @@ -176,7 +207,7 @@ int main(int argc, char** argv) {
Stats::finalize();

/* write stats to json file for postprocessing */
Stats::write( "timers.json" );
Stats::write( "out/timers.json" );

MPI_Finalize();

Expand Down
33 changes: 18 additions & 15 deletions distributedcombigrid/examples/combi_example/ctparam
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
[ct]
dim = 2
lmin = 3 3
lmax = 10 10
leval = 5 5
p = 1 2
ncombi = 10

[application]
dt = 1e-3
nsteps = 100

[manager]
ngroup = 2
nprocs = 2
{
"ct": {
"dim": 3,
"lmin": "3 3 3",
"lmax": "6 6 6",
"leval": "5 5 5",
"p": ["1 1 1", "1 2 1"],
"ncombi": 10
},
"application": {
"dt": 1e-3,
"nsteps": 10
},
"manager": {
"ngroup": 2,
"nprocs": [1, 2]
}
}
Loading