PrevUpHomeNext

Parallel computation with OpenMP and MPI

OpenMP
MPI
Concepts

Parallelization is a key feature for modern numerical libraries due to the vast availability of many cores nowadays, even on Laptops. odeint currently supports parallelization with OpenMP and MPI, as described in the following sections. However, it should be made clear from the beginning that the difficulty of efficiently distributing ODE integration on many cores/machines lies in the parallelization of the system function, which is still the user's responsibility. Simply using a parallel odeint backend without parallelizing the system function will bring you almost no performance gains.

odeint's OpenMP support is implemented as an external backend, which needs to be manually included. Depending on the compiler some additional flags may be needed, i.e. -fopenmp for GCC.

#include <omp.h>
#include <boost/numeric/odeint.hpp>
#include <boost/numeric/odeint/external/openmp/openmp.hpp>

In the easiest parallelization approach with OpenMP we use a standard vector as the state type:

typedef std::vector< double > state_type;

We initialize the state with some random data:

size_t N = 131101;
state_type x( N );
boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi );
boost::random::mt19937 engine( 0 );
generate( x.begin() , x.end() , boost::bind( distribution , engine ) );

Now we have to configure the stepper to use the OpenMP backend. This is done by explicitly providing the openmp_range_algebra as a template parameter to the stepper. This algebra requires the state type to be a model of Random Access Range and will be used from multiple threads by the algebra.

typedef runge_kutta4<
                  state_type , double ,
                  state_type , double ,
                  openmp_range_algebra
                > stepper_type;

Additional to providing the stepper with OpenMP parallelization we also need a parallelized system function to exploit the available cores. Here this is shown for a simple one-dimensional chain of phase oscillators with nearest neighbor coupling:

struct phase_chain
{
    phase_chain( double gamma = 0.5 )
    : m_gamma( gamma ) { }

    void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const
    {
        const size_t N = x.size();
        #pragma omp parallel for schedule(runtime)
        for(size_t i = 1 ; i < N - 1 ; ++i)
        {
            dxdt[i] = coupling_func( x[i+1] - x[i] ) +
                      coupling_func( x[i-1] - x[i] );
        }
        dxdt[0  ] = coupling_func( x[1  ] - x[0  ] );
        dxdt[N-1] = coupling_func( x[N-2] - x[N-1] );
    }

    double coupling_func( double x ) const
    {
        return sin( x ) - m_gamma * ( 1.0 - cos( x ) );
    }

    double m_gamma;
};

[Note] Note

In the OpenMP backends the system function will always be called sequentially from the thread used to start the integration.

Finally, we perform the integration by using one of the integrate functions from odeint. As you can see, the parallelization is completely hidden in the stepper and the system function. OpenMP will take care of distributing the work among the threads and join them automatically.

integrate_n_steps( stepper_type() , phase_chain( 1.2 ) ,
                   x , 0.0 , 0.01 , 100 );

After integrating, the data can be accessed immediately and be processed further. Note, that you can specify the OpenMP scheduling by calling omp_set_schedule in the beginning of your program:

int chunk_size = N/omp_get_max_threads();
omp_set_schedule( omp_sched_static , chunk_size );

See openmp/phase_chain.cpp for the complete example.

Split state

For advanced cases odeint offers another approach to use OpenMP that allows for a more exact control of the parallelization. For example, for odd-sized data where OpenMP's thread boundaries don't match cache lines and hurt performance it might be advisable to copy the data from the continuous vector<T> into separate, individually aligned, vectors. For this, odeint provides the openmp_state<T> type, essentially an alias for vector<vector<T>>.

Here, the initialization is done with a vector<double>, but then we use odeint's split function to fill an openmp_state. The splitting is done such that the sizes of the individual regions differ at most by 1 to make the computation as uniform as possible.

const size_t N = 131101;
vector<double> x( N );
boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi );
boost::random::mt19937 engine( 0 );
generate( x.begin() , x.end() , boost::bind( distribution , engine ) );
const size_t blocks = omp_get_max_threads();
state_type x_split( blocks );
split( x , x_split );

Of course, the system function has to be changed to deal with the openmp_state. Note that each sub-region of the state is computed in a single task, but at the borders read access to the neighbouring regions is required.

struct phase_chain_omp_state
{
    phase_chain_omp_state( double gamma = 0.5 )
    : m_gamma( gamma ) { }

    void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const
    {
        const size_t N = x.size();
        #pragma omp parallel for schedule(runtime)
        for(size_t n = 0 ; n < N ; ++n)
        {
            const size_t M = x[n].size();
            for(size_t m = 1 ; m < M-1 ; ++m)
            {
                dxdt[n][m] = coupling_func( x[n][m+1] - x[n][m] ) +
                             coupling_func( x[n][m-1] - x[n][m] );
            }
            dxdt[n][0] = coupling_func( x[n][1] - x[n][0] );
            if( n > 0 )
            {
                dxdt[n][0] += coupling_func( x[n-1].back() - x[n].front() );
            }
            dxdt[n][M-1] = coupling_func( x[n][M-2] - x[n][M-1] );
            if( n < N-1 )
            {
                dxdt[n][M-1] += coupling_func( x[n+1].front() - x[n].back() );
            }
        }
    }

    double coupling_func( double x ) const
    {
        return sin( x ) - m_gamma * ( 1.0 - cos( x ) );
    }

    double m_gamma;
};

Using the openmp_state<T> state type automatically selects openmp_algebra which executes odeint's internal computations on parallel regions. Hence, no manual configuration of the stepper is necessary. At the end of the integration, we use unsplit to concatenate the sub-regions back together into a single vector.

integrate_n_steps( runge_kutta4<state_type>() , phase_chain_omp_state( 1.2 ) ,
                   x_split , 0.0 , 0.01 , 100 );
unsplit( x_split , x );

[Note] Note

You don't actually need to use openmp_state<T> for advanced use cases, openmp_algebra is simply an alias for openmp_nested_algebra<range_algebra> and supports any model of Random Access Range as the outer, parallel state type, and will use the given algebra on its elements.

See openmp/phase_chain_omp_state.cpp for the complete example.

MPI

To expand the parallel computation across multiple machines we can use MPI.

The system function implementation is similar to the OpenMP variant with split data, the main difference being that while OpenMP uses a spawn/join model where everything not explicitly paralleled is only executed in the main thread, in MPI's model each node enters the main() method independently, diverging based on its rank and synchronizing through message-passing and explicit barriers.

odeint's MPI support is implemented as an external backend, too. Depending on the MPI implementation the code might need to be compiled with i.e. mpic++.

#include <boost/numeric/odeint.hpp>
#include <boost/numeric/odeint/external/mpi/mpi.hpp>

Instead of reading another thread's data, we asynchronously send and receive the relevant data from neighbouring nodes, performing some computation in the interim to hide the latency.

struct phase_chain_mpi_state
{
    phase_chain_mpi_state( double gamma = 0.5 )
    : m_gamma( gamma ) { }

    void operator()( const state_type &x , state_type &dxdt , double /* t */ ) const
    {
        const size_t M = x().size();
        const bool have_left = x.world.rank() > 0,
                   have_right = x.world.rank() < x.world.size()-1;
        double x_left, x_right;
        boost::mpi::request r_left, r_right;
        if( have_left )
        {
            x.world.isend( x.world.rank()-1 , 0 , x().front() ); // send to x_right
            r_left = x.world.irecv( x.world.rank()-1 , 0 , x_left ); // receive from x().back()
        }
        if( have_right )
        {
            x.world.isend( x.world.rank()+1 , 0 , x().back() ); // send to x_left
            r_right = x.world.irecv( x.world.rank()+1 , 0 , x_right ); // receive from x().front()
        }
        for(size_t m = 1 ; m < M-1 ; ++m)
        {
            dxdt()[m] = coupling_func( x()[m+1] - x()[m] ) +
                        coupling_func( x()[m-1] - x()[m] );
        }
        dxdt()[0] = coupling_func( x()[1] - x()[0] );
        if( have_left )
        {
            r_left.wait();
            dxdt()[0] += coupling_func( x_left - x().front() );
        }
        dxdt()[M-1] = coupling_func( x()[M-2] - x()[M-1] );
        if( have_right )
        {
            r_right.wait();
            dxdt()[M-1] += coupling_func( x_right - x().back() );
        }
    }

    double coupling_func( double x ) const
    {
        return sin( x ) - m_gamma * ( 1.0 - cos( x ) );
    }

    double m_gamma;
};

Analogous to openmp_state<T> we use mpi_state< InnerState<T> >, which automatically selects mpi_nested_algebra and the appropriate MPI-oblivious inner algebra (since our inner state is a vector, the inner algebra will be range_algebra as in the OpenMP example).

typedef mpi_state< vector<double> > state_type;

In the main program we construct a communicator which tells us the size of the cluster and the current node's rank within that. We generate the input data on the master node only, avoiding unnecessary work on the other nodes. Instead of simply copying chunks, split acts as a MPI collective function here and sends/receives regions from master to each slave. The input argument is ignored on the slaves, but the master node receives a region in its output and will participate in the computation.

boost::mpi::environment env( argc , argv );
boost::mpi::communicator world;

const size_t N = 131101;
vector<double> x;
if( world.rank() == 0 )
{
    x.resize( N );
    boost::random::uniform_real_distribution<double> distribution( 0.0 , 2.0*pi );
    boost::random::mt19937 engine( 0 );
    generate( x.begin() , x.end() , boost::bind( distribution , engine ) );
}

state_type x_split( world );
split( x , x_split );

Now that x_split contains (only) the local chunk for each node, we start the integration.

To print the result on the master node, we send the processed data back using unsplit.

integrate_n_steps( runge_kutta4<state_type>() , phase_chain_mpi_state( 1.2 ) ,
                   x_split , 0.0 , 0.01 , 100 );
unsplit( x_split , x );

[Note] Note

mpi_nested_algebra::for_eachN doesn't use any MPI constructs, it simply calls the inner algebra on the local chunk and the system function is not guarded by any barriers either, so if you don't manually place any (for example in parameter studies cases where the elements are completely independent) you might see the nodes diverging, returning from this call at different times.

See mpi/phase_chain.cpp for the complete example.

As used by mpi_nested_algebra.

Notation

InnerState

The inner state type

State

The MPI-state type

state

Object of type State

world

Object of type boost::mpi::communicator

Valid Expressions

Name

Expression

Type

Semantics

Construct a state with a communicator

State(world)

State

Constructs the State.

Construct a state with the default communicator

State()

State

Constructs the State.

Get the current node's inner state

state()

InnerState

Returns a (const) reference.

Get the communicator

state.world

boost::mpi::communicator

See Boost.MPI.

Models
  • mpi_state<InnerState>

As used by openmp_nested_algebra, essentially a Random Access Container with ValueType = InnerState.

Notation

InnerState

The inner state type

State

The split state type

state

Object of type State

Valid Expressions

Name

Expression

Type

Semantics

Construct a state for n chunks

State(n)

State

Constructs underlying vector.

Get a chunk

state[i]

InnerState

Accesses underlying vector.

Get the number of chunks

state.size()

size_type

Returns size of underlying vector.

Models
  • openmp_state<ValueType> with InnerState = vector<ValueType>
Notation

Container1

The continuous-data container type

x

Object of type Container1

Container2

The chunked-data container type

y

Object of type Container2

Valid Expressions

Name

Expression

Type

Semantics

Copy chunks of input to output elements

split(x, y)

void

Calls split_impl<Container1, Container2>::split(x, y), splits x into y.size() chunks.

Join chunks of input elements to output

unsplit(y, x)

void

Calls unsplit_impl<Container2, Container1>::unsplit(y, x), assumes x is of the correct size σ y[i].size(), does not resize x.

Models
  • defined for Container1 = Boost.Range and Container2 = openmp_state
  • and Container2 = mpi_state.

To implement splitters for containers incompatible with Boost.Range, specialize the split_impl and unsplit_impl types:

template< class Container1, class Container2 , class Enabler = void >
struct split_impl {
    static void split( const Container1 &from , Container2 &to );
};

template< class Container2, class Container1 , class Enabler = void >
struct unsplit_impl {
    static void unsplit( const Container2 &from , Container1 &to );
};


PrevUpHomeNext