 ### Using CUDA (or OpenMP, TBB, ...) via Thrust

Phase oscillator ensemble
Large oscillator chains
Parameter studies

Modern graphic cards (graphic processing units - GPUs) can be used to speed up the performance of time consuming algorithms by means of massive parallelization. They are designed to execute many operations in parallel. odeint can utilize the power of GPUs by means of CUDA and Thrust, which is a STL-like interface for the native CUDA API.

Important Thrust also supports parallelization using OpenMP and Intel Threading Building Blocks (TBB). You can switch between CUDA, OpenMP and TBB parallelizations by a simple compiler switch. Hence, this also provides an easy way to get basic OpenMP parallelization into odeint. The examples discussed below are focused on GPU parallelization, though.

To use odeint with CUDA a few points have to be taken into account. First of all, the problem has to be well chosen. It makes absolutely no sense to try to parallelize the code for a three dimensional system, it is simply too small and not worth the effort. One single function call (kernel execution) on the GPU is slow but you can do the operation on a huge set of data with only one call. We have experienced that the vector size over which is parallelized should be of the order of 106 to make full use of the GPU. Secondly, you have to use Thrust's algorithms and functors when implementing the rhs the ODE. This might be tricky since it involves some kind of functional programming knowledge.

Typical applications for CUDA and odeint are large systems, like lattices or discretizations of PDE, and parameter studies. We introduce now three examples which show how the power of GPUs can be used in combination with odeint.

Important The full power of CUDA is only available for really large systems where the number of coupled ordinary differential equations is of order N=106 or larger. For smaller systems the CPU is usually much faster. You can also integrate an ensemble of different uncoupled ODEs in parallel as shown in the last example.

#### Phase oscillator ensemble

The first example is the phase oscillator ensemble from the previous section:

​k / dt = ω​k + ε / N Σ​j sin( φ​j - φ​k ).

It has a phase transition at ε = 2 in the limit of infinite numbers of oscillators N. In the case of finite N this transition is smeared out but still clearly visible.

Thrust and CUDA are perfectly suited for such kinds of problems where one needs a large number of particles (oscillators). We start by defining the state type which is a `thrust::device_vector`. The content of this vector lives on the GPU. If you are not familiar with this we recommend reading the Getting started section on the Thrust website.

```//change this to float if your device does not support double computation
typedef double value_type;

//change this to host_vector< ... > of you want to run on CPU
typedef thrust::device_vector< value_type > state_type;
// typedef thrust::host_vector< value_type > state_type;
```

Thrust follows a functional programming approach. If you want to perform a calculation on the GPU you usually have to call a global function like `thrust::for_each`, `thrust::reduce`, ... with an appropriate local functor which performs the basic operation. An example is

```struct add_two
{
template< class T >
__host__ __device__
void operator()( T &t ) const
{
t += T( 2 );
}
};

// ...

thrust::for_each( x.begin() , x.end() , add_two() );
```

This code generically adds two to every element in the container `x`.

For the purpose of integrating the phase oscillator ensemble we need

• to calculate the system function, hence the r.h.s. of the ODE.
• this involves computing the mean field of the oscillator example, i.e. the values of R and θ

The mean field is calculated in a class `mean_field_calculator`

```struct mean_field_calculator
{
struct sin_functor : public thrust::unary_function< value_type , value_type >
{
__host__ __device__
value_type operator()( value_type x) const
{
return sin( x );
}
};

struct cos_functor : public thrust::unary_function< value_type , value_type >
{
__host__ __device__
value_type operator()( value_type x) const
{
return cos( x );
}
};

static std::pair< value_type , value_type > get_mean( const state_type &x )
{
value_type sin_sum = thrust::reduce(
thrust::make_transform_iterator( x.begin() , sin_functor() ) ,
thrust::make_transform_iterator( x.end() , sin_functor() ) );
value_type cos_sum = thrust::reduce(
thrust::make_transform_iterator( x.begin() , cos_functor() ) ,
thrust::make_transform_iterator( x.end() , cos_functor() ) );

cos_sum /= value_type( x.size() );
sin_sum /= value_type( x.size() );

value_type K = sqrt( cos_sum * cos_sum + sin_sum * sin_sum );
value_type Theta = atan2( sin_sum , cos_sum );

return std::make_pair( K , Theta );
}
};
```

Inside this class two member structures `sin_functor` and `cos_functor` are defined. They compute the sine and the cosine of a value and they are used within a transform iterator to calculate the sum of sin(φ​k) and cos(φ​k). The classifiers `__host__` and `__device__` are CUDA specific and define a function or operator which can be executed on the GPU as well as on the CPU. The line

```value_type sin_sum = thrust::reduce(
thrust::make_transform_iterator( x.begin() , sin_functor() ) ,
thrust::make_transform_iterator( x.end() , sin_functor() ) );
```

performs the calculation of this sine-sum on the GPU (or on the CPU, depending on your thrust configuration).

The system function is defined via

```class phase_oscillator_ensemble
{

public:

struct sys_functor
{
value_type m_K , m_Theta , m_epsilon;

sys_functor( value_type K , value_type Theta , value_type epsilon )
: m_K( K ) , m_Theta( Theta ) , m_epsilon( epsilon ) { }

template< class Tuple >
__host__ __device__
void operator()( Tuple t )
{
thrust::get<2>(t) = thrust::get<1>(t) + m_epsilon * m_K * sin( m_Theta - thrust::get<0>(t) );
}
};

// ...

void operator() ( const state_type &x , state_type &dxdt , const value_type dt ) const
{
std::pair< value_type , value_type > mean_field = mean_field_calculator::get_mean( x );

thrust::for_each(
thrust::make_zip_iterator( thrust::make_tuple( x.begin() , m_omega.begin() , dxdt.begin() ) ),
thrust::make_zip_iterator( thrust::make_tuple( x.end() , m_omega.end() , dxdt.end()) ) ,
sys_functor( mean_field.first , mean_field.second , m_epsilon )
);
}

// ...
};
```

This class is used within the `do_step` and `integrate` method. It defines a member structure `sys_functor` for the r.h.s. of each individual oscillator and the `operator()` for the use in the steppers and integrators of odeint. The functor computes first the mean field of φ​k and secondly calculates the whole r.h.s. of the ODE using this mean field. Note, how nicely `thrust::tuple` and `thrust::zip_iterator` play together.

Now we are ready to put everything together. All we have to do for making odeint ready for using the GPU is to parametrize the stepper with the `state_type` and `value_type`:

```typedef runge_kutta4< state_type , value_type , state_type , value_type > stepper_type;
```

Note We have specifically define four template parameters because we have to override the default parameter value `double` with `value_type` to ensure our programs runs properly if we use `float` as fundamental data type.

You can also use a controlled or dense output stepper, e.g.

```typedef runge_kutta_dopri5< state_type , value_type , state_type , value_type > stepper_type;
```

Then, it is straightforward to integrate the phase ensemble by creating an instance of the rhs class and using an integration function:

```phase_oscillator_ensemble ensemble( N , 1.0 );
```

```size_t steps1 = integrate_const( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , boost::ref( ensemble ) , x , 0.0 , t_transients , dt );
```

We have to use `boost::ref` here in order to pass the rhs class as reference and not by value. This ensures that the natural frequencies of each oscillator are not copied when calling `integrate_const`. In the full example the performance and results of the Runge-Kutta-4 and the Dopri5 solver are compared.

The full example can be found at phase_oscillator_example.cu.

#### Large oscillator chains

The next example is a large, one-dimensional chain of nearest-neighbor coupled phase oscillators with the following equations of motion:

d φ​k / dt = ω​k + sin( φ​k+1 - φ​k ) + sin( φ​k - φ​k-1)

In principle we can use all the techniques from the previous phase oscillator ensemble example, but we have to take special care about the coupling of the oscillators. To efficiently implement the coupling you can use a very elegant way employing Thrust's permutation iterator. A permutation iterator behaves like a normal iterator on a vector but it does not iterate along the usual order of the elements. It rather iterates along some permutation of the elements defined by some index map. To realize the nearest neighbor coupling we create one permutation iterator which travels one step behind a usual iterator and another permutation iterator which travels one step in front. The full system class is:

```//change this to host_vector< ... > if you want to run on CPU
typedef thrust::device_vector< value_type > state_type;
typedef thrust::device_vector< size_t > index_vector_type;
//typedef thrust::host_vector< value_type > state_type;
//typedef thrust::host_vector< size_t > index_vector_type;

class phase_oscillators
{

public:

struct sys_functor
{
template< class Tuple >
__host__ __device__
void operator()( Tuple t )  // this functor works on tuples of values
{
// first, unpack the tuple into value, neighbors and omega
const value_type phi = thrust::get<0>(t);
const value_type phi_left = thrust::get<1>(t);  // left neighbor
const value_type phi_right = thrust::get<2>(t); // right neighbor
const value_type omega = thrust::get<3>(t);
// the dynamical equation
thrust::get<4>(t) = omega + sin( phi_right - phi ) + sin( phi - phi_left );
}
};

phase_oscillators( const state_type &omega )
: m_omega( omega ) , m_N( omega.size() ) , m_prev( omega.size() ) , m_next( omega.size() )
{
// build indices pointing to left and right neighbours
thrust::counting_iterator<size_t> c( 0 );
thrust::copy( c , c+m_N-1 , m_prev.begin()+1 );
m_prev = 0; // m_prev = { 0 , 0 , 1 , 2 , 3 , ... , N-1 }

thrust::copy( c+1 , c+m_N , m_next.begin() );
m_next[m_N-1] = m_N-1; // m_next = { 1 , 2 , 3 , ... , N-1 , N-1 }
}

void operator() ( const state_type &x , state_type &dxdt , const value_type dt )
{
thrust::for_each(
thrust::make_zip_iterator(
thrust::make_tuple(
x.begin() ,
thrust::make_permutation_iterator( x.begin() , m_prev.begin() ) ,
thrust::make_permutation_iterator( x.begin() , m_next.begin() ) ,
m_omega.begin() ,
dxdt.begin()
) ),
thrust::make_zip_iterator(
thrust::make_tuple(
x.end() ,
thrust::make_permutation_iterator( x.begin() , m_prev.end() ) ,
thrust::make_permutation_iterator( x.begin() , m_next.end() ) ,
m_omega.end() ,
dxdt.end()) ) ,
sys_functor()
);
}

private:

const state_type &m_omega;
const size_t m_N;
index_vector_type m_prev;
index_vector_type m_next;
};
```

Note, how easy you can obtain the value for the left and right neighboring oscillator in the system functor using the permutation iterators. But, the call of the `thrust::for_each` function looks relatively complicated. Every term of the r.h.s. of the ODE is resembled by one iterator packed in exactly the same way as it is unpacked in the functor above.

Now we put everything together. We create random initial conditions and decreasing frequencies such that we should get synchronization. We copy the frequencies and the initial conditions onto the device and finally initialize and perform the integration. As result we simply write out the current state, hence the phase of each oscillator.

```// create initial conditions and omegas on host:
vector< value_type > x_host( N );
vector< value_type > omega_host( N );
for( size_t i=0 ; i<N ; ++i )
{
x_host[i] = 2.0 * pi * drand48();
omega_host[i] = ( N - i ) * epsilon; // decreasing frequencies
}

// copy to device
state_type x = x_host;
state_type omega = omega_host;

// create stepper
runge_kutta4< state_type , value_type , state_type , value_type > stepper;

// create phase oscillator system function
phase_oscillators sys( omega );

// integrate
integrate_const( stepper , sys , x , 0.0 , 10.0 , dt );

thrust::copy( x.begin() , x.end() , std::ostream_iterator< value_type >( std::cout , "\n" ) );
std::cout << std::endl;
```

The full example can be found at phase_oscillator_chain.cu.

#### Parameter studies

Another important use case for Thrust and CUDA are parameter studies of relatively small systems. Consider for example the three-dimensional Lorenz system from the chaotic systems example in the previous section which has three parameters. If you want to study the behavior of this system for different parameters you usually have to integrate the system for many parameter values. Using thrust and odeint you can do this integration in parallel, hence you integrate a whole ensemble of Lorenz systems where each individual realization has a different parameter value.

In the following we will show how you can use Thrust to integrate the above mentioned ensemble of Lorenz systems. We will vary only the parameter β but it is straightforward to vary other parameters or even two or all three parameters. Furthermore, we will use the largest Lyapunov exponent to quantify the behavior of the system (chaoticity).

We start by defining the range of the parameters we want to study. The state_type is again a ```thrust::device_vector< value_type >```.

```vector< value_type > beta_host( N );
const value_type beta_min = 0.0 , beta_max = 56.0;
for( size_t i=0 ; i<N ; ++i )
beta_host[i] = beta_min + value_type( i ) * ( beta_max - beta_min ) / value_type( N - 1 );

state_type beta = beta_host;
```

The next thing we have to implement is the Lorenz system without perturbations. Later, a system with perturbations is also implemented in order to calculate the Lyapunov exponent. We will use an ansatz where each device function calculates one particular realization of the Lorenz ensemble

```struct lorenz_system
{
struct lorenz_functor
{
template< class T >
__host__ __device__
void operator()( T t ) const
{
// unpack the parameter we want to vary and the Lorenz variables
value_type R = thrust::get< 3 >( t );
value_type x = thrust::get< 0 >( t );
value_type y = thrust::get< 1 >( t );
value_type z = thrust::get< 2 >( t );
thrust::get< 4 >( t ) = sigma * ( y - x );
thrust::get< 5 >( t ) = R * x - y - x * z;
thrust::get< 6 >( t ) = -b * z + x * y ;

}
};

lorenz_system( size_t N , const state_type &beta )
: m_N( N ) , m_beta( beta ) { }

template< class State , class Deriv >
void operator()(  const State &x , Deriv &dxdt , value_type t ) const
{
thrust::for_each(
thrust::make_zip_iterator( thrust::make_tuple(
boost::begin( x ) ,
boost::begin( x ) + m_N ,
boost::begin( x ) + 2 * m_N ,
m_beta.begin() ,
boost::begin( dxdt ) ,
boost::begin( dxdt ) + m_N ,
boost::begin( dxdt ) + 2 * m_N  ) ) ,
thrust::make_zip_iterator( thrust::make_tuple(
boost::begin( x ) + m_N ,
boost::begin( x ) + 2 * m_N ,
boost::begin( x ) + 3 * m_N ,
m_beta.begin() ,
boost::begin( dxdt ) + m_N ,
boost::begin( dxdt ) + 2 * m_N ,
boost::begin( dxdt ) + 3 * m_N  ) ) ,
lorenz_functor() );
}
size_t m_N;
const state_type &m_beta;
};
```

As `state_type` a `thrust::device_vector` or a Boost.Range of a `device_vector` is used. The length of the state is 3N where N is the number of systems. The system is encoded into this vector such that all x components come first, then every y components and finally every z components. Implementing the device function is then a simple task, you only have to decompose the tuple originating from the zip iterators.

Besides the system without perturbations we furthermore need to calculate the system including linearized equations governing the time evolution of small perturbations. Using the method from above this is straightforward, with a small difficulty that Thrust's tuples have a maximal arity of 10. But this is only a small problem since we can create a zip iterator packed with zip iterators. So the top level zip iterator contains one zip iterator for the state, one normal iterator for the parameter, and one zip iterator for the derivative. Accessing the elements of this tuple in the system function is then straightforward, you unpack the tuple with `thrust::get<>()`. We will not show the code here, it is to large. It can be found here and is easy to understand.

Furthermore, we need an observer which determines the norm of the perturbations, normalizes them and averages the logarithm of the norm. The device functor which is used within this observer is defined

```struct lyap_functor
{
template< class T >
__host__ __device__
void operator()( T t ) const
{
value_type &dx = thrust::get< 0 >( t );
value_type &dy = thrust::get< 1 >( t );
value_type &dz = thrust::get< 2 >( t );
value_type norm = sqrt( dx * dx + dy * dy + dz * dz );
dx /= norm;
dy /= norm;
dz /= norm;
thrust::get< 3 >( t ) += log( norm );
}
};
```

Note, that this functor manipulates the state, i.e. the perturbations.

Now we complete the whole code to calculate the Lyapunov exponents. First, we have to define a state vector. This vector contains 6N entries, the state x,y,z and its perturbations dx,dy,dz. We initialize them such that x=y=z=10, dx=1, and dy=dz=0. We define a stepper type, a controlled Runge-Kutta Dormand-Prince 5 stepper. We start with some integration to overcome the transient behavior. For this, we do not involve the perturbation and run the algorithm only on the state x,y,z without any observer. Note, how Boost.Range is used for partial integration of the state vector without perturbations (the first half of the whole state). After the transient, the full system with perturbations is integrated and the Lyapunov exponents are calculated and written to `stdout`.

```state_type x( 6 * N );

// initialize x,y,z
thrust::fill( x.begin() , x.begin() + 3 * N , 10.0 );

// initial dx
thrust::fill( x.begin() + 3 * N , x.begin() + 4 * N , 1.0 );

// initialize dy,dz
thrust::fill( x.begin() + 4 * N , x.end() , 0.0 );

// create error stepper, can be used with make_controlled or make_dense_output
typedef runge_kutta_dopri5< state_type , value_type , state_type , value_type > stepper_type;

lorenz_system lorenz( N , beta );
lorenz_perturbation_system lorenz_perturbation( N , beta );
lyap_observer obs( N , 1 );

// calculate transients
integrate_adaptive( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , lorenz , std::make_pair( x.begin() , x.begin() + 3 * N ) , 0.0 , 10.0 , dt );

// calculate the Lyapunov exponents -- the main loop
double t = 0.0;
while( t < 10000.0 )
{
integrate_adaptive( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , lorenz_perturbation , x , t , t + 1.0 , 0.1 );
t += 1.0;
obs( x , t );
}

vector< value_type > lyap( N );
obs.fill_lyap( lyap );

for( size_t i=0 ; i<N ; ++i )
cout << beta_host[i] << "\t" << lyap[i] << "\n";
```

The full example can be found at lorenz_parameters.cu.