wielding_8h_source.html

/************************************************************************/

/*                                                                      */

/*    vspline - a set of generic tools for creation and evaluation      */

/*              of uniform b-splines                                    */

/*                                                                      */

/*            Copyright 2015 - 2023 by Kay F. Jahnke                    */

/*                                                                      */

/*    The git repository for this software is at                        */

/*                                                                      */

/*    https://bitbucket.org/kfj/vspline                                 */

/*                                                                      */

/*    Please direct questions, bug reports, and contributions to        */

/*                                                                      */

/*    kfjahnke+vspline@gmail.com                                        */

/*                                                                      */

/*    Permission is hereby granted, free of charge, to any person       */

/*    obtaining a copy of this software and associated documentation    */

/*    files (the "Software"), to deal in the Software without           */

/*    restriction, including without limitation the rights to use,      */

/*    copy, modify, merge, publish, distribute, sublicense, and/or      */

/*    sell copies of the Software, and to permit persons to whom the    */

/*    Software is furnished to do so, subject to the following          */

/*    conditions:                                                       */

/*                                                                      */

/*    The above copyright notice and this permission notice shall be    */

/*    included in all copies or substantial portions of the             */

/*    Software.                                                         */

/*                                                                      */

/*    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND    */

/*    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES   */

/*    OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND          */

/*    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT       */

/*    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,      */

/*    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING      */

/*    FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR     */

/*    OTHER DEALINGS IN THE SOFTWARE.                                   */

/*                                                                      */

/************************************************************************/


/*! \file wielding.h


    \brief Implementation of vspline::transform


    wielding.h provides code to process all 1D subbarrays of nD views.

    This is similar to using vigra::Navigator, which also iterates over

    1D subarrays of nD arrays. Here, this access is hand-coded to have

    complete control over the process, and to work with range-based

    code rather than the iterator-based approach vigra uses.


    The code is structured so that separable aspects of the process

    are coded as separate entities:


    the top-level object in the wielding code is class wield.

    class wield offers several methods taking information about

    the data which are to be processed, and std::functions defining

    the specific processing which is intended for the 1D subbarrays.

    When one of wield's top level methods is called, it iterates

    over the 1D subarrays, calling the std::function for each subarray

    in turn - the std::function is used as a callback function.


    Once inside the callback function, what's now seen is a specific

    1D subarray (or a pair of them, when two arrays are processed

    in sync), plus any additional information specifically needed

    by the callback function, like the starting index in the nD

    array, which is needed for index-based transforms.


    The callback 'functions' passed to the wield object in this body

    of code are actually functors. They are set up to 'contain' an

    adapted vspline::unary_functor, which is capable of processing

    data contained in the arrays.


    If vectorization is not used, the processing is trivial: it 'collapses'

    to a simple traversal of the 1D subarray(s), using the unvectorized

    evaluation code in the vspline::unary_functor. But the whole point

    of 'aggregation' is to feed the *vectorized* evaluation code:


    Here, the data are reworked to be suited for vectorized processing.

    This is done by copying incoming data into a small buffer, using

    techniques like SIMD gathering, SIMD loads and possibly Vc-provided

    deinterleaving, then processing the buffer with vectorized code,

    and finally writing the result back to target memory using the

    reverse operations: SIMD scatters or stores, or Vc's interleaving

    code. The 'magic' is that all of this is transparent to calling

    code: to the caller it's merely a call into code processing arrays

    of data, and all the complex buffering and unbuffering is done

    in a 'black box', encapsulated in class wield and the callback

    functions.


    The functions handling individual 1D subarrays of data are natural

    candidates as 'joblets' to be used by several worker threads. With

    my new multithreading code introduced in March 2019, multithreading

    can use this granularity efficiently with an arbitrary number of

    workers. The multithreading is now done directly by class 'wield'

    in it's top-level methods and follows the 'standard' pattern of

    setting up the 'payload' as a lambda with reference capture,

    parcelling out 'joblets' via a vspline::atomic. This ensures

    granularity at the level of individual 1D subarrays (like, lines

    of an image) and next to no signalling overhead. As an added

    benefit, the set of currently active threads will co-operate on

    a reasonably small area of memory, making cache hits likely.


    If Vc is used, the code provides specialized routines for cases

    where Vc can speed things up. Without Vc, this code will not be

    compiled (it's inside #ifdef USE_VC ... #endif preprocessor

    statements). Without Vc, the code will still be vectorized by

    a technique I call 'goading': The data are repackaged into small

    SoAs with vector-friendly array sizes and the expectation is that

    the compiler will recognize that the resulting inner loops are

    candidates for autovectorization. Using this technique has the

    advantage that - if the compiler 'gets it' - code will be generated

    for every target the *compiler* can produce autovectorized code for,

    rather than being limited to what Vc covers. And since the Vc types

    may mystify the compiler, not using them may also allow the compiler

    to optimize the code better. The 'goading' is done by using a 'mock'

    SIMD type (vspline::simd_type, see simd_type.h for more information).

    The actual SIMD or pseudo-SIMD data types used by the wielding code

    are not fixed, though - what's used is inferred from the functor

    passed to the wielding code, and the idea is to widen the feeding

    spectrum easily to other vectorized data types. If there is no

    specialized code for these types (like the Vc code for Vc data),

    there are only very few requirements for these types and adapting

    to new variants should be simple. TODO: concretize interface


    After the aggregation code, wielding.h provides three functions

    using the mechanism described above to process arrays of data.

    These functions (index_, coupled_ and generate_wield ) take care

    of setting up and calling into the wield objects. They are used in

    turn to implement 'transform' routines, which are the top-level

    code user code calls. These top-level routines take care of

    argument checking and presenting the arguments to the wielding

    code in the form it needs. That code is in transform.h.


    So by now, the use of the term 'wielding' should be obvious.

    We have a 'tool', namely the vspline::unary_functor, and we have

    data on which we intend to use the unary_functor. What's left to

    do? Wielding the tool! And since this operation can be factored

    out, I've done so and labeled it the 'wielding' code. There is

    another place in vspline which also provides 'wielding' code:

    it's the code in filter.h, which is used to 'wield' specific

    digital filters (like convolution or b-spline prefiltering),

    applying them to arrays of data. The requirements there are

    quite different from the requirements here, so these two bodies

    of wielding code are separate, but the design method is the same:

    we use two conceptual entities, the tool and it's use.


    The 'magic' in vspline's wielding code is the automatic

    multithreading and vectorization, which is done transparently and

    makes the code fast. But seen from the outside, by a caller of

    one of the 'transform' functions, all the complexity is hidden.

    And, at the same time, if code is needed for targets which can't

    use vector code or multithreading, enabling or disabling these

    capabilites is as simple as passing a preprocessor definition

    to the compiler.

*/


#ifndef VSPLINE_WIELDING_H


#include <atomic>

#include "interleave.h"

#include "vspline.h"


namespace wielding

{

/// indexed_aggregator receives the start coordinate and processing axis

/// along with the data to process, this is meant for index-transforms.

/// The coordinate is updated for every call to the 'inner' functor

/// so that the inner functor has the current coordinate as input.

/// The code in this template will only be used for vectorized operation,

/// without vectorization, only the specialization for vsize == 1 below

/// is used.


template < size_t vsz , typename ic_type , class functor_type ,

           typename = std::enable_if < ( vsz > 1 ) > >

struct indexed_aggregator

{

  // extract the functor's i/o type system


  typedef typename functor_type::in_type in_type ;

  typedef typename functor_type::in_ele_type in_ele_type ;

  typedef typename functor_type::in_v in_v ;

  typedef typename functor_type::in_ele_v in_ele_v ;


  typedef typename functor_type::out_type out_type ;

  typedef typename functor_type::out_ele_type out_ele_type ;

  typedef typename functor_type::out_v out_v ;

  typedef typename functor_type::out_ele_v out_ele_v ;


  enum { dim_in = functor_type::dim_in } ;

  enum { dim_out = functor_type::dim_out } ;


  // note how we use the functor's in_type as the coordinate type,

  // rather than using a TinyVector of some integral type. This way

  // we have the index already in the type needed by the functor and

  // arithmetic on the coordinate uses this type as well.


  const functor_type functor ;


  indexed_aggregator ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


#ifdef USE_VC


  // helper function to determine if the lane count is a multiple

  // of the hardware vector size. Used only with Vc. This function

  // is used to initialize a const static bool where direct

  // initialization produces a warning if hsize is zero.


  static bool is_n_hsize()

  {

    if ( vspline::vector_traits<out_ele_type>::hsize <= 0 )

      return false ;

    int div_by = vspline::vector_traits<out_ele_type>::hsize ;

    return ( vsz % div_by == 0 ) ;

  }


#endif


  // note how 'crd' is of in_type, which depends on the functor,

  // while the actual call passes an integral type. If in_type

  // is real, this overload is nevertheless picked and the argument

  // converted to the real coordinate type.


  void operator() ( in_type crd ,

                    int axis ,

                    out_type * trg ,

                    ic_type stride ,

                    ic_type length )

  {

    auto aggregates = length / vsz ;

    auto leftover = length - aggregates * vsz ;


    // the buffer and the nD coordinate are created as the data types

    // which the functor expects.


    out_v buffer ;

    in_v md_crd ;


    // initialize the vectorized coordinate. This coordinate will

    // remain constant except for the component indexing the

    // processing axis, which will be counted up as we go along.

    // This makes the index calculations very efficient: for one

    // vectorized evaluation, we only need a single vectorized

    // addition where the vectorized coordinate is increased by

    // vsize.


    for ( int d = 0 ; d < dim_in ; d++ )

    {

      if ( d != axis )

        md_crd[d] = crd[d] ;

      else

      {

        for ( int e = 0 ; e < vsz ; e++ )

          md_crd[d][e] = crd[d] + e ;

      }

    }


#ifdef USE_VC


    // flag which is true if vsz is a multiple of the hardware

    // vector size for out_ele_type. This flag will activate the use

    // of specialized memory access code (Vc::InterleavedMemoryWrapper)

    // If this is unwanted, the easiest way to deactivate that code

    // is by setting this flag to false. Then, all access which can't

    // use straight SIMD store operations will use scatters.


    static const bool out_n_vecsz ( is_n_hsize() ) ;


#else


    static const bool out_n_vecsz = false ;


#endif


    // process a bunch of coordinates: apply the 'inner' functor,

    // then write result to memory using 'fluff'.


    // flag used to dispatch to either of the unstrided bunch/fluff

    // overloads:


    typedef typename std::integral_constant < bool , dim_out == 1 > use_store_t ;


    // if the stride is 1, we can use specialized 'fluff' variants,

    // provided the data are single-channel (or the vector width

    // is a multiple of the hardware vector width when Vc is used).

    // All other cases are handled with the variant of 'fluff'

    // taking a stride.


    if ( stride == 1 && ( dim_out == 1 || out_n_vecsz ) )

    {

      for ( ic_type a = 0 ; a < aggregates ; a++ )

      {

        functor ( md_crd , buffer ) ;

        fluff ( buffer , trg , use_store_t() ) ;

        trg += vsz ;

        md_crd[axis] += vsz ;

      }

    }

    else

    {

      for ( ic_type a = 0 ; a < aggregates ; a++ )

      {

        functor ( md_crd , buffer ) ;

        fluff ( buffer , trg , stride ) ;

        trg += vsz * stride ;

        md_crd[axis] += vsz ;

      }

    }


    // peeling is done, any leftovers are processed one-by-one


    crd[axis] += aggregates * vsz ;


    for ( ic_type r = 0 ; r < leftover ; r++ )

    {

      functor ( crd , *trg ) ;

      trg += stride ;

      crd[axis]++ ;

    }

  }

} ; // struct indexed_aggregator


/// specialization for vsz == 1. Here the data are simply

/// processed one by one in a loop, without vectorization.


template < typename ic_type , class functor_type >

struct indexed_aggregator < 1 , ic_type , functor_type >

{

  const functor_type functor ;


  indexed_aggregator ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  // note how we use the functor's in_type as the coordinate type,

  // rather than using a TinyVector of some integral type. This way

  // we have the index already in the type needed by the functor and

  // arithmetic on the coordinate uses this type as well.


  typedef typename functor_type::in_type sd_coordinate_type ;


  void operator() ( sd_coordinate_type crd ,

                    int axis ,

                    typename functor_type::out_type * trg ,

                    ic_type stride ,

                    ic_type length )

  {

    for ( ic_type r = 0 ; r < length ; r++ )

    {

      functor ( crd , *trg ) ;

      trg += stride ;

      crd[axis]++ ;

    }

  }

} ;


/// indexed_reductor is used for reductions and has no output. The actual

/// reduction is handled by the functor: each thread has it's own copy of

/// the functor, which does it's own part of the reduction, and 'offloads'

/// it's result to some mutex-protected receptacle when it's destructed,

/// see the 'reduce' functions in transform.h for a more detailed explanation

/// and an example of such a functor.

/// idexed_reductor processes discrete coordinates, whereas yield_reductor

/// (the next class down) processes values. This variant works just like

/// an indexed_aggregator, only that it produces no output - at least not

/// for every coordinate fed to the functor, the functor itself does hold

/// state (the reduction) and is also responsible for offloading per-thread

/// results when the worker threads terminate.

/// This class holds a copy of the functor, and each thread has an instance

/// of this class, ensuring that each worker thread can reduce it's share of

/// the work load independently.


template < size_t vsz , typename ic_type , class functor_type ,

           typename = std::enable_if < ( vsz > 1 ) > >

struct indexed_reductor

{

  // extract the functor's type system


  typedef typename functor_type::in_type in_type ;

  typedef typename functor_type::in_ele_type in_ele_type ;

  typedef typename functor_type::in_v in_v ;

  typedef typename functor_type::in_ele_v in_ele_v ;


  enum { dim_in = functor_type::dim_in } ;


  functor_type functor ;


  // get the coordinate type the functor expects


  typedef typename functor_type::in_type crd_type ;


  // the c'tor copy-constructs member 'functor'


  indexed_reductor ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  void operator() ( in_type crd ,

                    int axis ,

                    ic_type length )

  {

    auto aggregates = length / vsz ;

    auto leftover = length - aggregates * vsz ;


    // the nD coordinate is created as the data type

    // which the functor expects.


    in_v md_crd ;


    // initialize the vectorized coordinate. This coordinate will

    // remain constant except for the component indexing the

    // processing axis, which will be counted up as we go along.

    // This makes the index calculations very efficient: for one

    // vectorized evaluation, we only need a single vectorized

    // addition where the vectorized coordinate is increased by

    // vsize.


    for ( int d = 0 ; d < dim_in ; d++ )

    {

      if ( d != axis )

        md_crd[d] = crd[d] ;

      else

      {

        for ( int e = 0 ; e < vsz ; e++ )

          md_crd[d][e] = crd[d] + e ;

      }

    }


    // process a bunch of coordinates: apply the 'inner' functor.


    for ( ic_type a = 0 ; a < aggregates ; a++ )

    {

      functor ( md_crd ) ;

      md_crd[axis] += vsz ;

    }


    // peeling is done, any leftovers are processed one-by-one


    crd[axis] += aggregates * vsz ;


    for ( ic_type r = 0 ; r < leftover ; r++ )

    {

      functor ( crd ) ;

      crd[axis]++ ;

    }

  }

} ; // struct indexed_reductor


/// specialization for vsz == 1. Here the data are simply

/// processed one by one in a loop, without vectorization.


template < typename ic_type , class functor_type >

struct indexed_reductor < 1 , ic_type , functor_type >

{

  const functor_type functor ;


  indexed_reductor ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  // note how we use the functor's in_type as the coordinate type,

  // rather than using a TinyVector of some integral type. This way

  // we have the index already in the type needed by the functor and

  // arithmetic on the coordinate uses this type as well.


  typedef typename functor_type::in_type sd_coordinate_type ;


  void operator() ( sd_coordinate_type crd ,

                    int axis ,

                    ic_type length )

  {

    for ( ic_type r = 0 ; r < length ; r++ )

    {

      functor ( crd ) ;

      crd[axis]++ ;

    }

  }

} ;


/// an aggregator to reduce arrays. This is like using indexed_reductor

/// with a functor gathering from an array, but due to the use of 'bunch'

/// this class is faster for certain array types, because it can use

/// load/shuffle operations instead of always gathering.


template < size_t vsz , typename ic_type , class functor_type ,

           typename = std::enable_if < ( vsz > 1 ) > >

struct yield_reductor

{

  typedef typename functor_type::in_type in_type ;

  typedef typename functor_type::in_ele_type in_ele_type ;


  enum { dim_in = functor_type::dim_in } ;


  functor_type functor ;


  // get the data types the functor expects


  typedef typename functor_type::in_v in_v ;


  yield_reductor ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  void operator() ( const in_type * src ,

                    ic_type in_stride ,

                    ic_type length

                  )

  {

    auto aggregates = length / vsz ;

    auto leftover = length - aggregates * vsz ;


    in_v in_buffer ;


    // first we perform a peeling run, processing data vectorized

    // as long as there are enough data to fill the vectorized

    // buffers (md_XXX_data_type)


#ifdef USE_VC


    // flags which are true if vsz is a multiple of the hardware

    // vector size for the elementary types involved. This works like

    // an opt-in: even if dim_in or dim_out are not 1, if these flags

    // are true, specialized load/store variants are called. If, then,

    // use_load_t or use_store_t are std::false_type, we'll end up in

    // the specialized Vc code using InterleavedMemoryWrapper.


    static const bool in_n_vecsz

      = (    vspline::vector_traits<in_ele_type>::hsize > 0

          && vsz % vspline::vector_traits<in_ele_type>::hsize == 0 ) ;


#else


    static const bool in_n_vecsz = false ;


#endif


    // used to dispatch to either of the unstrided bunch overloads;


    typedef typename std::integral_constant < bool , dim_in == 1 > use_load_t ;


    // depending on whether the input is strided or not,

    // and on the vector size and number of channels,

    // we pick different overloads of 'bunch'. The

    // overloads without stride may use InterleavedMemoryWrapper,

    // or, for single-channel data, SIMD load operations,

    // which is most efficient. We can only pick the variants

    // using InterleavedMemoryWrapper if vsz is a multiple of

    // the hardware SIMD register size, hence the rather complex

    // conditionals. But the complexity is rewarded with optimal

    // peformance.


    for ( ic_type a = 0 ; a < aggregates ; a++ )

    {

      if (    in_stride == 1

          && ( dim_in == 1 || in_n_vecsz ) )

      {

        bunch ( src , in_buffer , use_load_t() ) ;

        src += vsz ;

        functor ( in_buffer ) ;

      }

      else

      {

        bunch ( src , in_buffer , in_stride ) ;

        src += in_stride * vsz ;

        functor ( in_buffer ) ;

      }

    }


    // peeling is done, we mop up the remainder with scalar code


    for ( ic_type r = 0 ; r < leftover ; r++ )

    {

      functor ( *src ) ;

      src += in_stride ;

    }

  }

} ; // struct yield_reductor


/// specialization for vsz == 1. Here the data are simply

/// processed one by one in a loop, without vectorization.


template < typename ic_type , class functor_type >

struct yield_reductor < 1 , ic_type , functor_type >

{

  const functor_type functor ;


  yield_reductor ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  void operator() ( const typename functor_type::in_type * src ,

                    ic_type in_stride ,

                    ic_type length

                  )

  {

    for ( ic_type r = 0 ; r < length ; r++ )

    {

      functor ( *src ) ;

      src += in_stride ;

    }

  }

} ;


/// generate_aggregator is very similar to indexed_aggregator, but

/// instead of managing and passing a coordinate to the functor, the

/// functor now manages the argument side of the operation: it acts

/// as a generator. To make this possible, the generator has to hold

/// run-time modifiable state and can't be const like the functors

/// used in the other aggregators, where the functors are 'pure' in

/// a functional programming sense.

/// A 'generator' functor to be used with this body of code is expected

/// to behave in a certain fashion:

/// - all of it's state which stays constant and shared by all invocations

///   has to be present after construction.

/// - the generator is trivially copyable

/// - copying the generator produces copies hoding the same shared state

/// - the generator has a 'reset' routine taking a coordinate. This

///   routine initializes state pertaining to a single 1D subarray

///   of data to be processed in a worker thread.


template < size_t _vsize , typename ic_type , class functor_type ,

           typename = std::enable_if < ( _vsize > 1 ) > >

struct generate_aggregator

{

  static const size_t vsize = _vsize ;


  // extract the generator's output type system. This is the same

  // system as is used by vspline::unary_functor, minus, of course,

  // the input types.


  typedef typename functor_type::out_type out_type ;

  typedef typename functor_type::out_ele_type out_ele_type ;

  typedef typename functor_type::out_nd_ele_type out_nd_ele_type ;

  typedef typename functor_type::out_v out_v ;

  typedef typename functor_type::out_ele_v out_ele_v ;

  typedef typename functor_type::out_nd_ele_v out_nd_ele_v ;


  enum { channels = functor_type::channels } ;


  // functor is a generator and carries mutable state, so it's not const


  functor_type functor ;


  // get the coordinate type the functor expects


  typedef typename functor_type::shape_type crd_type ;


  // the c'tor copy-constructs member 'functor', which will again be

  // copy-constructed in the single threads, providing a separate

  // instance for each thread.


  generate_aggregator ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  // variant code producing a full line of data in one go

  // this may go later, there seems to be no gain to be had from this.


#ifdef USE_BUFFERED_GENERATION


  void operator() ( crd_type crd ,

                    int axis ,

                    out_type * trg ,

                    ic_type stride ,

                    ic_type length )

  {

    // We need an nD equivalent of 'trg' to use 'fluff'


    out_nd_ele_type * & nd_trg

      = reinterpret_cast < out_nd_ele_type * & > ( trg ) ;


    auto aggregates = length / vsize ;

    auto leftover = length - aggregates * vsize ;


    // reset the functor to start from a new initial coordinate.


    functor.reset ( crd , aggregates ) ;


    // the buffer is created as the data type which the functor expects.

    // since the functor is a generator, there is no input for it.


    vigra::MultiArray < 1 , out_ele_v >

           vbuffer ( aggregates * channels ) ;


    vigra::MultiArray < 1 , out_type > rest ( leftover ) ;


    functor.eval ( vbuffer , rest ) ;


#ifdef USE_VC


    // flag which is true if vsize is a multiple of the hardware

    // vector size for out_ele_type. This flag will activate the use

    // of specialized memory access code (Vc::InterleavedMemoryWrapper)

    // If this is unwanted, the easiest way to deactivate that code

    // is by setting this flag to false. Then, all access which can't

    // use straight SIMD store operations will use scatters.


    static const bool out_n_vecsz

      = (    vspline::vector_traits<out_ele_type>::hsize > 0

          && vsize % vspline::vector_traits<out_ele_type>::hsize == 0 ) ;


#else


    static const bool out_n_vecsz = false ;


#endif


    // generate a set of data: call the 'inner' functor,

    // then write result to memory using 'fluff'.


    // flag used to dispatch to either of the unstrided bunch/fluff overloads:


    typedef typename std::integral_constant

                     < bool , channels == 1 > use_store_t ;


    // if the stride is 1, we can use specialized 'fluff' variants,

    // provided the data are single-channel (or the vector width

    // is a multiple of the hardware vector width when Vc is used).

    // All other cases are handled with the variant of 'fluff'

    // taking a stride.


    // TODO: would be nice to simply have a MultiArrayView of

    // aggregates * out_v, but that crashes

    // hence the detour via the nD type and storing (and reloading)

    // individual vectors


   // We need an nD equivalent of 'vr' to use 'fluff'


    out_v vr ;

    out_nd_ele_v & ndvr = reinterpret_cast < out_nd_ele_v & > ( vr ) ;


    if ( stride == 1 && ( channels == 1 || out_n_vecsz ) )

    {

      for ( ic_type a = 0 ; a < aggregates ; a++ )

      {

        for ( size_t ch = 0 ; ch < channels ; ch++ )

          ndvr[ch] = vbuffer [ a * channels + ch ] ;

        fluff ( ndvr , nd_trg , use_store_t() ) ;

        trg += vsize ;

      }

    }

    else

    {

      for ( ic_type a = 0 ; a < aggregates ; a++ )

      {

        for ( size_t ch = 0 ; ch < channels ; ch++ )

          ndvr[ch] = vbuffer [ a * channels + ch ] ;

        fluff ( ndvr , nd_trg , stride ) ;

        trg += vsize * stride ;

      }

    }


    // peeling is done, any leftovers are processed one-by-one


    for ( ic_type r = 0 ; r < leftover ; r++ )

    {

      *trg = rest[r] ;

      trg += stride ;

    }

  }


#else // USE_BUFFERED_GENERATION


  void operator() ( crd_type crd ,

                    int axis ,

                    out_type * trg ,

                    ic_type stride ,

                    ic_type length )

  {

    // We need an nD equivalent of 'trg' to use 'fluff'


    out_nd_ele_type * & nd_trg

      = reinterpret_cast < out_nd_ele_type * & > ( trg ) ;


    // set up the vectorizable extent and the remainder


    auto aggregates = length / vsize ;

    auto leftover = length - aggregates * vsize ;


    // reset the functor to start from a new initial coordinate.


    functor.reset ( crd , aggregates ) ;


    // buffer 'vr' is created as the data type which the functor expects.

    // since the functor is a generator, there is no input for it.

    // We also need an nD equivalent to use 'fluff'


    out_v vr ;

    out_nd_ele_v & ndvr = reinterpret_cast < out_nd_ele_v & > ( vr ) ;


#ifdef USE_VC


    // flag which is true if vsize is a multiple of the hardware

    // vector size for out_ele_type. This flag will activate the use

    // of specialized memory access code (Vc::InterleavedMemoryWrapper)

    // If this is unwanted, the easiest way to deactivate that code

    // is by setting this flag to false. Then, all access which can't

    // use straight SIMD store operations will use scatters.


    static const bool out_n_vecsz

      = (    vspline::vector_traits<out_ele_type>::hsize > 0

          && vsize % vspline::vector_traits<out_ele_type>::hsize == 0 ) ;


#else


    static const bool out_n_vecsz = false ;


#endif


    // generate a set of data: call the 'inner' functor,

    // then write result to memory using 'fluff'.


    // flag used to dispatch to either of the unstrided bunch/fluff overloads:


    typedef typename std::integral_constant

                     < bool , channels == 1 > use_store_t ;


    // if the stride is 1, we can use specialized 'fluff' variants,

    // provided the data are single-channel (or the vector width

    // is a multiple of the hardware vector width when Vc is used).

    // All other cases are handled with the variant of 'fluff'

    // taking a stride.


    if ( stride == 1 && ( channels == 1 || out_n_vecsz ) )

    {

      for ( ic_type a = 0 ; a < aggregates ; a++ )

      {

        functor.eval ( vr ) ;

        fluff ( ndvr , nd_trg , use_store_t() ) ;

        trg += vsize ;

      }

    }

    else

    {

      for ( ic_type a = 0 ; a < aggregates ; a++ )

      {

        functor.eval ( vr ) ;

        fluff ( ndvr , nd_trg , stride ) ;

        trg += vsize * stride ;

      }

    }


    // peeling is done, any leftovers are processed one-by-one


    for ( ic_type r = 0 ; r < leftover ; r++ )

    {

      functor.eval ( *trg ) ;

      trg += stride ;

    }

  }


#endif // USE_BUFFERED_GENERATION


} ; // struct generate_aggregator


/// specialization for vsz == 1. Here the data are simply

/// processed one by one in a loop, without vectorization.


template < typename ic_type , class functor_type >

struct generate_aggregator < 1 , ic_type , functor_type >

{

  static const size_t vsize = 1 ;

  functor_type functor ;


  generate_aggregator ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  typedef typename functor_type::shape_type crd_type ;

  typedef typename functor_type::out_type out_type ;


#ifdef USE_BUFFERED_GENERATION


  void operator() ( crd_type crd ,

                    int axis ,

                    out_type * trg ,

                    ic_type stride ,

                    ic_type length )

  {

    functor.reset ( crd , 0 ) ;

    vigra::MultiArray < 1 , out_type > result ( length ) ;

    functor.eval ( result ) ;


    for ( ic_type r = 0 ; r < length ; r++ )

    {

      *trg = result [ r ] ;

      trg += stride ;

    }

  }


#else


  void operator() ( crd_type crd ,

                    int axis ,

                    out_type * trg ,

                    ic_type stride ,

                    ic_type length )

  {

    functor.reset ( crd , 0 ) ;


    for ( ic_type r = 0 ; r < length ; r++ )

    {

      functor.eval ( *trg ) ;

      trg += stride ;

    }

  }


#endif

} ;


/// an aggregator for separate - possibly different - source and target.

/// If source and target are in fact different, the inner functor will

/// read data from source, process them and then write them to target.

/// If source and target are the same, the operation will be in-place,

/// but not explicitly so. vspline uses this style of two-argument functor,

/// and this is the aggregator we use for vspline's array-based transforms.

/// The code in this template will only be used for vectorized operation,

/// If vectorization is not used, only the specialization for vsize == 1

/// below is used.


template < size_t vsz , typename ic_type , class functor_type ,

           typename = std::enable_if < ( vsz > 1 ) > >

struct coupled_aggregator

{

  typedef typename functor_type::in_type in_type ;

  typedef typename functor_type::in_ele_type in_ele_type ;


  enum { dim_in = functor_type::dim_in } ;

  enum { dim_out = functor_type::dim_out } ;


  typedef typename functor_type::out_type out_type ;

  typedef typename functor_type::out_ele_type out_ele_type ;


  const functor_type functor ;


  // get the data types the functor expects


  typedef typename functor_type::in_v in_v ;

  typedef typename functor_type::out_v out_v ;


  coupled_aggregator ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  void operator() ( const in_type * src ,

                    ic_type in_stride ,

                    out_type * trg ,

                    ic_type out_stride ,

                    ic_type length

                  )

  {

    auto aggregates = length / vsz ;

    auto leftover = length - aggregates * vsz ;

    const bool is_apply = ( (void*) src == (void*) trg ) ;


    in_v in_buffer ;

    out_v out_buffer ;


    // first we perform a peeling run, processing data vectorized

    // as long as there are enough data to fill the vectorized

    // buffers (md_XXX_data_type)


#ifdef USE_VC


    // flags which are true if vsz is a multiple of the hardware

    // vector size for the elementary types involved. This works like

    // an opt-in: even if dim_in or dim_out are not 1, if these flags

    // are true, specialized load/store variants are called. If, then,

    // use_load_t or use_store_t are std::false_type, we'll end up in

    // the specialized Vc code using InterleavedMemoryWrapper.


    static const bool in_n_vecsz

      = (    vspline::vector_traits<in_ele_type>::hsize > 0

          && vsz % vspline::vector_traits<in_ele_type>::hsize == 0 ) ;


    static const bool out_n_vecsz

      = (    vspline::vector_traits<out_ele_type>::hsize > 0

          && vsz % vspline::vector_traits<out_ele_type>::hsize == 0 ) ;


#else


    static const bool in_n_vecsz = false ;

    static const bool out_n_vecsz = false ;


#endif


    // used to dispatch to either of the unstrided bunch/fluff overloads;

    // see also the remarks coming with use_store_t in the routine above.


    typedef typename std::integral_constant < bool , dim_in == 1 > use_load_t ;


    typedef typename std::integral_constant < bool , dim_out == 1 > use_store_t ;


    // depending on whether the input/output is strided or not,

    // and on the vector size and number of channels,

    // we pick different overloads of 'bunch' and fluff'. The

    // overloads without stride may use InterleavedMemoryWrapper,

    // or, for single-channel data, SIMD load/store operations,

    // which is most efficient. We can only pick the variants

    // using InterleavedMemoryWrapper if vsz is a multiple of

    // the hardware SIMD register size, hence the rather complex

    // conditionals. But the complexity is rewarded with optimal

    // peformance.


    if (    in_stride == 1

        && ( dim_in == 1 || in_n_vecsz ) )

    {

      if (    out_stride == 1

          && ( dim_out == 1 || out_n_vecsz ) )

      {

        for ( ic_type a = 0 ; a < aggregates ; a++ )

        {

          bunch ( src , in_buffer , use_load_t() ) ;

          src += vsz ;

          functor ( in_buffer , out_buffer ) ;

          fluff ( out_buffer , trg , use_store_t() ) ;

          trg += vsz ;

        }

      }

      else

      {

        for ( ic_type a = 0 ; a < aggregates ; a++ )

        {

          bunch ( src , in_buffer , use_load_t() ) ;

          src += vsz ;

          functor ( in_buffer , out_buffer ) ;

          fluff ( out_buffer , trg , out_stride ) ;

          trg += out_stride * vsz ;

        }

      }

    }

    else

    {

      if (    out_stride == 1

          && ( dim_out == 1 || out_n_vecsz ) )

      {

        for ( ic_type a = 0 ; a < aggregates ; a++ )

        {

          bunch ( src , in_buffer , in_stride ) ;

          src += in_stride * vsz ;

          functor ( in_buffer , out_buffer ) ;

          fluff ( out_buffer , trg , use_store_t() ) ;

          trg += vsz ;

        }

      }

      else

      {

        // this is the 'generic' case:

        for ( ic_type a = 0 ; a < aggregates ; a++ )

        {

          bunch ( src , in_buffer , in_stride ) ;

          src += in_stride * vsz ;

          functor ( in_buffer , out_buffer ) ;

          fluff ( out_buffer , trg , out_stride ) ;

          trg += out_stride * vsz ;

        }

      }

    }


    // peeling is done, we mop up the remainder with scalar code

    // KFJ 2022-05-19 initially I coded so that an apply would have

    // to take care not to write to out and read in subsequently,

    // but I think the code should rather be defensive and avoid

    // the problem without user code having to be aware of it.

    // hence the test for equality of src and trg.


    if ( leftover )

    {

      if ( is_apply )

      {

        // this is an 'apply', avoid write-before-read

        out_type help ;

        for ( ic_type r = 0 ; r < leftover ; r++ )

        {

          functor ( *src , help ) ;

          *trg = help ;

          src += in_stride ;

          trg += out_stride ;

        }

      }

      else

      {

        // this is not an 'apply'

        for ( ic_type r = 0 ; r < leftover ; r++ )

        {

          functor ( *src , *trg ) ;

          src += in_stride ;

          trg += out_stride ;

        }

      }

    }

  }

} ; // struct coupled_aggregator


/// specialization for vsz == 1. Here the data are simply

/// processed one by one in a loop, without vectorization.


template < typename ic_type , class functor_type >

struct coupled_aggregator < 1 , ic_type , functor_type >

{

  const functor_type functor ;


  coupled_aggregator ( const functor_type & _functor )

  : functor ( _functor )

  { } ;


  void operator() ( const typename functor_type::in_type * src ,

                    ic_type in_stride ,

                    typename functor_type::out_type * trg ,

                    ic_type out_stride ,

                    ic_type length

                  )

  {

    if ( (void*)src == (void*)trg )

    {

      // this is an 'apply', avoid write-before-read

      typename functor_type::out_type help ;

      for ( ic_type r = 0 ; r < length ; r++ )

      {

        functor ( *src , help ) ;

        *trg = help ;

        src += in_stride ;

        trg += out_stride ;

      }

    }

    else

    {

      // this is not an 'apply'

      for ( ic_type r = 0 ; r < length ; r++ )

      {

        functor ( *src , *trg ) ;

        src += in_stride ;

        trg += out_stride ;

      }

    }

  }

} ;


/// reimplementation of wield using the new 'neutral' multithread.

/// The workers now all receive the same task to process one line

/// at a time until all lines are processed. This simplifies the code;

/// the wield object directly calls 'multithread' in it's operator().

/// And it improves performance, presumably because tail-end idling

/// is reduced: all active threads have data to process until the last

/// line has been picked up by an aggregator. So tail-end idling is

/// in the order of magnitude of a line's worth, in contrast to half

/// a worker's share of the data in the previous implementation.

/// The current implementation does away with specialized partitioning

/// code (at least for the time being); it looks like the performance

/// is decent throughout, even without exploiting locality by

/// partitioning to tiles.


template < int dimension , class in_type , class out_type = in_type >

struct wield

{

  typedef vigra::MultiArrayView < dimension , in_type > in_view_type ;

  typedef vigra::MultiArrayView < dimension , out_type > out_view_type ;

  typedef typename in_view_type::difference_type_1 index_type ;

  typedef typename in_view_type::difference_type shape_type ;


  // wielding, using two arrays. It's assumed that both arrays have

  // the same shape. The coupled_aggregator takes a pointer and stride

  // for each array.

  // Note how the first view is taken by const&, indicating that

  // it can not be modified. Only the second view, the target of

  // the operation, is non-const.


//   void operator() ( const in_view_type & in_view ,

//                     out_view_type & out_view ,

//                     coupled_aggregator_type func ,

//                     int axis = 0 ,

//                     int njobs = vspline::default_njobs ,

//                     vspline::atomic < bool > * p_cancel = 0

//                   )

//   {

//     assert ( in_view.shape() == out_view.shape() ) ;

//

//     auto in_stride = in_view.stride ( axis ) ;

//     auto slice1 = in_view.bindAt ( axis , 0 ) ;

//

//     auto out_stride = out_view.stride ( axis ) ;

//     auto slice2 = out_view.bindAt ( axis , 0 ) ;

//

//     auto in_it = slice1.begin() ;

//     auto out_it = slice2.begin() ;

//

//     auto length = in_view.shape ( axis ) ;

//     auto nr_indexes = slice1.size() ;

//     vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

//

//     // we create the workers' code as a lambda pulling in the current

//     // context by reference. The code is quite simple:

//     // - decrement 'nlines'

//     // - if nlines is now less than zero, terminate

//     // - otherwise, call the aggregator function with arguments

//     //   pertaining to the line

//

//     std::function < void() > worker =

//     [&]()

//     {

//       std::ptrdiff_t i ;

//

//       while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )

//       {

//         if ( p_cancel && p_cancel->load() )

//           break ;

//

//         func ( & ( in_it [ i ] ) ,

//                in_stride ,

//                & ( out_it [ i ] ) ,

//                out_stride ,

//                length ) ;

//       }

//     } ;

//

//     // with the worker code fixed, we just call multithread:

//

//     vspline::multithread ( worker , njobs ) ;

//   }

//

//   // overload of operator() which will work with an object

//   // of type indexed_aggregator for the std::function it expects. This

//   // object presents the nD index into the target array as input to its'

//   // inner functor, which produces the output from this nD index, rather

//   // than looking at the array (which is only written to).

//   // The view coming in is non-const and will receive the result data.

//   // The aggregator is taken as a std::function of this type:

//

//   void operator() ( out_view_type & out_view ,

//                     indexed_aggregator_type func ,

//                     int axis = 0 ,

//                     int njobs = vspline::default_njobs ,

//                     vspline::atomic < bool > * p_cancel = 0

//                   )

//   {

//     auto out_stride = out_view.stride ( axis ) ;

//     auto slice = out_view.bindAt ( axis , 0 ) ;

//

//     auto out_it = slice.begin() ;

//     std::ptrdiff_t nr_indexes = slice.size() ;

//     vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

//     auto length = out_view.shape ( axis ) ;

//

//     // we iterate over the coordinates in slice_shape. This produces

//     // nD indexes into the view's subarray from 'begin' to 'end', so we

//     // need to offset the indexes with 'begin' to receive indexes

//     // into the view itself.

//

//     auto slice_shape = out_view.shape() ; // shape of whole array

//     slice_shape[axis] = 1 ;               // shape of slice with start positions

//

//     typedef vigra::MultiCoordinateIterator

//             < out_view_type::actual_dimension > mci_type ;

//

//     mci_type it ( slice_shape ) ;

//

//     std::function < void() > worker =

//     [&]()

//     {

//       std::ptrdiff_t i ;

//

//       while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )

//       {

//         if ( p_cancel && p_cancel->load() )

//           break ;

//

//         func ( it [ i ] ,

//                axis ,

//                & ( out_it [ i ] ) ,

//                out_stride ,

//                length ) ;

//       }

//     } ;

//

//     vspline::multithread ( worker , njobs ) ;

//   }

//


#ifndef WIELDING_SEGMENT_SIZE

#define WIELDING_SEGMENT_SIZE 0

#endif


  // variation of the coupled and index wielding code above splitting the

  // array(s) into segments along the processing axis. The benefit isn't

  // immediately obvious, but there are situations where using this code

  // makes a significant difference, namely when the functor relies on

  // memory access (which is typically the case for b-spline evaluation)

  // and following the evaluation order as implied by the structure of the

  // array(s) goes 'against the grain' of the functor's memory access. This

  // happens, for example, when the functor uses geometric transformations:

  // if the lines of the target are derived from, say, columns of the

  // original data, access to the interpolators's memory is widely scattered

  // through coefficient space. To an extent, caching helps, but with long

  // lines the cache capacity is exceeded. This is precisely where cutting

  // the lines into segments helps: the scattered access is shortened, and

  // there are fewer cache misses, at the cost of more handling overhead

  // caused by the extra level of complexity - which is minimal.

  // The problem with this approach is finding a way of fixing the segment

  // size optimally for a given memory access pattern. If memory access is

  // not encumbered by geometric transformations, there is no problem in the

  // first place, and using segments is slightly detrimental. If there are

  // transformations, it's not easy to find the optimal segment size, because

  // this depends on the functor. With b-splines, for example, the degree of

  // the spline matters, because with rising degree, the memory footprint of

  // individual evaluations grows. And geometric transformations are a varied

  // bunch and one can at best hope to find heuristic values for the segment

  // size.

  // I tentatively recommend using WIELDING_SEGMENT_SIZE 512; not #defining

  // a value results in falling back to unsegmented code, which should be

  // optimal if there are no geometric transformations.


  template < size_t vsz , typename ... types >

  void operator() ( const in_view_type & in_view ,

                    out_view_type & out_view ,

                    coupled_aggregator < vsz , types ... > func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0 ,

                    std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE

                  )

  {

    assert ( in_view.shape() == out_view.shape() ) ;


    // per default, fall back to not using segments


    if ( segment_size <= 0 )

      segment_size = in_view.shape ( axis ) ;


    // extract the strides for input and output


    auto in_stride = in_view.stride ( axis ) ;

    auto out_stride = out_view.stride ( axis ) ;


    // create slices holding the start positions of the lines


    auto slice1 = in_view.bindAt ( axis , 0 ) ;

    auto slice2 = out_view.bindAt ( axis , 0 ) ;


    // and iterators over these slices


    auto in_it = slice1.begin() ;

    auto out_it = slice2.begin() ;


    // get the line length and the number of lines


    auto length = in_view.shape ( axis ) ;

    auto nr_lines = slice1.size() ;


    // get the number of line segments


    std::ptrdiff_t nsegments = length / segment_size ;

    if ( length % segment_size )

      nsegments++ ;


    // calculate the total number of joblet indexes


    std::ptrdiff_t nr_indexes = nr_lines * nsegments ;


    // set up the atomic to share out the joblet indexes

    // to the worker threads


    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;


    // set up the payload code for 'multithread'


    auto worker =

    [&]()

    {

      std::ptrdiff_t joblet_index ;


      while ( vspline::fetch_ascending ( indexes , nr_indexes , joblet_index ) )

      {

        // terminate early on cancellation request


        if ( p_cancel && p_cancel->load() )

          break ;


        // glean segment and line index from joblet index


        auto s = joblet_index / nr_lines ;

        auto j = joblet_index % nr_lines ;


        // use these indexes to calculate corresponding addresses


        auto in_start_address =

          & ( in_it [ j ] ) + in_stride * s * segment_size ;


        auto out_start_address =

          & ( out_it [ j ] ) + out_stride * s * segment_size ;


        // the last segment may be less than segment_size long


        auto segment_length =

          std::min ( segment_size , length - s * segment_size ) ;


        // now call the coupled aggregator to process the current segment


        func ( in_start_address ,

               in_stride ,

               out_start_address ,

               out_stride ,

               segment_length ) ;

      }

    } ;


    // with the atomic distributing joblet indexes and the payload code

    // established, we call multithread to invoke worker threads to invoke

    // the payload code


    vspline::multithread ( worker , njobs ) ;

  }


  // variant feeding indexes as input to the functor


  template < size_t vsz , typename ... types >

  void operator() ( out_view_type & out_view ,

                    indexed_aggregator < vsz , types ... > func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0 ,

                    std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE

                  )

  {

    if ( segment_size <= 0 )

      segment_size = out_view.shape ( axis ) ;


    auto out_stride = out_view.stride ( axis ) ;

    auto slice = out_view.bindAt ( axis , 0 ) ;


    auto out_it = slice.begin() ;

    std::ptrdiff_t nr_lines = slice.size() ;

    auto length = out_view.shape ( axis ) ;


    auto slice_shape = out_view.shape() ; // shape of whole array

    slice_shape[axis] = 1 ;               // shape of slice with start positions


    typedef vigra::MultiCoordinateIterator

            < out_view_type::actual_dimension > mci_type ;


    mci_type it ( slice_shape ) ;


    std::ptrdiff_t nsegments = length / segment_size ;

    if ( length % segment_size )

      nsegments++ ;


    std::ptrdiff_t nr_indexes = nr_lines * nsegments ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;


    auto worker =

    [&]()

    {

      std::ptrdiff_t i ;


      while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        auto s = i / nr_lines ;

        auto j = i % nr_lines ;


        auto start_index = it [ j ] ;

        start_index [ axis ] += s * segment_size ;


        auto start_address = & ( out_view [ start_index ] ) ;


        auto segment_length =

          std::min ( segment_size , length - s * segment_size ) ;


        func ( start_index ,

               axis ,

               start_address ,

               out_stride ,

               segment_length ) ;

      }

    } ;


    vspline::multithread ( worker , njobs ) ;

  }


  // variant feeding indexes as input to a reduction functor. The

  // worker threads create per-thread copies of the functor to accrete

  // per-thread reductions, the functor's destructor is responsible

  // for pooling the per-thread results.


  template < size_t vsz , typename ... types >

  void operator() ( shape_type in_shape ,

                    const indexed_reductor < vsz , types ... > & func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0 ,

                    std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE

                  )

  {

    typedef indexed_reductor < vsz , types ... > func_t ;


    // per default, fall back to not using segments


    if ( segment_size <= 0 )

      segment_size = in_shape [ axis ] ;


    // get the line length and the number of lines


    auto length = in_shape [ axis ] ;

    auto nr_lines = prod ( in_shape ) / length ;


    auto slice_shape = in_shape ;

    slice_shape[axis] = 1 ;


    typedef vigra::MultiCoordinateIterator < dimension > mci_type ;


    mci_type it ( slice_shape ) ;


    // get the number of line segments


    std::ptrdiff_t nsegments = length / segment_size ;

    if ( length % segment_size )

      nsegments++ ;


    // calculate the total number of joblet indexes


    std::ptrdiff_t nr_indexes = nr_lines * nsegments ;


    // set up the atomic to share out the joblet indexes

    // to the worker threads


    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;


    // set up the payload code for 'multithread'


    auto worker =

    [&]()

    {

      std::ptrdiff_t i ;

      func_t w_func ( func ) ; // create per-thread copy of 'func'


      while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        auto s = i / nr_lines ;

        auto j = i % nr_lines ;


        auto start_index = it [ j ] ;

        start_index [ axis ] += s * segment_size ;


        auto segment_length =

          std::min ( segment_size , length - s * segment_size ) ;


        w_func ( it [ i ] , axis , length ) ;

      }

      // when 'worker' ends, w_func goes out of scope and is destructed.

      // It's destructor is responsible for pooling the per-thread

      // reduction results.

    } ;


    vspline::multithread ( worker , njobs ) ;

  }


  template < size_t vsz , typename ... types >

  void operator() ( const in_view_type & in_view ,

                    const yield_reductor < vsz , types ... > & func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0 ,

                    std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE

                  )

  {

    typedef yield_reductor < vsz , types ... > func_t ;


    // per default, fall back to not using segments


    if ( segment_size <= 0 )

      segment_size = in_view.shape ( axis ) ;


    // extract the strides for input and output


    auto in_stride = in_view.stride ( axis ) ;


    // create slices holding the start positions of the lines


    auto slice1 = in_view.bindAt ( axis , 0 ) ;


    // and iterators over these slices


    auto in_it = slice1.begin() ;


    // get the line length and the number of lines


    auto length = in_view.shape ( axis ) ;

    auto nr_lines = slice1.size() ;


    // get the number of line segments


    std::ptrdiff_t nsegments = length / segment_size ;

    if ( length % segment_size )

      nsegments++ ;


    // calculate the total number of joblet indexes


    std::ptrdiff_t nr_indexes = nr_lines * nsegments ;


    // set up the atomic to share out the joblet indexes

    // to the worker threads


    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;


    // set up the payload code for 'multithread'


    auto worker =

    [&]()

    {

      std::ptrdiff_t joblet_index ;

      func_t w_func ( func ) ; // create per-thread copy of 'func'


      while ( vspline::fetch_ascending ( indexes , nr_indexes , joblet_index ) )

      {

        // terminate early on cancellation request


        if ( p_cancel && p_cancel->load() )

          break ;


        // glean segment and line index from joblet index


        auto s = joblet_index / nr_lines ;

        auto j = joblet_index % nr_lines ;


        // use these indexes to calculate corresponding addresses


        auto in_start_address =

          & ( in_it [ j ] ) + in_stride * s * segment_size ;


        // the last segment may be less than segment_size long


        auto segment_length =

          std::min ( segment_size , length - s * segment_size ) ;


        // now call the coupled aggregator to process the current segment


        w_func ( in_start_address ,

                 in_stride ,

                 segment_length ) ;

      }

      // when 'worker' ends, w_func goes out of scope and is destructed.

      // It's destructor is responsible for pooling the per-thread

      // reduction results.

    } ;


    // with the atomic distributing joblet indexes and the payload code

    // established, we call multithread to invoke worker threads to invoke

    // the payload code


    vspline::multithread ( worker , njobs ) ;

  }


  // use a generator to produce data. As the aggregator for this use

  // has the same call signature as an indexed aggregator, we use a named

  // method here and may do so for the other top-level methods as well.


  template < size_t vsz , typename ... types >

  void generate ( out_view_type & out_view ,

                  generate_aggregator < vsz , types ... > func ,

                  int axis = 0 ,

                  int njobs = vspline::default_njobs ,

                  vspline::atomic < bool > * p_cancel = 0

                )

  {

    auto out_stride = out_view.stride ( axis ) ;

    auto slice = out_view.bindAt ( axis , 0 ) ;


    auto out_it = slice.begin() ;

    std::ptrdiff_t nr_indexes = slice.size() ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

    auto length = out_view.shape ( axis ) ;


    auto slice_shape = out_view.shape() ; // shape of whole array

    slice_shape[axis] = 1 ;  // shape of slice with start positions


    // iterator yielding start indexes


    typedef vigra::MultiCoordinateIterator

            < out_view_type::actual_dimension > mci_type ;


    mci_type it ( slice_shape ) ;


    auto worker =

    [&]()

    {

      // create thread-specific copy of generate_aggregator. This is

      // necessary because a generate_aggregator carries mutable state

      // which is modified with each call to it's operator()


      auto w_func = func ;


      std::ptrdiff_t i ;


      while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        w_func ( it [ i ] ,

                 axis ,

                 & ( out_it [ i ] ) ,

                 out_stride ,

                 length ) ;

      }

    } ;


    vspline::multithread ( worker , njobs ) ;

  }

} ;


template < class in_type , class out_type >

struct wield < 1 , in_type , out_type >

{

  enum { dimension = 1 } ;


  typedef vigra::MultiArrayView < dimension , in_type > in_view_type ;

  typedef vigra::MultiArrayView < dimension , out_type > out_view_type ;

  typedef typename in_view_type::difference_type shape_type ;

  typedef typename in_view_type::difference_type_1 index_type ;


  template < size_t vsz , typename ... types >

  void operator() ( const in_view_type & in_view ,

                    out_view_type & out_view ,

                    coupled_aggregator < vsz , types ... > func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0

                  )

  {

    auto stride1 = in_view.stride ( axis ) ;

    auto length = in_view.shape ( axis ) ;

    auto stride2 = out_view.stride ( axis ) ;


    assert ( in_view.shape() == out_view.shape() ) ;


    auto nr_indexes = in_view.shape ( axis ) ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

    std::ptrdiff_t batch_size = 1024 ; // TODO optimize


    auto worker =

    [&]()

    {

      std::ptrdiff_t lo , hi ;


      while ( vspline::fetch_range_ascending

                ( indexes , batch_size , nr_indexes , lo , hi ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        func ( & ( in_view [ lo ] ) ,

               stride1 ,

               & ( out_view [ lo ] ) ,

               stride2 ,

               hi - lo ) ;

      }

    } ;


    // with the worker code fixed, we just call multithread:


    vspline::multithread ( worker , njobs ) ;


  }


  template < size_t vsz , typename ... types >

  void operator() ( in_view_type & view ,

                    indexed_aggregator < vsz , types ... > func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0

                  )

  {

    std::ptrdiff_t stride = view.stride ( axis ) ;

    std::ptrdiff_t nr_indexes = view.shape ( axis ) ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

    std::ptrdiff_t batch_size = 1024 ; // TODO optimize


    auto worker =

    [&]()

    {

      std::ptrdiff_t lo , hi ;


      while ( vspline::fetch_range_ascending

                ( indexes , batch_size , nr_indexes , lo , hi ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        // note: we're 1D; creating a shape_type is only 'technical'

        shape_type _lo ( lo ) ;


        func ( _lo ,

               axis ,

               & ( view [ lo ] ) ,

               stride ,

               hi - lo ) ;

      }

    } ;


    // with the worker code fixed, we just call multithread:


    vspline::multithread ( worker , njobs ) ;

  }


  // TODO test 1D variants of reductors


  template < size_t vsz , typename ... types >

  void operator() ( shape_type & shape ,

                    indexed_reductor < vsz , types ... > func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0

                  )

  {

    typedef indexed_reductor < vsz , types ... > func_t ;

    std::ptrdiff_t nr_indexes = shape [ axis ] ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

    std::ptrdiff_t batch_size = 1024 ; // TODO optimize


    auto worker =

    [&]()

    {

      std::ptrdiff_t lo , hi ;

      func_t w_func ( func ) ;


      while ( vspline::fetch_range_ascending

                ( indexes , batch_size , nr_indexes , lo , hi ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        // note: we're 1D; creating a shape_type is only 'technical'

        shape_type _lo ( lo ) ;


        w_func ( _lo , axis , hi - lo ) ;

      }

    } ;


    // with the worker code fixed, we just call multithread:


    vspline::multithread ( worker , njobs ) ;

  }


  template < size_t vsz , typename ... types >

  void operator() ( const in_view_type & in_view ,

                    yield_reductor < vsz , types ... > func ,

                    int axis = 0 ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0

                  )

  {

    typedef yield_reductor < vsz , types ... > func_t ;


    auto stride1 = in_view.stride ( axis ) ;

    auto length = in_view.shape ( axis ) ;


    auto nr_indexes = in_view.shape ( axis ) ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

    std::ptrdiff_t batch_size = 1024 ; // TODO optimize


    auto worker =

    [&]()

    {

      std::ptrdiff_t lo , hi ;

      func_t w_func ( func ) ;


      while ( vspline::fetch_range_ascending

                ( indexes , batch_size , nr_indexes , lo , hi ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        w_func ( & ( in_view [ lo ] ) , stride1 , hi - lo ) ;

      }

    } ;


    // with the worker code fixed, we just call multithread:


    vspline::multithread ( worker , njobs ) ;

  }


  template < size_t vsz , typename ... types >

  void generate ( in_view_type & view ,

                  generate_aggregator < vsz , types ... > func ,

                  int axis = 0 ,

                  int njobs = vspline::default_njobs ,

                  vspline::atomic < bool > * p_cancel = 0

                )

  {

    std::ptrdiff_t stride = view.stride ( axis ) ;

    std::ptrdiff_t nr_indexes = view.shape ( axis ) ;

    vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;

    // batch_size must be a multiple of vsize to help the generator

    std::ptrdiff_t batch_size =   1024 % vsz

                                ? ( 1 + 1024 / vsz ) * vsz

                                : 1024 ;


    auto worker =

    [&]()

    {

      // create thread-specific copy of generate_aggregator. This is

      // necessary because a generate_aggregator carries mutable state

      // which is modified with each call to it's operator()


      auto w_func = func ;


      std::ptrdiff_t lo , hi ;


      while ( vspline::fetch_range_ascending

                ( indexes , batch_size , nr_indexes , lo , hi ) )

      {

        if ( p_cancel && p_cancel->load() )

          break ;


        // see comment in generator code, which currently expects

        // to start at coordinate 0


        // note: we're 1D; creating a shape_type is only 'technical'

        shape_type _lo ( lo ) ;


        w_func ( _lo ,

                 axis ,

                 & ( view [ lo ] ) ,

                 stride ,

                 hi - lo ) ;

      }

    } ;


    // with the worker code fixed, we just call multithread:


    vspline::multithread ( worker , njobs ) ;

  }


} ;


/// vs_adapter wraps a vspline::unary_functor to produce a functor which is

/// compatible with the wielding code. This is necessary, because vspline's

/// unary_functors take 'naked' arguments if the data are 1D, while the

/// wielding code always passes TinyVectors. The operation of this wrapper

/// class should not have a run-time effect; it's simply converting references.

/// the wrapped functor is only used via operator(), so this is what we provide.

/// While it would be nice to simply pass through the unwrapped unary_functor,

/// this would force us to deal with the distinction between data in TinyVectors

/// and 'naked' fundamentals deeper down in the code, and here is a good central

/// place where we can route to uniform access via TinyVectors - possibly with

/// only one element.

/// By inheriting from inner_type, we provide all of inner_type's type system

/// which we don't explicitly override.

/// Rest assured: the reinterpret_cast is safe. If the data are single-channel,

/// the containerized version takes up the same meory as the uncontainerized

/// version of the datum. multi-channel data are containerized anyway.


template < class inner_type >

struct vs_adapter

: public inner_type

{

  using typename inner_type::in_ele_v ;

  using typename inner_type::out_ele_v ;

  using typename inner_type::in_ele_type ;

  using typename inner_type::out_ele_type ;


  typedef typename inner_type::in_nd_ele_type in_type ;

  typedef typename inner_type::out_nd_ele_type out_type ;

  typedef typename inner_type::in_nd_ele_v in_v ;

  typedef typename inner_type::out_nd_ele_v out_v ;


  vs_adapter ( const inner_type & _inner )

  : inner_type ( _inner )

  { } ;


  /// operator() overload for unvectorized arguments


  void operator() ( const in_type & in ,

                         out_type & out ) const

  {

    inner_type::eval

      ( reinterpret_cast < const typename inner_type::in_type & > ( in ) ,

        reinterpret_cast < typename inner_type::out_type & > ( out ) ) ;

  }


  /// vectorized evaluation function. This is enabled only if vsize > 1


  template < typename = std::enable_if < ( inner_type::vsize > 1 ) > >

  void operator() ( const in_v & in ,

                         out_v & out ) const

  {

    inner_type::eval

      ( reinterpret_cast < const typename inner_type::in_v & > ( in ) ,

        reinterpret_cast < typename inner_type::out_v & > ( out ) ) ;

  }

} ;


/// same procedure for a vspline::sink_type


template < class sink_type >

struct vs_sink_adapter

: public sink_type

{

  using typename sink_type::in_ele_v ;


  typedef typename sink_type::in_nd_ele_type in_type ;

  typedef typename sink_type::in_nd_ele_v in_v ;


  vs_sink_adapter ( const sink_type & _sink )

  : sink_type ( _sink )

  { } ;


  /// operator() overload for unvectorized arguments


  void operator() ( const in_type & in ) const

  {

    (*((sink_type*)(this)))

      ( reinterpret_cast < const typename sink_type::in_type & > ( in ) ) ;

  }


  /// vectorized evaluation function. This is enabled only if vsize > 1


  template < typename = std::enable_if < ( sink_type::vsize > 1 ) > >

  void operator() ( const in_v & in ) const

  {

    (*((sink_type*)(this)))

      ( reinterpret_cast < const typename sink_type::in_v & > ( in ) ) ;

  }

} ;


/// index_wield uses vspline's 'multithread' function to invoke

/// an index-transformation functor for all indexes into an array,

/// We use functors which are vector-capable,

/// typically they will be derived from vspline::unary_functor.

/// index_wield internally uses a 'wield' object to invoke

/// the functor on the chunks of data.


// after 'output', I added an additional argument pointing to a

// vspline::atomic<bool>. The atomic pointed to is checked on

// function entry, and if found false, the operation is aborted.

// With this mechanism, calling code can keep a handle on the progress

// of the multithreaded operation and cancel at least those parts of

// it which have not yet started. with the introduction of finer

// granularity with the new multithreading code, the cancellation

// flag is now also checked on starting on a new 1D subset of the data.

// If these 'lines' aren't 'very' long, the effect of cancellation is

// reasonably quick.

// Per default, a null pointer is passed, which disables the check

// for cancellation, so the interface is stable. The same change was

// applied to the other transform variants.


template < class functor_type , int dimension >

void index_wield ( const functor_type functor ,

                   vigra::MultiArrayView < dimension ,

                                           typename functor_type::out_type

                                         > * output ,

                   int njobs = vspline::default_njobs ,

                   vspline::atomic < bool > * p_cancel = 0

                 )

{

  typedef typename functor_type::out_type out_type ;


  wield < dimension , out_type > wld ;


  indexed_aggregator < functor_type::vsize ,

                       int , // std::ptrdiff_t ,

                       functor_type > agg ( functor ) ;


  wld ( *output , agg , 0 , njobs , p_cancel ) ;

}


// index_reduce is used for reductions. The functor passed to this function

// Will be copied for each working thread, and the copies are fed coordinates,

// so the functor needs operator() overloads to take both single and vectorized

// coordinates. The functor copies will typically accumulate their part of the

// reduction, and 'offload' their partial result when they are destructed.

// With this construction, there is no need for inter-thread coordination

// of the reduction process, only the final offloading round needs coordination,

// which can be provided by e.g. a lock guard or adding to an atomic.


template < class functor_type , int dimension >

void index_reduce ( const functor_type & functor ,

                    vigra::TinyVector < long , dimension > shape ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0

                  )

{

  wield < dimension , int > wld ;


  indexed_reductor < functor_type::vsize ,

                     int , // std::ptrdiff_t ,

                     functor_type > agg ( functor ) ;


  wld ( shape , agg , 0 , njobs , p_cancel ) ;

}


// equivalent function to reduce an array


template < class functor_type , int dimension >

void value_reduce ( const functor_type & functor ,

                    const vigra::MultiArrayView < dimension ,

                                                   typename functor_type::in_type

                                                 > * input ,

                    int njobs = vspline::default_njobs ,

                    vspline::atomic < bool > * p_cancel = 0

                 )

{

  wield < dimension , vigra::TinyVector<float,1> > wld ;


  yield_reductor < functor_type::vsize ,

                   int , // std::ptrdiff_t ,

                   functor_type > agg ( functor ) ;


  wld ( *input , agg , 0 , njobs , p_cancel ) ;

}


/// coupled_wield processes two arrays. The first array is taken as input,

/// the second for output. Both arrays must have the same dimensionality

/// and shape. Their data types have to be the same as the 'in_type' and

/// the 'out_type' of the functor which was passed in.


template < class functor_type , int dimension >

void coupled_wield ( const functor_type functor ,

                     const vigra::MultiArrayView < dimension ,

                                                   typename functor_type::in_type

                                                 > * input ,

                     vigra::MultiArrayView < dimension ,

                                             typename functor_type::out_type

                                           > * output ,

                     int njobs = vspline::default_njobs ,

                     vspline::atomic < bool > * p_cancel = 0

                 )

{

  typedef typename functor_type::in_type in_type ;

  typedef typename functor_type::out_type out_type ;


  wield < dimension , in_type , out_type > wld ;


  coupled_aggregator < functor_type::vsize ,

                       int , // std::ptrdiff_t ,

                       functor_type > agg ( functor ) ;


  wld ( *input , *output , agg , 0 , njobs , p_cancel ) ;

}


/// generate_wield uses a generator function to produce data. Inside vspline,

/// this is used for grid_eval, which can produce performance gains by

/// precalculating frequently reused b-spline evaluation weights. The

/// generator holds these weights in readily vectorized form, shared for

/// all worker threads.


template < class functor_type , unsigned int dimension >

void generate_wield ( const functor_type functor ,

                      vigra::MultiArrayView < dimension ,

                                              typename functor_type::out_type

                                            > & output ,

                     int njobs = vspline::default_njobs ,

                     vspline::atomic < bool > * p_cancel = 0

                    )

{

  typedef typename functor_type::out_type out_type ;


  wield < dimension , out_type > wld ;


  generate_aggregator < functor_type::vsize ,

                        int , // std::ptrdiff_t ,

                        functor_type > agg ( functor ) ;


  wld.generate ( output , agg , 0 , njobs , p_cancel ) ;

}


} ; // namespace wielding


#define VSPLINE_WIELDING_H

#endif

vsize
@ vsize
Definition: eval.cc:96

interleave.h
Implementation of 'bunch' and 'fluff'.

vspline::default_njobs
const int default_njobs
Definition: multithread.h:220

vspline::fetch_ascending
bool fetch_ascending(vspline::atomic< index_t > &source, const index_t &total, index_t &index)
fetch_ascending counts up from zero to total-1, which is more efficient if the indexes are used to ad...
Definition: multithread.h:284

vspline::multithread
int multithread(std::function< void() > payload, std::size_t nr_workers=default_njobs)
multithread uses a thread pool of worker threads to perform a multithreaded operation....
Definition: multithread.h:412

vspline::fetch_range_ascending
bool fetch_range_ascending(vspline::atomic< index_t > &source, const index_t &count, const index_t &total, index_t &low, index_t &high)
fetch_range_ascending also uses an atomic initialized to the total number of indexes to be distribute...
Definition: multithread.h:336

vspline::atomic
std::atomic< T > atomic
Definition: multithread.h:224

wielding
Definition: interleave.h:96

wielding::index_reduce
void index_reduce(const functor_type &functor, vigra::TinyVector< long, dimension > shape, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:2081

wielding::bunch
void bunch(const vigra::TinyVector< ele_type, chn > *const &src, vigra::TinyVector< vspline::vc_simd_type< ele_type, vsz >, chn > &trg, const ic_type &stride)
bunch picks up data from interleaved, strided memory and stores them in a data type representing a pa...
Definition: interleave.h:220

wielding::coupled_wield
void coupled_wield(const functor_type functor, const vigra::MultiArrayView< dimension, typename functor_type::in_type > *input, vigra::MultiArrayView< dimension, typename functor_type::out_type > *output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
coupled_wield processes two arrays. The first array is taken as input, the second for output....
Definition: wielding.h:2122

wielding::generate_wield
void generate_wield(const functor_type functor, vigra::MultiArrayView< dimension, typename functor_type::out_type > &output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
generate_wield uses a generator function to produce data. Inside vspline, this is used for grid_eval,...
Definition: wielding.h:2152

wielding::ic_type
int ic_type
Definition: interleave.h:98

wielding::index_wield
void index_wield(const functor_type functor, vigra::MultiArrayView< dimension, typename functor_type::out_type > *output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
index_wield uses vspline's 'multithread' function to invoke an index-transformation functor for all i...
Definition: wielding.h:2052

wielding::value_reduce
void value_reduce(const functor_type &functor, const vigra::MultiArrayView< dimension, typename functor_type::in_type > *input, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:2099

wielding::fluff
void fluff(const vigra::TinyVector< vspline::vc_simd_type< ele_type, vsz >, chn > &src, vigra::TinyVector< ele_type, chn > *const &trg, const ic_type &stride)
reverse operation: a package of vectorized data is written to interleaved, strided memory....
Definition: interleave.h:267

vspline::vector_traits
with the definition of 'simd_traits', we can proceed to implement 'vector_traits': struct vector_trai...
Definition: vector.h:344

wielding::coupled_aggregator< 1, ic_type, functor_type >::functor
const functor_type functor
Definition: wielding.h:1101

wielding::coupled_aggregator< 1, ic_type, functor_type >::coupled_aggregator
coupled_aggregator(const functor_type &_functor)
Definition: wielding.h:1103

wielding::coupled_aggregator
an aggregator for separate - possibly different - source and target. If source and target are in fact...
Definition: wielding.h:924

wielding::coupled_aggregator::out_ele_type
functor_type::out_ele_type out_ele_type
Definition: wielding.h:932

wielding::coupled_aggregator::out_type
functor_type::out_type out_type
Definition: wielding.h:931

wielding::coupled_aggregator::out_v
functor_type::out_v out_v
Definition: wielding.h:939

wielding::coupled_aggregator::in_v
functor_type::in_v in_v
Definition: wielding.h:938

wielding::coupled_aggregator::coupled_aggregator
coupled_aggregator(const functor_type &_functor)
Definition: wielding.h:941

wielding::coupled_aggregator::in_type
functor_type::in_type in_type
Definition: wielding.h:925

wielding::coupled_aggregator::functor
const functor_type functor
Definition: wielding.h:934

wielding::coupled_aggregator::in_ele_type
functor_type::in_ele_type in_ele_type
Definition: wielding.h:926

wielding::coupled_aggregator::operator()
void operator()(const in_type *src, ic_type in_stride, out_type *trg, ic_type out_stride, ic_type length)
Definition: wielding.h:945

wielding::generate_aggregator< 1, ic_type, functor_type >::generate_aggregator
generate_aggregator(const functor_type &_functor)
Definition: wielding.h:865

wielding::generate_aggregator< 1, ic_type, functor_type >::crd_type
functor_type::shape_type crd_type
Definition: wielding.h:869

wielding::generate_aggregator< 1, ic_type, functor_type >::out_type
functor_type::out_type out_type
Definition: wielding.h:870

wielding::generate_aggregator< 1, ic_type, functor_type >::functor
functor_type functor
Definition: wielding.h:863

wielding::generate_aggregator
generate_aggregator is very similar to indexed_aggregator, but instead of managing and passing a coor...
Definition: wielding.h:624

wielding::generate_aggregator::crd_type
functor_type::shape_type crd_type
Definition: wielding.h:646

wielding::generate_aggregator::out_nd_ele_v
functor_type::out_nd_ele_v out_nd_ele_v
Definition: wielding.h:636

wielding::generate_aggregator::out_nd_ele_type
functor_type::out_nd_ele_type out_nd_ele_type
Definition: wielding.h:633

wielding::generate_aggregator::out_v
functor_type::out_v out_v
Definition: wielding.h:634

wielding::generate_aggregator::vsize
static const size_t vsize
Definition: wielding.h:625

wielding::generate_aggregator::out_ele_type
functor_type::out_ele_type out_ele_type
Definition: wielding.h:632

wielding::generate_aggregator::generate_aggregator
generate_aggregator(const functor_type &_functor)
Definition: wielding.h:652

wielding::generate_aggregator::out_ele_v
functor_type::out_ele_v out_ele_v
Definition: wielding.h:635

wielding::generate_aggregator::out_type
functor_type::out_type out_type
Definition: wielding.h:631

wielding::generate_aggregator::functor
functor_type functor
Definition: wielding.h:642

wielding::generate_aggregator::operator()
void operator()(crd_type crd, int axis, out_type *trg, ic_type stride, ic_type length)
Definition: wielding.h:764

wielding::indexed_aggregator< 1, ic_type, functor_type >
specialization for vsz == 1. Here the data are simply processed one by one in a loop,...
Definition: wielding.h:328

wielding::indexed_aggregator< 1, ic_type, functor_type >::indexed_aggregator
indexed_aggregator(const functor_type &_functor)
Definition: wielding.h:331

wielding::indexed_aggregator< 1, ic_type, functor_type >::sd_coordinate_type
functor_type::in_type sd_coordinate_type
Definition: wielding.h:340

wielding::indexed_aggregator< 1, ic_type, functor_type >::functor
const functor_type functor
Definition: wielding.h:329

wielding::indexed_aggregator
indexed_aggregator receives the start coordinate and processing axis along with the data to process,...
Definition: wielding.h:175

wielding::indexed_aggregator::in_v
functor_type::in_v in_v
Definition: wielding.h:180

wielding::indexed_aggregator::is_n_hsize
static bool is_n_hsize()
Definition: wielding.h:209

wielding::indexed_aggregator::indexed_aggregator
indexed_aggregator(const functor_type &_functor)
Definition: wielding.h:198

wielding::indexed_aggregator::out_ele_type
functor_type::out_ele_type out_ele_type
Definition: wielding.h:184

wielding::indexed_aggregator::out_type
functor_type::out_type out_type
Definition: wielding.h:183

wielding::indexed_aggregator::in_ele_v
functor_type::in_ele_v in_ele_v
Definition: wielding.h:181

wielding::indexed_aggregator::out_ele_v
functor_type::out_ele_v out_ele_v
Definition: wielding.h:186

wielding::indexed_aggregator::functor
const functor_type functor
Definition: wielding.h:196

wielding::indexed_aggregator::out_v
functor_type::out_v out_v
Definition: wielding.h:185

wielding::indexed_aggregator::in_ele_type
functor_type::in_ele_type in_ele_type
Definition: wielding.h:179

wielding::indexed_aggregator::in_type
functor_type::in_type in_type
Definition: wielding.h:178

wielding::indexed_reductor< 1, ic_type, functor_type >
specialization for vsz == 1. Here the data are simply processed one by one in a loop,...
Definition: wielding.h:454

wielding::indexed_reductor< 1, ic_type, functor_type >::sd_coordinate_type
functor_type::in_type sd_coordinate_type
Definition: wielding.h:466

wielding::indexed_reductor< 1, ic_type, functor_type >::functor
const functor_type functor
Definition: wielding.h:455

wielding::indexed_reductor< 1, ic_type, functor_type >::indexed_reductor
indexed_reductor(const functor_type &_functor)
Definition: wielding.h:457

wielding::indexed_reductor
indexed_reductor is used for reductions and has no output. The actual reduction is handled by the fun...
Definition: wielding.h:376

wielding::indexed_reductor::in_v
functor_type::in_v in_v
Definition: wielding.h:381

wielding::indexed_reductor::functor
functor_type functor
Definition: wielding.h:386

wielding::indexed_reductor::in_ele_type
functor_type::in_ele_type in_ele_type
Definition: wielding.h:380

wielding::indexed_reductor::in_type
functor_type::in_type in_type
Definition: wielding.h:379

wielding::indexed_reductor::in_ele_v
functor_type::in_ele_v in_ele_v
Definition: wielding.h:382

wielding::indexed_reductor::indexed_reductor
indexed_reductor(const functor_type &_functor)
Definition: wielding.h:394

wielding::indexed_reductor::crd_type
functor_type::in_type crd_type
Definition: wielding.h:390

wielding::vs_adapter
vs_adapter wraps a vspline::unary_functor to produce a functor which is compatible with the wielding ...
Definition: wielding.h:1960

wielding::vs_adapter::in_type
inner_type::in_nd_ele_type in_type
Definition: wielding.h:1966

wielding::vs_adapter::in_v
inner_type::in_nd_ele_v in_v
Definition: wielding.h:1968

wielding::vs_adapter::operator()
void operator()(const in_type &in, out_type &out) const
operator() overload for unvectorized arguments
Definition: wielding.h:1977

wielding::vs_adapter::out_v
inner_type::out_nd_ele_v out_v
Definition: wielding.h:1969

wielding::vs_adapter::vs_adapter
vs_adapter(const inner_type &_inner)
Definition: wielding.h:1971

wielding::vs_adapter::out_type
inner_type::out_nd_ele_type out_type
Definition: wielding.h:1967

wielding::vs_sink_adapter
same procedure for a vspline::sink_type
Definition: wielding.h:2002

wielding::vs_sink_adapter::in_type
sink_type::in_nd_ele_type in_type
Definition: wielding.h:2005

wielding::vs_sink_adapter::vs_sink_adapter
vs_sink_adapter(const sink_type &_sink)
Definition: wielding.h:2008

wielding::vs_sink_adapter::operator()
void operator()(const in_type &in) const
operator() overload for unvectorized arguments
Definition: wielding.h:2014

wielding::vs_sink_adapter::in_v
sink_type::in_nd_ele_v in_v
Definition: wielding.h:2006

wielding::wield< 1, in_type, out_type >::in_view_type
vigra::MultiArrayView< dimension, in_type > in_view_type
Definition: wielding.h:1720

wielding::wield< 1, in_type, out_type >::shape_type
in_view_type::difference_type shape_type
Definition: wielding.h:1722

wielding::wield< 1, in_type, out_type >::index_type
in_view_type::difference_type_1 index_type
Definition: wielding.h:1723

wielding::wield< 1, in_type, out_type >::generate
void generate(in_view_type &view, generate_aggregator< vsz, types ... > func, int axis=0, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:1887

wielding::wield< 1, in_type, out_type >::out_view_type
vigra::MultiArrayView< dimension, out_type > out_view_type
Definition: wielding.h:1721

wielding::wield
reimplementation of wield using the new 'neutral' multithread. The workers now all receive the same t...
Definition: wielding.h:1155

wielding::wield::generate
void generate(out_view_type &out_view, generate_aggregator< vsz, types ... > func, int axis=0, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:1662

wielding::wield::shape_type
in_view_type::difference_type shape_type
Definition: wielding.h:1159

wielding::wield::operator()
void operator()(const in_view_type &in_view, out_view_type &out_view, coupled_aggregator< vsz, types ... > func, int axis=0, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0, std::ptrdiff_t segment_size=WIELDING_SEGMENT_SIZE)
Definition: wielding.h:1313

wielding::wield::in_view_type
vigra::MultiArrayView< dimension, in_type > in_view_type
Definition: wielding.h:1156

wielding::wield::out_view_type
vigra::MultiArrayView< dimension, out_type > out_view_type
Definition: wielding.h:1157

wielding::wield::index_type
in_view_type::difference_type_1 index_type
Definition: wielding.h:1158

wielding::yield_reductor< 1, ic_type, functor_type >::functor
const functor_type functor
Definition: wielding.h:585

wielding::yield_reductor< 1, ic_type, functor_type >::yield_reductor
yield_reductor(const functor_type &_functor)
Definition: wielding.h:587

wielding::yield_reductor
an aggregator to reduce arrays. This is like using indexed_reductor with a functor gathering from an ...
Definition: wielding.h:488

wielding::yield_reductor::operator()
void operator()(const in_type *src, ic_type in_stride, ic_type length)
Definition: wielding.h:504

wielding::yield_reductor::in_ele_type
functor_type::in_ele_type in_ele_type
Definition: wielding.h:490

wielding::yield_reductor::in_type
functor_type::in_type in_type
Definition: wielding.h:489

wielding::yield_reductor::yield_reductor
yield_reductor(const functor_type &_functor)
Definition: wielding.h:500

wielding::yield_reductor::in_v
functor_type::in_v in_v
Definition: wielding.h:498

wielding::yield_reductor::functor
functor_type functor
Definition: wielding.h:494

vspline.h
includes all headers from vspline (most of them indirectly)