Implementation of vspline::transform. More...

#include <atomic>
#include "interleave.h"
#include "vspline.h"

Classes
struct	wielding::indexed_aggregator< vsz, ic_type, functor_type, typename >
	indexed_aggregator receives the start coordinate and processing axis along with the data to process, this is meant for index-transforms. The coordinate is updated for every call to the 'inner' functor so that the inner functor has the current coordinate as input. The code in this template will only be used for vectorized operation, without vectorization, only the specialization for vsize == 1 below is used. More...

struct	wielding::indexed_aggregator< 1, ic_type, functor_type >
	specialization for vsz == 1. Here the data are simply processed one by one in a loop, without vectorization. More...

struct	wielding::indexed_reductor< vsz, ic_type, functor_type, typename >
	indexed_reductor is used for reductions and has no output. The actual reduction is handled by the functor: each thread has it's own copy of the functor, which does it's own part of the reduction, and 'offloads' it's result to some mutex-protected receptacle when it's destructed, see the 'reduce' functions in transform.h for a more detailed explanation and an example of such a functor. idexed_reductor processes discrete coordinates, whereas yield_reductor (the next class down) processes values. This variant works just like an indexed_aggregator, only that it produces no output - at least not for every coordinate fed to the functor, the functor itself does hold state (the reduction) and is also responsible for offloading per-thread results when the worker threads terminate. This class holds a copy of the functor, and each thread has an instance of this class, ensuring that each worker thread can reduce it's share of the work load independently. More...

struct	wielding::indexed_reductor< 1, ic_type, functor_type >
	specialization for vsz == 1. Here the data are simply processed one by one in a loop, without vectorization. More...

struct	wielding::yield_reductor< vsz, ic_type, functor_type, typename >
	an aggregator to reduce arrays. This is like using indexed_reductor with a functor gathering from an array, but due to the use of 'bunch' this class is faster for certain array types, because it can use load/shuffle operations instead of always gathering. More...

struct	wielding::yield_reductor< 1, ic_type, functor_type >
	specialization for vsz == 1. Here the data are simply processed one by one in a loop, without vectorization. More...

struct	wielding::generate_aggregator< _vsize, ic_type, functor_type, typename >
	generate_aggregator is very similar to indexed_aggregator, but instead of managing and passing a coordinate to the functor, the functor now manages the argument side of the operation: it acts as a generator. To make this possible, the generator has to hold run-time modifiable state and can't be const like the functors used in the other aggregators, where the functors are 'pure' in a functional programming sense. A 'generator' functor to be used with this body of code is expected to behave in a certain fashion: More...

struct	wielding::generate_aggregator< 1, ic_type, functor_type >
	specialization for vsz == 1. Here the data are simply processed one by one in a loop, without vectorization. More...

struct	wielding::coupled_aggregator< vsz, ic_type, functor_type, typename >
	an aggregator for separate - possibly different - source and target. If source and target are in fact different, the inner functor will read data from source, process them and then write them to target. If source and target are the same, the operation will be in-place, but not explicitly so. vspline uses this style of two-argument functor, and this is the aggregator we use for vspline's array-based transforms. The code in this template will only be used for vectorized operation, If vectorization is not used, only the specialization for vsize == 1 below is used. More...

struct	wielding::coupled_aggregator< 1, ic_type, functor_type >
	specialization for vsz == 1. Here the data are simply processed one by one in a loop, without vectorization. More...

struct	wielding::wield< dimension, in_type, out_type >
	reimplementation of wield using the new 'neutral' multithread. The workers now all receive the same task to process one line at a time until all lines are processed. This simplifies the code; the wield object directly calls 'multithread' in it's operator(). And it improves performance, presumably because tail-end idling is reduced: all active threads have data to process until the last line has been picked up by an aggregator. So tail-end idling is in the order of magnitude of a line's worth, in contrast to half a worker's share of the data in the previous implementation. The current implementation does away with specialized partitioning code (at least for the time being); it looks like the performance is decent throughout, even without exploiting locality by partitioning to tiles. More...

struct	wielding::wield< 1, in_type, out_type >

struct	wielding::vs_adapter< inner_type >
	vs_adapter wraps a vspline::unary_functor to produce a functor which is compatible with the wielding code. This is necessary, because vspline's unary_functors take 'naked' arguments if the data are 1D, while the wielding code always passes TinyVectors. The operation of this wrapper class should not have a run-time effect; it's simply converting references. the wrapped functor is only used via operator(), so this is what we provide. While it would be nice to simply pass through the unwrapped unary_functor, this would force us to deal with the distinction between data in TinyVectors and 'naked' fundamentals deeper down in the code, and here is a good central place where we can route to uniform access via TinyVectors - possibly with only one element. By inheriting from inner_type, we provide all of inner_type's type system which we don't explicitly override. Rest assured: the reinterpret_cast is safe. If the data are single-channel, the containerized version takes up the same meory as the uncontainerized version of the datum. multi-channel data are containerized anyway. More...

struct	wielding::vs_sink_adapter< sink_type >
	same procedure for a vspline::sink_type More...

Namespaces
namespace	wielding

Macros
#define	VSPLINE_WIELDING_H

Functions
template<class functor_type , int dimension>
void	wielding::index_wield (const functor_type functor, vigra::MultiArrayView< dimension, typename functor_type::out_type > output, int njobs=vspline::default_njobs, vspline::atomic< bool > p_cancel=0)
	index_wield uses vspline's 'multithread' function to invoke an index-transformation functor for all indexes into an array, We use functors which are vector-capable, typically they will be derived from vspline::unary_functor. index_wield internally uses a 'wield' object to invoke the functor on the chunks of data. More...

template<class functor_type , int dimension>
void	wielding::index_reduce (const functor_type &functor, vigra::TinyVector< long, dimension > shape, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)

template<class functor_type , int dimension>
void	wielding::value_reduce (const functor_type &functor, const vigra::MultiArrayView< dimension, typename functor_type::in_type > input, int njobs=vspline::default_njobs, vspline::atomic< bool > p_cancel=0)

template<class functor_type , int dimension>
void	wielding::coupled_wield (const functor_type functor, const vigra::MultiArrayView< dimension, typename functor_type::in_type > input, vigra::MultiArrayView< dimension, typename functor_type::out_type > output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
	coupled_wield processes two arrays. The first array is taken as input, the second for output. Both arrays must have the same dimensionality and shape. Their data types have to be the same as the 'in_type' and the 'out_type' of the functor which was passed in. More...

template<class functor_type , unsigned int dimension>
void	wielding::generate_wield (const functor_type functor, vigra::MultiArrayView< dimension, typename functor_type::out_type > &output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
	generate_wield uses a generator function to produce data. Inside vspline, this is used for grid_eval, which can produce performance gains by precalculating frequently reused b-spline evaluation weights. The generator holds these weights in readily vectorized form, shared for all worker threads. More...

Detailed Description

Implementation of vspline::transform.

wielding.h provides code to process all 1D subbarrays of nD views. This is similar to using vigra::Navigator, which also iterates over 1D subarrays of nD arrays. Here, this access is hand-coded to have complete control over the process, and to work with range-based code rather than the iterator-based approach vigra uses.

The code is structured so that separable aspects of the process are coded as separate entities:

the top-level object in the wielding code is class wield. class wield offers several methods taking information about the data which are to be processed, and std::functions defining the specific processing which is intended for the 1D subbarrays. When one of wield's top level methods is called, it iterates over the 1D subarrays, calling the std::function for each subarray in turn - the std::function is used as a callback function.

Once inside the callback function, what's now seen is a specific 1D subarray (or a pair of them, when two arrays are processed in sync), plus any additional information specifically needed by the callback function, like the starting index in the nD array, which is needed for index-based transforms.

The callback 'functions' passed to the wield object in this body of code are actually functors. They are set up to 'contain' an adapted vspline::unary_functor, which is capable of processing data contained in the arrays.

If vectorization is not used, the processing is trivial: it 'collapses' to a simple traversal of the 1D subarray(s), using the unvectorized evaluation code in the vspline::unary_functor. But the whole point of 'aggregation' is to feed the vectorized evaluation code:

Here, the data are reworked to be suited for vectorized processing. This is done by copying incoming data into a small buffer, using techniques like SIMD gathering, SIMD loads and possibly Vc-provided deinterleaving, then processing the buffer with vectorized code, and finally writing the result back to target memory using the reverse operations: SIMD scatters or stores, or Vc's interleaving code. The 'magic' is that all of this is transparent to calling code: to the caller it's merely a call into code processing arrays of data, and all the complex buffering and unbuffering is done in a 'black box', encapsulated in class wield and the callback functions.

The functions handling individual 1D subarrays of data are natural candidates as 'joblets' to be used by several worker threads. With my new multithreading code introduced in March 2019, multithreading can use this granularity efficiently with an arbitrary number of workers. The multithreading is now done directly by class 'wield' in it's top-level methods and follows the 'standard' pattern of setting up the 'payload' as a lambda with reference capture, parcelling out 'joblets' via a vspline::atomic. This ensures granularity at the level of individual 1D subarrays (like, lines of an image) and next to no signalling overhead. As an added benefit, the set of currently active threads will co-operate on a reasonably small area of memory, making cache hits likely.

If Vc is used, the code provides specialized routines for cases where Vc can speed things up. Without Vc, this code will not be compiled (it's inside #ifdef USE_VC ... #endif preprocessor statements). Without Vc, the code will still be vectorized by a technique I call 'goading': The data are repackaged into small SoAs with vector-friendly array sizes and the expectation is that the compiler will recognize that the resulting inner loops are candidates for autovectorization. Using this technique has the advantage that - if the compiler 'gets it' - code will be generated for every target the compiler can produce autovectorized code for, rather than being limited to what Vc covers. And since the Vc types may mystify the compiler, not using them may also allow the compiler to optimize the code better. The 'goading' is done by using a 'mock' SIMD type (vspline::simd_type, see simd_type.h for more information). The actual SIMD or pseudo-SIMD data types used by the wielding code are not fixed, though - what's used is inferred from the functor passed to the wielding code, and the idea is to widen the feeding spectrum easily to other vectorized data types. If there is no specialized code for these types (like the Vc code for Vc data), there are only very few requirements for these types and adapting to new variants should be simple. TODO: concretize interface

After the aggregation code, wielding.h provides three functions using the mechanism described above to process arrays of data. These functions (index_, coupled_ and generate_wield ) take care of setting up and calling into the wield objects. They are used in turn to implement 'transform' routines, which are the top-level code user code calls. These top-level routines take care of argument checking and presenting the arguments to the wielding code in the form it needs. That code is in transform.h.

So by now, the use of the term 'wielding' should be obvious. We have a 'tool', namely the vspline::unary_functor, and we have data on which we intend to use the unary_functor. What's left to do? Wielding the tool! And since this operation can be factored out, I've done so and labeled it the 'wielding' code. There is another place in vspline which also provides 'wielding' code: it's the code in filter.h, which is used to 'wield' specific digital filters (like convolution or b-spline prefiltering), applying them to arrays of data. The requirements there are quite different from the requirements here, so these two bodies of wielding code are separate, but the design method is the same: we use two conceptual entities, the tool and it's use.

The 'magic' in vspline's wielding code is the automatic multithreading and vectorization, which is done transparently and makes the code fast. But seen from the outside, by a caller of one of the 'transform' functions, all the complexity is hidden. And, at the same time, if code is needed for targets which can't use vector code or multithreading, enabling or disabling these capabilites is as simple as passing a preprocessor definition to the compiler.

Definition in file wielding.h.

Macro Definition Documentation

◆ VSPLINE_WIELDING_H

#define VSPLINE_WIELDING_H

Definition at line 2173 of file wielding.h.

Classes

Namespaces

Macros

Functions

Detailed Description

Macro Definition Documentation

◆ VSPLINE_WIELDING_H