vspline 1.1.0
Generic C++11 Code for Uniform B-Splines
wielding.h
Go to the documentation of this file.
1/************************************************************************/
2/* */
3/* vspline - a set of generic tools for creation and evaluation */
4/* of uniform b-splines */
5/* */
6/* Copyright 2015 - 2023 by Kay F. Jahnke */
7/* */
8/* The git repository for this software is at */
9/* */
10/* https://bitbucket.org/kfj/vspline */
11/* */
12/* Please direct questions, bug reports, and contributions to */
13/* */
14/* kfjahnke+vspline@gmail.com */
15/* */
16/* Permission is hereby granted, free of charge, to any person */
17/* obtaining a copy of this software and associated documentation */
18/* files (the "Software"), to deal in the Software without */
19/* restriction, including without limitation the rights to use, */
20/* copy, modify, merge, publish, distribute, sublicense, and/or */
21/* sell copies of the Software, and to permit persons to whom the */
22/* Software is furnished to do so, subject to the following */
23/* conditions: */
24/* */
25/* The above copyright notice and this permission notice shall be */
26/* included in all copies or substantial portions of the */
27/* Software. */
28/* */
29/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND */
30/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES */
31/* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND */
32/* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT */
33/* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, */
34/* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING */
35/* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR */
36/* OTHER DEALINGS IN THE SOFTWARE. */
37/* */
38/************************************************************************/
39
40/*! \file wielding.h
41
42 \brief Implementation of vspline::transform
43
44 wielding.h provides code to process all 1D subbarrays of nD views.
45 This is similar to using vigra::Navigator, which also iterates over
46 1D subarrays of nD arrays. Here, this access is hand-coded to have
47 complete control over the process, and to work with range-based
48 code rather than the iterator-based approach vigra uses.
49
50 The code is structured so that separable aspects of the process
51 are coded as separate entities:
52
53 the top-level object in the wielding code is class wield.
54 class wield offers several methods taking information about
55 the data which are to be processed, and std::functions defining
56 the specific processing which is intended for the 1D subbarrays.
57 When one of wield's top level methods is called, it iterates
58 over the 1D subarrays, calling the std::function for each subarray
59 in turn - the std::function is used as a callback function.
60
61 Once inside the callback function, what's now seen is a specific
62 1D subarray (or a pair of them, when two arrays are processed
63 in sync), plus any additional information specifically needed
64 by the callback function, like the starting index in the nD
65 array, which is needed for index-based transforms.
66
67 The callback 'functions' passed to the wield object in this body
68 of code are actually functors. They are set up to 'contain' an
69 adapted vspline::unary_functor, which is capable of processing
70 data contained in the arrays.
71
72 If vectorization is not used, the processing is trivial: it 'collapses'
73 to a simple traversal of the 1D subarray(s), using the unvectorized
74 evaluation code in the vspline::unary_functor. But the whole point
75 of 'aggregation' is to feed the *vectorized* evaluation code:
76
77 Here, the data are reworked to be suited for vectorized processing.
78 This is done by copying incoming data into a small buffer, using
79 techniques like SIMD gathering, SIMD loads and possibly Vc-provided
80 deinterleaving, then processing the buffer with vectorized code,
81 and finally writing the result back to target memory using the
82 reverse operations: SIMD scatters or stores, or Vc's interleaving
83 code. The 'magic' is that all of this is transparent to calling
84 code: to the caller it's merely a call into code processing arrays
85 of data, and all the complex buffering and unbuffering is done
86 in a 'black box', encapsulated in class wield and the callback
87 functions.
88
89 The functions handling individual 1D subarrays of data are natural
90 candidates as 'joblets' to be used by several worker threads. With
91 my new multithreading code introduced in March 2019, multithreading
92 can use this granularity efficiently with an arbitrary number of
93 workers. The multithreading is now done directly by class 'wield'
94 in it's top-level methods and follows the 'standard' pattern of
95 setting up the 'payload' as a lambda with reference capture,
96 parcelling out 'joblets' via a vspline::atomic. This ensures
97 granularity at the level of individual 1D subarrays (like, lines
98 of an image) and next to no signalling overhead. As an added
99 benefit, the set of currently active threads will co-operate on
100 a reasonably small area of memory, making cache hits likely.
101
102 If Vc is used, the code provides specialized routines for cases
103 where Vc can speed things up. Without Vc, this code will not be
104 compiled (it's inside #ifdef USE_VC ... #endif preprocessor
105 statements). Without Vc, the code will still be vectorized by
106 a technique I call 'goading': The data are repackaged into small
107 SoAs with vector-friendly array sizes and the expectation is that
108 the compiler will recognize that the resulting inner loops are
109 candidates for autovectorization. Using this technique has the
110 advantage that - if the compiler 'gets it' - code will be generated
111 for every target the *compiler* can produce autovectorized code for,
112 rather than being limited to what Vc covers. And since the Vc types
113 may mystify the compiler, not using them may also allow the compiler
114 to optimize the code better. The 'goading' is done by using a 'mock'
115 SIMD type (vspline::simd_type, see simd_type.h for more information).
116 The actual SIMD or pseudo-SIMD data types used by the wielding code
117 are not fixed, though - what's used is inferred from the functor
118 passed to the wielding code, and the idea is to widen the feeding
119 spectrum easily to other vectorized data types. If there is no
120 specialized code for these types (like the Vc code for Vc data),
121 there are only very few requirements for these types and adapting
122 to new variants should be simple. TODO: concretize interface
123
124 After the aggregation code, wielding.h provides three functions
125 using the mechanism described above to process arrays of data.
126 These functions (index_, coupled_ and generate_wield ) take care
127 of setting up and calling into the wield objects. They are used in
128 turn to implement 'transform' routines, which are the top-level
129 code user code calls. These top-level routines take care of
130 argument checking and presenting the arguments to the wielding
131 code in the form it needs. That code is in transform.h.
132
133 So by now, the use of the term 'wielding' should be obvious.
134 We have a 'tool', namely the vspline::unary_functor, and we have
135 data on which we intend to use the unary_functor. What's left to
136 do? Wielding the tool! And since this operation can be factored
137 out, I've done so and labeled it the 'wielding' code. There is
138 another place in vspline which also provides 'wielding' code:
139 it's the code in filter.h, which is used to 'wield' specific
140 digital filters (like convolution or b-spline prefiltering),
141 applying them to arrays of data. The requirements there are
142 quite different from the requirements here, so these two bodies
143 of wielding code are separate, but the design method is the same:
144 we use two conceptual entities, the tool and it's use.
145
146 The 'magic' in vspline's wielding code is the automatic
147 multithreading and vectorization, which is done transparently and
148 makes the code fast. But seen from the outside, by a caller of
149 one of the 'transform' functions, all the complexity is hidden.
150 And, at the same time, if code is needed for targets which can't
151 use vector code or multithreading, enabling or disabling these
152 capabilites is as simple as passing a preprocessor definition
153 to the compiler.
154*/
155
156#ifndef VSPLINE_WIELDING_H
157
158#include <atomic>
159#include "interleave.h"
160#include "vspline.h"
161
162namespace wielding
163{
164/// indexed_aggregator receives the start coordinate and processing axis
165/// along with the data to process, this is meant for index-transforms.
166/// The coordinate is updated for every call to the 'inner' functor
167/// so that the inner functor has the current coordinate as input.
168/// The code in this template will only be used for vectorized operation,
169/// without vectorization, only the specialization for vsize == 1 below
170/// is used.
171
172template < size_t vsz , typename ic_type , class functor_type ,
173 typename = std::enable_if < ( vsz > 1 ) > >
175{
176 // extract the functor's i/o type system
177
178 typedef typename functor_type::in_type in_type ;
179 typedef typename functor_type::in_ele_type in_ele_type ;
180 typedef typename functor_type::in_v in_v ;
181 typedef typename functor_type::in_ele_v in_ele_v ;
182
183 typedef typename functor_type::out_type out_type ;
184 typedef typename functor_type::out_ele_type out_ele_type ;
185 typedef typename functor_type::out_v out_v ;
186 typedef typename functor_type::out_ele_v out_ele_v ;
187
188 enum { dim_in = functor_type::dim_in } ;
189 enum { dim_out = functor_type::dim_out } ;
190
191 // note how we use the functor's in_type as the coordinate type,
192 // rather than using a TinyVector of some integral type. This way
193 // we have the index already in the type needed by the functor and
194 // arithmetic on the coordinate uses this type as well.
195
196 const functor_type functor ;
197
198 indexed_aggregator ( const functor_type & _functor )
199 : functor ( _functor )
200 { } ;
201
202#ifdef USE_VC
203
204 // helper function to determine if the lane count is a multiple
205 // of the hardware vector size. Used only with Vc. This function
206 // is used to initialize a const static bool where direct
207 // initialization produces a warning if hsize is zero.
208
209 static bool is_n_hsize()
210 {
212 return false ;
214 return ( vsz % div_by == 0 ) ;
215 }
216
217#endif
218
219 // note how 'crd' is of in_type, which depends on the functor,
220 // while the actual call passes an integral type. If in_type
221 // is real, this overload is nevertheless picked and the argument
222 // converted to the real coordinate type.
223
224 void operator() ( in_type crd ,
225 int axis ,
226 out_type * trg ,
227 ic_type stride ,
228 ic_type length )
229 {
230 auto aggregates = length / vsz ;
231 auto leftover = length - aggregates * vsz ;
232
233 // the buffer and the nD coordinate are created as the data types
234 // which the functor expects.
235
236 out_v buffer ;
237 in_v md_crd ;
238
239 // initialize the vectorized coordinate. This coordinate will
240 // remain constant except for the component indexing the
241 // processing axis, which will be counted up as we go along.
242 // This makes the index calculations very efficient: for one
243 // vectorized evaluation, we only need a single vectorized
244 // addition where the vectorized coordinate is increased by
245 // vsize.
246
247 for ( int d = 0 ; d < dim_in ; d++ )
248 {
249 if ( d != axis )
250 md_crd[d] = crd[d] ;
251 else
252 {
253 for ( int e = 0 ; e < vsz ; e++ )
254 md_crd[d][e] = crd[d] + e ;
255 }
256 }
257
258#ifdef USE_VC
259
260 // flag which is true if vsz is a multiple of the hardware
261 // vector size for out_ele_type. This flag will activate the use
262 // of specialized memory access code (Vc::InterleavedMemoryWrapper)
263 // If this is unwanted, the easiest way to deactivate that code
264 // is by setting this flag to false. Then, all access which can't
265 // use straight SIMD store operations will use scatters.
266
267 static const bool out_n_vecsz ( is_n_hsize() ) ;
268
269#else
270
271 static const bool out_n_vecsz = false ;
272
273#endif
274
275 // process a bunch of coordinates: apply the 'inner' functor,
276 // then write result to memory using 'fluff'.
277
278 // flag used to dispatch to either of the unstrided bunch/fluff
279 // overloads:
280
281 typedef typename std::integral_constant < bool , dim_out == 1 > use_store_t ;
282
283 // if the stride is 1, we can use specialized 'fluff' variants,
284 // provided the data are single-channel (or the vector width
285 // is a multiple of the hardware vector width when Vc is used).
286 // All other cases are handled with the variant of 'fluff'
287 // taking a stride.
288
289 if ( stride == 1 && ( dim_out == 1 || out_n_vecsz ) )
290 {
291 for ( ic_type a = 0 ; a < aggregates ; a++ )
292 {
293 functor ( md_crd , buffer ) ;
294 fluff ( buffer , trg , use_store_t() ) ;
295 trg += vsz ;
296 md_crd[axis] += vsz ;
297 }
298 }
299 else
300 {
301 for ( ic_type a = 0 ; a < aggregates ; a++ )
302 {
303 functor ( md_crd , buffer ) ;
304 fluff ( buffer , trg , stride ) ;
305 trg += vsz * stride ;
306 md_crd[axis] += vsz ;
307 }
308 }
309
310 // peeling is done, any leftovers are processed one-by-one
311
312 crd[axis] += aggregates * vsz ;
313
314 for ( ic_type r = 0 ; r < leftover ; r++ )
315 {
316 functor ( crd , *trg ) ;
317 trg += stride ;
318 crd[axis]++ ;
319 }
320 }
321} ; // struct indexed_aggregator
322
323/// specialization for vsz == 1. Here the data are simply
324/// processed one by one in a loop, without vectorization.
325
326template < typename ic_type , class functor_type >
328{
329 const functor_type functor ;
330
331 indexed_aggregator ( const functor_type & _functor )
332 : functor ( _functor )
333 { } ;
334
335 // note how we use the functor's in_type as the coordinate type,
336 // rather than using a TinyVector of some integral type. This way
337 // we have the index already in the type needed by the functor and
338 // arithmetic on the coordinate uses this type as well.
339
340 typedef typename functor_type::in_type sd_coordinate_type ;
341
342 void operator() ( sd_coordinate_type crd ,
343 int axis ,
344 typename functor_type::out_type * trg ,
345 ic_type stride ,
346 ic_type length )
347 {
348 for ( ic_type r = 0 ; r < length ; r++ )
349 {
350 functor ( crd , *trg ) ;
351 trg += stride ;
352 crd[axis]++ ;
353 }
354 }
355} ;
356
357/// indexed_reductor is used for reductions and has no output. The actual
358/// reduction is handled by the functor: each thread has it's own copy of
359/// the functor, which does it's own part of the reduction, and 'offloads'
360/// it's result to some mutex-protected receptacle when it's destructed,
361/// see the 'reduce' functions in transform.h for a more detailed explanation
362/// and an example of such a functor.
363/// idexed_reductor processes discrete coordinates, whereas yield_reductor
364/// (the next class down) processes values. This variant works just like
365/// an indexed_aggregator, only that it produces no output - at least not
366/// for every coordinate fed to the functor, the functor itself does hold
367/// state (the reduction) and is also responsible for offloading per-thread
368/// results when the worker threads terminate.
369/// This class holds a copy of the functor, and each thread has an instance
370/// of this class, ensuring that each worker thread can reduce it's share of
371/// the work load independently.
372
373template < size_t vsz , typename ic_type , class functor_type ,
374 typename = std::enable_if < ( vsz > 1 ) > >
376{
377 // extract the functor's type system
378
379 typedef typename functor_type::in_type in_type ;
380 typedef typename functor_type::in_ele_type in_ele_type ;
381 typedef typename functor_type::in_v in_v ;
382 typedef typename functor_type::in_ele_v in_ele_v ;
383
384 enum { dim_in = functor_type::dim_in } ;
385
386 functor_type functor ;
387
388 // get the coordinate type the functor expects
389
390 typedef typename functor_type::in_type crd_type ;
391
392 // the c'tor copy-constructs member 'functor'
393
394 indexed_reductor ( const functor_type & _functor )
395 : functor ( _functor )
396 { } ;
397
398 void operator() ( in_type crd ,
399 int axis ,
400 ic_type length )
401 {
402 auto aggregates = length / vsz ;
403 auto leftover = length - aggregates * vsz ;
404
405 // the nD coordinate is created as the data type
406 // which the functor expects.
407
408 in_v md_crd ;
409
410 // initialize the vectorized coordinate. This coordinate will
411 // remain constant except for the component indexing the
412 // processing axis, which will be counted up as we go along.
413 // This makes the index calculations very efficient: for one
414 // vectorized evaluation, we only need a single vectorized
415 // addition where the vectorized coordinate is increased by
416 // vsize.
417
418 for ( int d = 0 ; d < dim_in ; d++ )
419 {
420 if ( d != axis )
421 md_crd[d] = crd[d] ;
422 else
423 {
424 for ( int e = 0 ; e < vsz ; e++ )
425 md_crd[d][e] = crd[d] + e ;
426 }
427 }
428
429 // process a bunch of coordinates: apply the 'inner' functor.
430
431 for ( ic_type a = 0 ; a < aggregates ; a++ )
432 {
433 functor ( md_crd ) ;
434 md_crd[axis] += vsz ;
435 }
436
437 // peeling is done, any leftovers are processed one-by-one
438
439 crd[axis] += aggregates * vsz ;
440
441 for ( ic_type r = 0 ; r < leftover ; r++ )
442 {
443 functor ( crd ) ;
444 crd[axis]++ ;
445 }
446 }
447} ; // struct indexed_reductor
448
449/// specialization for vsz == 1. Here the data are simply
450/// processed one by one in a loop, without vectorization.
451
452template < typename ic_type , class functor_type >
454{
455 const functor_type functor ;
456
457 indexed_reductor ( const functor_type & _functor )
458 : functor ( _functor )
459 { } ;
460
461 // note how we use the functor's in_type as the coordinate type,
462 // rather than using a TinyVector of some integral type. This way
463 // we have the index already in the type needed by the functor and
464 // arithmetic on the coordinate uses this type as well.
465
466 typedef typename functor_type::in_type sd_coordinate_type ;
467
468 void operator() ( sd_coordinate_type crd ,
469 int axis ,
470 ic_type length )
471 {
472 for ( ic_type r = 0 ; r < length ; r++ )
473 {
474 functor ( crd ) ;
475 crd[axis]++ ;
476 }
477 }
478} ;
479
480/// an aggregator to reduce arrays. This is like using indexed_reductor
481/// with a functor gathering from an array, but due to the use of 'bunch'
482/// this class is faster for certain array types, because it can use
483/// load/shuffle operations instead of always gathering.
484
485template < size_t vsz , typename ic_type , class functor_type ,
486 typename = std::enable_if < ( vsz > 1 ) > >
488{
489 typedef typename functor_type::in_type in_type ;
490 typedef typename functor_type::in_ele_type in_ele_type ;
491
492 enum { dim_in = functor_type::dim_in } ;
493
494 functor_type functor ;
495
496 // get the data types the functor expects
497
498 typedef typename functor_type::in_v in_v ;
499
500 yield_reductor ( const functor_type & _functor )
501 : functor ( _functor )
502 { } ;
503
504 void operator() ( const in_type * src ,
505 ic_type in_stride ,
506 ic_type length
507 )
508 {
509 auto aggregates = length / vsz ;
510 auto leftover = length - aggregates * vsz ;
511
512 in_v in_buffer ;
513
514 // first we perform a peeling run, processing data vectorized
515 // as long as there are enough data to fill the vectorized
516 // buffers (md_XXX_data_type)
517
518#ifdef USE_VC
519
520 // flags which are true if vsz is a multiple of the hardware
521 // vector size for the elementary types involved. This works like
522 // an opt-in: even if dim_in or dim_out are not 1, if these flags
523 // are true, specialized load/store variants are called. If, then,
524 // use_load_t or use_store_t are std::false_type, we'll end up in
525 // the specialized Vc code using InterleavedMemoryWrapper.
526
527 static const bool in_n_vecsz
530
531#else
532
533 static const bool in_n_vecsz = false ;
534
535#endif
536
537 // used to dispatch to either of the unstrided bunch overloads;
538
539 typedef typename std::integral_constant < bool , dim_in == 1 > use_load_t ;
540
541 // depending on whether the input is strided or not,
542 // and on the vector size and number of channels,
543 // we pick different overloads of 'bunch'. The
544 // overloads without stride may use InterleavedMemoryWrapper,
545 // or, for single-channel data, SIMD load operations,
546 // which is most efficient. We can only pick the variants
547 // using InterleavedMemoryWrapper if vsz is a multiple of
548 // the hardware SIMD register size, hence the rather complex
549 // conditionals. But the complexity is rewarded with optimal
550 // peformance.
551
552 for ( ic_type a = 0 ; a < aggregates ; a++ )
553 {
554 if ( in_stride == 1
555 && ( dim_in == 1 || in_n_vecsz ) )
556 {
557 bunch ( src , in_buffer , use_load_t() ) ;
558 src += vsz ;
559 functor ( in_buffer ) ;
560 }
561 else
562 {
563 bunch ( src , in_buffer , in_stride ) ;
564 src += in_stride * vsz ;
565 functor ( in_buffer ) ;
566 }
567 }
568
569 // peeling is done, we mop up the remainder with scalar code
570
571 for ( ic_type r = 0 ; r < leftover ; r++ )
572 {
573 functor ( *src ) ;
574 src += in_stride ;
575 }
576 }
577} ; // struct yield_reductor
578
579/// specialization for vsz == 1. Here the data are simply
580/// processed one by one in a loop, without vectorization.
581
582template < typename ic_type , class functor_type >
583struct yield_reductor < 1 , ic_type , functor_type >
584{
585 const functor_type functor ;
586
587 yield_reductor ( const functor_type & _functor )
588 : functor ( _functor )
589 { } ;
590
591 void operator() ( const typename functor_type::in_type * src ,
592 ic_type in_stride ,
593 ic_type length
594 )
595 {
596 for ( ic_type r = 0 ; r < length ; r++ )
597 {
598 functor ( *src ) ;
599 src += in_stride ;
600 }
601 }
602} ;
603
604/// generate_aggregator is very similar to indexed_aggregator, but
605/// instead of managing and passing a coordinate to the functor, the
606/// functor now manages the argument side of the operation: it acts
607/// as a generator. To make this possible, the generator has to hold
608/// run-time modifiable state and can't be const like the functors
609/// used in the other aggregators, where the functors are 'pure' in
610/// a functional programming sense.
611/// A 'generator' functor to be used with this body of code is expected
612/// to behave in a certain fashion:
613/// - all of it's state which stays constant and shared by all invocations
614/// has to be present after construction.
615/// - the generator is trivially copyable
616/// - copying the generator produces copies hoding the same shared state
617/// - the generator has a 'reset' routine taking a coordinate. This
618/// routine initializes state pertaining to a single 1D subarray
619/// of data to be processed in a worker thread.
620
621template < size_t _vsize , typename ic_type , class functor_type ,
622 typename = std::enable_if < ( _vsize > 1 ) > >
624{
625 static const size_t vsize = _vsize ;
626
627 // extract the generator's output type system. This is the same
628 // system as is used by vspline::unary_functor, minus, of course,
629 // the input types.
630
631 typedef typename functor_type::out_type out_type ;
632 typedef typename functor_type::out_ele_type out_ele_type ;
633 typedef typename functor_type::out_nd_ele_type out_nd_ele_type ;
634 typedef typename functor_type::out_v out_v ;
635 typedef typename functor_type::out_ele_v out_ele_v ;
636 typedef typename functor_type::out_nd_ele_v out_nd_ele_v ;
637
638 enum { channels = functor_type::channels } ;
639
640 // functor is a generator and carries mutable state, so it's not const
641
642 functor_type functor ;
643
644 // get the coordinate type the functor expects
645
646 typedef typename functor_type::shape_type crd_type ;
647
648 // the c'tor copy-constructs member 'functor', which will again be
649 // copy-constructed in the single threads, providing a separate
650 // instance for each thread.
651
652 generate_aggregator ( const functor_type & _functor )
653 : functor ( _functor )
654 { } ;
655
656 // variant code producing a full line of data in one go
657 // this may go later, there seems to be no gain to be had from this.
658
659#ifdef USE_BUFFERED_GENERATION
660
661 void operator() ( crd_type crd ,
662 int axis ,
663 out_type * trg ,
664 ic_type stride ,
665 ic_type length )
666 {
667 // We need an nD equivalent of 'trg' to use 'fluff'
668
669 out_nd_ele_type * & nd_trg
670 = reinterpret_cast < out_nd_ele_type * & > ( trg ) ;
671
672 auto aggregates = length / vsize ;
673 auto leftover = length - aggregates * vsize ;
674
675 // reset the functor to start from a new initial coordinate.
676
677 functor.reset ( crd , aggregates ) ;
678
679 // the buffer is created as the data type which the functor expects.
680 // since the functor is a generator, there is no input for it.
681
682 vigra::MultiArray < 1 , out_ele_v >
683 vbuffer ( aggregates * channels ) ;
684
685 vigra::MultiArray < 1 , out_type > rest ( leftover ) ;
686
687 functor.eval ( vbuffer , rest ) ;
688
689#ifdef USE_VC
690
691 // flag which is true if vsize is a multiple of the hardware
692 // vector size for out_ele_type. This flag will activate the use
693 // of specialized memory access code (Vc::InterleavedMemoryWrapper)
694 // If this is unwanted, the easiest way to deactivate that code
695 // is by setting this flag to false. Then, all access which can't
696 // use straight SIMD store operations will use scatters.
697
698 static const bool out_n_vecsz
701
702#else
703
704 static const bool out_n_vecsz = false ;
705
706#endif
707
708 // generate a set of data: call the 'inner' functor,
709 // then write result to memory using 'fluff'.
710
711 // flag used to dispatch to either of the unstrided bunch/fluff overloads:
712
713 typedef typename std::integral_constant
714 < bool , channels == 1 > use_store_t ;
715
716 // if the stride is 1, we can use specialized 'fluff' variants,
717 // provided the data are single-channel (or the vector width
718 // is a multiple of the hardware vector width when Vc is used).
719 // All other cases are handled with the variant of 'fluff'
720 // taking a stride.
721
722 // TODO: would be nice to simply have a MultiArrayView of
723 // aggregates * out_v, but that crashes
724 // hence the detour via the nD type and storing (and reloading)
725 // individual vectors
726
727 // We need an nD equivalent of 'vr' to use 'fluff'
728
729 out_v vr ;
730 out_nd_ele_v & ndvr = reinterpret_cast < out_nd_ele_v & > ( vr ) ;
731
732 if ( stride == 1 && ( channels == 1 || out_n_vecsz ) )
733 {
734 for ( ic_type a = 0 ; a < aggregates ; a++ )
735 {
736 for ( size_t ch = 0 ; ch < channels ; ch++ )
737 ndvr[ch] = vbuffer [ a * channels + ch ] ;
738 fluff ( ndvr , nd_trg , use_store_t() ) ;
739 trg += vsize ;
740 }
741 }
742 else
743 {
744 for ( ic_type a = 0 ; a < aggregates ; a++ )
745 {
746 for ( size_t ch = 0 ; ch < channels ; ch++ )
747 ndvr[ch] = vbuffer [ a * channels + ch ] ;
748 fluff ( ndvr , nd_trg , stride ) ;
749 trg += vsize * stride ;
750 }
751 }
752
753 // peeling is done, any leftovers are processed one-by-one
754
755 for ( ic_type r = 0 ; r < leftover ; r++ )
756 {
757 *trg = rest[r] ;
758 trg += stride ;
759 }
760 }
761
762#else // USE_BUFFERED_GENERATION
763
764 void operator() ( crd_type crd ,
765 int axis ,
766 out_type * trg ,
767 ic_type stride ,
768 ic_type length )
769 {
770 // We need an nD equivalent of 'trg' to use 'fluff'
771
772 out_nd_ele_type * & nd_trg
773 = reinterpret_cast < out_nd_ele_type * & > ( trg ) ;
774
775 // set up the vectorizable extent and the remainder
776
777 auto aggregates = length / vsize ;
778 auto leftover = length - aggregates * vsize ;
779
780 // reset the functor to start from a new initial coordinate.
781
782 functor.reset ( crd , aggregates ) ;
783
784 // buffer 'vr' is created as the data type which the functor expects.
785 // since the functor is a generator, there is no input for it.
786 // We also need an nD equivalent to use 'fluff'
787
788 out_v vr ;
789 out_nd_ele_v & ndvr = reinterpret_cast < out_nd_ele_v & > ( vr ) ;
790
791#ifdef USE_VC
792
793 // flag which is true if vsize is a multiple of the hardware
794 // vector size for out_ele_type. This flag will activate the use
795 // of specialized memory access code (Vc::InterleavedMemoryWrapper)
796 // If this is unwanted, the easiest way to deactivate that code
797 // is by setting this flag to false. Then, all access which can't
798 // use straight SIMD store operations will use scatters.
799
800 static const bool out_n_vecsz
803
804#else
805
806 static const bool out_n_vecsz = false ;
807
808#endif
809
810 // generate a set of data: call the 'inner' functor,
811 // then write result to memory using 'fluff'.
812
813 // flag used to dispatch to either of the unstrided bunch/fluff overloads:
814
815 typedef typename std::integral_constant
816 < bool , channels == 1 > use_store_t ;
817
818 // if the stride is 1, we can use specialized 'fluff' variants,
819 // provided the data are single-channel (or the vector width
820 // is a multiple of the hardware vector width when Vc is used).
821 // All other cases are handled with the variant of 'fluff'
822 // taking a stride.
823
824 if ( stride == 1 && ( channels == 1 || out_n_vecsz ) )
825 {
826 for ( ic_type a = 0 ; a < aggregates ; a++ )
827 {
828 functor.eval ( vr ) ;
829 fluff ( ndvr , nd_trg , use_store_t() ) ;
830 trg += vsize ;
831 }
832 }
833 else
834 {
835 for ( ic_type a = 0 ; a < aggregates ; a++ )
836 {
837 functor.eval ( vr ) ;
838 fluff ( ndvr , nd_trg , stride ) ;
839 trg += vsize * stride ;
840 }
841 }
842
843 // peeling is done, any leftovers are processed one-by-one
844
845 for ( ic_type r = 0 ; r < leftover ; r++ )
846 {
847 functor.eval ( *trg ) ;
848 trg += stride ;
849 }
850 }
851
852#endif // USE_BUFFERED_GENERATION
853
854} ; // struct generate_aggregator
855
856/// specialization for vsz == 1. Here the data are simply
857/// processed one by one in a loop, without vectorization.
858
859template < typename ic_type , class functor_type >
860struct generate_aggregator < 1 , ic_type , functor_type >
861{
862 static const size_t vsize = 1 ;
863 functor_type functor ;
864
865 generate_aggregator ( const functor_type & _functor )
866 : functor ( _functor )
867 { } ;
868
869 typedef typename functor_type::shape_type crd_type ;
870 typedef typename functor_type::out_type out_type ;
871
872#ifdef USE_BUFFERED_GENERATION
873
874 void operator() ( crd_type crd ,
875 int axis ,
876 out_type * trg ,
877 ic_type stride ,
878 ic_type length )
879 {
880 functor.reset ( crd , 0 ) ;
881 vigra::MultiArray < 1 , out_type > result ( length ) ;
882 functor.eval ( result ) ;
883
884 for ( ic_type r = 0 ; r < length ; r++ )
885 {
886 *trg = result [ r ] ;
887 trg += stride ;
888 }
889 }
890
891#else
892
894 int axis ,
895 out_type * trg ,
896 ic_type stride ,
897 ic_type length )
898 {
899 functor.reset ( crd , 0 ) ;
900
901 for ( ic_type r = 0 ; r < length ; r++ )
902 {
903 functor.eval ( *trg ) ;
904 trg += stride ;
905 }
906 }
907
908#endif
909} ;
910
911/// an aggregator for separate - possibly different - source and target.
912/// If source and target are in fact different, the inner functor will
913/// read data from source, process them and then write them to target.
914/// If source and target are the same, the operation will be in-place,
915/// but not explicitly so. vspline uses this style of two-argument functor,
916/// and this is the aggregator we use for vspline's array-based transforms.
917/// The code in this template will only be used for vectorized operation,
918/// If vectorization is not used, only the specialization for vsize == 1
919/// below is used.
920
921template < size_t vsz , typename ic_type , class functor_type ,
922 typename = std::enable_if < ( vsz > 1 ) > >
924{
925 typedef typename functor_type::in_type in_type ;
926 typedef typename functor_type::in_ele_type in_ele_type ;
927
928 enum { dim_in = functor_type::dim_in } ;
929 enum { dim_out = functor_type::dim_out } ;
930
931 typedef typename functor_type::out_type out_type ;
932 typedef typename functor_type::out_ele_type out_ele_type ;
933
934 const functor_type functor ;
935
936 // get the data types the functor expects
937
938 typedef typename functor_type::in_v in_v ;
939 typedef typename functor_type::out_v out_v ;
940
941 coupled_aggregator ( const functor_type & _functor )
942 : functor ( _functor )
943 { } ;
944
945 void operator() ( const in_type * src ,
946 ic_type in_stride ,
947 out_type * trg ,
948 ic_type out_stride ,
949 ic_type length
950 )
951 {
952 auto aggregates = length / vsz ;
953 auto leftover = length - aggregates * vsz ;
954 const bool is_apply = ( (void*) src == (void*) trg ) ;
955
956 in_v in_buffer ;
957 out_v out_buffer ;
958
959 // first we perform a peeling run, processing data vectorized
960 // as long as there are enough data to fill the vectorized
961 // buffers (md_XXX_data_type)
962
963#ifdef USE_VC
964
965 // flags which are true if vsz is a multiple of the hardware
966 // vector size for the elementary types involved. This works like
967 // an opt-in: even if dim_in or dim_out are not 1, if these flags
968 // are true, specialized load/store variants are called. If, then,
969 // use_load_t or use_store_t are std::false_type, we'll end up in
970 // the specialized Vc code using InterleavedMemoryWrapper.
971
972 static const bool in_n_vecsz
975
976 static const bool out_n_vecsz
979
980#else
981
982 static const bool in_n_vecsz = false ;
983 static const bool out_n_vecsz = false ;
984
985#endif
986
987 // used to dispatch to either of the unstrided bunch/fluff overloads;
988 // see also the remarks coming with use_store_t in the routine above.
989
990 typedef typename std::integral_constant < bool , dim_in == 1 > use_load_t ;
991
992 typedef typename std::integral_constant < bool , dim_out == 1 > use_store_t ;
993
994 // depending on whether the input/output is strided or not,
995 // and on the vector size and number of channels,
996 // we pick different overloads of 'bunch' and fluff'. The
997 // overloads without stride may use InterleavedMemoryWrapper,
998 // or, for single-channel data, SIMD load/store operations,
999 // which is most efficient. We can only pick the variants
1000 // using InterleavedMemoryWrapper if vsz is a multiple of
1001 // the hardware SIMD register size, hence the rather complex
1002 // conditionals. But the complexity is rewarded with optimal
1003 // peformance.
1004
1005 if ( in_stride == 1
1006 && ( dim_in == 1 || in_n_vecsz ) )
1007 {
1008 if ( out_stride == 1
1009 && ( dim_out == 1 || out_n_vecsz ) )
1010 {
1011 for ( ic_type a = 0 ; a < aggregates ; a++ )
1012 {
1013 bunch ( src , in_buffer , use_load_t() ) ;
1014 src += vsz ;
1015 functor ( in_buffer , out_buffer ) ;
1016 fluff ( out_buffer , trg , use_store_t() ) ;
1017 trg += vsz ;
1018 }
1019 }
1020 else
1021 {
1022 for ( ic_type a = 0 ; a < aggregates ; a++ )
1023 {
1024 bunch ( src , in_buffer , use_load_t() ) ;
1025 src += vsz ;
1026 functor ( in_buffer , out_buffer ) ;
1027 fluff ( out_buffer , trg , out_stride ) ;
1028 trg += out_stride * vsz ;
1029 }
1030 }
1031 }
1032 else
1033 {
1034 if ( out_stride == 1
1035 && ( dim_out == 1 || out_n_vecsz ) )
1036 {
1037 for ( ic_type a = 0 ; a < aggregates ; a++ )
1038 {
1039 bunch ( src , in_buffer , in_stride ) ;
1040 src += in_stride * vsz ;
1041 functor ( in_buffer , out_buffer ) ;
1042 fluff ( out_buffer , trg , use_store_t() ) ;
1043 trg += vsz ;
1044 }
1045 }
1046 else
1047 {
1048 // this is the 'generic' case:
1049 for ( ic_type a = 0 ; a < aggregates ; a++ )
1050 {
1051 bunch ( src , in_buffer , in_stride ) ;
1052 src += in_stride * vsz ;
1053 functor ( in_buffer , out_buffer ) ;
1054 fluff ( out_buffer , trg , out_stride ) ;
1055 trg += out_stride * vsz ;
1056 }
1057 }
1058 }
1059
1060 // peeling is done, we mop up the remainder with scalar code
1061 // KFJ 2022-05-19 initially I coded so that an apply would have
1062 // to take care not to write to out and read in subsequently,
1063 // but I think the code should rather be defensive and avoid
1064 // the problem without user code having to be aware of it.
1065 // hence the test for equality of src and trg.
1066
1067 if ( leftover )
1068 {
1069 if ( is_apply )
1070 {
1071 // this is an 'apply', avoid write-before-read
1072 out_type help ;
1073 for ( ic_type r = 0 ; r < leftover ; r++ )
1074 {
1075 functor ( *src , help ) ;
1076 *trg = help ;
1077 src += in_stride ;
1078 trg += out_stride ;
1079 }
1080 }
1081 else
1082 {
1083 // this is not an 'apply'
1084 for ( ic_type r = 0 ; r < leftover ; r++ )
1085 {
1086 functor ( *src , *trg ) ;
1087 src += in_stride ;
1088 trg += out_stride ;
1089 }
1090 }
1091 }
1092 }
1093} ; // struct coupled_aggregator
1094
1095/// specialization for vsz == 1. Here the data are simply
1096/// processed one by one in a loop, without vectorization.
1097
1098template < typename ic_type , class functor_type >
1099struct coupled_aggregator < 1 , ic_type , functor_type >
1100{
1101 const functor_type functor ;
1102
1103 coupled_aggregator ( const functor_type & _functor )
1104 : functor ( _functor )
1105 { } ;
1106
1107 void operator() ( const typename functor_type::in_type * src ,
1108 ic_type in_stride ,
1109 typename functor_type::out_type * trg ,
1110 ic_type out_stride ,
1111 ic_type length
1112 )
1113 {
1114 if ( (void*)src == (void*)trg )
1115 {
1116 // this is an 'apply', avoid write-before-read
1117 typename functor_type::out_type help ;
1118 for ( ic_type r = 0 ; r < length ; r++ )
1119 {
1120 functor ( *src , help ) ;
1121 *trg = help ;
1122 src += in_stride ;
1123 trg += out_stride ;
1124 }
1125 }
1126 else
1127 {
1128 // this is not an 'apply'
1129 for ( ic_type r = 0 ; r < length ; r++ )
1130 {
1131 functor ( *src , *trg ) ;
1132 src += in_stride ;
1133 trg += out_stride ;
1134 }
1135 }
1136 }
1137} ;
1138
1139/// reimplementation of wield using the new 'neutral' multithread.
1140/// The workers now all receive the same task to process one line
1141/// at a time until all lines are processed. This simplifies the code;
1142/// the wield object directly calls 'multithread' in it's operator().
1143/// And it improves performance, presumably because tail-end idling
1144/// is reduced: all active threads have data to process until the last
1145/// line has been picked up by an aggregator. So tail-end idling is
1146/// in the order of magnitude of a line's worth, in contrast to half
1147/// a worker's share of the data in the previous implementation.
1148/// The current implementation does away with specialized partitioning
1149/// code (at least for the time being); it looks like the performance
1150/// is decent throughout, even without exploiting locality by
1151/// partitioning to tiles.
1152
1153template < int dimension , class in_type , class out_type = in_type >
1154struct wield
1155{
1156 typedef vigra::MultiArrayView < dimension , in_type > in_view_type ;
1157 typedef vigra::MultiArrayView < dimension , out_type > out_view_type ;
1158 typedef typename in_view_type::difference_type_1 index_type ;
1159 typedef typename in_view_type::difference_type shape_type ;
1160
1161 // wielding, using two arrays. It's assumed that both arrays have
1162 // the same shape. The coupled_aggregator takes a pointer and stride
1163 // for each array.
1164 // Note how the first view is taken by const&, indicating that
1165 // it can not be modified. Only the second view, the target of
1166 // the operation, is non-const.
1167
1168// void operator() ( const in_view_type & in_view ,
1169// out_view_type & out_view ,
1170// coupled_aggregator_type func ,
1171// int axis = 0 ,
1172// int njobs = vspline::default_njobs ,
1173// vspline::atomic < bool > * p_cancel = 0
1174// )
1175// {
1176// assert ( in_view.shape() == out_view.shape() ) ;
1177//
1178// auto in_stride = in_view.stride ( axis ) ;
1179// auto slice1 = in_view.bindAt ( axis , 0 ) ;
1180//
1181// auto out_stride = out_view.stride ( axis ) ;
1182// auto slice2 = out_view.bindAt ( axis , 0 ) ;
1183//
1184// auto in_it = slice1.begin() ;
1185// auto out_it = slice2.begin() ;
1186//
1187// auto length = in_view.shape ( axis ) ;
1188// auto nr_indexes = slice1.size() ;
1189// vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1190//
1191// // we create the workers' code as a lambda pulling in the current
1192// // context by reference. The code is quite simple:
1193// // - decrement 'nlines'
1194// // - if nlines is now less than zero, terminate
1195// // - otherwise, call the aggregator function with arguments
1196// // pertaining to the line
1197//
1198// std::function < void() > worker =
1199// [&]()
1200// {
1201// std::ptrdiff_t i ;
1202//
1203// while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )
1204// {
1205// if ( p_cancel && p_cancel->load() )
1206// break ;
1207//
1208// func ( & ( in_it [ i ] ) ,
1209// in_stride ,
1210// & ( out_it [ i ] ) ,
1211// out_stride ,
1212// length ) ;
1213// }
1214// } ;
1215//
1216// // with the worker code fixed, we just call multithread:
1217//
1218// vspline::multithread ( worker , njobs ) ;
1219// }
1220//
1221// // overload of operator() which will work with an object
1222// // of type indexed_aggregator for the std::function it expects. This
1223// // object presents the nD index into the target array as input to its'
1224// // inner functor, which produces the output from this nD index, rather
1225// // than looking at the array (which is only written to).
1226// // The view coming in is non-const and will receive the result data.
1227// // The aggregator is taken as a std::function of this type:
1228//
1229// void operator() ( out_view_type & out_view ,
1230// indexed_aggregator_type func ,
1231// int axis = 0 ,
1232// int njobs = vspline::default_njobs ,
1233// vspline::atomic < bool > * p_cancel = 0
1234// )
1235// {
1236// auto out_stride = out_view.stride ( axis ) ;
1237// auto slice = out_view.bindAt ( axis , 0 ) ;
1238//
1239// auto out_it = slice.begin() ;
1240// std::ptrdiff_t nr_indexes = slice.size() ;
1241// vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1242// auto length = out_view.shape ( axis ) ;
1243//
1244// // we iterate over the coordinates in slice_shape. This produces
1245// // nD indexes into the view's subarray from 'begin' to 'end', so we
1246// // need to offset the indexes with 'begin' to receive indexes
1247// // into the view itself.
1248//
1249// auto slice_shape = out_view.shape() ; // shape of whole array
1250// slice_shape[axis] = 1 ; // shape of slice with start positions
1251//
1252// typedef vigra::MultiCoordinateIterator
1253// < out_view_type::actual_dimension > mci_type ;
1254//
1255// mci_type it ( slice_shape ) ;
1256//
1257// std::function < void() > worker =
1258// [&]()
1259// {
1260// std::ptrdiff_t i ;
1261//
1262// while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )
1263// {
1264// if ( p_cancel && p_cancel->load() )
1265// break ;
1266//
1267// func ( it [ i ] ,
1268// axis ,
1269// & ( out_it [ i ] ) ,
1270// out_stride ,
1271// length ) ;
1272// }
1273// } ;
1274//
1275// vspline::multithread ( worker , njobs ) ;
1276// }
1277//
1278
1279#ifndef WIELDING_SEGMENT_SIZE
1280#define WIELDING_SEGMENT_SIZE 0
1281#endif
1282
1283 // variation of the coupled and index wielding code above splitting the
1284 // array(s) into segments along the processing axis. The benefit isn't
1285 // immediately obvious, but there are situations where using this code
1286 // makes a significant difference, namely when the functor relies on
1287 // memory access (which is typically the case for b-spline evaluation)
1288 // and following the evaluation order as implied by the structure of the
1289 // array(s) goes 'against the grain' of the functor's memory access. This
1290 // happens, for example, when the functor uses geometric transformations:
1291 // if the lines of the target are derived from, say, columns of the
1292 // original data, access to the interpolators's memory is widely scattered
1293 // through coefficient space. To an extent, caching helps, but with long
1294 // lines the cache capacity is exceeded. This is precisely where cutting
1295 // the lines into segments helps: the scattered access is shortened, and
1296 // there are fewer cache misses, at the cost of more handling overhead
1297 // caused by the extra level of complexity - which is minimal.
1298 // The problem with this approach is finding a way of fixing the segment
1299 // size optimally for a given memory access pattern. If memory access is
1300 // not encumbered by geometric transformations, there is no problem in the
1301 // first place, and using segments is slightly detrimental. If there are
1302 // transformations, it's not easy to find the optimal segment size, because
1303 // this depends on the functor. With b-splines, for example, the degree of
1304 // the spline matters, because with rising degree, the memory footprint of
1305 // individual evaluations grows. And geometric transformations are a varied
1306 // bunch and one can at best hope to find heuristic values for the segment
1307 // size.
1308 // I tentatively recommend using WIELDING_SEGMENT_SIZE 512; not #defining
1309 // a value results in falling back to unsegmented code, which should be
1310 // optimal if there are no geometric transformations.
1311
1312 template < size_t vsz , typename ... types >
1313 void operator() ( const in_view_type & in_view ,
1314 out_view_type & out_view ,
1316 int axis = 0 ,
1317 int njobs = vspline::default_njobs ,
1318 vspline::atomic < bool > * p_cancel = 0 ,
1319 std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE
1320 )
1321 {
1322 assert ( in_view.shape() == out_view.shape() ) ;
1323
1324 // per default, fall back to not using segments
1325
1326 if ( segment_size <= 0 )
1327 segment_size = in_view.shape ( axis ) ;
1328
1329 // extract the strides for input and output
1330
1331 auto in_stride = in_view.stride ( axis ) ;
1332 auto out_stride = out_view.stride ( axis ) ;
1333
1334 // create slices holding the start positions of the lines
1335
1336 auto slice1 = in_view.bindAt ( axis , 0 ) ;
1337 auto slice2 = out_view.bindAt ( axis , 0 ) ;
1338
1339 // and iterators over these slices
1340
1341 auto in_it = slice1.begin() ;
1342 auto out_it = slice2.begin() ;
1343
1344 // get the line length and the number of lines
1345
1346 auto length = in_view.shape ( axis ) ;
1347 auto nr_lines = slice1.size() ;
1348
1349 // get the number of line segments
1350
1351 std::ptrdiff_t nsegments = length / segment_size ;
1352 if ( length % segment_size )
1353 nsegments++ ;
1354
1355 // calculate the total number of joblet indexes
1356
1357 std::ptrdiff_t nr_indexes = nr_lines * nsegments ;
1358
1359 // set up the atomic to share out the joblet indexes
1360 // to the worker threads
1361
1362 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1363
1364 // set up the payload code for 'multithread'
1365
1366 auto worker =
1367 [&]()
1368 {
1369 std::ptrdiff_t joblet_index ;
1370
1371 while ( vspline::fetch_ascending ( indexes , nr_indexes , joblet_index ) )
1372 {
1373 // terminate early on cancellation request
1374
1375 if ( p_cancel && p_cancel->load() )
1376 break ;
1377
1378 // glean segment and line index from joblet index
1379
1380 auto s = joblet_index / nr_lines ;
1381 auto j = joblet_index % nr_lines ;
1382
1383 // use these indexes to calculate corresponding addresses
1384
1385 auto in_start_address =
1386 & ( in_it [ j ] ) + in_stride * s * segment_size ;
1387
1388 auto out_start_address =
1389 & ( out_it [ j ] ) + out_stride * s * segment_size ;
1390
1391 // the last segment may be less than segment_size long
1392
1393 auto segment_length =
1394 std::min ( segment_size , length - s * segment_size ) ;
1395
1396 // now call the coupled aggregator to process the current segment
1397
1398 func ( in_start_address ,
1399 in_stride ,
1400 out_start_address ,
1401 out_stride ,
1402 segment_length ) ;
1403 }
1404 } ;
1405
1406 // with the atomic distributing joblet indexes and the payload code
1407 // established, we call multithread to invoke worker threads to invoke
1408 // the payload code
1409
1410 vspline::multithread ( worker , njobs ) ;
1411 }
1412
1413 // variant feeding indexes as input to the functor
1414
1415 template < size_t vsz , typename ... types >
1416 void operator() ( out_view_type & out_view ,
1418 int axis = 0 ,
1419 int njobs = vspline::default_njobs ,
1420 vspline::atomic < bool > * p_cancel = 0 ,
1421 std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE
1422 )
1423 {
1424 if ( segment_size <= 0 )
1425 segment_size = out_view.shape ( axis ) ;
1426
1427 auto out_stride = out_view.stride ( axis ) ;
1428 auto slice = out_view.bindAt ( axis , 0 ) ;
1429
1430 auto out_it = slice.begin() ;
1431 std::ptrdiff_t nr_lines = slice.size() ;
1432 auto length = out_view.shape ( axis ) ;
1433
1434 auto slice_shape = out_view.shape() ; // shape of whole array
1435 slice_shape[axis] = 1 ; // shape of slice with start positions
1436
1437 typedef vigra::MultiCoordinateIterator
1438 < out_view_type::actual_dimension > mci_type ;
1439
1440 mci_type it ( slice_shape ) ;
1441
1442 std::ptrdiff_t nsegments = length / segment_size ;
1443 if ( length % segment_size )
1444 nsegments++ ;
1445
1446 std::ptrdiff_t nr_indexes = nr_lines * nsegments ;
1447 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1448
1449 auto worker =
1450 [&]()
1451 {
1452 std::ptrdiff_t i ;
1453
1454 while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )
1455 {
1456 if ( p_cancel && p_cancel->load() )
1457 break ;
1458
1459 auto s = i / nr_lines ;
1460 auto j = i % nr_lines ;
1461
1462 auto start_index = it [ j ] ;
1463 start_index [ axis ] += s * segment_size ;
1464
1465 auto start_address = & ( out_view [ start_index ] ) ;
1466
1467 auto segment_length =
1468 std::min ( segment_size , length - s * segment_size ) ;
1469
1470 func ( start_index ,
1471 axis ,
1472 start_address ,
1473 out_stride ,
1474 segment_length ) ;
1475 }
1476 } ;
1477
1478 vspline::multithread ( worker , njobs ) ;
1479 }
1480
1481 // variant feeding indexes as input to a reduction functor. The
1482 // worker threads create per-thread copies of the functor to accrete
1483 // per-thread reductions, the functor's destructor is responsible
1484 // for pooling the per-thread results.
1485
1486 template < size_t vsz , typename ... types >
1487 void operator() ( shape_type in_shape ,
1489 int axis = 0 ,
1490 int njobs = vspline::default_njobs ,
1491 vspline::atomic < bool > * p_cancel = 0 ,
1492 std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE
1493 )
1494 {
1495 typedef indexed_reductor < vsz , types ... > func_t ;
1496
1497 // per default, fall back to not using segments
1498
1499 if ( segment_size <= 0 )
1500 segment_size = in_shape [ axis ] ;
1501
1502 // get the line length and the number of lines
1503
1504 auto length = in_shape [ axis ] ;
1505 auto nr_lines = prod ( in_shape ) / length ;
1506
1507 auto slice_shape = in_shape ;
1508 slice_shape[axis] = 1 ;
1509
1510 typedef vigra::MultiCoordinateIterator < dimension > mci_type ;
1511
1512 mci_type it ( slice_shape ) ;
1513
1514 // get the number of line segments
1515
1516 std::ptrdiff_t nsegments = length / segment_size ;
1517 if ( length % segment_size )
1518 nsegments++ ;
1519
1520 // calculate the total number of joblet indexes
1521
1522 std::ptrdiff_t nr_indexes = nr_lines * nsegments ;
1523
1524 // set up the atomic to share out the joblet indexes
1525 // to the worker threads
1526
1527 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1528
1529 // set up the payload code for 'multithread'
1530
1531 auto worker =
1532 [&]()
1533 {
1534 std::ptrdiff_t i ;
1535 func_t w_func ( func ) ; // create per-thread copy of 'func'
1536
1537 while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )
1538 {
1539 if ( p_cancel && p_cancel->load() )
1540 break ;
1541
1542 auto s = i / nr_lines ;
1543 auto j = i % nr_lines ;
1544
1545 auto start_index = it [ j ] ;
1546 start_index [ axis ] += s * segment_size ;
1547
1548 auto segment_length =
1549 std::min ( segment_size , length - s * segment_size ) ;
1550
1551 w_func ( it [ i ] , axis , length ) ;
1552 }
1553 // when 'worker' ends, w_func goes out of scope and is destructed.
1554 // It's destructor is responsible for pooling the per-thread
1555 // reduction results.
1556 } ;
1557
1558 vspline::multithread ( worker , njobs ) ;
1559 }
1560
1561 template < size_t vsz , typename ... types >
1562 void operator() ( const in_view_type & in_view ,
1564 int axis = 0 ,
1565 int njobs = vspline::default_njobs ,
1566 vspline::atomic < bool > * p_cancel = 0 ,
1567 std::ptrdiff_t segment_size = WIELDING_SEGMENT_SIZE
1568 )
1569 {
1570 typedef yield_reductor < vsz , types ... > func_t ;
1571
1572 // per default, fall back to not using segments
1573
1574 if ( segment_size <= 0 )
1575 segment_size = in_view.shape ( axis ) ;
1576
1577 // extract the strides for input and output
1578
1579 auto in_stride = in_view.stride ( axis ) ;
1580
1581 // create slices holding the start positions of the lines
1582
1583 auto slice1 = in_view.bindAt ( axis , 0 ) ;
1584
1585 // and iterators over these slices
1586
1587 auto in_it = slice1.begin() ;
1588
1589 // get the line length and the number of lines
1590
1591 auto length = in_view.shape ( axis ) ;
1592 auto nr_lines = slice1.size() ;
1593
1594 // get the number of line segments
1595
1596 std::ptrdiff_t nsegments = length / segment_size ;
1597 if ( length % segment_size )
1598 nsegments++ ;
1599
1600 // calculate the total number of joblet indexes
1601
1602 std::ptrdiff_t nr_indexes = nr_lines * nsegments ;
1603
1604 // set up the atomic to share out the joblet indexes
1605 // to the worker threads
1606
1607 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1608
1609 // set up the payload code for 'multithread'
1610
1611 auto worker =
1612 [&]()
1613 {
1614 std::ptrdiff_t joblet_index ;
1615 func_t w_func ( func ) ; // create per-thread copy of 'func'
1616
1617 while ( vspline::fetch_ascending ( indexes , nr_indexes , joblet_index ) )
1618 {
1619 // terminate early on cancellation request
1620
1621 if ( p_cancel && p_cancel->load() )
1622 break ;
1623
1624 // glean segment and line index from joblet index
1625
1626 auto s = joblet_index / nr_lines ;
1627 auto j = joblet_index % nr_lines ;
1628
1629 // use these indexes to calculate corresponding addresses
1630
1631 auto in_start_address =
1632 & ( in_it [ j ] ) + in_stride * s * segment_size ;
1633
1634 // the last segment may be less than segment_size long
1635
1636 auto segment_length =
1637 std::min ( segment_size , length - s * segment_size ) ;
1638
1639 // now call the coupled aggregator to process the current segment
1640
1641 w_func ( in_start_address ,
1642 in_stride ,
1643 segment_length ) ;
1644 }
1645 // when 'worker' ends, w_func goes out of scope and is destructed.
1646 // It's destructor is responsible for pooling the per-thread
1647 // reduction results.
1648 } ;
1649
1650 // with the atomic distributing joblet indexes and the payload code
1651 // established, we call multithread to invoke worker threads to invoke
1652 // the payload code
1653
1654 vspline::multithread ( worker , njobs ) ;
1655 }
1656
1657 // use a generator to produce data. As the aggregator for this use
1658 // has the same call signature as an indexed aggregator, we use a named
1659 // method here and may do so for the other top-level methods as well.
1660
1661 template < size_t vsz , typename ... types >
1662 void generate ( out_view_type & out_view ,
1664 int axis = 0 ,
1665 int njobs = vspline::default_njobs ,
1666 vspline::atomic < bool > * p_cancel = 0
1667 )
1668 {
1669 auto out_stride = out_view.stride ( axis ) ;
1670 auto slice = out_view.bindAt ( axis , 0 ) ;
1671
1672 auto out_it = slice.begin() ;
1673 std::ptrdiff_t nr_indexes = slice.size() ;
1674 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1675 auto length = out_view.shape ( axis ) ;
1676
1677 auto slice_shape = out_view.shape() ; // shape of whole array
1678 slice_shape[axis] = 1 ; // shape of slice with start positions
1679
1680 // iterator yielding start indexes
1681
1682 typedef vigra::MultiCoordinateIterator
1683 < out_view_type::actual_dimension > mci_type ;
1684
1685 mci_type it ( slice_shape ) ;
1686
1687 auto worker =
1688 [&]()
1689 {
1690 // create thread-specific copy of generate_aggregator. This is
1691 // necessary because a generate_aggregator carries mutable state
1692 // which is modified with each call to it's operator()
1693
1694 auto w_func = func ;
1695
1696 std::ptrdiff_t i ;
1697
1698 while ( vspline::fetch_ascending ( indexes , nr_indexes , i ) )
1699 {
1700 if ( p_cancel && p_cancel->load() )
1701 break ;
1702
1703 w_func ( it [ i ] ,
1704 axis ,
1705 & ( out_it [ i ] ) ,
1706 out_stride ,
1707 length ) ;
1708 }
1709 } ;
1710
1711 vspline::multithread ( worker , njobs ) ;
1712 }
1713} ;
1714
1715template < class in_type , class out_type >
1716struct wield < 1 , in_type , out_type >
1717{
1718 enum { dimension = 1 } ;
1719
1720 typedef vigra::MultiArrayView < dimension , in_type > in_view_type ;
1721 typedef vigra::MultiArrayView < dimension , out_type > out_view_type ;
1722 typedef typename in_view_type::difference_type shape_type ;
1723 typedef typename in_view_type::difference_type_1 index_type ;
1724
1725 template < size_t vsz , typename ... types >
1726 void operator() ( const in_view_type & in_view ,
1727 out_view_type & out_view ,
1729 int axis = 0 ,
1730 int njobs = vspline::default_njobs ,
1731 vspline::atomic < bool > * p_cancel = 0
1732 )
1733 {
1734 auto stride1 = in_view.stride ( axis ) ;
1735 auto length = in_view.shape ( axis ) ;
1736 auto stride2 = out_view.stride ( axis ) ;
1737
1738 assert ( in_view.shape() == out_view.shape() ) ;
1739
1740 auto nr_indexes = in_view.shape ( axis ) ;
1741 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1742 std::ptrdiff_t batch_size = 1024 ; // TODO optimize
1743
1744 auto worker =
1745 [&]()
1746 {
1747 std::ptrdiff_t lo , hi ;
1748
1750 ( indexes , batch_size , nr_indexes , lo , hi ) )
1751 {
1752 if ( p_cancel && p_cancel->load() )
1753 break ;
1754
1755 func ( & ( in_view [ lo ] ) ,
1756 stride1 ,
1757 & ( out_view [ lo ] ) ,
1758 stride2 ,
1759 hi - lo ) ;
1760 }
1761 } ;
1762
1763 // with the worker code fixed, we just call multithread:
1764
1765 vspline::multithread ( worker , njobs ) ;
1766
1767 }
1768
1769 template < size_t vsz , typename ... types >
1772 int axis = 0 ,
1773 int njobs = vspline::default_njobs ,
1774 vspline::atomic < bool > * p_cancel = 0
1775 )
1776 {
1777 std::ptrdiff_t stride = view.stride ( axis ) ;
1778 std::ptrdiff_t nr_indexes = view.shape ( axis ) ;
1779 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1780 std::ptrdiff_t batch_size = 1024 ; // TODO optimize
1781
1782 auto worker =
1783 [&]()
1784 {
1785 std::ptrdiff_t lo , hi ;
1786
1788 ( indexes , batch_size , nr_indexes , lo , hi ) )
1789 {
1790 if ( p_cancel && p_cancel->load() )
1791 break ;
1792
1793 // note: we're 1D; creating a shape_type is only 'technical'
1794 shape_type _lo ( lo ) ;
1795
1796 func ( _lo ,
1797 axis ,
1798 & ( view [ lo ] ) ,
1799 stride ,
1800 hi - lo ) ;
1801 }
1802 } ;
1803
1804 // with the worker code fixed, we just call multithread:
1805
1806 vspline::multithread ( worker , njobs ) ;
1807 }
1808
1809 // TODO test 1D variants of reductors
1810
1811 template < size_t vsz , typename ... types >
1812 void operator() ( shape_type & shape ,
1814 int axis = 0 ,
1815 int njobs = vspline::default_njobs ,
1816 vspline::atomic < bool > * p_cancel = 0
1817 )
1818 {
1819 typedef indexed_reductor < vsz , types ... > func_t ;
1820 std::ptrdiff_t nr_indexes = shape [ axis ] ;
1821 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1822 std::ptrdiff_t batch_size = 1024 ; // TODO optimize
1823
1824 auto worker =
1825 [&]()
1826 {
1827 std::ptrdiff_t lo , hi ;
1828 func_t w_func ( func ) ;
1829
1831 ( indexes , batch_size , nr_indexes , lo , hi ) )
1832 {
1833 if ( p_cancel && p_cancel->load() )
1834 break ;
1835
1836 // note: we're 1D; creating a shape_type is only 'technical'
1837 shape_type _lo ( lo ) ;
1838
1839 w_func ( _lo , axis , hi - lo ) ;
1840 }
1841 } ;
1842
1843 // with the worker code fixed, we just call multithread:
1844
1845 vspline::multithread ( worker , njobs ) ;
1846 }
1847
1848 template < size_t vsz , typename ... types >
1849 void operator() ( const in_view_type & in_view ,
1851 int axis = 0 ,
1852 int njobs = vspline::default_njobs ,
1853 vspline::atomic < bool > * p_cancel = 0
1854 )
1855 {
1856 typedef yield_reductor < vsz , types ... > func_t ;
1857
1858 auto stride1 = in_view.stride ( axis ) ;
1859 auto length = in_view.shape ( axis ) ;
1860
1861 auto nr_indexes = in_view.shape ( axis ) ;
1862 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1863 std::ptrdiff_t batch_size = 1024 ; // TODO optimize
1864
1865 auto worker =
1866 [&]()
1867 {
1868 std::ptrdiff_t lo , hi ;
1869 func_t w_func ( func ) ;
1870
1872 ( indexes , batch_size , nr_indexes , lo , hi ) )
1873 {
1874 if ( p_cancel && p_cancel->load() )
1875 break ;
1876
1877 w_func ( & ( in_view [ lo ] ) , stride1 , hi - lo ) ;
1878 }
1879 } ;
1880
1881 // with the worker code fixed, we just call multithread:
1882
1883 vspline::multithread ( worker , njobs ) ;
1884 }
1885
1886 template < size_t vsz , typename ... types >
1887 void generate ( in_view_type & view ,
1889 int axis = 0 ,
1890 int njobs = vspline::default_njobs ,
1891 vspline::atomic < bool > * p_cancel = 0
1892 )
1893 {
1894 std::ptrdiff_t stride = view.stride ( axis ) ;
1895 std::ptrdiff_t nr_indexes = view.shape ( axis ) ;
1896 vspline::atomic < std::ptrdiff_t > indexes ( nr_indexes ) ;
1897 // batch_size must be a multiple of vsize to help the generator
1898 std::ptrdiff_t batch_size = 1024 % vsz
1899 ? ( 1 + 1024 / vsz ) * vsz
1900 : 1024 ;
1901
1902 auto worker =
1903 [&]()
1904 {
1905 // create thread-specific copy of generate_aggregator. This is
1906 // necessary because a generate_aggregator carries mutable state
1907 // which is modified with each call to it's operator()
1908
1909 auto w_func = func ;
1910
1911 std::ptrdiff_t lo , hi ;
1912
1914 ( indexes , batch_size , nr_indexes , lo , hi ) )
1915 {
1916 if ( p_cancel && p_cancel->load() )
1917 break ;
1918
1919 // see comment in generator code, which currently expects
1920 // to start at coordinate 0
1921
1922 // note: we're 1D; creating a shape_type is only 'technical'
1923 shape_type _lo ( lo ) ;
1924
1925 w_func ( _lo ,
1926 axis ,
1927 & ( view [ lo ] ) ,
1928 stride ,
1929 hi - lo ) ;
1930 }
1931 } ;
1932
1933 // with the worker code fixed, we just call multithread:
1934
1935 vspline::multithread ( worker , njobs ) ;
1936 }
1937
1938} ;
1939
1940/// vs_adapter wraps a vspline::unary_functor to produce a functor which is
1941/// compatible with the wielding code. This is necessary, because vspline's
1942/// unary_functors take 'naked' arguments if the data are 1D, while the
1943/// wielding code always passes TinyVectors. The operation of this wrapper
1944/// class should not have a run-time effect; it's simply converting references.
1945/// the wrapped functor is only used via operator(), so this is what we provide.
1946/// While it would be nice to simply pass through the unwrapped unary_functor,
1947/// this would force us to deal with the distinction between data in TinyVectors
1948/// and 'naked' fundamentals deeper down in the code, and here is a good central
1949/// place where we can route to uniform access via TinyVectors - possibly with
1950/// only one element.
1951/// By inheriting from inner_type, we provide all of inner_type's type system
1952/// which we don't explicitly override.
1953/// Rest assured: the reinterpret_cast is safe. If the data are single-channel,
1954/// the containerized version takes up the same meory as the uncontainerized
1955/// version of the datum. multi-channel data are containerized anyway.
1956
1957template < class inner_type >
1959: public inner_type
1960{
1961 using typename inner_type::in_ele_v ;
1962 using typename inner_type::out_ele_v ;
1963 using typename inner_type::in_ele_type ;
1964 using typename inner_type::out_ele_type ;
1965
1966 typedef typename inner_type::in_nd_ele_type in_type ;
1967 typedef typename inner_type::out_nd_ele_type out_type ;
1968 typedef typename inner_type::in_nd_ele_v in_v ;
1969 typedef typename inner_type::out_nd_ele_v out_v ;
1970
1971 vs_adapter ( const inner_type & _inner )
1972 : inner_type ( _inner )
1973 { } ;
1974
1975 /// operator() overload for unvectorized arguments
1976
1977 void operator() ( const in_type & in ,
1978 out_type & out ) const
1979 {
1980 inner_type::eval
1981 ( reinterpret_cast < const typename inner_type::in_type & > ( in ) ,
1982 reinterpret_cast < typename inner_type::out_type & > ( out ) ) ;
1983 }
1984
1985 /// vectorized evaluation function. This is enabled only if vsize > 1
1986
1987 template < typename = std::enable_if < ( inner_type::vsize > 1 ) > >
1988 void operator() ( const in_v & in ,
1989 out_v & out ) const
1990 {
1991 inner_type::eval
1992 ( reinterpret_cast < const typename inner_type::in_v & > ( in ) ,
1993 reinterpret_cast < typename inner_type::out_v & > ( out ) ) ;
1994 }
1995} ;
1996
1997/// same procedure for a vspline::sink_type
1998
1999template < class sink_type >
2001: public sink_type
2002{
2003 using typename sink_type::in_ele_v ;
2004
2005 typedef typename sink_type::in_nd_ele_type in_type ;
2006 typedef typename sink_type::in_nd_ele_v in_v ;
2007
2008 vs_sink_adapter ( const sink_type & _sink )
2009 : sink_type ( _sink )
2010 { } ;
2011
2012 /// operator() overload for unvectorized arguments
2013
2014 void operator() ( const in_type & in ) const
2015 {
2016 (*((sink_type*)(this)))
2017 ( reinterpret_cast < const typename sink_type::in_type & > ( in ) ) ;
2018 }
2019
2020 /// vectorized evaluation function. This is enabled only if vsize > 1
2021
2022 template < typename = std::enable_if < ( sink_type::vsize > 1 ) > >
2023 void operator() ( const in_v & in ) const
2024 {
2025 (*((sink_type*)(this)))
2026 ( reinterpret_cast < const typename sink_type::in_v & > ( in ) ) ;
2027 }
2028} ;
2029
2030/// index_wield uses vspline's 'multithread' function to invoke
2031/// an index-transformation functor for all indexes into an array,
2032/// We use functors which are vector-capable,
2033/// typically they will be derived from vspline::unary_functor.
2034/// index_wield internally uses a 'wield' object to invoke
2035/// the functor on the chunks of data.
2036
2037// after 'output', I added an additional argument pointing to a
2038// vspline::atomic<bool>. The atomic pointed to is checked on
2039// function entry, and if found false, the operation is aborted.
2040// With this mechanism, calling code can keep a handle on the progress
2041// of the multithreaded operation and cancel at least those parts of
2042// it which have not yet started. with the introduction of finer
2043// granularity with the new multithreading code, the cancellation
2044// flag is now also checked on starting on a new 1D subset of the data.
2045// If these 'lines' aren't 'very' long, the effect of cancellation is
2046// reasonably quick.
2047// Per default, a null pointer is passed, which disables the check
2048// for cancellation, so the interface is stable. The same change was
2049// applied to the other transform variants.
2050
2051template < class functor_type , int dimension >
2052void index_wield ( const functor_type functor ,
2053 vigra::MultiArrayView < dimension ,
2054 typename functor_type::out_type
2055 > * output ,
2056 int njobs = vspline::default_njobs ,
2057 vspline::atomic < bool > * p_cancel = 0
2058 )
2059{
2060 typedef typename functor_type::out_type out_type ;
2061
2063
2065 int , // std::ptrdiff_t ,
2066 functor_type > agg ( functor ) ;
2067
2068 wld ( *output , agg , 0 , njobs , p_cancel ) ;
2069}
2070
2071// index_reduce is used for reductions. The functor passed to this function
2072// Will be copied for each working thread, and the copies are fed coordinates,
2073// so the functor needs operator() overloads to take both single and vectorized
2074// coordinates. The functor copies will typically accumulate their part of the
2075// reduction, and 'offload' their partial result when they are destructed.
2076// With this construction, there is no need for inter-thread coordination
2077// of the reduction process, only the final offloading round needs coordination,
2078// which can be provided by e.g. a lock guard or adding to an atomic.
2079
2080template < class functor_type , int dimension >
2081void index_reduce ( const functor_type & functor ,
2082 vigra::TinyVector < long , dimension > shape ,
2083 int njobs = vspline::default_njobs ,
2084 vspline::atomic < bool > * p_cancel = 0
2085 )
2086{
2088
2090 int , // std::ptrdiff_t ,
2091 functor_type > agg ( functor ) ;
2092
2093 wld ( shape , agg , 0 , njobs , p_cancel ) ;
2094}
2095
2096// equivalent function to reduce an array
2097
2098template < class functor_type , int dimension >
2099void value_reduce ( const functor_type & functor ,
2100 const vigra::MultiArrayView < dimension ,
2101 typename functor_type::in_type
2102 > * input ,
2103 int njobs = vspline::default_njobs ,
2104 vspline::atomic < bool > * p_cancel = 0
2105 )
2106{
2108
2110 int , // std::ptrdiff_t ,
2111 functor_type > agg ( functor ) ;
2112
2113 wld ( *input , agg , 0 , njobs , p_cancel ) ;
2114}
2115
2116/// coupled_wield processes two arrays. The first array is taken as input,
2117/// the second for output. Both arrays must have the same dimensionality
2118/// and shape. Their data types have to be the same as the 'in_type' and
2119/// the 'out_type' of the functor which was passed in.
2120
2121template < class functor_type , int dimension >
2122void coupled_wield ( const functor_type functor ,
2123 const vigra::MultiArrayView < dimension ,
2124 typename functor_type::in_type
2125 > * input ,
2126 vigra::MultiArrayView < dimension ,
2127 typename functor_type::out_type
2128 > * output ,
2129 int njobs = vspline::default_njobs ,
2130 vspline::atomic < bool > * p_cancel = 0
2131 )
2132{
2133 typedef typename functor_type::in_type in_type ;
2134 typedef typename functor_type::out_type out_type ;
2135
2137
2139 int , // std::ptrdiff_t ,
2140 functor_type > agg ( functor ) ;
2141
2142 wld ( *input , *output , agg , 0 , njobs , p_cancel ) ;
2143}
2144
2145/// generate_wield uses a generator function to produce data. Inside vspline,
2146/// this is used for grid_eval, which can produce performance gains by
2147/// precalculating frequently reused b-spline evaluation weights. The
2148/// generator holds these weights in readily vectorized form, shared for
2149/// all worker threads.
2150
2151template < class functor_type , unsigned int dimension >
2152void generate_wield ( const functor_type functor ,
2153 vigra::MultiArrayView < dimension ,
2154 typename functor_type::out_type
2155 > & output ,
2156 int njobs = vspline::default_njobs ,
2157 vspline::atomic < bool > * p_cancel = 0
2158 )
2159{
2160 typedef typename functor_type::out_type out_type ;
2161
2163
2165 int , // std::ptrdiff_t ,
2166 functor_type > agg ( functor ) ;
2167
2168 wld.generate ( output , agg , 0 , njobs , p_cancel ) ;
2169}
2170
2171} ; // namespace wielding
2172
2173#define VSPLINE_WIELDING_H
2174#endif
@ vsize
Definition: eval.cc:96
Implementation of 'bunch' and 'fluff'.
const int default_njobs
Definition: multithread.h:220
bool fetch_ascending(vspline::atomic< index_t > &source, const index_t &total, index_t &index)
fetch_ascending counts up from zero to total-1, which is more efficient if the indexes are used to ad...
Definition: multithread.h:284
int multithread(std::function< void() > payload, std::size_t nr_workers=default_njobs)
multithread uses a thread pool of worker threads to perform a multithreaded operation....
Definition: multithread.h:412
bool fetch_range_ascending(vspline::atomic< index_t > &source, const index_t &count, const index_t &total, index_t &low, index_t &high)
fetch_range_ascending also uses an atomic initialized to the total number of indexes to be distribute...
Definition: multithread.h:336
std::atomic< T > atomic
Definition: multithread.h:224
void index_reduce(const functor_type &functor, vigra::TinyVector< long, dimension > shape, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:2081
void bunch(const vigra::TinyVector< ele_type, chn > *const &src, vigra::TinyVector< vspline::vc_simd_type< ele_type, vsz >, chn > &trg, const ic_type &stride)
bunch picks up data from interleaved, strided memory and stores them in a data type representing a pa...
Definition: interleave.h:220
void coupled_wield(const functor_type functor, const vigra::MultiArrayView< dimension, typename functor_type::in_type > *input, vigra::MultiArrayView< dimension, typename functor_type::out_type > *output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
coupled_wield processes two arrays. The first array is taken as input, the second for output....
Definition: wielding.h:2122
void generate_wield(const functor_type functor, vigra::MultiArrayView< dimension, typename functor_type::out_type > &output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
generate_wield uses a generator function to produce data. Inside vspline, this is used for grid_eval,...
Definition: wielding.h:2152
int ic_type
Definition: interleave.h:98
void index_wield(const functor_type functor, vigra::MultiArrayView< dimension, typename functor_type::out_type > *output, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
index_wield uses vspline's 'multithread' function to invoke an index-transformation functor for all i...
Definition: wielding.h:2052
void value_reduce(const functor_type &functor, const vigra::MultiArrayView< dimension, typename functor_type::in_type > *input, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:2099
void fluff(const vigra::TinyVector< vspline::vc_simd_type< ele_type, vsz >, chn > &src, vigra::TinyVector< ele_type, chn > *const &trg, const ic_type &stride)
reverse operation: a package of vectorized data is written to interleaved, strided memory....
Definition: interleave.h:267
with the definition of 'simd_traits', we can proceed to implement 'vector_traits': struct vector_trai...
Definition: vector.h:344
an aggregator for separate - possibly different - source and target. If source and target are in fact...
Definition: wielding.h:924
functor_type::out_ele_type out_ele_type
Definition: wielding.h:932
functor_type::out_type out_type
Definition: wielding.h:931
functor_type::out_v out_v
Definition: wielding.h:939
functor_type::in_v in_v
Definition: wielding.h:938
coupled_aggregator(const functor_type &_functor)
Definition: wielding.h:941
functor_type::in_type in_type
Definition: wielding.h:925
const functor_type functor
Definition: wielding.h:934
functor_type::in_ele_type in_ele_type
Definition: wielding.h:926
void operator()(const in_type *src, ic_type in_stride, out_type *trg, ic_type out_stride, ic_type length)
Definition: wielding.h:945
generate_aggregator is very similar to indexed_aggregator, but instead of managing and passing a coor...
Definition: wielding.h:624
functor_type::shape_type crd_type
Definition: wielding.h:646
functor_type::out_nd_ele_v out_nd_ele_v
Definition: wielding.h:636
functor_type::out_nd_ele_type out_nd_ele_type
Definition: wielding.h:633
functor_type::out_v out_v
Definition: wielding.h:634
static const size_t vsize
Definition: wielding.h:625
functor_type::out_ele_type out_ele_type
Definition: wielding.h:632
generate_aggregator(const functor_type &_functor)
Definition: wielding.h:652
functor_type::out_ele_v out_ele_v
Definition: wielding.h:635
functor_type::out_type out_type
Definition: wielding.h:631
void operator()(crd_type crd, int axis, out_type *trg, ic_type stride, ic_type length)
Definition: wielding.h:764
specialization for vsz == 1. Here the data are simply processed one by one in a loop,...
Definition: wielding.h:328
indexed_aggregator(const functor_type &_functor)
Definition: wielding.h:331
indexed_aggregator receives the start coordinate and processing axis along with the data to process,...
Definition: wielding.h:175
functor_type::in_v in_v
Definition: wielding.h:180
static bool is_n_hsize()
Definition: wielding.h:209
indexed_aggregator(const functor_type &_functor)
Definition: wielding.h:198
functor_type::out_ele_type out_ele_type
Definition: wielding.h:184
functor_type::out_type out_type
Definition: wielding.h:183
functor_type::in_ele_v in_ele_v
Definition: wielding.h:181
functor_type::out_ele_v out_ele_v
Definition: wielding.h:186
const functor_type functor
Definition: wielding.h:196
functor_type::out_v out_v
Definition: wielding.h:185
functor_type::in_ele_type in_ele_type
Definition: wielding.h:179
functor_type::in_type in_type
Definition: wielding.h:178
specialization for vsz == 1. Here the data are simply processed one by one in a loop,...
Definition: wielding.h:454
indexed_reductor(const functor_type &_functor)
Definition: wielding.h:457
indexed_reductor is used for reductions and has no output. The actual reduction is handled by the fun...
Definition: wielding.h:376
functor_type::in_v in_v
Definition: wielding.h:381
functor_type::in_ele_type in_ele_type
Definition: wielding.h:380
functor_type::in_type in_type
Definition: wielding.h:379
functor_type::in_ele_v in_ele_v
Definition: wielding.h:382
indexed_reductor(const functor_type &_functor)
Definition: wielding.h:394
functor_type::in_type crd_type
Definition: wielding.h:390
vs_adapter wraps a vspline::unary_functor to produce a functor which is compatible with the wielding ...
Definition: wielding.h:1960
inner_type::in_nd_ele_type in_type
Definition: wielding.h:1966
inner_type::in_nd_ele_v in_v
Definition: wielding.h:1968
void operator()(const in_type &in, out_type &out) const
operator() overload for unvectorized arguments
Definition: wielding.h:1977
inner_type::out_nd_ele_v out_v
Definition: wielding.h:1969
vs_adapter(const inner_type &_inner)
Definition: wielding.h:1971
inner_type::out_nd_ele_type out_type
Definition: wielding.h:1967
same procedure for a vspline::sink_type
Definition: wielding.h:2002
sink_type::in_nd_ele_type in_type
Definition: wielding.h:2005
vs_sink_adapter(const sink_type &_sink)
Definition: wielding.h:2008
void operator()(const in_type &in) const
operator() overload for unvectorized arguments
Definition: wielding.h:2014
sink_type::in_nd_ele_v in_v
Definition: wielding.h:2006
vigra::MultiArrayView< dimension, in_type > in_view_type
Definition: wielding.h:1720
in_view_type::difference_type shape_type
Definition: wielding.h:1722
in_view_type::difference_type_1 index_type
Definition: wielding.h:1723
void generate(in_view_type &view, generate_aggregator< vsz, types ... > func, int axis=0, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:1887
vigra::MultiArrayView< dimension, out_type > out_view_type
Definition: wielding.h:1721
reimplementation of wield using the new 'neutral' multithread. The workers now all receive the same t...
Definition: wielding.h:1155
void generate(out_view_type &out_view, generate_aggregator< vsz, types ... > func, int axis=0, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0)
Definition: wielding.h:1662
in_view_type::difference_type shape_type
Definition: wielding.h:1159
void operator()(const in_view_type &in_view, out_view_type &out_view, coupled_aggregator< vsz, types ... > func, int axis=0, int njobs=vspline::default_njobs, vspline::atomic< bool > *p_cancel=0, std::ptrdiff_t segment_size=WIELDING_SEGMENT_SIZE)
Definition: wielding.h:1313
vigra::MultiArrayView< dimension, in_type > in_view_type
Definition: wielding.h:1156
vigra::MultiArrayView< dimension, out_type > out_view_type
Definition: wielding.h:1157
in_view_type::difference_type_1 index_type
Definition: wielding.h:1158
yield_reductor(const functor_type &_functor)
Definition: wielding.h:587
an aggregator to reduce arrays. This is like using indexed_reductor with a functor gathering from an ...
Definition: wielding.h:488
void operator()(const in_type *src, ic_type in_stride, ic_type length)
Definition: wielding.h:504
functor_type::in_ele_type in_ele_type
Definition: wielding.h:490
functor_type::in_type in_type
Definition: wielding.h:489
yield_reductor(const functor_type &_functor)
Definition: wielding.h:500
functor_type::in_v in_v
Definition: wielding.h:498
functor_type functor
Definition: wielding.h:494
includes all headers from vspline (most of them indirectly)