vspline 1.1.0
Generic C++11 Code for Uniform B-Splines
filter.h
Go to the documentation of this file.
1/************************************************************************/
2/* */
3/* vspline - a set of generic tools for creation and evaluation */
4/* of uniform b-splines */
5/* */
6/* Copyright 2015 - 2023 by Kay F. Jahnke */
7/* */
8/* The git repository for this software is at */
9/* */
10/* https://bitbucket.org/kfj/vspline */
11/* */
12/* Please direct questions, bug reports, and contributions to */
13/* */
14/* kfjahnke+vspline@gmail.com */
15/* */
16/* Permission is hereby granted, free of charge, to any person */
17/* obtaining a copy of this software and associated documentation */
18/* files (the "Software"), to deal in the Software without */
19/* restriction, including without limitation the rights to use, */
20/* copy, modify, merge, publish, distribute, sublicense, and/or */
21/* sell copies of the Software, and to permit persons to whom the */
22/* Software is furnished to do so, subject to the following */
23/* conditions: */
24/* */
25/* The above copyright notice and this permission notice shall be */
26/* included in all copies or substantial portions of the */
27/* Software. */
28/* */
29/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND */
30/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES */
31/* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND */
32/* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT */
33/* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, */
34/* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING */
35/* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR */
36/* OTHER DEALINGS IN THE SOFTWARE. */
37/* */
38/************************************************************************/
39
40/*! \file filter.h
41
42 \brief generic implementation of separable filtering for nD arrays
43
44 This body of code provides the application of separable filters,
45 not the filters themselves. Most essential for vspline is the use for
46 b-spline prefiltering, see prefilter.h for the implementation of
47 the specific filter.
48
49 Another type of filter used by and provided by vspline is general
50 separable convolution, which, inside vspline, is used for reconstruction
51 of the original data from b-spline coefficients, see convolve.h.
52
53 The code in this file is what I call 'wielding' code. It's function is
54 to present the data in such a fashion that the code needed for the actual
55 filter is a reasonably trivial 1D operation. 'Presenting' the data is
56 a complex operation in vspline: the data are distributed to a set of
57 worker threads, and they are restructured so that they can be processed
58 by the processor's vector units. All of this is optional and transparent
59 to the calling code. The 'wielding' code in this file is structurally
60 similar to the code in transform.h, but here we use specific buffering
61 operations which would make no sense there: for separable filtering,
62 we have to preserve the adjacency of the data along the processing axis
63 and present it to the filter, which is unnecessary for transformations,
64 where each value can be processed in isolation.
65
66 Most of the functionality in this file is in namespace detail, signalling
67 that it is not meant to be called from outside. Class buffer_handling and
68 the data types it uses to interface with nD memory are the exception,
69 since they are meant to be inherited/used to implement specific filters.
70
71 At the bottom of the file there's a free function template called
72 'filter'. This is what other code will normally call.
73*/
74
75#include <vector>
76#include "common.h"
77
78#ifndef VSPLINE_FILTER_H
79#define VSPLINE_FILTER_H
80
81namespace vspline
82{
83/// class 'bundle' holds all information needed to access a set of
84/// vsize 1D subarrays of an nD array. This is the data structure
85/// we use to tell the buffering and unbuffering code which data
86/// we want it to put into the buffer or distribute back out. The
87/// buffer itself holds the data in compact form, ready for vector
88/// code to access them at maximum speed.
89
90template < class dtype , // data type
91 size_t vsize > // vector width
92struct bundle
93{
94 dtype * data ; // data base address
95 const std::ptrdiff_t * idx ; // pointer to gather/scatter indexes
96 std::ptrdiff_t stride ; // stride in units of dtype
97 unsigned long z ; // number of repetitions
98
99 bundle ( dtype * _data ,
100 const std::ptrdiff_t * _idx ,
101 std::ptrdiff_t _stride ,
102 unsigned long _z )
103 : data ( _data ) ,
104 idx ( _idx ) ,
105 stride ( _stride ) ,
106 z (_z )
107 { } ;
108} ;
109
110/// move bundle data to compact memory
111
112template < class stype , // source data type
113 class ttype , // target data type
114 size_t vsize > // vector width
115void move ( const bundle < stype , vsize > & src ,
116 ttype * trg ,
117 std::ptrdiff_t trg_stride
118 )
119{
120 auto z = src.z ;
121 auto ps = src.data ;
122
123 while ( z-- ) // repeat z times:
124 {
125 for ( size_t i = 0 ; i < vsize ; i++ )
126 {
127 // load from source, store to target, using ith index
128 trg [ i ] = ttype ( ps [ src.idx [ i ] ] ) ;
129 }
130 ps += src.stride ; // apply stride to source
131 trg += trg_stride ; // and target
132 }
133}
134
135// nearly the same, but takes ni as runtime parameter, effectively
136// limiting the transfer to the first ni offsets in the bundle.
137// This will only be used rarely, so performance is less of an issue.
138
139template < class stype , // source data type
140 class ttype , // target data type
141 size_t vsize > // vector width
142void move ( const bundle < stype , vsize > & src ,
143 ttype * trg ,
144 std::ptrdiff_t trg_stride ,
145 int ni )
146{
147 auto z = src.z ;
148 auto ps = src.data ;
149
150 while ( z-- ) // repeat z times:
151 {
152 for ( int i = 0 ; i < ni ; i++ )
153 {
154 // load from source, store to target, using ith index
155 trg [ i ] = ttype ( ps [ src.idx [ i ] ] ) ;
156 }
157 trg += trg_stride ;
158 ps += src.stride ; // apply stride to source
159 }
160}
161
162/// move data from compact memory to bundle
163
164template < class stype , // source data type
165 class ttype , // target data type
166 size_t vsize > // vector width
167void move ( const stype * src ,
168 std::ptrdiff_t src_stride ,
169 const bundle < ttype , vsize > & trg )
170{
171 auto z = trg.z ;
172 auto pt = trg.data ;
173
174 while ( z-- ) // repeat z times:
175 {
176 for ( size_t i = 0 ; i < vsize ; i++ )
177 {
178 // load from source, store to target, using ith index
179 pt [ trg.idx [ i ] ] = ttype ( src [ i ] ) ;
180 }
181 src += src_stride ;
182 pt += trg.stride ; // apply stride to target
183 }
184}
185
186// nearly the same, but takes ni as runtime parameter, effectively
187// limiting the transfer to the first ni offsets in the bundle.
188
189template < class stype , // source data type
190 class ttype , // target data type
191 size_t vsize > // vector width
192void move ( const stype * src ,
193 std::ptrdiff_t src_stride ,
194 const bundle < ttype , vsize > & trg ,
195 int ni )
196{
197 auto z = trg.z ;
198 auto pt = trg.data ;
199
200 while ( z-- ) // repeat z times:
201 {
202 for ( int i = 0 ; i < ni ; i++ )
203 {
204 // load from source, store to target, using ith index
205 pt [ trg.idx [ i ] ] = ttype ( src [ i ] ) ;
206 }
207 src += src_stride ;
208 pt += trg.stride ; // apply stride to target
209 }
210}
211
212/// buffer_handling provides services needed for interfacing
213/// with a buffer of simdized/goading data. The init() routine
214/// receives two views: one to a buffer accepting incoming data,
215/// and one to a buffer providing results. Currently, all filters
216/// used in vspline operate in-place, but the two-argument form
217/// leaves room to manoevre.
218/// get() and put() receive 'bundle' arguments which are used
219/// to transfer incoming data to the view defined in in_window,
220/// and to transfer result data from the view defined in
221/// out_window back to target memory.
222
223template < template < typename , size_t > class _vtype ,
224 typename _dtype ,
225 size_t _vsize >
227{
228protected:
229
230 enum { vsize = _vsize } ;
231 typedef _dtype dtype ;
232 typedef _vtype < dtype , vsize > vtype ;
233
234 vigra::MultiArrayView < 1 , vtype > in_window ;
235 vigra::MultiArrayView < 1 , vtype > out_window ;
236
237 void init ( vigra::MultiArrayView < 1 , vtype > & _in_window ,
238 vigra::MultiArrayView < 1 , vtype > & _out_window )
239 {
240 in_window = _in_window ;
241 out_window = _out_window ;
242 }
243
244 // get and put receive 'bundle' objects, currently the code
245 // uses a template for the arguments but we might fix it to
246 // bundles only
247
248 // note the use of 'offset' which is needed for situations
249 // when input/output consists of several arrrays
250
251 // why not use ni all the time? I suspect
252 // fixed-width moves are faster. TODO test
253
254 // KFJ 2018-02-20 added parameter for the buffer's stride from
255 // one datum to the next, expressed in untits of 'dtype'. This
256 // is needed if the buffer contains SimdArrays which hold
257 // padding and therefore are larger than vsize dtype. This does
258 // only occur for certain vsize values, not for the default as
259 // per common.h, so it went unnoticed for some time.
260
261 static const std::ptrdiff_t bf_stride = sizeof(vtype) / sizeof(dtype) ;
262
263public:
264
265 /// fetch data from 'source' into the buffer 'in_window'
266
267 template < class tractor >
268 void get ( const tractor & src ,
269 std::ptrdiff_t offset = 0 ,
270 int ni = vsize ) const
271 {
272 if ( ni == vsize ) // fixed-width move
273
274 move ( src ,
275 (dtype*) ( in_window.data() + offset ) ,
277 ) ;
278
279 else // reduced width move
280
281 move ( src ,
282 (dtype*) ( in_window.data() + offset ) ,
283 bf_stride ,
284 ni
285 ) ;
286 }
287
288 /// deposit result data from 'out_window' into target memory
289
290 template < class tractor >
291 void put ( const tractor & trg ,
292 std::ptrdiff_t offset = 0 ,
293 int ni = vsize ) const
294 {
295 if ( ni == vsize )
296 move ( (const dtype *) ( out_window.data() + offset ) ,
297 bf_stride ,
298 trg
299 ) ;
300 else
301 move ( (const dtype *) ( out_window.data() + offset ) ,
302 bf_stride ,
303 trg ,
304 ni
305 ) ;
306 }
307
308} ;
309
310namespace detail
311{
312/// 'present' feeds 'raw' data to a filter and returns the filtered
313/// data. In order to perform this task with maximum efficiency,
314/// the actual code is quite involved.
315///
316/// we have two variants of the routine, one for 'stacks' of several
317/// arrays (vpresent) and one for single arrays (present).
318///
319/// The routine used in vspline is 'present', 'vpresent' is for special
320/// cases. present splits the data into 'bundles' of 1D subarrays
321/// collinear to the processing axis. These bundles are fed to the
322/// 'handler', which copies them into a buffer, performs the actual
323/// filtering, and then writes them back to target memory.
324///
325/// Using 'vpresent', incoming data are taken as std::vectors of
326/// source_view_type. The incoming arrays have to have the same extent
327/// in every dimension *except* the processing axis. While the actual
328/// process of extracting parts of the data for processing is slightly
329/// more involved, it is analogous to first concatenating all source
330/// arrays into a single array, stacking along the processing axis.
331/// The combined array is then split up into 1D subarrays collinear
332/// to the processing axis, and sets of these subarrays are passed to
333/// the handler by calling it's 'get' method. The set of 1D subarrays
334//// is coded as a 'bundle', which describes such a set by a combination
335/// of base address and a set of gather/scatter indexes.
336///
337/// Once the data have been accepted by the handler, the handler's
338/// operator() is called, which results in the handler filtering the
339/// data (or whatever else it might do). Next, the processed data are
340/// taken back from the handler by calling it's 'put' routine. The put
341/// routine also receives a 'bundle' parameter, resulting in the
342/// processed data being distributed back into a multidimensional
343/// array (or a set of them, like the input).
344///
345/// This mechanism sounds complicated, but buffering the data for
346/// processing (which oftentimes has to look at the data several times)
347/// is usually more efficient than operating on the data in their
348/// in-array locations, which are often widely distributed, making
349/// the memory access slow. On top of the memory efficiency gain,
350/// there is another aspect: by choosing the bundle size wisely,
351/// the buffered data can be processed by vector code. Even if the
352/// data aren't explicit SIMD vectors (which is an option), the
353/// simple fact that they 'fit' allows the optimizer to autovectorize
354/// the code, a technique which I call 'goading': You present the
355/// data in vector-friendly guise and thereby lure the optimizer to
356/// do the right thing. Another aspect of buffering is that the
357/// buffer can use a specific data type best suited to the arithmetic
358/// operation at hand which may be different from the source and target
359/// data. This is especially useful if incoming data are of an integral
360/// type: operating directly on integers would spoil the data, but if
361/// the buffer is set up to contain a real type, the data are lifted
362/// to it on arrival in the buffer and processed with float maths.
363/// A drawback to this method of dealing with integral data is the fact
364/// that, when filtering nD data along several axes, intermediate
365/// results are stored back to the integral type after processing along
366/// each axis, accruing quantization errors with each pass. If this is
367/// an issue - like, with high-dimensional data or insufficient dynamic
368/// range, please consider moving to a real data type before filtering.
369///
370/// Note that this code operates on arrays of fundamentals. The code
371/// calling this routine will have element-expanded data which weren't
372/// fundamentals in the first place. This expansion helps automatic
373/// vectorization, and for explicit vectorization with Vc it is even
374/// necessary.
375///
376/// Also note that this routine operates in single-threaded code:
377/// It's invoked via vspline::multithread, and each worker thread will
378/// perform it's own call to 'present'. This is why the first argument
379/// is a range (containing the range of the partitioning assigned to
380/// the current worker thread) and why the other arguments come in as
381/// pointers, where I'd usually pass by reference.
382
383// we start out with 'present', which takes plain MultiArrayViews
384// for input and output. This is the simpler case. 'vpresent', which
385// follows, is structurally similar but has to deal with the extra
386// complication of processing 'stacks' instead of single arrays.
387
388template < typename source_view_type ,
389 typename target_view_type ,
390 typename stripe_handler_type >
392 const source_view_type * p_source ,
393 target_view_type * p_target ,
394 const typename stripe_handler_type::arg_type * p_args ,
395 int axis )
396{
397 enum { dimension = source_view_type::actual_dimension } ;
399
400 // get references to the source and target views
401
402 const source_view_type & source ( *p_source ) ;
403 target_view_type & target ( *p_target ) ;
404
405 // get the total length of the axis we want to process
406
407 std::ptrdiff_t count = source.shape ( axis ) ;
408
409 // set up the 'stripe handler' which holds the buffer(s)
410 // and calls the filter. It's important that this does not
411 // happen until now when we've landed in the code executed
412 // by the worker threads, since the stripe_handler holds
413 // state which must not be shared between threads.
414
415 stripe_handler_type handler ( *p_args , count ) ;
416
417 // get the data types we're dealing with. These are fundamentals,
418 // since the arrays have been element-expanded for processing.
419
420 typedef typename source_view_type::value_type stype ;
421 typedef typename target_view_type::value_type ttype ;
422
423 // take a slice from the source array orthogonal to the processing
424 // axis. That's where the starting points of the 1D subarrays are. Then
425 // obtain a MultiCoordinateIterator over the indexes in the slice.
426 // We use nD indexes rather than directly using offsets because these
427 // indexes will be reused when the filtered data are written to the
428 // destination memory, where strides may be different. So the offsets
429 // for the way 'in' and the way 'out' are calculated from the same
430 // indexes just before they are used, by multiplying with the
431 // appropriate strides and summing up.
432
433 auto sample_slice = source.bindAt ( axis , 0 ) ;
434
435 vigra::MultiCoordinateIterator < dimension - 1 >
436 sliter ( sample_slice.shape() ) ;
437
438 // shape_type can hold an nD index into the slice, just what
439 // sliter refers to.
440
441 typedef vigra::TinyVector < std::ptrdiff_t , dimension - 1 > shape_type ;
442
443 // set of indexes for one run. Note the initialization with
444 // the first index, guarding against 'narrow stripes'.
445
446 vigra::TinyVector < shape_type , vsize > indexes { *sliter } ;
447
448 // set of offsets into the source slice which will be used for
449 // gather/scatter. These will be equivalent to the indexes above,
450 // 'condensed' by applying the stride and summing up.
451
452 vigra::TinyVector < std::ptrdiff_t , vsize > offsets ;
453
454 // now we keep fetching batches of vsize 1D subarrays until the
455 // atomic which p_tickets points to yields no more indexes
456
457 std::ptrdiff_t lo , hi ;
458 std::ptrdiff_t batch_size = vsize ;
459 std::ptrdiff_t total_size = sample_slice.size() ;
460
462 ( *p_tickets , batch_size , total_size , lo , hi ) )
463 {
464 std::ptrdiff_t n_fetch = hi - lo ;
465
466 for ( std::ptrdiff_t i = 0 , f = lo ; i < n_fetch ; ++i , ++f )
467 {
468 indexes [ i ] = sliter [ f ] ;
469 }
470
471 // process the source array
472
473 auto stride = source.stride ( axis ) ;
474 auto size = source.shape ( axis ) ;
475
476 auto source_slice = source.bindAt ( axis , 0 ) ;
477 auto source_base_adress = source_slice.data() ;
478
479 // obtain a set of offsets from the set of indexes by 'condensing'
480 // the nD index into an offset - by applying the strides and summing up
481
482 for ( int e = 0 ; e < vsize ; e++ )
483 {
484 offsets[e] = sum ( source_slice.stride() * indexes[e] ) ;
485 }
486
487 // form a 'bundle' to pass the data to the handler
488
489 bundle < stype , vsize > bi ( source_base_adress ,
490 offsets.data() ,
491 stride ,
492 size ) ;
493
494 // now use the bundle to move the data to the handler's buffer
495
496 handler.get ( bi , 0 , n_fetch ) ;
497
498 // now call the handler's operator(). This performs the actual
499 // filtering of the data
500
501 handler() ;
502
503 // now empty out the buffer to the target array, using pretty much
504 // the same set of operations as was used for fetching the data
505 // from source. Note how the offsets are recalculated, now using the
506 // target slice's strides.
507
508 stride = target.stride ( axis ) ;
509 size = target.shape ( axis ) ;
510
511 auto target_slice = target.bindAt ( axis , 0 ) ;
512 auto target_base_adress = target_slice.data() ;
513
514 for ( int e = 0 ; e < vsize ; e++ )
515 offsets[e] = sum ( target_slice.stride() * indexes[e] ) ;
516
517 bundle < ttype , vsize > bo ( target_base_adress ,
518 offsets.data() ,
519 stride ,
520 size ) ;
521
522 handler.put ( bo , 0 , n_fetch ) ;
523 }
524}
525
526/// vpresent is a variant of 'present' processing 'stacks' of arrays.
527/// See 'present' for discussion. This variant of 'present' will rarely
528/// be used. Having it does no harm but if you study the code, you may
529/// safely ignore it unless you are actually using single-axis filtering
530/// of stacks of arrays. the code is structurally similar to 'present',
531/// with the extra complication of processing stacks instead of single
532/// arrays.
533
534template < typename source_view_type ,
535 typename target_view_type ,
536 typename stripe_handler_type >
538 const std::vector<source_view_type> * p_source ,
539 std::vector<target_view_type> * p_target ,
540 const typename stripe_handler_type::arg_type * p_args ,
541 int axis )
542{
543 enum { dimension = source_view_type::actual_dimension } ;
545
546 // get references to the std::vectors holding source and target views
547
548 const std::vector<source_view_type> & source ( *p_source ) ;
549 std::vector<target_view_type> & target ( *p_target ) ;
550
551 // get the total length of the axis we want to process
552
553 std::ptrdiff_t count = 0 ;
554 for ( auto & e : source )
555 count += e.shape ( axis ) ;
556
557 // set up the 'stripe handler' which holds the buffer(s)
558 // and calls the filter. It's important that this does not
559 // happen until now when we've landed in the code executed
560 // by the worker threads, since the stripe_handler holds
561 // state which must not be shared between threads.
562
563 stripe_handler_type handler ( *p_args , count ) ;
564
565 // get the data types we're dealing with. These are fundamentals,
566 // since the arrays have been element-expanded for processing.
567
568 typedef typename source_view_type::value_type stype ;
569 typedef typename target_view_type::value_type ttype ;
570
571 // take a slice from the first source array orthogonal to the processing
572 // axis. That's where the starting points of the 1D subarrays are. Then
573 // obtain a MultiCoordinateIterator over the indexes in the slice
574
575 auto sample_slice = source[0].bindAt ( axis , 0 ) ;
576
577 vigra::MultiCoordinateIterator < dimension - 1 >
578 sliter ( sample_slice.shape() ) ;
579
580 // shape_type can hold an nD index into the slice, just what
581 // sliter refers to.
582
583 typedef vigra::TinyVector < std::ptrdiff_t , dimension - 1 > shape_type ;
584
585 // set of indexes for one run. Note the initialization with
586 // the first index, guarding against 'narrow stripes'.
587
588 vigra::TinyVector < shape_type , vsize > indexes { *sliter } ;
589
590 // set of offsets into the source slice which will be used for
591 // gather/scatter. These will be equivalent to the indexes above,
592 // 'condensed' by applying the stride and summing up.
593
594 vigra::TinyVector < std::ptrdiff_t , vsize > offsets ;
595
596 // now we keep fetching batches of vsize 1D subarrays until the
597 // atomic which p_tickets points to yields no more indexes
598
599 std::ptrdiff_t lo , hi ;
600 std::ptrdiff_t batch_size = vsize ;
601 std::ptrdiff_t total_size = sample_slice.size() ;
602
604 ( *p_tickets , batch_size , total_size , lo , hi ) )
605 {
606 std::ptrdiff_t n_fetch = hi - lo ;
607
608 for ( std::ptrdiff_t i = 0 , f = lo ; i < n_fetch ; ++i , ++f )
609 {
610 indexes [ i ] = sliter [ f ] ;
611 }
612
613 // iterate over the input arrays, loading data into the buffer
614 // from all arrays in turn, using the same set of indexes.
615 // 'progress' holds the part of 'count' that has been transferred
616 // already.
617
618 std::ptrdiff_t progress = 0 ;
619
620 // now iterate over the source arrays. While the set of nD indexes
621 // used is the same for each stack member, the offsets may be different,
622 // as they are calculated using specific strides for each stack member.
623
624 for ( auto & input : source )
625 {
626 auto source_stride = input.stride ( axis ) ;
627 auto part_size = input.shape ( axis ) ;
628 auto slice = input.bindAt ( axis , 0 ) ;
629 auto source_base_adress = slice.data() ;
630
631 // obtain a set of offsets from the set of indexes by 'condensing'
632 // the nD index into an offset - by applying the strides and summing up
633
634 for ( int e = 0 ; e < vsize ; e++ )
635 {
636 offsets[e] = sum ( slice.stride() * indexes[e] ) ;
637 }
638
639 // form a 'bundle' to pass the data to the handler
640
641 bundle < stype , vsize > bi ( source_base_adress ,
642 offsets.data() ,
643 source_stride ,
644 part_size ) ;
645
646 // now use the bundle to fill part_size entries in the handler's
647 // buffer, starting at 'progress'. 'progress' records how many
648 // sets of values have already been pushed into the buffer
649 // then carry on with the next input array, if any
650
651 handler.get ( bi , progress , n_fetch ) ;
652 progress += part_size ;
653 }
654
655 // data from all stacks have been transferred to the buffer.
656 // now call the handler's operator(). This performs the actual
657 // filtering of the data
658
659 handler() ;
660
661 // now empty out the buffer to the std::vector of target arrays,
662 // using pretty much the same set of operations as was used for
663 // fetching the data from source.
664
665 progress = 0 ;
666
667 for ( auto & output : target )
668 {
669 auto target_stride = output.stride ( axis ) ;
670 auto part_size = output.shape ( axis ) ;
671 auto slice = output.bindAt ( axis , 0 ) ;
672 auto target_base_adress = slice.data() ;
673
674 for ( int e = 0 ; e < vsize ; e++ )
675 offsets[e] = sum ( slice.stride() * indexes[e] ) ;
676
677 bundle < ttype , vsize > bo ( target_base_adress ,
678 offsets.data() ,
679 target_stride ,
680 part_size ) ;
681
682 handler.put ( bo , progress , n_fetch ) ;
683 progress += part_size ;
684 }
685 }
686}
687
688/// struct separable_filter is the central object used for 'wielding'
689/// filters. The filters themselves are defined as 1D operations, which
690/// is sufficient for a separable filter: the 1D operation is applied
691/// to each axis in turn. If the *data* themselves are 1D, this is
692/// inefficient if the run of data is very long: we'd end up with a
693/// single thread processing the data without vectorization. So for this
694/// special case, we use a bit of trickery: long runs of 1D data are
695/// folded up, processed as 2D (with multithreading and vectorization)
696/// and the result of this operation, which isn't correct everywhere,
697/// is 'mended' where it is wrong. If the data are nD, we process them
698/// by buffering chunks collinear to the processing axis and applying
699/// the 1D filter to these chunks. 'Chunks' isn't quite the right word
700/// to use here - what we're buffering are 'bundles' of 1D subarrays,
701/// where a bundle holds as many 1D subarrays as a SIMD vector is wide.
702/// this makes it possible to process the buffered data with vectorized
703/// code. While most of the time the buffering will simply copy data into
704/// and out of the buffer, we use a distinct data type for the buffer
705/// which makes sure that arithmetic can be performed in floating point
706/// and with sufficient precision to do the data justice. With this
707/// provision we can safely process arrays of integral type. Such data
708/// are 'promoted' to this type when they are buffered and converted to
709/// the result type afterwards. Of course there will be quantization
710/// errors if the data are converted to an integral result type; it's
711/// best to use a real result type.
712/// The type for arithmetic operations inside the filter is fixed via
713/// stripe_handler_type, which takes a template argument '_math_ele_type'.
714/// This way, the arithmetic type is distributed consistently.
715/// Also note that an integral target type will receive the data via a
716/// simple type conversion and not with saturation arithmetics. If this
717/// is an issue, filter to a real-typed target and process separately.
718/// A good way of using integral data is to have integral input
719/// and real-typed output. Promoting the integral data to a real type
720/// preserves them precisely, and the 'exact' result is then stored in
721/// floating point. With such a scheme, raw data (like image data,
722/// which are often 8 or 16 bit integers) can be 'sucked in' without
723/// need for previous conversion, producing filtered data in, say, float
724/// for further processing.
725
726template < typename input_array_type ,
727 typename output_array_type ,
728 typename stripe_handler_type >
730{
731 enum { dimension = input_array_type::actual_dimension } ;
732 static_assert ( dimension == output_array_type::actual_dimension ,
733 "separable_filter: input and output array type must have the same dimension" ) ;
734
735 typedef typename input_array_type::value_type in_value_type ;
736 typedef typename output_array_type::value_type out_value_type ;
737
738 enum { channels = vigra::ExpandElementResult < in_value_type > :: size } ;
739 static_assert ( channels
740 == vigra::ExpandElementResult < out_value_type > :: size ,
741 "separable_filter: input and output data type must have the same number of channels" ) ;
742
743 typedef typename vigra::ExpandElementResult < in_value_type >
744 :: type in_ele_type ;
745
746 typedef typename vigra::ExpandElementResult < out_value_type >
747 :: type out_ele_type ;
748
749 typedef std::integral_constant < bool , dimension == 1 > is_1d_type ;
750 typedef std::integral_constant < bool , channels == 1 > is_1_channel_type ;
751
752 /// this is the standard entry point to the separable filter code
753 /// for processing *all* axes of an array. first we use a dispatch
754 /// to separate processing of 1D data from processing of nD data.
755
756 template < class filter_args > // may be single argument or a std::vector
757 void operator() ( const input_array_type & input ,
758 output_array_type & output ,
759 const filter_args & handler_args ,
760 int njobs = vspline::default_njobs ) const
761 {
762 // we use a dispatch depending on whether data are 1D or nD arrays
763
765 input , output , handler_args , njobs ) ;
766 }
767
768 // _on_dimension differentiates between 1D and nD data. We don't
769 // look at the arguments - they are simply forwarded to either
770 // _process_1d or _process_nd.
771
772 template < typename ... types >
773 void _on_dimension ( std::true_type , // 1D data
774 types ... args ) const
775 {
776 // data are 1D. unpack the variadic content and call
777 // the specialized method
778
779 _process_1d ( args ... ) ;
780 }
781
782 template < typename ... types >
783 void _on_dimension ( std::false_type , // nD data
784 types ... args ) const
785 {
786 // data are nD. unpack the variadic content and call
787 // the code for nD processing.
788
789 _process_nd ( args ... ) ;
790 }
791
792 /// specialized processing of 1D input/output.
793 /// We have established that the data are 1D.
794 /// we have received a std::vector of handler arguments.
795 /// It has to contain precisely one element which we unpack
796 /// and use to call the overload below.
797
798 template < typename in_vt , typename out_vt >
799 void _process_1d ( const in_vt & input ,
800 out_vt & output ,
801 const std::vector
802 < typename stripe_handler_type::arg_type >
803 & handler_args ,
804 int njobs ) const
805 {
806 assert ( handler_args.size() == 1 ) ;
807 _process_1d ( input , output , handler_args[0] , njobs ) ;
808 }
809
810 /// specialized processing of 1D input/output.
811 /// We have established that the data are 1D and we have
812 /// a single handler argument.
813 /// This routine may come as a surprise and it's quite long
814 /// and complex. The strategy is this:
815 /// - if the data are 'quite short', simply run a 1D filter
816 /// directly on the data, without any multithreading or
817 /// vectorization. If the user has specified 'zero tolerance',
818 /// do the same.
819 /// - otherwise employ 'fake 2D processing': pretend the
820 /// data are 2D, filter them with 2D code (which offers
821 /// multithreading and vectorization) and then 'mend'
822 /// the result, which is wrong in parts due to the
823 /// inappropriate processing.
824 /// expect 'fake 2D processing' to kick in for buffer sizes
825 /// somewhere in the low thousands, to give you a rough idea.
826 /// All data paths in this routine make sure that the maths
827 /// are done in math_type, there won't be storing of
828 /// intermediate values to a lesser type. If the user
829 /// has specified 'zero tolerance' and the output type is not
830 /// the same as math_type, we have a worst-case scenario where
831 /// the entire length of data is buffered in math_type and the
832 /// operation is single-threaded and unvectorized, but this
833 /// should rarely happen and requires the user to explicitly
834 /// override the defaults. If the data are too short for fake
835 /// 2D processing, the operation will also fail to multithread
836 /// or vectorize.
837
838 // TODO: establish the cost of powering up the multithreaded data
839 // processing to set a lower limit for data sizes which should be
840 // processed with several threads: the overhead for small data sets
841 // might make multithreading futile.
842
843 // call receiving an axis is routed to overload below - this
844 // overload here is needed for symmetry with _process_nd
845 // TODO: needing this seems slightly dodgy...
846
847 template < typename in_vt , typename out_vt >
848 void _process_1d ( const in_vt & input ,
849 out_vt & output ,
850 int axis ,
851 const typename stripe_handler_type::arg_type
852 & handler_args ,
853 int njobs ) const
854 {
855 _process_1d ( input , output , handler_args , njobs ) ;
856 }
857
858 template < typename in_vt , typename out_vt >
859 void _process_1d ( const in_vt & input ,
860 out_vt & output ,
861 const typename stripe_handler_type::arg_type
862 & handler_args ,
863 int njobs ) const
864 {
865 typedef typename in_vt::value_type in_value_type ;
866 typedef typename out_vt::value_type out_value_type ;
867
868 // we'll need to access the 'raw' filter. To specify it's type in
869 // agreement with 'stripe_handler_type', we glean math_ele_type
870 // from there and construct math_type from it.
871
872 typedef typename stripe_handler_type::math_ele_type math_ele_type ;
874
875 // obtain a raw filter capable of processing math_type
876
877 auto raw_filter = stripe_handler_type::template
878 get_raw_filter < math_type > ( handler_args ) ;
879
880 // right now, we only need the filter's support width, but we
881 // may use the filter further down.
882
883 const int bands = channels ;
884 int runup = raw_filter.get_support_width() ;
885
886 // if we can multithread, start out with as many lanes
887 // as the desired number of threads
888
889 int lanes = njobs ;
891
892 // the number of lanes is multiplied by the
893 // number of elements a vector-friendly type can handle
894
895 lanes *= vsize ;
896
897 // the absolute minimum to successfully run the fake 2D filter is this:
898 // TODO we might rise the threshold, min_length, here
899
900 int min_length = 4 * runup * lanes + 2 * runup ;
901
902 // runup == INT_MAX signals that fake 2D processing is inappropriate.
903 // if input is too short to bother with fake 2D, just single-lane it
904
905 if ( runup == INT_MAX || input.shape(0) < min_length )
906 {
907 lanes = 1 ;
908 }
909 else
910 {
911 // input is larger than the absolute minimum, maybe we can even increase
912 // the number of lanes some more? we'd like to do this if the input is
913 // very large, since we use buffering and don't want the buffers to become
914 // overly large. But the smaller the run along the split x axis, the more
915 // incorrect margin values we have to mend, so we need a compromise.
916 // assume a 'good' length for input: some length where further splitting
917 // would not be wanted anymore. TODO: do some testing, find a good value
918
919 int good_length = 64 * runup * lanes + 2 * runup ;
920
921 int split = 1 ;
922
923 // suppose we split input.shape(0) in ( 2 * split ) parts, is it still larger
924 // than this 'good' length? If not, leave split factor as it is.
925
926 while ( input.shape(0) / ( 2 * split ) >= good_length )
927 {
928 // if yes, double split factor, try again
929 split *= 2 ;
930 }
931
932 lanes *= split ; // increase number of lanes by additional split
933 }
934
935 // if there's only one lane we fall back to single-threaded
936 // operation, using a 'raw' filter directly processing the
937 // input - either producing the output straight away or,
938 // intermediately, it's representation in math_type.
939
940 if ( lanes == 1 )
941 {
942 // we look at the data first: if out_value_type is the same type
943 // as math_type, we can use the raw filter directly on input and
944 // output. This is also possible if the filter is single-pass,
945 // because a single-pass filter does not need to store intermediate
946 // results - so convolution is okay, but b-spline prefiltering
947 // is not.
948 if ( std::is_same < out_value_type , math_type > :: value
949 || stripe_handler_type::is_single_pass )
950 {
951 auto raw_filter = stripe_handler_type::template
952 get_raw_filter < in_value_type ,
954 math_type > ( handler_args ) ;
955
956 raw_filter.solve ( input , output ) ;
957 }
958 else
959 {
960 // we can't use the easy option above. So we'll have to create
961 // a buffer of math_type, use that as target, run the filter
962 // and copy the result to output. This is potentially expensive:
963 // the worst case is that we have to create a buffer which is
964 // larger than the whole input signal (if math_type's size is
965 // larger than in-value_type's) - and on top, the operation is
966 // single-threaded and unvectorized. This should rarely happen
967 // for long signals. Mathematically, we're definitely on the
968 // safe side, provided the user hasn't chosen an unsuitable
969 // math_type.
970
971 vigra::MultiArray < 1 , math_type > buffer ( input.shape() ) ;
972
973 auto raw_filter = stripe_handler_type::template
974 get_raw_filter < in_value_type ,
975 math_type ,
976 math_type > ( handler_args ) ;
977
978 raw_filter.solve ( input , buffer ) ;
979
980 auto trg = output.begin() ;
981 for ( auto const & src : buffer )
982 {
983 *trg = out_value_type ( src ) ;
984 ++trg ;
985 }
986 }
987 return ; // return directly. we're done
988 }
989
990 // the input qualifies for fake 2D processing.
991 // we want as many chunks as we have lanes. There may be some data left
992 // beyond the chunks (tail_size of value_type)
993
994 int core_size = input.shape(0) ;
995 int chunk_size = core_size / lanes ;
996 core_size = lanes * chunk_size ;
997 int tail_size = input.shape(0) - core_size ;
998
999 // just doublecheck
1000
1001 assert ( core_size + tail_size == input.shape(0) ) ;
1002
1003 // now here's the strategy: we treat the data as if they were 2D. This will
1004 // introduce errors along the 'vertical' margins, since there the 2D treatment
1005 // will start with some boundary condition along the x axis instead of looking
1006 // at the neighbouring line where the actual continuation is.
1007
1008 // first we deal with the very beginning and end of the signal. This requires
1009 // special treatment, because here we want the boundary conditions to take
1010 // effect. So we copy the beginning and end of the signal to a buffer, being
1011 // generous with how many data we pick. The resulting buffer will have an
1012 // unusable part in the middle, where tail follows head, but since we've made
1013 // sure that this location is surrounded by enough 'runup' data, the effect
1014 // will only be detectable at +/- runup from the point where tail follows head.
1015 // The beginning of head and the end of tail are at the beginning and end
1016 // of the buffer, though, so that applying the boundary condition will
1017 // have the desired effect. What we'll actually use of the buffer is not
1018 // the central bit with the effects of the clash of head and tail, but
1019 // only the bits at the ends which aren't affected because they are far enough
1020 // away. Another way of looking at this operation is that we 'cut out' a large
1021 // central section of the data and process the remainder, ignoring the cut-out
1022 // part. Then we only use that part of the result which is 'far enough' away
1023 // from the cut to be unaffected by it.
1024
1025 // note how this code fixes a bug in my initial implementation, which produced
1026 // erroneous results with periodic splines, because the boundary condition
1027 // was not properly honoured.
1028
1029 // calculate the sizes of the parts of the signal we'll put into the buffer
1030 int front = 2 * runup ;
1031 int back = tail_size + 2 * runup ;
1032 int total = front + back ;
1033
1034 // create the buffer and copy the beginning and end of the signal into it.
1035 // Note how the data are converted to math_type to do the filtering
1036
1037 vigra::MultiArray < 1 , math_type > head_and_tail ( total ) ;
1038
1039 auto target_it = head_and_tail.begin() ;
1040 auto source_it = input.begin() ;
1041 for ( int i = 0 ; i < front ; i++ )
1042 {
1043 *target_it = math_type ( *source_it ) ;
1044 ++target_it ;
1045 ++source_it ;
1046 }
1047 source_it = input.end() - back ;
1048 for ( int i = 0 ; i < back ; i++ )
1049 {
1050 *target_it = math_type ( *source_it ) ;
1051 ++target_it ;
1052 ++source_it ;
1053 }
1054
1055 // this buffer is submitted to the 'raw' filter. After the call, the buffer
1056 // has usable data for the very beginning and end of the signal.
1057
1058 raw_filter.solve ( head_and_tail , head_and_tail ) ;
1059
1060 // set up two MultiArrayViews corresponding to the portions of the data
1061 // we copied into the buffer. The first bit of 'head' and the last bit
1062 // of 'tail' hold valid data and will be used further down.
1063
1064 vigra::MultiArrayView < 1 , math_type > head
1065 ( vigra::Shape1 ( front ) , head_and_tail.data() ) ;
1066
1067 vigra::MultiArrayView < 1 , math_type > tail
1068 ( vigra::Shape1 ( back ) , head_and_tail.data() + front ) ;
1069
1070 // head now has runup correct values at the beginning, succeeded by runup
1071 // invalid values, and tail has tail_size + runup correct values at the end,
1072 // preceded by runup values which aren't usable.
1073
1074 // now we create a fake 2D view to the margin of the data. Note how we let
1075 // the view begin 2 * runup before the end of the first line, capturing the
1076 // 'wraparound' right in the middle of the view.
1077
1078 // The fake 2D views hold enough runup on either side of the usable
1079 // data, so we use boundary conditions which are fast to compute instead of
1080 // futilely using the boundary conditions pertaining to the original data,
1081 // which would only have an effect on the runup data which do not end up
1082 // in the final result at all. We still end up wasting a few cycles, because
1083 // the filter itself will surround the data with some extrapolated values
1084 // (as many as is deemed appropriate for the filter's support), but at least
1085 // the extrapolation won't be coputationally expensive. Getting rid of these
1086 // extra computations is probably more expensive than accepting this small
1087 // amound of wasted CPU time.
1088
1089 typename stripe_handler_type::arg_type
1090 handler_args_with_bc_guess ( handler_args ) ;
1091
1092 handler_args_with_bc_guess.bc = vspline::GUESS ;
1093
1094 // KFJ 2018-02-11 both here, and a bit further down, where 'margin_target'
1095 // is set up, I had forgotten to multiply the offset which is added to
1096 // *.data() with the appropriate stride, resulting in memory errors where
1097 // the stride wasn't 1. since this rarely happens, I did not notice it
1098 // until now.
1099
1100 vigra::MultiArrayView < 2 , in_value_type >
1101
1102 fake_2d_margin ( vigra::Shape2 ( 4 * runup ,
1103 lanes - 1 ) ,
1104
1105 vigra::Shape2 ( input.stride(0) ,
1106 input.stride(0)
1107 * chunk_size ) ,
1108
1109 input.data() + input.stride(0)
1110 * ( chunk_size - 2 * runup )
1111 ) ;
1112
1113 // again we create a buffer and filter into the buffer
1114
1115 vigra::MultiArray < 2 , out_value_type >
1116 margin_buffer ( fake_2d_margin.shape() ) ;
1117
1119 vigra::MultiArrayView < 2 , out_value_type > ,
1120 stripe_handler_type >()
1121 ( fake_2d_margin , margin_buffer , 0 , handler_args_with_bc_guess , njobs ) ;
1122
1123 // now we have filtered data for the margins in margin_buffer,
1124 // of which the central half is usable, the remainder being runup data
1125 // which we'll ignore. Here's a view to the central half:
1126
1127 vigra::MultiArrayView < 2 , out_value_type > margin
1128 = margin_buffer.subarray ( vigra::Shape2 ( runup , 0 ) ,
1129 vigra::Shape2 ( 3 * runup , lanes - 1 ) ) ;
1130
1131 // we create a view to the target array's margin which we intend
1132 // to overwrite, but the data will only be copied in from margin
1133 // after the treatment of the core.
1134
1135 vigra::MultiArrayView < 2 , out_value_type >
1136
1137 margin_target ( vigra::Shape2 ( 2 * runup ,
1138 lanes - 1 ) ,
1139
1140 vigra::Shape2 ( output.stride(0) ,
1141 output.stride(0)
1142 * chunk_size ) ,
1143
1144 output.data() + output.stride(0)
1145 * ( chunk_size - runup )
1146 ) ;
1147
1148 // next we 'fake' a 2D array from input and filter it to output, this may
1149 // be an in-place operation, since we've extracted all margin information
1150 // earlier and deposited what we need in buffers.
1151
1152 vigra::MultiArrayView < 2 , in_value_type >
1153 fake_2d_source ( vigra::Shape2 ( chunk_size , lanes ) ,
1154 vigra::Shape2 ( input.stride(0) , input.stride(0) * chunk_size ) ,
1155 input.data() ) ;
1156
1157 vigra::MultiArrayView < 2 , out_value_type >
1158 fake_2d_target ( vigra::Shape2 ( chunk_size , lanes ) ,
1159 vigra::Shape2 ( output.stride(0) , output.stride(0) * chunk_size ) ,
1160 output.data() ) ;
1161
1162 // now we filter the fake 2D source to the fake 2D target
1163
1165 vigra::MultiArrayView < 2 , out_value_type > ,
1166 stripe_handler_type >()
1167 ( fake_2d_source , fake_2d_target , 0 , handler_args_with_bc_guess , njobs ) ;
1168
1169 // we now have filtered data in target, but the stripes along the magins
1170 // in x-direction (1 runup wide) are wrong, because we applied whatever
1171 // boundary conditions inherent to the filter, while the data in fact
1172 // continued from one line end to the next one's beginning.
1173 // this is why we have the data in 'margin', and we now copy them to the
1174 // relevant section of 'target'
1175
1176 margin_target = margin ;
1177
1178 // finally we have to fix the first and last few values, which weren't
1179 // touched by the margin operation (due to margin's offset and length)
1180 // note how we move back from 'math_type' to 'out_value_type'.
1181
1182 for ( int i = 0 ; i < runup ; i++ )
1183 output[i] = out_value_type ( head[i] ) ;
1184
1185 int j = tail.size() - tail_size - runup ;
1186 for ( int i = output.size() - tail_size - runup ;
1187 i < output.size() ; i++ , j++ )
1188 output[i] = out_value_type ( tail[j] ) ;
1189
1190 } // end of first _process_1d() overload
1191
1192 /// specialized processing of nD input/output. We have established
1193 /// that the data are nD. Now we process the axes in turn, passing
1194 /// the per-axis handler args.
1195
1196 void _process_nd ( const input_array_type & input ,
1197 output_array_type & output ,
1198 const std::vector
1199 < typename stripe_handler_type::arg_type >
1200 & handler_args ,
1201 int njobs ) const
1202 {
1203 _process_nd ( input , output , 0 ,
1204 handler_args [ 0 ] , njobs ) ;
1205
1206 for ( int axis = 1 ; axis < dimension ; axis++ )
1207 {
1208 // note the different argument signature here: the first argument
1209 // is now 'output', because the run for axis 0 above has deposited
1210 // it's output there and that's where we need to pick it up for
1211 // filtering along the other axes - now we're operating in-place.
1212
1213 _process_nd ( output , output , axis ,
1214 handler_args [ axis ] , njobs ) ;
1215 }
1216 }
1217
1218 // in the next section we have code processing nD data along
1219 // a specific axis. The code starts with an operator() overload
1220 // meant to be called from 'outside'. This is meant for cases
1221 // where filtering needs to be done differently for different
1222 // axes. After that we have the actual processing code.
1223
1224 /// this operator() overload for single-axis processing takes
1225 /// plain arrays for input and output, they may be either 1D or nD.
1226 /// again we use _on_dimension, now with a different argument
1227 /// signature (we have 'axis' now). As _on_dimension is a
1228 /// variadic template, we 'reemerge' in the right place.
1229
1230 void operator() ( const input_array_type & input ,
1231 output_array_type & output ,
1232 int axis ,
1233 const typename stripe_handler_type::arg_type
1234 & handler_args ,
1235 int njobs = vspline::default_njobs ) const
1236 {
1238 input , output , axis ,
1239 handler_args , njobs ) ;
1240 }
1241
1242 /// processing of nD input/output. we have established that the data
1243 /// are nD. now we look at the data type. If the data are multi-channel,
1244 /// they need to be element-expanded for further processing, which
1245 /// is done by '_on_expand'. Note that there are two variants of
1246 /// _on_expand, one for plain arrays and one for stacks of arrays,
1247 /// which are coded after the operator() overload taking stacks of
1248 /// arrays further down.
1249
1250 template < typename ... types >
1251 void _process_nd ( types ... args ) const
1252 {
1253 // we're all set for processing a single axis of data.
1254 // now we have a dispatch on is_1_channel_type, because if the
1255 // data are multi-channel, we want to element-expand the arrays.
1256
1257 _on_expand ( is_1_channel_type() , args ... ) ;
1258 }
1259
1260 /// variant of _on_expand for single arrays. this overload is called
1261 /// if the data are multi-channel. we element-expand the arrays, then
1262 /// call the single-channel overload below
1263
1264 // KFJ 2020-11-19 made this overload into a template of in_t, out_t,
1265 // because it can be called with in_t being the same as input_array_type,
1266 // or in_t being output_array_type - the latter happens for axes >= 1,
1267 // where the input is the output of filtering axis 0 and already of
1268 // the same type as the output.
1269
1270 template < typename in_t , typename out_t >
1271 void _on_expand ( std::false_type , // is_1_channel_type() ,
1272 const in_t & input ,
1273 out_t & output ,
1274 int axis ,
1275 const typename stripe_handler_type::arg_type
1276 & handler_args ,
1277 int njobs ) const
1278 {
1279 auto source = input.expandElements ( 0 ) ;
1280 auto target = output.expandElements ( 0 ) ;
1281
1282 // with the element-expanded data at hand, we can now delegate to
1283 // the routine below, which deals with single-channel data
1284
1285 _on_expand ( std::true_type() , // the expanded arrays are single-channel
1286 source ,
1287 target ,
1288 axis + 1 , // processing axis is now one higher
1289 handler_args ,
1290 njobs ) ;
1291 }
1292
1293 /// Variant of _on_expand for single arrays. This is the single-channel
1294 /// overload. The arrays now hold fundamentals, either because that was
1295 /// their original data type or because they have been element-expanded.
1296 /// Now we finally get to do the filtering.
1297 /// Note how we introduce input and output as templates, since we can't
1298 /// be sure of their type: they may or may not have been element-expanded.
1299
1300 // reinplementation of _on_expand using the new multithreading logic.
1301 // The original implementation assigned subsets of the workload to
1302 // individual worker threads. This new version assigns the same
1303 // task to all workers: fetch batches of lines and process them
1304 // until there are no more. The new version provides better locality,
1305 // granularity and code simplicity.
1306
1307 template < typename in_t , typename out_t >
1308 void _on_expand ( std::true_type , // is_1_channel_type() ,
1309 const in_t & input ,
1310 out_t & output ,
1311 int axis ,
1312 const typename stripe_handler_type::arg_type
1313 & handler_args ,
1314 int njobs ) const
1315 {
1316 // size is the size of a 1-element thick slice perpendicular to
1317 // the processing axis. We set up a vspline::atomic holding that number,
1318 // which will be accessed by the worker threads in batches of vsize
1319 // to be buffered, filtered and written out to the target array.
1320 // think of the vspline::atomic as a ticket dispenser giving out tickets
1321 // with successive numbers pertaining to lines needing processing.
1322
1323 auto size = input.size() / input.shape ( axis ) ;
1324 vspline::atomic < std::ptrdiff_t > tickets ( size ) ;
1325
1326 // we'll use this worker code, which simply calls 'present' with
1327 // the arguments it needs, including, as first argument, a pointer to
1328 // the vspline::atomic above which yields indexes of 1D subarrays. Using
1329 // a lambda with cach-by-reference here is a neat way of producing the
1330 // std::function<void()> which 'multithread' expects
1331
1332 std::function < void() > worker =
1333 [&]()
1334 {
1335 present < in_t , out_t , stripe_handler_type >
1336 ( &tickets , &input , &output , &handler_args , axis ) ;
1337 } ;
1338
1339 // finally we use multithread() to set up njobs worker threads which
1340 // all run the same code, fetching batches of 1D subarrays until they
1341 // have all been processed.
1342
1343 vspline::multithread ( worker , njobs ) ;
1344 }
1345
1346 /// this operator() overload for single-axis processing takes
1347 /// std::vectors ('stacks') of arrays. this is only supported
1348 /// for nD data. This is a rarely-used variant; throughout vspline
1349 /// there isn't currently any place where this routine is called,
1350 /// but it's needed for some special applications. If you are studying
1351 /// the code, you may safely disregard the remainder of the code in
1352 /// this class definition; the two _on_expand variants below are
1353 /// also for the special case of 'stacks' of arrays.
1354
1355 // With a bit of adapter code, this path could be used for
1356 // processing vigra's 'chunked arrays': for every axis, put
1357 // all sequences of chunks collinear to that axis into a
1358 // std::vector (as MultiArrayViews, not the data themselves),
1359 // then pass these stacks to this routine. TODO: try
1360 // As long as one sequence of chunks fits into memory, the
1361 // process should be efficient, allowing filtering of very large
1362 // data sets.
1363
1364 void operator() ( const std::vector<input_array_type> & input ,
1365 std::vector<output_array_type> & output ,
1366 int axis ,
1367 const typename stripe_handler_type::arg_type
1368 & handler_args ,
1369 int njobs = vspline::default_njobs ) const
1370 {
1371 static_assert ( ! is_1d_type() ,
1372 "processing of stacked 1D arrays is not supported" ) ;
1373
1374 _process_nd ( input , output , axis ,
1375 handler_args , njobs ) ;
1376 }
1377
1378 /// variant of _on_expand for stacks of arrays.
1379 /// this overload is called if the data are multi-channel.
1380 /// we element-expand the arrays, then call the single-channel
1381 /// overload below
1382
1383 template < typename in_vt , typename out_vt >
1384 void _on_expand ( std::false_type , // is_1_channel_type() ,
1385 const std::vector < in_vt > & input ,
1386 std::vector < out_vt > & output ,
1387 int axis ,
1388 const typename stripe_handler_type::arg_type
1389 & handler_args ,
1390 int njobs ) const
1391 {
1392 typedef vigra::MultiArrayView < dimension + 1 , in_ele_type >
1393 e_input_array_type ;
1394
1395 typedef vigra::MultiArrayView < dimension + 1 , out_ele_type >
1396 e_output_array_type ;
1397
1398 // note how we expand the element channel to dimension 0, to make sure
1399 // that adjacent 1D subarrays will be processed together.
1400
1401 std::vector<e_input_array_type> source ;
1402 for ( auto & e : input )
1403 source.push_back ( e.expandElements ( 0 ) ) ;
1404
1405 std::vector<e_output_array_type> target ;
1406 for ( auto & e : output )
1407 target.push_back ( e.expandElements ( 0 ) ) ;
1408
1409 // with the element-expanded data at hand, we can now delegate to
1410 // the routine below, which deals with single-channel data
1411
1412 _on_expand ( std::true_type() , // the expanded arrays are single-channel
1413 source ,
1414 target ,
1415 axis + 1 , // processing axis is now one higher
1416 handler_args ,
1417 njobs ) ;
1418 }
1419
1420 /// variant of _on_expand for stacks of arrays
1421 /// this is the single-channel overload. The arrays now hold fundamentals,
1422 /// either because that was their original data type or because they have
1423 /// been element-expanded. We end up in this routine for all processing
1424 /// of stacks and finally get to do the filtering.
1425
1426 // reinplementation of _on_expand using the new multithreading logic.
1427 // The original implementation assigned subsets of the workload to
1428 // individual worker threads. This new version assigns the same
1429 // task to all workers: fetch batches of lines and process them
1430 // until there are no more. The new version provides better locality,
1431 // granularity and code simplicity.
1432
1433 template < typename in_vt , typename out_vt >
1434 void _on_expand ( std::true_type , // is_1_channel_type() ,
1435 const std::vector < in_vt > & input ,
1436 std::vector < out_vt > & output ,
1437 int axis ,
1438 const typename stripe_handler_type::arg_type
1439 & handler_args ,
1440 int njobs ) const
1441 {
1442 // size is the size of a 1-element thick slice perpendicular to
1443 // the processing axis. We set up a vspline::atomic holding that number,
1444 // which will be accessed by the worker threads in batches of vsize
1445 // to be buffered, filtered and written out to the target array.
1446 // think of the vspline::atomic as a ticket dispenser giving out tickets
1447 // with successive numbers pertaining to lines needing processing.
1448
1449 auto size = input[0].size() / input[0].shape ( axis ) ;
1450 vspline::atomic < std::ptrdiff_t > tickets ( size ) ;
1451
1452 // we'll use this worker code, which simply calls 'present' with
1453 // the arguments it needs, including, as first argument, a pointer to
1454 // the vspline::atomic above which yields indexes of 1D subarrays. Using
1455 // a lambda with cach-by-reference here is a neat way of producing the
1456 // std::function<void()> which 'multithread' expects
1457
1458 std::function < void() > worker =
1459 [&]()
1460 {
1461 vpresent < in_vt , out_vt , stripe_handler_type >
1462 ( &tickets , &input , &output , &handler_args , axis ) ;
1463 } ;
1464
1465 // finally we use multithread() to set up njobs worker threads which
1466 // all run the same code, fetching batches of 1D subarrays until they
1467 // have all been processed.
1468
1469 vspline::multithread ( worker , njobs ) ;
1470 }
1471
1472} ; // struct separable_filter
1473
1474} ; // namespace detail
1475
1476/// vspline::filter is the common entry point for filter operations
1477/// in vspline. This routine does not yet do any processing, it's
1478/// purpose is to convert it's arguments to 'canonical' format
1479/// and then call the actual filter code in namespace detail.
1480/// It also determines the type used for arithmetic operations.
1481/// The type specification for input and output assures that only
1482/// arrays with the same dimensionality are accepted, and a static
1483/// assertion makes sure the number of channels match. canonical
1484/// form means that input and output value type are either
1485/// fundamental (for single-channel data) or TinyVectors of a
1486/// fundamental data type. This way, the input and output is
1487/// presented in a neutral form, ignoring all specifics of
1488/// T1 and T2, which may be TinyVectors, complex, RGBValue etc.
1489
1490template < typename in_type ,
1491 typename out_type ,
1492 unsigned int D ,
1493 class filter_type ,
1494 typename ... types >
1495void filter ( const vigra::MultiArrayView < D , in_type > & input ,
1496 vigra::MultiArrayView < D , out_type > & output ,
1497 types ... args )
1498{
1499 // find out the elementary (fundamental) type of in_type and out_type
1500 // by using vigra's ExpandElementResult mechanism.
1501
1502 typedef typename vigra::ExpandElementResult < in_type >
1503 :: type in_ele_type ;
1504
1505 typedef typename vigra::ExpandElementResult < out_type >
1506 :: type out_ele_type ;
1507
1508 // get the number of channels and make sure it's consistent
1509
1510 enum { channels = vigra::ExpandElementResult < in_type > :: size } ;
1511
1512 static_assert ( channels
1513 == vigra::ExpandElementResult < out_type > :: size ,
1514 "separable_filter: input and output data type must have the same number of channels" ) ;
1515
1516 // produce the canonical types for both data types and arrays
1517
1518 typedef canonical_type < in_type > canonical_in_value_type ;
1519 typedef vigra::MultiArrayView < D , canonical_in_value_type > cn_in_type ;
1520
1521 typedef canonical_type < out_type > canonical_out_value_type ;
1522 typedef vigra::MultiArrayView < D , canonical_out_value_type > cn_out_type ;
1523
1524 // call separable_filter with arrays reinterpreted as canonical types,
1525 // and all other arguments unchecked and unchanged.
1526
1527 detail::separable_filter < cn_in_type ,
1528 cn_out_type ,
1529 filter_type >()
1530 ( reinterpret_cast < const cn_in_type & > ( input ) ,
1531 reinterpret_cast < cn_out_type & > ( output ) ,
1532 args ... ) ;
1533}
1534
1535} ; // namespace vspline
1536
1537#endif // VSPLINE_FILTER_H
buffer_handling provides services needed for interfacing with a buffer of simdized/goading data....
Definition: filter.h:227
void put(const tractor &trg, std::ptrdiff_t offset=0, int ni=vsize) const
deposit result data from 'out_window' into target memory
Definition: filter.h:291
void get(const tractor &src, std::ptrdiff_t offset=0, int ni=vsize) const
fetch data from 'source' into the buffer 'in_window'
Definition: filter.h:268
vigra::MultiArrayView< 1, vtype > in_window
Definition: filter.h:234
static const std::ptrdiff_t bf_stride
Definition: filter.h:261
_vtype< dtype, vsize > vtype
Definition: filter.h:232
vigra::MultiArrayView< 1, vtype > out_window
Definition: filter.h:235
void init(vigra::MultiArrayView< 1, vtype > &_in_window, vigra::MultiArrayView< 1, vtype > &_out_window)
Definition: filter.h:237
definitions common to all files in this project, utility code
double dtype
Definition: eval.cc:93
@ vsize
Definition: eval.cc:96
void present(vspline::atomic< std::ptrdiff_t > *p_tickets, const source_view_type *p_source, target_view_type *p_target, const typename stripe_handler_type::arg_type *p_args, int axis)
'present' feeds 'raw' data to a filter and returns the filtered data. In order to perform this task w...
Definition: filter.h:391
void vpresent(vspline::atomic< std::ptrdiff_t > *p_tickets, const std::vector< source_view_type > *p_source, std::vector< target_view_type > *p_target, const typename stripe_handler_type::arg_type *p_args, int axis)
vpresent is a variant of 'present' processing 'stacks' of arrays. See 'present' for discussion....
Definition: filter.h:537
Definition: basis.h:79
void filter(const vigra::MultiArrayView< D, in_type > &input, vigra::MultiArrayView< D, out_type > &output, types ... args)
vspline::filter is the common entry point for filter operations in vspline. This routine does not yet...
Definition: filter.h:1495
const int default_njobs
Definition: multithread.h:220
int multithread(std::function< void() > payload, std::size_t nr_workers=default_njobs)
multithread uses a thread pool of worker threads to perform a multithreaded operation....
Definition: multithread.h:412
bool fetch_range_ascending(vspline::atomic< index_t > &source, const index_t &count, const index_t &total, index_t &low, index_t &high)
fetch_range_ascending also uses an atomic initialized to the total number of indexes to be distribute...
Definition: multithread.h:336
typename std::conditional< N==1, ET< T >, vigra::TinyVector< ET< T >, N > > ::type canonical_type
produce the 'canonical type' for a given type T. If T is single-channel, this will be the elementary ...
Definition: common.h:225
void move(const bundle< stype, vsize > &src, ttype *trg, std::ptrdiff_t trg_stride)
move bundle data to compact memory
Definition: filter.h:115
std::atomic< T > atomic
Definition: multithread.h:224
@ GUESS
Definition: common.h:78
class 'bundle' holds all information needed to access a set of vsize 1D subarrays of an nD array....
Definition: filter.h:93
const std::ptrdiff_t * idx
Definition: filter.h:95
std::ptrdiff_t stride
Definition: filter.h:96
dtype * data
Definition: filter.h:94
unsigned long z
Definition: filter.h:97
bundle(dtype *_data, const std::ptrdiff_t *_idx, std::ptrdiff_t _stride, unsigned long _z)
Definition: filter.h:99
struct separable_filter is the central object used for 'wielding' filters. The filters themselves are...
Definition: filter.h:730
void _on_expand(std::true_type, const in_t &input, out_t &output, int axis, const typename stripe_handler_type::arg_type &handler_args, int njobs) const
Variant of _on_expand for single arrays. This is the single-channel overload. The arrays now hold fun...
Definition: filter.h:1308
void _on_expand(std::false_type, const std::vector< in_vt > &input, std::vector< out_vt > &output, int axis, const typename stripe_handler_type::arg_type &handler_args, int njobs) const
variant of _on_expand for stacks of arrays. this overload is called if the data are multi-channel....
Definition: filter.h:1384
void _on_dimension(std::true_type, types ... args) const
Definition: filter.h:773
void _process_1d(const in_vt &input, out_vt &output, const typename stripe_handler_type::arg_type &handler_args, int njobs) const
Definition: filter.h:859
std::integral_constant< bool, channels==1 > is_1_channel_type
Definition: filter.h:750
void operator()(const input_array_type &input, output_array_type &output, const filter_args &handler_args, int njobs=vspline::default_njobs) const
this is the standard entry point to the separable filter code for processing all axes of an array....
Definition: filter.h:757
void _process_nd(const input_array_type &input, output_array_type &output, const std::vector< typename stripe_handler_type::arg_type > &handler_args, int njobs) const
specialized processing of nD input/output. We have established that the data are nD....
Definition: filter.h:1196
vigra::ExpandElementResult< out_value_type >::type out_ele_type
Definition: filter.h:747
void _process_1d(const in_vt &input, out_vt &output, const std::vector< typename stripe_handler_type::arg_type > &handler_args, int njobs) const
specialized processing of 1D input/output. We have established that the data are 1D....
Definition: filter.h:799
void _on_dimension(std::false_type, types ... args) const
Definition: filter.h:783
std::integral_constant< bool, dimension==1 > is_1d_type
Definition: filter.h:749
void _process_1d(const in_vt &input, out_vt &output, int axis, const typename stripe_handler_type::arg_type &handler_args, int njobs) const
specialized processing of 1D input/output. We have established that the data are 1D and we have a sin...
Definition: filter.h:848
input_array_type::value_type in_value_type
Definition: filter.h:735
void _on_expand(std::true_type, const std::vector< in_vt > &input, std::vector< out_vt > &output, int axis, const typename stripe_handler_type::arg_type &handler_args, int njobs) const
variant of _on_expand for stacks of arrays this is the single-channel overload. The arrays now hold f...
Definition: filter.h:1434
output_array_type::value_type out_value_type
Definition: filter.h:736
vigra::ExpandElementResult< in_value_type >::type in_ele_type
Definition: filter.h:744
void _on_expand(std::false_type, const in_t &input, out_t &output, int axis, const typename stripe_handler_type::arg_type &handler_args, int njobs) const
variant of _on_expand for single arrays. this overload is called if the data are multi-channel....
Definition: filter.h:1271
void _process_nd(types ... args) const
processing of nD input/output. we have established that the data are nD. now we look at the data type...
Definition: filter.h:1251