vspline 1.1.0
Generic C++11 Code for Uniform B-Splines
interleave.h
Go to the documentation of this file.
1/************************************************************************/
2/* */
3/* vspline - a set of generic tools for creation and evaluation */
4/* of uniform b-splines */
5/* */
6/* Copyright 2015 - 2023 by Kay F. Jahnke */
7/* */
8/* The git repository for this software is at */
9/* */
10/* https://bitbucket.org/kfj/vspline */
11/* */
12/* Please direct questions, bug reports, and contributions to */
13/* */
14/* kfjahnke+vspline@gmail.com */
15/* */
16/* Permission is hereby granted, free of charge, to any person */
17/* obtaining a copy of this software and associated documentation */
18/* files (the "Software"), to deal in the Software without */
19/* restriction, including without limitation the rights to use, */
20/* copy, modify, merge, publish, distribute, sublicense, and/or */
21/* sell copies of the Software, and to permit persons to whom the */
22/* Software is furnished to do so, subject to the following */
23/* conditions: */
24/* */
25/* The above copyright notice and this permission notice shall be */
26/* included in all copies or substantial portions of the */
27/* Software. */
28/* */
29/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND */
30/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES */
31/* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND */
32/* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT */
33/* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, */
34/* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING */
35/* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR */
36/* OTHER DEALINGS IN THE SOFTWARE. */
37/* */
38/************************************************************************/
39
40/*! \file interleave.h
41
42 \brief Implementation of 'bunch' and 'fluff'
43
44 The two function templates 'bunch' and 'fluff' provide code to
45 access interleaved memory holding 'xel' data: stuff like pixels,
46 or coordinates - data types which consist of several equally-typed
47 fundamentals. 'bunch' fetches data from interleaved memory and
48 deposits them in a set of vectors, and 'fluff' does the reverse
49 operation. There are single-channel variants routing to the
50 more efficient simple load/store operation. Strided data will
51 be handled correctly, but the specialized library code which
52 realizes the access as a load/shuffle or shuffle/store does
53 require unstrided data, so it will only be used if the stride
54 is one (measured in 'xel' units') - strides larger than one will
55 be routed to the less specialized code.
56
57 de/interleaving is a common operation, and speeding it up does
58 usually pay off. The most basic approach used here is 'goading':
59 the memory access is coded as a small loop, hoping the compiler
60 will 'get it' and autovectorize the operation. Mileage will vary.
61 'One step up' is the use of 'regular gather/scatter' - a gather
62 or scatter operation with fixed indices. This may still route to
63 'goading' code if the current ISA does not provide gather/scatter.
64 The best perfromance will usually arise from routing to dedicated
65 de/interleaving code, like Vc's InterleavedMemoryWrapper or
66 highway's StoreInterleaved function templates.
67
68 Because the access to interleaved memory is a recognizably
69 separate operation, I have factored out the code to this header.
70 The code is used extensively by wielding.h.
71*/
72
73#ifndef INERLEAVE_H
74#define INERLEAVE_H
75
76#ifdef USE_HWY
77
78// if we have highway, we can use some of the specialized functions
79// to access interleaved memory.
80
81#include "hwy_simd_type.h"
82
83namespace vspline
84{
85 using hwy::HWY_NAMESPACE::StoreInterleaved2 ;
86 using hwy::HWY_NAMESPACE::StoreInterleaved3 ;
87 using hwy::HWY_NAMESPACE::StoreInterleaved4 ;
88 using hwy::HWY_NAMESPACE::LoadInterleaved2 ;
89 using hwy::HWY_NAMESPACE::LoadInterleaved3 ;
90 using hwy::HWY_NAMESPACE::LoadInterleaved4 ;
91} ;
92
93#endif
94
95namespace wielding
96{
97
98typedef int ic_type ;
99
100#ifdef USE_VC
101
102// if we have Vc, we'll use Vc::InterleavedMemoryWrapper if possible.
103// This takes some coding effort to get the routing right.
104
105namespace detail
106{
107
108// Here we have some collateral code to use Vc's InterleavedMemoryWrapper.
109// This is a specialized way of accessing interleaved but unstrided data,
110// which uses several SIMD loads, then reshuffles the data. This should
111// be quicker than using a set of gather operations.
112
113// fetch of interleaved, but unstrided data located at _data
114// into a TinyVector of vspline::simdized_types using InterleavedMemoryWrapper.
115// uses SimdArrays containing K full hardware SIMD Vectors
116
117template < typename T , size_t N , size_t K , size_t ... seq >
118void fetch ( vigra::TinyVector
119 < vspline::simdized_type < T , K * Vc::Vector<T>::size() > , N > & v ,
120 const vigra::TinyVector < T , N > * _data ,
121 const size_t & sz ,
122 Vc::index_sequence < seq ... > )
123{
124 const Vc::InterleavedMemoryWrapper < const vigra::TinyVector < T , N > ,
125 Vc::Vector<T> > data ( _data ) ;
126
127 // as_v1_type is a type holding K Vc::Vector<T> in a TinyVector.
128 // we reinterpret the incoming reference to have as_v1_type
129 // as value_type - instead of an equally-sized SimdArray. With
130 // this interpretation of the data we can use the
131 // InterleavedMemoryWrapper, which operates on Vc::Vectors
132 // only.
133
134 // KFJ 2018-02-20 given VS as the size of a Vc::Vector<T>, I had initially
135 // coded as if a SimdArray<T,VS*K> had a size of VS*K, so just as much as
136 // K Vc::Vector<T> occupy. This is not necessarily so, the SimdArray may
137 // be larger. Hence this additional bit of size arithmetics to make the
138 // reinterpret_cast below succeed for all K, which calculates the number
139 // of Vc::Vectors, nv, which occupy the same space as the SimdArray
140
141 enum { nv = sizeof ( vspline::simdized_type < T , K * Vc::Vector < T > :: size() > )
142 / sizeof ( Vc::Vector < T > ) } ;
143
144 typedef typename vigra::TinyVector < Vc::Vector < T > , nv > as_v1_type ;
145 typedef typename vigra::TinyVector < as_v1_type , N > as_vn_type ;
146
147 as_vn_type & as_vn = reinterpret_cast < as_vn_type & > ( v ) ;
148
149 // we fill the SimdArrays in as_vn round-robin. Note the use of
150 // Vc::tie - this makes the transition effortless.
151
152 for ( size_t k = 0 ; k < K ; k++ )
153 {
154 Vc::tie ( as_vn [ seq ] [ k ] ... )
155 = ( data [ sz + k * Vc::Vector<T>::size() ] ) ;
156 }
157}
158
159template < typename T , size_t N , size_t K , size_t ... seq >
160void stash ( const vigra::TinyVector
161 < vspline::simdized_type < T , K * Vc::Vector<T>::size() > , N > & v ,
162 vigra::TinyVector < T , N > * _data ,
163 const size_t & sz ,
164 Vc::index_sequence < seq ... > )
165{
166 Vc::InterleavedMemoryWrapper < vigra::TinyVector < T , N > ,
167 Vc::Vector<T> > data ( _data ) ;
168
169 // we reinterpret the incoming reference to have as_v1_type
170 // as value_type, just as in 'fetch' above.
171
172 // KFJ 2018-02-20 given VS as the size of a Vc::Vector<T>, I had initially
173 // coded as if a SimdArray<T,VS*K> had a size of VS*K, so just as much as
174 // K Vc::Vector<T> occupy. This is not necessarily so, the SimdArray may
175 // be larger. Hence this additional bit of size arithmetics to make the
176 // reinterpret_cast below succeed for all K, which calculates the number
177 // of Vc::Vectors, nv, which occupy the same space as the SimdArray
178
179 enum { nv = sizeof ( vspline::simdized_type < T , K * Vc::Vector < T > :: size() > )
180 / sizeof ( Vc::Vector < T > ) } ;
181
182 typedef typename vigra::TinyVector < Vc::Vector < T > , nv > as_v1_type ;
183 typedef typename vigra::TinyVector < as_v1_type , N > as_vn_type ;
184
185 const as_vn_type & as_vn = reinterpret_cast < const as_vn_type & > ( v ) ;
186
187 // we unpack the SimdArrays in as_vn round-robin. Note, again, the use
188 // of Vc::tie - I found no other way to assign to data[...] at all.
189
190 for ( size_t k = 0 ; k < K ; k++ )
191 {
192 data [ sz + k * Vc::Vector<T>::size() ]
193 = Vc::tie ( as_vn [ seq ] [ k ] ... ) ;
194 }
195}
196
197} ; // end of namespace detail
198
199// here we have the versions of bunch and fluff using specialized
200// Vc operations to access the buffer. These routines take Vc data
201// types, and they are only present if USE_VC is defined at all.
202// Further down we have less specific signatures which will be chosen
203// if either Vc is not used at all or if the data types passed are
204// not Vc types.
205
206/// bunch picks up data from interleaved, strided memory and stores
207/// them in a data type representing a package of vector data.
208
209/// The first overload of 'bunch' uses a gather operation to obtain
210/// the data from memory. This overload is used if the source data
211/// are strided and are therefore not contiguous in memory. It's
212/// also used if unstrided data are multi-channel and the vector width
213/// is not a multiple of the hardware vector width, because I haven't
214/// fully implemented using Vc::InterleavedMemoryWrapper for SimdArrays.
215/// This first routine can be used for all situations, the two overloads
216/// below are optimizations, increasing performance for specific
217/// cases.
218
219template < typename ele_type , int chn , std::size_t vsz >
220void bunch ( const vigra::TinyVector < ele_type , chn > * const & src ,
221 vigra::TinyVector < vspline::vc_simd_type < ele_type , vsz > , chn > & trg ,
222 const ic_type & stride )
223{
224 typedef typename vspline::vc_simd_type < ele_type , vsz > :: index_type index_type ;
225 index_type ix = index_type::IndexesFromZero() * stride * chn ;
226
227 for ( int d = 0 ; d < chn ; d++ )
228 trg[d].gather ( ((ele_type*)src) + d , ix ) ;
229}
230
231/// overload for unstrided single-channel data.
232/// here we can use an SIMD load, the implementation is very
233/// straightforward, and the performance gain is large.
234
235template < typename ele_type , std::size_t vsz >
236void bunch ( const vigra::TinyVector < ele_type , 1 > * const & src ,
237 vigra::TinyVector < vspline::vc_simd_type < ele_type , vsz > , 1 > & trg ,
238 std::true_type
239 )
240{
241 trg[0].load ( (const ele_type*) src ) ;
242}
243
244/// the third overload, which is only enabled if vsz is a multiple
245/// of the SIMD vector capacity, delegates to detail::fetch, which
246/// handles the data acquisition with a Vc::InterleavedMemoryWrapper.
247/// This overload is only for unstrided multichannel data.
248
249template < typename ele_type , int chn , std::size_t vsz >
250typename std::enable_if < vsz % Vc::Vector<ele_type>::size() == 0 > :: type
251bunch ( const vigra::TinyVector < ele_type , chn > * const & src ,
252 vigra::TinyVector < vspline::vc_simd_type < ele_type , vsz > , chn > & trg ,
253 std::false_type
254 )
255{
256 enum { K = vsz / Vc::Vector<ele_type>::size() } ;
257
258 detail::fetch < ele_type , chn , K >
259 ( trg , src , 0 , Vc::make_index_sequence<chn>() ) ;
260}
261
262/// reverse operation: a package of vectorized data is written to
263/// interleaved, strided memory. We have the same sequence
264/// of overloads as for 'bunch'.
265
266template < typename ele_type , int chn , std::size_t vsz >
267void fluff ( const vigra::TinyVector < vspline::vc_simd_type < ele_type , vsz > , chn > & src ,
268 vigra::TinyVector < ele_type , chn > * const & trg ,
269 const ic_type & stride )
270{
271 typedef typename vspline::vc_simd_type < ele_type , vsz > :: index_type index_type ;
272 index_type ix = index_type::IndexesFromZero() * stride * chn ;
273
274 for ( int d = 0 ; d < chn ; d++ )
275 src[d].scatter ( ((ele_type*)trg) + d , ix ) ;
276}
277
278template < typename ele_type , std::size_t vsz >
279void fluff ( const vigra::TinyVector < vspline::vc_simd_type < ele_type , vsz > , 1 > & src ,
280 vigra::TinyVector < ele_type , 1 > * const & trg ,
281 std::true_type
282 )
283{
284 src[0].store ( (ele_type*) trg ) ;
285}
286
287template < typename ele_type , int chn , std::size_t vsz >
288typename std::enable_if < vsz % Vc::Vector<ele_type>::size() == 0 > :: type
289fluff ( const vigra::TinyVector < vspline::vc_simd_type < ele_type , vsz > , chn > & src ,
290 vigra::TinyVector < ele_type , chn > * const & trg ,
291 std::false_type
292 )
293{
294 enum { K = vsz / Vc::Vector<ele_type>::size() } ;
295
296 detail::stash < ele_type , chn , K >
297 ( src , trg , 0 , Vc::make_index_sequence<chn>() ) ;
298}
299
300#endif // USE_VC
301
302// when not processing Vc data , bunch and fluff use rgather/rscatter
303// if availabe, and otherwise 'goading' for buffering and unbuffering.
304// If the data are single-channel and unstrided, SIMD load/store
305// operations are used which are the fastest:
306
307// data are unstrided and single-channel, issue a SIMD load operation
308
309template < typename target_type , typename ele_type >
310void bunch ( const vigra::TinyVector < ele_type , 1 > * const & src ,
311 target_type & trg ,
312 std::true_type
313 )
314{
315 trg[0].load ( reinterpret_cast < const ele_type * const > ( src ) ) ;
316}
317
318// data are unstrided and single-channel, issue a SIMD store
319
320template < typename ele_type , typename source_type >
321void fluff ( const source_type & src ,
322 vigra::TinyVector < ele_type , 1 > * const & trg ,
323 std::true_type
324 )
325{
326 src[0].store ( reinterpret_cast < ele_type * const > ( trg ) ) ;
327}
328
329template < typename target_type , typename ele_type , int chn >
330void _bunch ( const vigra::TinyVector < ele_type , chn > * const & src ,
331 target_type & trg ,
332 const ic_type & stride )
333{
334 const ele_type * p_src = reinterpret_cast < const ele_type * const > ( src ) ;
335 std::size_t estride = stride * chn ;
336 for ( int ch = 0 ; ch < chn ; ch++ )
337 {
338 trg[ch].rgather ( p_src , estride ) ;
339 ++p_src ;
340 }
341}
342
343template < typename ele_type , typename source_type , int chn >
344void _fluff ( const source_type & src ,
345 vigra::TinyVector < ele_type , chn > * const & trg ,
346 const ic_type & stride
347 )
348{
349 ele_type * p_trg = reinterpret_cast < ele_type * const > ( trg ) ;
350 std::size_t estride = stride * chn ;
351 for ( int ch = 0 ; ch < chn ; ch++ )
352 {
353 src[ch].rscatter ( p_trg , estride ) ;
354 ++p_trg ;
355 }
356}
357
358template < typename target_type , typename ele_type , int chn >
359void bunch ( const vigra::TinyVector < ele_type , chn > * const & src ,
360 target_type & trg ,
361 const ic_type & stride )
362{
363 _bunch ( src , trg , stride ) ;
364}
365
366template < typename ele_type , typename source_type , int chn >
367void fluff ( const source_type & src ,
368 vigra::TinyVector < ele_type , chn > * const & trg ,
369 const ic_type & stride
370 )
371{
372 _fluff ( src , trg , stride ) ;
373}
374
375#ifdef USE_HWY
376
377// for the time being, highway can interleave 2, 3, and 4-channel
378// data. We route the code with overloads, because there is no
379// generalized code for an arbitrary number of channels.
380// code 'fluffing' xel data with more than four channels will route
381// to one of the templates above, using gather/scatter or goading.
382
383template < typename T , std::size_t vsz >
384void fluff ( const vigra::TinyVector
386 vigra::TinyVector < T , 2 > * const & trg ,
387 const ic_type & stride )
388{
389 typedef typename vspline::hwy_simd_type < T , vsz > :: D D ;
390 if ( stride == 1 )
391 {
392 T * p_trg = (T*) trg ;
393 for ( std::size_t n = 0 , i = 0 ; n < vsz ; ++i , n += src[0].L() )
394 {
395 StoreInterleaved2 ( src[0].yield ( i ) ,
396 src[1].yield ( i ) ,
397 D() ,
398 p_trg ) ;
399 p_trg += 2 * src[0].L() ;
400 }
401 }
402 else
403 {
404 _fluff ( src , trg , stride ) ;
405 }
406}
407
408template < typename T , std::size_t vsz >
409void fluff ( const vigra::TinyVector
411 vigra::TinyVector < T , 3 > * const & trg ,
412 const ic_type & stride )
413{
414 typedef typename vspline::hwy_simd_type < T , vsz > :: D D ;
415 if ( stride == 1 )
416 {
417 T * p_trg = (T*) trg ;
418 for ( std::size_t n = 0 , i = 0 ; n < vsz ; ++i , n += src[0].L() )
419 {
420 StoreInterleaved3 ( src[0].yield ( i ) ,
421 src[1].yield ( i ) ,
422 src[2].yield ( i ) ,
423 D() ,
424 p_trg ) ;
425 p_trg += 3 * src[0].L() ;
426 }
427 }
428 else
429 {
430 _fluff ( src , trg , stride ) ;
431 }
432}
433
434template < typename T , std::size_t vsz >
435void fluff ( const vigra::TinyVector
437 vigra::TinyVector < T , 4 > * const & trg ,
438 const ic_type & stride )
439{
440 typedef typename vspline::hwy_simd_type < T , vsz > :: D D ;
441 if ( stride == 1 )
442 {
443 T * p_trg = (T*) trg ;
444 for ( std::size_t n = 0 , i = 0 ; n < vsz ; ++i , n += src[0].L() )
445 {
446 StoreInterleaved4 ( src[0].yield ( i ) ,
447 src[1].yield ( i ) ,
448 src[2].yield ( i ) ,
449 src[3].yield ( i ) ,
450 D() ,
451 p_trg ) ;
452 p_trg += 4 * src[0].L() ;
453 }
454 }
455 else
456 {
457 _fluff ( src , trg , stride ) ;
458 }
459}
460
461template < typename T , std::size_t vsz >
462void bunch ( const vigra::TinyVector < T , 2 > * const & src ,
463 vigra::TinyVector
465 const ic_type & stride )
466{
467 const T * p_src = (T*) src ;
468 typedef typename vspline::hwy_simd_type < T , vsz > :: vec_t vec_t ;
469 typedef typename vspline::hwy_simd_type < T , vsz > :: D D ;
470 if ( stride == 1 )
471 {
472 const T * p_src = (const T*) src ;
473 vec_t c0 , c1 ;
474 for ( std::size_t n = 0 , i = 0 ; n < vsz ; ++i , n += trg[0].L() )
475 {
476 LoadInterleaved2 ( D() , p_src , c0 , c1 ) ;
477 trg[0].take ( i , c0 ) ;
478 trg[1].take ( i , c1 ) ;
479 p_src += 2 * trg[0].L() ;
480 }
481 }
482 else
483 {
484 _bunch ( src , trg , stride ) ;
485 }
486}
487
488template < typename T , std::size_t vsz >
489void bunch ( const vigra::TinyVector < T , 3 > * const & src ,
490 vigra::TinyVector
492 const ic_type & stride )
493{
494 const T * p_src = (T*) src ;
495 typedef typename vspline::hwy_simd_type < T , vsz > :: vec_t vec_t ;
496 typedef typename vspline::hwy_simd_type < T , vsz > :: D D ;
497 if ( stride == 1 )
498 {
499 const T * p_src = (const T*) src ;
500 vec_t c0 , c1 , c2 ;
501 for ( std::size_t n = 0 , i = 0 ; n < vsz ; ++i , n += trg[0].L() )
502 {
503 LoadInterleaved3 ( D() , p_src , c0 , c1 , c2 ) ;
504 trg[0].take ( i , c0 ) ;
505 trg[1].take ( i , c1 ) ;
506 trg[2].take ( i , c2 ) ;
507 p_src += 3 * trg[0].L() ;
508 }
509 }
510 else
511 {
512 _bunch ( src , trg , stride ) ;
513 }
514}
515
516template < typename T , std::size_t vsz >
517void bunch ( const vigra::TinyVector < T , 4 > * const & src ,
518 vigra::TinyVector
520 const ic_type & stride )
521{
522 const T * p_src = (T*) src ;
523 typedef typename vspline::hwy_simd_type < T , vsz > :: vec_t vec_t ;
524 typedef typename vspline::hwy_simd_type < T , vsz > :: D D ;
525 if ( stride == 1 )
526 {
527 const T * p_src = (const T*) src ;
528 vec_t c0 , c1 , c2 , c3 ;
529 for ( std::size_t n = 0 , i = 0 ; n < vsz ; ++i , n += trg[0].L() )
530 {
531 LoadInterleaved4 ( D() , p_src , c0 , c1 , c2 , c3 ) ;
532 trg[0].take ( i , c0 ) ;
533 trg[1].take ( i , c1 ) ;
534 trg[2].take ( i , c2 ) ;
535 trg[3].take ( i , c3 ) ;
536 p_src += 4 * trg[0].L() ;
537 }
538 }
539 else
540 {
541 _bunch ( src , trg , stride ) ;
542 }
543}
544
545#endif
546
547} ; // namespace wielding
548
549#endif // #ifndef INERLEAVE_H
vigra::MultiArray< 2, pixel_type > target_type
Definition: ca_correct.cc:115
SIMD type using highway.
Definition: basis.h:79
typename vector_traits< T, N > ::type simdized_type
this alias is used as a shorthand to pick the vectorized type for a given type T and a size N from 'v...
Definition: vector.h:459
void fetch(vigra::TinyVector< vspline::simdized_type< T, K *Vc::Vector< T >::size() >, N > &v, const vigra::TinyVector< T, N > *_data, const size_t &sz, Vc::index_sequence< seq ... >)
Definition: interleave.h:118
void stash(const vigra::TinyVector< vspline::simdized_type< T, K *Vc::Vector< T >::size() >, N > &v, vigra::TinyVector< T, N > *_data, const size_t &sz, Vc::index_sequence< seq ... >)
Definition: interleave.h:160
void _fluff(const source_type &src, vigra::TinyVector< ele_type, chn > *const &trg, const ic_type &stride)
Definition: interleave.h:344
void bunch(const vigra::TinyVector< ele_type, chn > *const &src, vigra::TinyVector< vspline::vc_simd_type< ele_type, vsz >, chn > &trg, const ic_type &stride)
bunch picks up data from interleaved, strided memory and stores them in a data type representing a pa...
Definition: interleave.h:220
void _bunch(const vigra::TinyVector< ele_type, chn > *const &src, target_type &trg, const ic_type &stride)
Definition: interleave.h:330
int ic_type
Definition: interleave.h:98
void fluff(const vigra::TinyVector< vspline::vc_simd_type< ele_type, vsz >, chn > &src, vigra::TinyVector< ele_type, chn > *const &trg, const ic_type &stride)
reverse operation: a package of vectorized data is written to interleaved, strided memory....
Definition: interleave.h:267
class template vc_simd_type provides a fixed-size SIMD type. This implementation of vspline::vc_simd_...
Definition: vc_simd_type.h:77