vspline 1.1.0
Generic C++11 Code for Uniform B-Splines
hwy_simd_type.h
Go to the documentation of this file.
1/************************************************************************/
2/* */
3/* vspline - a set of generic tools for creation and evaluation */
4/* of uniform b-splines */
5/* */
6/* Copyright 2015 - 2023 by Kay F. Jahnke */
7/* */
8/* The git repository for this software is at */
9/* */
10/* https://bitbucket.org/kfj/vspline */
11/* */
12/* Please direct questions, bug reports, and contributions to */
13/* */
14/* kfjahnke+vspline@gmail.com */
15/* */
16/* Permission is hereby granted, free of charge, to any person */
17/* obtaining a copy of this software and associated documentation */
18/* files (the "Software"), to deal in the Software without */
19/* restriction, including without limitation the rights to use, */
20/* copy, modify, merge, publish, distribute, sublicense, and/or */
21/* sell copies of the Software, and to permit persons to whom the */
22/* Software is furnished to do so, subject to the following */
23/* conditions: */
24/* */
25/* The above copyright notice and this permission notice shall be */
26/* included in all copies or substantial portions of the */
27/* Software. */
28/* */
29/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND */
30/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES */
31/* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND */
32/* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT */
33/* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, */
34/* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING */
35/* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR */
36/* OTHER DEALINGS IN THE SOFTWARE. */
37/* */
38/************************************************************************/
39
40/*! \file hwy_simd_type.h
41
42 \brief SIMD type using highway
43
44 This is a new, tentative implementation of vspline::simd_type
45 using highway (https://github.com/google/highway). highway
46 provides code to work with hardware SIMD in a portable way,
47 but it's still very close to the hardware, and does not
48 provide support for vectors larger than the hardware's register
49 size. vspline::simd_type, on the other hand, is a fixed-size
50 construct which may well exceed the hardware size. The 'goading'
51 implementation of vspline::simd_type uses small loops over a
52 POD C vector to implement the functionality - hoping that the
53 compiler will 'get it' and autovectorize the code. This
54 implementation is also based on a POD C vector, but the
55 functionality is implemented (wherever this seems feasible or
56 sensible) by using highway SIMD code. In a way it's enforcing
57 by explicit code what 'ordinary' vspline::simd_type hopes to
58 get from the compiler via autovectorization, and since the
59 compiler's 'insight' into the code is limited, the explicit
60 approach tends to come out on top, producing SIMD binary more
61 often (and in more efficient variants) than the goading
62 approach.
63
64 Some of the functionality is implemented by simple goading
65 routines. This is either because this is deemed acceptable
66 (e.g. printing a hwy_simd_type to the console is not in any way
67 time critical, nor can it benefit from SIMD code) - or
68 because I haven't yet tackled writing 'proper' SIMD code for
69 the functionality in question - for example, type conversions
70 are still done with goading. This state of affairs also reflects
71 my implementation strategy: I started out with the 'ordinary'
72 vspline::simd_type and replaced more and more of the goading
73 code by 'proper' SIMD code.
74
75 'Backing' the SIMD vectors like that is only one way of handling
76 the SIMD types in the background, but has the advantage of, first,
77 being compatible with the goading code (so one can 'go over the
78 memory' or 'fall back to scalar') and, second, being general, so
79 that both sized and sizeless vectors can be implemented with the
80 same code. The disadvantage is that the compiler may not find all
81 opportunities for keeping the SIMD code 'afloat' in a set of
82 registers, but may at times resort to actually creating and using
83 the underlying POD C array, rather than optimizing it away.
84
85 Nevertheless, this implementation seems to tend towards 'proper'
86 SIMD code rather than towards the goading implementation - first
87 tests showed the code took up to 30% longer than code done with
88 Vc on an AVX2 machine, whereas the goading code often takes twice
89 or thrice as long, so it seems to be a path worth persuing.
90*/
91
92#ifndef VSPLINE_HWY_SIMD_TYPE_H
93#define VSPLINE_HWY_SIMD_TYPE_H
94
95#include <iostream>
96#include <functional>
97#include <type_traits>
98#include <assert.h>
99
100#ifndef HWY_SIMD_TYPE_H
101#define HWY_SIMD_TYPE_H
102
103#include <hwy/highway.h>
104#include <hwy/contrib/math/math-inl.h>
105#include <hwy/aligned_allocator.h>
106#include <hwy/print-inl.h>
107#ifndef HWY_HAVE_ATAN2
108#include "hwy_atan2.h"
109#endif
110
111// lux uses it's own dispatching mechanism, but this code should also
112// cooperate with highway's multi-platform strategy
113
115
116namespace HWY_NAMESPACE {
117
118namespace hn = hwy::HWY_NAMESPACE ;
119
120/// mask type for hwy_simd_type. This is a type which holds a set of masks
121/// stored in uint8_t, as the highway mask storing function provides.
122/// So this type is memory-backed, just like hwy_simd_type. Template arguments
123/// are the corresponding hwy_simd_type's tag type and it's lane count.
124/// highway is strict about which vectors and masks can interoperate,
125/// and only allows 'direct' interoperation if the types involved
126/// 'match' in size. Masks pertaining to vectors of differently-sized T
127/// aren't directly interoperable because they don't have the same
128/// lane count. One requires k masks of one type and k * 2 ^ i of the
129/// other. Here, we follow a different paradigm: The top-level objects
130/// we're dealing with have a fixed 'vsize', the number of lanes they
131/// hold. This should be a power of two. The paradigm is that objects
132/// with equal vsize should be interoperable, no matter what lane count
133/// the hardware vectors have which are used to implement their
134/// functionality. This makes user code simpler: users pick a vsize
135/// which they use for a body of code, all vector-like objects use the
136/// common vsize, and the implementation of the vector-like objects
137/// takes care of 'rolling out' the operations to hardware vectors.
138/// At times this produces what I call 'friction' - if the underlying
139/// hardware vectors and masks are not directly compatible, code is
140/// needed to interoperate them, and this code can at times be slow.
141/// So the recommendation for users is to avoid 'friction' by avoiding
142/// mixing differently-sized types, but with the given paradigm, this
143/// is a matter of performance tuning rather than imposing constraints
144/// on code structure. Some of the 'friction' might be mitigated by
145/// additional code using highway's up- and down-scaling routines,
146/// but for now the code rather uses 'goading' with small loops over
147/// the backing memory, relying on the compiler to handle this
148/// efficiently.
149
150template < typename D , std::size_t _vsize >
151struct HWY_ALIGN mchunk_t
152{
153 typedef typename hn::TFromD < D > T ;
154 typedef typename hn::Vec < D > vec_t ;
155 typedef T value_type ;
156
157 static const std::size_t vsize = _vsize ;
158
159 // pessimistic estimate: we can be certain that vsize bytes
160 // will suffice: that would be enough even if there were only
161 // one lane per vector. The advantage of this size is that we
162 // can avoid some calculations to figure out offsets, the
163 // disadvantage is quite high memory use - up to eight times
164 // as much as a tightly packed set of bits would consume.
165 // This is less of an issue than one might think, though,
166 // because the compiler may be able to optimize this memory
167 // away, so it doesn't become manifest.
168
169 static const std::size_t mask_bytes = vsize ;
170
171private:
172
173 HWY_ALIGN uint8_t inner [ mask_bytes ] ;
174
175public:
176
177// if we're not using sizeless vectors, the number of lanes is constexpr,
178// which can help the compiler produce more efficient binary.
179
180#ifdef HWY_HAVE_SCALABLE
181 std::size_t L() const
182 {
183 return Lanes ( D() ) ;
184 }
185#else
186 static constexpr std::size_t L()
187 {
188 return Lanes ( D() ) ;
189 }
190#endif
191
192 // direct access to the data for cheating
193
194 uint8_t * data()
195 {
196 return inner ;
197 }
198
199 const uint8_t * data() const
200 {
201 return inner ;
202 }
203
204 // The 'underlying' hardware mask type. This is entirely determined
205 // by the tag type, D, of the vector this mask pertains to.
206
207 typedef hn::Mask < D > vmask_type ;
208
209 // access to the memory 'as' masks
210 // we load the mask from the memory position which would apply
211 // if there were only one lane per vector. This will certainly
212 // be a valid choice, and as long as we're consistent and don't
213 // mind wasting some space, it's okay. We might even store one
214 // byte per lane. The gaps between i/o positions simply remain
215 // unused and even undefined - we 'waste' memory for the sake of
216 // making access code as efficient as possible.
217
218 vmask_type yield ( const std::size_t & i ) const
219 {
220 return hn::LoadMaskBits( D() , inner + i * L() ) ;
221 }
222
223 void take ( const std::size_t & i , const vmask_type & rhs )
224 {
225 hn::StoreMaskBits( D() , rhs , inner + i * L() ) ;
226 }
227
228 // SaveToBytes 'offloads' the mask to memory holding uint8_t,
229 // so that each bit in the 'flattened' mask corresponds to
230 // one byte in the memory. true mask bits set the corresponding
231 // memory byte to 0xFF, false mask bits set it to 0x00.
232
233 void SaveToBytes ( uint8_t * p_trg ) const
234 {
235 std::size_t n_lanes = Lanes ( D() ) ;
236 std::size_t n_mask = vsize / n_lanes ;
237 for ( std::size_t i = 0 ; i < n_mask ; i++ )
238 {
239 std::size_t ofs = i * n_lanes ;
240 uint8_t bit = 1 ;
241 for ( std::size_t k = 0 ; k < n_lanes ; k++ )
242 {
243 std::size_t byte = k / 8 ;
244 if ( inner [ ofs + byte ] & bit )
245 *p_trg = 0xff ;
246 else
247 *p_trg = 0x00 ;
248 ++p_trg ;
249 bit <<= 1 ;
250 if ( bit == 0 )
251 bit = 1 ;
252 }
253 }
254 }
255
256 // reverse operation: this loads the mask from bytes in memory.
257
258 void LoadFromBytes ( const uint8_t * p_trg )
259 {
260 std::size_t n_lanes = Lanes ( D() ) ;
261 std::size_t n_mask = vsize / n_lanes ;
262 for ( std::size_t i = 0 ; i < n_mask ; i++ )
263 {
264 std::size_t ofs = i * n_lanes ;
265 uint8_t bit = 1 ;
266 for ( std::size_t k = 0 ; k < n_lanes ; k++ )
267 {
268 std::size_t byte = k / 8 ;
269 if ( *p_trg )
270 inner [ ofs + byte ] |= bit ;
271 else
272 inner [ ofs + byte ] &= ~bit ;
273 ++p_trg ;
274 bit <<= 1 ;
275 if ( bit == 0 )
276 bit = 1 ;
277 }
278 }
279 }
280
281 // transfer moves masking information from one mchunk_t to another.
282 // If both mchunk_t have compatible backing memory, this routine is
283 // futile - the operation can be achieved by simply copying the
284 // backing memory (inner) - but otherwise, using this routine works
285 // like first using SaveToBytes, then LoadFromBytes - but it does so
286 // without needing the buffer.
287
288 template < typename D1 , typename D2 >
289 void transfer ( const mchunk_t < D1 , vsize > & in_mask ,
290 mchunk_t < D2 , vsize > & out_mask )
291 {
292 const std::size_t in_n_lanes = Lanes ( D1() ) ;
293 const std::size_t out_n_lanes = Lanes ( D2() ) ;
294 std::size_t in_m = 0 ;
295 std::size_t out_m = 0 ;
296 std::size_t in_ofs = 0 ;
297 std::size_t out_ofs = 0 ;
298 std::size_t in_l = 0 ;
299 std::size_t out_l = 0 ;
300 uint8_t in_bit = 1 ;
301 uint8_t out_bit = 1 ;
302
303 const uint8_t * p_in = in_mask.data() ;
304 uint8_t * p_out = out_mask.data() ;
305
306 for ( std::size_t e = 0 ; e < vsize ; e++ )
307 {
308 if ( p_in [ in_ofs ] & in_bit )
309 p_out [ out_ofs ] |= out_bit ;
310 else
311 p_out [ out_ofs ] &= ~out_bit ;
312
313 if ( ++in_l == in_n_lanes )
314 {
315 in_l = 0 ;
316 ++in_m ;
317 in_ofs = in_m * in_n_lanes ;
318 in_bit = 1 ;
319 }
320 else
321 {
322 in_bit <<= 1 ;
323 if ( in_bit == 0 )
324 {
325 ++in_ofs ;
326 in_bit = 1 ;
327 }
328 }
329
330 if ( ++out_l == out_n_lanes )
331 {
332 out_l = 0 ;
333 ++out_m ;
334 out_ofs = out_m * out_n_lanes ;
335 out_bit = 1 ;
336 }
337 else
338 {
339 out_bit <<= 1 ;
340 if ( out_bit == 0 )
341 {
342 ++out_ofs ;
343 out_bit = 1 ;
344 }
345 }
346 }
347 }
348
349 // mask construction and assignment
350
351 mchunk_t() = default ;
352 mchunk_t ( const mchunk_t & ) = default ;
353 mchunk_t & operator= ( const mchunk_t & ) = default ;
354
355 // create an all-true or all-false mask
356
357 mchunk_t ( bool v )
358 {
359 vmask_type m = FirstN ( D() , v ? vsize : 0 ) ;
360 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
361 {
362 take ( i , m ) ;
363 }
364 }
365
366 // augmented assignments are defined to allow boolean arithmetic
367 // with masks. The augmented assigments are subsequently used to
368 // define the corresponding binary operators.
369
370 #define OPEQ_FUNC(OP,OPFN) \
371 mchunk_t & OP ( const mchunk_t & rhs ) \
372 { \
373 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
374 take ( i , OPFN ( yield ( i ) , rhs.yield ( i ) ) ) ; \
375 return *this ; \
376 }
377
378 OPEQ_FUNC(operator&=,hn::And)
379 OPEQ_FUNC(operator|=,hn::Or)
380 OPEQ_FUNC(operator^=,hn::Xor)
381
382 #undef OPEQ_FUNC
383
384 // assignment from another mchunk_t of the same type is covered.
385 // assignment from equally-sized, but differently-typed mchunk_t
386 // needs more logic: equal vsize does not mean equal use of the
387 // backing memory. this is only equal if sizeof(T) is the same,
388 // so there are two variants:
389 // the first one is used if the backing memory has the same layout,
390 // the second one if the layout differs. The dispatch is below.
391
392private:
393
394 template < typename D2 >
395 void _assign ( const mchunk_t < D2 , vsize > & rhs , std::true_type )
396 {
397 // identical layout. We can simply pretend rhs is of equal type
398
399 const auto & trhs ( reinterpret_cast < const mchunk_t & > ( rhs ) ) ;
400 *this = trhs ;
401 }
402
403 template < typename D2 >
404 void _assign ( const mchunk_t < D2 , vsize > & rhs , std::false_type )
405 {
406 // different layout. This requires 'transfer' and is potentially
407 // slow, so best avoided - but it provides the logic to make
408 // objects of equal vsize interoperable.
409
410 transfer ( rhs , *this ) ;
411 }
412
413public:
414
415 // assignment from an mchunk_t which represents masks of a different
416 // type. This top-level routine checks whether the masks pertain
417 // to equally-sized data types, in which case the 'backing' memory
418 // has identical layout. It then dispatches to the appropriate
419 // variant of _assign, above
420
421 template < typename D2 >
422 mchunk_t & operator= ( const mchunk_t < D2 , vsize > & rhs )
423 {
424 typedef typename
425 std::conditional < sizeof ( T ) == sizeof ( hn::TFromD < D2 > ) ,
426 std::true_type ,
427 std::false_type > :: type eq_t ;
428
429 _assign ( rhs , eq_t() ) ;
430 return *this ;
431 }
432
433 // We use the operator= template above to produce a corresponding c'tor
434
435 template < typename D2 >
437 {
438 *this = rhs ;
439 }
440
441 // next we have the binary operators, which delegate to the
442 // augmented assignments
443
444 #define OP_FUNC(OPFUNC,OPEQ) \
445 template < typename D2 > \
446 mchunk_t OPFUNC ( const mchunk_t < D2 , vsize > & rhs ) const \
447 { \
448 mchunk_t help ( *this ) ; \
449 help OPEQ rhs ; \
450 return help ; \
451 }
452
453 OP_FUNC(operator&,&=)
454 OP_FUNC(operator|,|=)
455 OP_FUNC(operator^,^=)
456 OP_FUNC(operator&&,&=)
457 OP_FUNC(operator||,|=)
458
459 #undef OP_FUNC
460
461 // the only unary operator for masks is the inversion, user may
462 // use unary ! or ~.
463
464 #define OP_FUNC(OPFUNC,OP) \
465 mchunk_t OPFUNC() const \
466 { \
467 mchunk_t help ; \
468 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
469 help.take ( i , OP ( yield ( i ) ) ) ; \
470 return help ; \
471 }
472
473 OP_FUNC(operator!,hn::Not)
474 OP_FUNC(operator~,hn::Not)
475
476 #undef OP_FUNC
477
478 // finally, reductions for masks.
479
480 bool none_of() const
481 {
482 vmask_type help ;
483 bool result = true ;
484 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
485 {
486 help = yield ( i ) ;
487 result &= hn::AllFalse ( D() , help ) ;
488 }
489 return result ;
490 }
491
492 bool any_of() const
493 {
494 return ! none_of() ;
495 }
496
497 bool all_of() const
498 {
499 vmask_type help ;
500 bool result = true ;
501 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
502 {
503 help = yield ( i ) ;
504 result &= hn::AllTrue ( D() , help ) ;
505 }
506 return result ;
507 }
508
509 // echo the mask to a std::ostream
510
511 friend std::ostream & operator<< ( std::ostream & osr ,
512 mchunk_t it )
513 {
514 uint8_t buffer [ vsize ] ;
515 it.SaveToBytes ( buffer ) ;
516 osr << "(" ;
517 for ( std::size_t i = 0 ; i < vsize ; i++ )
518 osr << ( buffer [ i ] ? "1" : "0" ) ;
519 osr << ")" ;
520 return osr ;
521 }
522
523} ;
524
525// free versions of reductions. It's often necessary to determine whether
526// a mask is completely full or empty, or has at least some non-false
527// members. The code might be extended to test arbitrary vectors rather
528// than only masks. As it stands, to apply the functions to an
529// arbitrary vector, use a construct like 'any_of ( v == 0 )' instead of
530// 'any_of ( v )'.
531
532template < typename D , std::size_t N >
533bool any_of ( const mchunk_t<D,N> & arg )
534{
535 return arg.any_of() ;
536}
537
538template < typename D , std::size_t N >
539bool all_of ( const mchunk_t<D,N> & arg )
540{
541 return arg.all_of() ;
542}
543
544template < typename D , std::size_t N >
545bool none_of ( const mchunk_t<D,N> & arg )
546{
547 return arg.none_of() ;
548}
549
550/// class template hwy_simd_type provides a fixed-size container type for
551/// small-ish sets of fundamentals which are stored in a POD C vector.
552/// This implementation uses highway to code the loops more efficiently.
553/// It mimicks Vc::SimdArray, just like vspline::simd_type does, and
554/// The code is derived from vspline::simd_array, changing the workhorse
555/// code from simple loops to the use of highway functions.
556/// The resulting type, with it's 'container-typical' interface, slots
557/// in well with the higher-level constructs used in vspline/lux and,
558/// at the same time, 'contains' the SIMD implementation in this class,
559/// so that it's use doesn't need to be known outside.
560/// As an arithmetic type, hwy_simd_type provides many mathematical operators
561/// and some functions - most of them are realized by calling corresponding
562/// highway functions, but some (still) rely on loops, either because they
563/// aren't performance-critical or because there is no highway code to be
564/// had for the purpose. Some methods are (currently) exclusive to this
565/// class, but may be ported to other SIMD interface classes; apart from
566/// the original 'goading' class vspline::simd_type, there is also an
567/// implementation using std::simd in pv/vspline/std_simd_type.h
568/// The lane count for a hwy_simd_type in this body of code should be a
569/// power of two, and it should be at least as large as the hardware lane
570/// count of the smallest fundamental used in vectorized form. To cover
571/// all eventualities, the hardware lane count of a vector of unsigned
572/// char (uint8_t) is a good choice. This choice is to avoid that hwy_simd_type
573/// objects of small T remain partly empty when a given small vsize is
574/// chosen to cater for vectors with larger T. At times, this will lead
575/// to overly high register pressure, and the overall performance may
576/// benefit from allowing partially filled hwy_simd_type via a smaller vsize,
577/// which is feasible because hwy_simd_type uses highway vectors with
578/// CappedTag.
579
580// forward declaration of class template hwy_simd_type
581
582template < typename _value_type ,
583 std::size_t _vsize >
584struct HWY_ALIGN hwy_simd_type ;
585
586// next we have conversion functions. I decided to code them as free
587// functions, which makes formulation easier because both source and
588// target can be template arguments. Conversion with highway is quite
589// involved because there is no generic definition, instead there are
590// a bunch of conversions which highway can do (as documented in the
591// quick reference) but the set is not complete. So we have to code
592// so that the available ones will be used and the other ones are
593// relized by 'goading' - going over the backing arrays with a loop.
594// To avoid repetition, we use macros for three cases: conversion with
595// ConvertTo, DemoteTo and PromoteTo. All conversions which are not
596// explicitly coded will fall back to goading.
597
598// catch-all template for conversions, uses goading
599
600template < typename src_t , typename trg_t , std::size_t vsize >
603{
604 auto p_src = src.data() ;
605 auto p_trg = trg.data() ;
606 for ( std::size_t i = 0 ; i < vsize ; i++ )
607 p_trg[i] = p_src[i] ;
608}
609
610// the remainder of the conversions uses three highway functions
611// only, so we write macros for the three types of conversion to
612// make it easier to see the big picture
613
614#define PROMOTE(SRC,TRG) \
615template < std::size_t vsize > \
616void convert ( const hwy_simd_type < SRC , vsize > & src , \
617 hwy_simd_type < TRG , vsize > & trg ) \
618{ \
619 typedef hn::CappedTag < TRG , vsize > D ; \
620 typedef hn::Rebind < SRC , D > ud_t ; \
621\
622 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += trg.L() ) \
623 { \
624 auto underfilled = src.template dt_yield < ud_t > ( i ) ; \
625 auto promoted = hn::PromoteTo ( D() , underfilled ) ; \
626 trg.take ( i , promoted ) ; \
627 } \
628}
629
630#define CONVERT(SRC,TRG) \
631template < std::size_t vsize > \
632void convert ( const hwy_simd_type < SRC , vsize > & src , \
633 hwy_simd_type < TRG , vsize > & trg ) \
634{ \
635 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += trg.L() ) \
636 trg.take ( i , hn::ConvertTo ( hn::CappedTag < TRG , vsize > () , \
637 src.yield ( i ) ) ) ; \
638}
639
640#define DEMOTE(SRC,TRG) \
641template < std::size_t vsize > \
642void convert ( const hwy_simd_type < SRC , vsize > & src , \
643 hwy_simd_type < TRG , vsize > & trg ) \
644{ \
645 typedef hn::CappedTag < SRC , vsize > X ; \
646 typedef hn::Rebind < TRG , X > ud_t ; \
647 \
648 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += hn::Lanes ( ud_t() ) ) \
649 { \
650 const auto demoted = hn::DemoteTo ( ud_t() , src.yield ( i ) ) ; \
651 trg.template dt_take < ud_t > ( i , demoted ) ; \
652 } \
653}
654
655// I've 'scraped' the highway quick reference for the possible
656// type combinations to be used with PromoteTo, DemoteTo and
657// ConvertTo. I ignore half floats for now. Here's the 'scrape':
658
659// PromoteTo: (bf16,f32) (f16,f32) (f32,f64) (i16,i32) (i32,i64)
660// (i8,i16) (i8,i32) (u16,i32) (u16,u32) (u32,u64)
661// (u8,i16), (u8,i32) (u8,u16) (u8,u32)
662//
663// DemoteTo: (f32,bf16) (f32,f16) (f64,f32) (f64,i32) (i16,i8)
664// (i16,u8) (i32,i16) (i32,i8) (i32,u16) (i32,u8)
665// (i64,i16) (i64,i32) (i64,i8) (i64,u16) (i64,u32)
666// (i64,u8) (u16,i8) (u16,u8) (u32,i16) (u32,i8)
667// (u32,u16) (u32,u8) (u64,i16) (u64,i32) (u64,i8)
668// (u64,u16) (u64,u32) (u64,u8)
669//
670// ConvertTo: (f32,i32) (f64,i64) (i32,f32) (i64,f64)
671
672// Converted to my macros, here are the specialized functions:
673
674// PromoteTo
675
676// PROMOTE(bfloat16,float)
677// PROMOTE(float16,float)
678PROMOTE(float,double)
679PROMOTE(short,int)
680PROMOTE(int,long)
681PROMOTE(signed char,short)
682PROMOTE(signed char,int)
683PROMOTE(unsigned short,int)
684PROMOTE(unsigned short,unsigned int)
685PROMOTE(unsigned int,unsigned long)
686PROMOTE(unsigned char,short)
687PROMOTE(unsigned char,int)
688PROMOTE(unsigned char,unsigned short)
689PROMOTE(unsigned char,unsigned int)
690
691// DemoteTo
692
693// DEMOTE(float,bfloat16)
694// DEMOTE(float,float16)
695DEMOTE(double,float)
696DEMOTE(double,int)
697DEMOTE(short,signed char)
698DEMOTE(short,unsigned char)
699DEMOTE(int,short)
700DEMOTE(int,signed char)
701DEMOTE(int,unsigned short)
702DEMOTE(int,unsigned char)
703DEMOTE(long,short)
704DEMOTE(long,int)
705DEMOTE(long,signed char)
706DEMOTE(long,unsigned short)
707DEMOTE(long,unsigned int)
708DEMOTE(long,unsigned char)
709DEMOTE(unsigned short,signed char)
710DEMOTE(unsigned short,unsigned char)
711DEMOTE(unsigned int,short)
712DEMOTE(unsigned int,signed char)
713DEMOTE(unsigned int,unsigned short)
714DEMOTE(unsigned int,unsigned char)
715DEMOTE(unsigned long,short)
716DEMOTE(unsigned long,int)
717DEMOTE(unsigned long,signed char)
718DEMOTE(unsigned long,unsigned short)
719DEMOTE(unsigned long,unsigned int)
720DEMOTE(unsigned long,unsigned char)
721
722// ConvertTo
723
724CONVERT(float,int)
725CONVERT(double,long)
726CONVERT(int,float)
727CONVERT(long,double)
728
729#undef PROMOTE
730#undef DEMOTE
731#undef CONVERT
732
733// template for conversions from double to other types which are
734// not covered by the overloads above. For now we ignore half floats
735// and code as if float and double were the only fp types.
736// These conversions use three steps, which may add up to more time
737// than 'goading' would use, so this should be TODO tested.
738
739template < typename T , std::size_t vsize >
742{
743 static_assert ( std::is_integral < T > :: value , "int only...!" ) ;
745 hwy_simd_type < int , vsize > i_src = l_src ;
746 convert ( i_src , trg ) ;
747}
748
749template < typename T , std::size_t vsize >
752{
753 static_assert ( std::is_integral < T > :: value , "int only...!" ) ;
755 hwy_simd_type < long , vsize > l_src = i_src ;
756 convert ( l_src , trg ) ;
757}
758
759// conversion to and from vspline::simd_type of equal T
760
761template < typename T , std::size_t vsize >
764{
765 src.store ( trg.data() ) ;
766}
767
768template < typename T , std::size_t vsize >
771{
772 trg.load ( src.data() ) ;
773}
774
775// conversion to and from vspline::simd_type of different T
776// This uses goading, because we can't be sure that src_t can be
777// handled by highway.
778
779template < typename src_t , typename trg_t , std::size_t vsize >
782{
783 auto p_trg = trg.data() ;
784 for ( std::size_t i = 0 ; i < vsize ; i++ )
785 p_trg[i] = src[i] ;
786}
787
788template < typename src_t , typename trg_t , std::size_t vsize >
791{
792 auto p_src = src.data() ;
793 for ( std::size_t i = 0 ; i < vsize ; i++ )
794 trg[i] = p_src[i] ;
795}
796
797// now comes the template class hwy_simd_type which implements the core
798// of the functionality, the SIMD data type 'standing in' for
799// vspline::simd_type when USE_HWY is defined.
800
801template < typename _value_type ,
802 std::size_t _vsize >
803struct HWY_ALIGN hwy_simd_type
804{
805 typedef std::size_t size_type ;
806
807 typedef _value_type value_type ;
808 typedef value_type T ;
809
810 // A typical choice of vsize would be 'as many as there are
811 // bytes in a hardware vector'. This insures that all hwy_simd_type
812 // are interoperable without having to use half- or quarter-filled
813 // vectors for size compatibility.
814
815 static const size_type vsize = _vsize ;
816 static const int ivsize = _vsize ; // finessing for g++
817 static const int vbytes = sizeof ( value_type ) * vsize ;
818
819 // we make sure vsize is a power of two - simd_traits in vector.h
820 // does route non-power-of-two sizes to vspline::simd_type, but
821 // user code may not use simd_traits. Best to be safe!
822
823 static_assert ( ( vsize & ( vsize - 1 ) ) == 0 ,
824 "only use powers of two as lane count for highway-based hwy_simd_type" ) ;
825
826 // we use CappedTag to make sure that interfacing with memory
827 // will never exceed the bounds of the 'backing' memory. This
828 // may waste some register space: if T were, e.g. uint8_t and
829 // vsize 32, while a register could hold 64 bytes, half the
830 // register would remain empty.
831
832 typedef hn::CappedTag < value_type , vsize > D ;
833 typedef typename hn::Vec < D > vec_t ;
834
835 // just to make sure the class does the right thing
836
837 void info() const
838 {
839 std::cout << "value_type has "
840 << sizeof(value_type) << " bytes" << std::endl ;
841 std::cout << "hwy_simd_type has "
842 << sizeof(hwy_simd_type) << " bytes" << std::endl ;
843 std::cout << "HWY_MAX_BYTES: "
844 << HWY_MAX_BYTES << " bytes" << std::endl ;
845 std::cout << "Lane count is "
846 << L() << std::endl ;
847 std::cout << "MaxLanes is "
848 << MaxLanes(D()) << std::endl ;
849 std::cout << "vsize is "
850 << vsize << " value_type" << std::endl ;
851 }
852
853#ifdef HWY_HAVE_SCALABLE
854
855 std::size_t L() const
856 {
857 return Lanes ( D() ) ;
858 }
859
860#else
861
862 static constexpr std::size_t L()
863 {
864 return Lanes ( D() ) ;
865 }
866
867#endif
868
869 // derive types used for masks and index vectors.
870
871 typedef hn::Vec < hn::RebindToSigned < D > > hw_index_type ;
872 typedef hn::DFromV < hw_index_type > DI ;
873 typedef hn::TFromD < DI > TI ;
875
876 static_assert ( std::is_same < hw_index_type ,
877 typename index_type::vec_t
878 > :: value ,
879 "index type mismatch" ) ;
880
881 // definition of the type for masks. Since hwy_simd_type holds the
882 // equivalent of potentially several vectors, the mask type has
883 // to hold the equivalent of as many masks.
884
887
888private:
889
890 // the 'backing' memory: storage of data is in a simple C array.
891 // This array is private, and the only access to it is via member
892 // functions.
893
894 HWY_ALIGN value_type inner [ vsize ] ;
895
896public:
897
898 // provide the size as a constexpr. This is possible because
899 // the size is indeed known at compile time, even though the
900 // underlying vectors may be sizeless - possibly capped.
901
902 static constexpr size_type size()
903 {
904 return vsize ;
905 }
906
907 // 'back door' for cheating
908
909 T * data()
910 {
911 return inner ;
912 }
913
914 const T * data() const
915 {
916 return inner ;
917 }
918
919 // interface to the 'backing' memory 'as' highway vectors.
920 // This is a key function. I have opted to use operator[] for access
921 // to individual lanes, in keeping with standard container semantics,
922 // And to avoid the vector types 'leaking out'. If user code wants
923 // to use the vectorized interface, it should do so via yield and
924 // take, as does the code inside this class.
925
926
927 vec_t yield ( const std::size_t & i ) const
928 {
929 return hn::Load ( D() , inner + i * L() ) ;
930 }
931
932 void take ( const std::size_t & i , const vec_t & rhs )
933 {
934 hn::Store ( rhs , D() , inner + i * L() ) ;
935 }
936
937 // dt_yield and dt_take are variants which are used to load
938 // and store under-filled vectors, which happens during type
939 // conversions to/from simd_types with a differently-sized
940 // T, involving promotion/demotion.
941
942 template < typename DT >
943 hn::Vec < DT > dt_yield ( const std::size_t & i ) const
944 {
945 return hn::Load ( DT() , inner + i * Lanes ( DT() ) ) ;
946 }
947
948 template < typename DT >
949 void dt_take ( const std::size_t & i , const hn::Vec < DT > & rhs )
950 {
951 hn::Store ( rhs , DT() , inner + i * Lanes ( DT() ) ) ;
952 }
953
954/*
955
956 // broadcast functions to help with functionality which is not
957 // available ready-made, and to help rolling out vector code to
958 // chunks, which need the operation repeated over the set of
959 // constituent vectors. The functions ending on _vf are vector
960 // functions and applied to the constituent vectors, the functions
961 // ending in plain _f are scalar functions and they are rolled
962 // out over the array of T, 'inner'.
963
964 typedef std::function < T() > gen_f ;
965 typedef std::function < T ( const T & ) > mod_f ;
966 typedef std::function < T ( const T & , const T & ) > bin_f ;
967
968 hwy_simd_type & broadcast ( gen_f f )
969 {
970 for ( std::size_t i = 0 ; i < size() ; i++ )
971 {
972 inner[i] = f() ;
973 }
974 return *this ;
975 }
976
977 hwy_simd_type & broadcast ( mod_f f )
978 {
979 for ( std::size_t i = 0 ; i < size() ; i++ )
980 {
981 inner[i] = f ( inner[i] ) ;
982 }
983 return *this ;
984 }
985
986 hwy_simd_type & broadcast ( bin_f f , const hwy_simd_type & rhs )
987 {
988 for ( std::size_t i = 0 ; i < size() ; i++ )
989 {
990 inner[i] = f ( inner[i] , rhs.inner[i] ) ;
991 }
992 return *this ;
993 }
994
995 typedef std::function < vec_t() > gen_vf ;
996 typedef std::function < vec_t ( const vec_t & ) > mod_vf ;
997 typedef std::function < vec_t ( const vec_t & , const vec_t & ) > bin_vf ;
998
999 // broadcast a vector generator function
1000
1001 hwy_simd_type & broadcast ( gen_vf f )
1002 {
1003 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1004 take ( i , f() ) ;
1005 return *this ;
1006 }
1007
1008 // broadcast a vector modulator
1009
1010 hwy_simd_type & broadcast ( mod_vf f )
1011 {
1012 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1013 take ( i , f ( yield ( i ) ) ) ;
1014 return *this ;
1015 }
1016
1017 // broadcast a vector binary function
1018
1019 hwy_simd_type & broadcast ( bin_vf f , const hwy_simd_type & rhs )
1020 {
1021 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1022 take ( i , f ( yield ( i ) , rhs.yield ( i ) ) ) ;
1023 return *this ;
1024 }
1025
1026*/
1027
1028 // c'tor from T, using hn::Set to provide a vector as initializer
1029
1030 hwy_simd_type ( const T & x )
1031 {
1032 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1033 take ( i , hn::Set ( D() , x ) ) ;
1034 }
1035
1036 hwy_simd_type() = default ;
1037
1038 // assignment from another chunk or a T
1039
1040 hwy_simd_type & operator= ( const hwy_simd_type & rhs )
1041 {
1042 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1043 take ( i , rhs.yield ( i ) ) ;
1044 return *this ;
1045 }
1046
1047 hwy_simd_type & operator= ( const hwy_simd_type && rhs )
1048 {
1049 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1050 take ( i , rhs.yield ( i ) ) ;
1051 return *this ;
1052 }
1053
1054 hwy_simd_type & operator= ( const T & rhs )
1055 {
1056 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1057 take ( i , hn::Set ( D() , rhs ) ) ;
1058 return *this ;
1059 }
1060
1062 {
1063 *this = x ;
1064 }
1065
1067 {
1068 *this = x ;
1069 }
1070
1071 // operator[] is mapped to ordinary element access to the backing
1072 // memory. It's assumed that user code will avoid using it for
1073 // performance-critical code, but it's 'nice to have' and easily
1074 // coded:
1075
1076 value_type & operator[] ( const size_type & i )
1077 {
1078 return inner[i] ;
1079 }
1080
1081 value_type operator[] ( const size_type & i ) const
1082 {
1083 return inner[i] ;
1084 }
1085
1086 // for conversions between different simd_types and to and from
1087 // vspline::simd_type, we 'break out' to free functions.
1088
1089 // conversion from one hwy_simd_type to another
1090
1091 template < typename U >
1092 hwy_simd_type & operator= ( const hwy_simd_type < U , vsize > & rhs )
1093 {
1094 convert ( rhs , *this ) ;
1095 return *this ;
1096 }
1097
1098 // assignment from a vspline::simd_type on the rhs
1099
1100 template < typename U >
1102 {
1103 convert ( rhs , *this ) ;
1104 return *this ;
1105 }
1106
1107 // conversion to a vspline::simd_type
1108
1109 template < typename U >
1111 {
1113 convert ( *this , result ) ;
1114 return result ;
1115 }
1116
1117 template < typename U >
1119 {
1120 *this = rhs ;
1121 }
1122
1123 template < typename U >
1125 {
1126 *this = rhs ;
1127 }
1128
1129 // implement copysign, isnegative, isfinite, isnan, setQnan, setZero
1130 // These functions are present in Vc::SimdArray. They are not currently
1131 // provided in other hwy_simd_type variants, I used them to code a direct
1132 // port of Vc's atan2 function.
1133 // For lux, this function is performance-critical, so I made several
1134 // attempts at porting it. A 'straight' port of the Vc code is possible
1135 // with these added functions, but it operates at the hwy_simd_type level,
1136 // which does not seem to optimize well. So I 'translated' the code
1137 // to use highway at the vec_t level - see hwy_atan2.h
1138
1140 const hwy_simd_type & sign_source )
1141 {
1142 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += value.L() )
1143 {
1144 value.take ( i , hn::CopySign ( value.yield ( i ) ,
1145 sign_source.yield ( i ) ) ) ;
1146 }
1147 return value ;
1148 }
1149
1150 // there might be a more efficient way to do this, but for now:
1151
1152 static mask_type isnegative ( const hwy_simd_type & rhs )
1153 {
1154 return ( rhs < value_type(0) ) ;
1155 }
1156
1157 static mask_type isfinite ( const hwy_simd_type & rhs )
1158 {
1159 mask_type result ;
1160 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += result.L() )
1161 {
1162 result.take ( i , hn::IsFinite ( rhs.yield ( i ) ) ) ;
1163 }
1164 return result ;
1165 }
1166
1167 static mask_type isnan ( const hwy_simd_type & rhs )
1168 {
1169 mask_type result ;
1170 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += result.L() )
1171 {
1172 result.take ( i , hn::IsNaN ( rhs.yield ( i ) ) ) ;
1173 }
1174 return result ;
1175 }
1176
1178 {
1179 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1180 {
1181 take ( i , IfThenElse ( m.yield ( i ) ,
1182 NaN ( D() ) ,
1183 yield ( i ) ) ) ;
1184 }
1185 return *this ;
1186 }
1187
1189 {
1190 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1191 {
1192 take ( i , IfThenElse ( m.yield ( i ) ,
1193 Zero ( D() ) ,
1194 yield ( i ) ) ) ;
1195 }
1196 return *this ;
1197 }
1198
1199 // produce a hwy_simd_type filled with T rising from zero
1200
1201 static const hwy_simd_type iota()
1202 {
1203 hwy_simd_type result ;
1204 auto v = hn::Iota ( D() , 0 ) ;
1205 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += result.L() )
1206 {
1207 result.take ( i , v ) ;
1208 v = hn::Add ( v , hn::Set ( D() , T ( Lanes ( D() ) ) ) ) ;
1209 }
1210 return result ;
1211 }
1212
1213 // mimick Vc's IndexesFromZero. This function produces an index
1214 // vector filled with indexes starting with zero. Because hwy
1215 // uses as many bits for the index as value_type has, we make
1216 // sure the indexes can fit.
1217
1219 {
1220 typedef typename index_type::value_type IT ;
1221 static const IT ceiling = std::numeric_limits < IT > :: max() ;
1222 static_assert ( ( vsize - 1 ) <= std::size_t ( ceiling ) ,
1223 "value_type too small" ) ;
1224
1225 return index_type::iota() ;
1226 }
1227
1228 // variant which starts from a different starting point and
1229 // optionally uses steps other than one. This is handy to
1230 // generate gather/scatter indexes to access strided data.
1231 // to avoid trouble, we look at the maximum index we can produce,
1232 // and add an assertion to make sure the indexes we expect will
1233 // fit the range.
1234
1235 static const index_type IndexesFrom ( const std::size_t & start ,
1236 const std::size_t & step = 1 )
1237 {
1238 typedef typename index_type::value_type IT ;
1239 static const IT ceiling = std::numeric_limits < IT > :: max() ;
1240
1241 assert ( start + ( vsize - 1 ) * step
1242 <= std::size_t ( ceiling ) ) ;
1243
1244 return ( index_type::iota() * IT(step) ) + IT(start) ;
1245 }
1246
1247 // functions Zero and One produce hwy_simd_type objects filled with
1248 // 0, or 1, respectively
1249
1250 static const hwy_simd_type Zero()
1251 {
1252 return hwy_simd_type ( value_type ( 0 ) ) ;
1253 }
1254
1255 static const hwy_simd_type One()
1256 {
1257 return hwy_simd_type ( value_type ( 1 ) ) ;
1258 }
1259
1260 // echo the vector to a std::ostream, read it from an istream
1261 // this also goes 'over the memory' because it's not deemed
1262 // performance-critical
1263
1264 friend std::ostream & operator<< ( std::ostream & osr ,
1265 hwy_simd_type it )
1266 {
1267 std::size_t l = it.L() ;
1268 osr << "(" ;
1269 for ( size_type i = 0 ; i < vsize - 1 ; i++ )
1270 osr << it [ i ] << ( i % l == l - 1 ? " | " : ", " ) ;
1271 osr << it [ vsize - 1 ] << ")" ;
1272 return osr ;
1273 }
1274
1275 friend std::istream & operator>> ( std::istream & isr ,
1276 hwy_simd_type it )
1277 {
1278 for ( size_type i = 0 ; i < vsize ; i++ )
1279 isr >> it [ i ] ;
1280 return isr ;
1281 }
1282
1283 // memory access functions, which load and store vector data.
1284 // We start out with functions transporting data from memory into
1285 // the hwy_simd_type. Some of these operations have corresponding
1286 // c'tors which use the member function to initialize 'inner'.
1287
1288 // load from unaligned memory. We're defensive here, if the
1289 // calling code is sure that the memory is appropriately aligned,
1290 // it can use load_aligned (below)
1291
1292 void load ( const value_type * const & p_src )
1293 {
1294 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1295 take ( i , hn::LoadU ( D() , p_src + i * L() ) ) ;
1296 }
1297
1298 void load_aligned ( const value_type * const & p_src )
1299 {
1300 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1301 take ( i , hn::Load ( D() , p_src + i * L() ) ) ;
1302 }
1303
1304 // now the reverse operations, storing to memory
1305
1306 void store ( value_type * const p_trg ) const
1307 {
1308 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1309 hn::StoreU ( yield ( i ) , D() , p_trg + i * Lanes ( D() ) ) ;
1310 }
1311
1312 void store_aligned ( value_type * const & p_trg ) const
1313 {
1314 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1315 hn::Store ( yield ( i ) , D() , p_trg + i * Lanes ( D() ) ) ;
1316 }
1317
1318// on AVX512, there do not seem to be any gather/scatter operations
1319// for short or byte values, so I use goading to implement them.
1320
1321#ifdef FLV_AVX512f
1322
1323 // to gather larger-than-short data, use hwy g/s
1324
1325 void _gather ( const value_type * const & p_src ,
1326 const index_type & indexes ,
1327 std::false_type )
1328 {
1329 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1330 take ( i , hn::GatherIndex ( D() , p_src ,
1331 indexes.yield ( i ) ) ) ;
1332 }
1333
1334 void _scatter ( value_type * const & p_trg ,
1335 const index_type & indexes ,
1336 std::false_type ) const
1337 {
1338 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1339 hn::ScatterIndex ( yield ( i ) , D() , p_trg ,
1340 indexes.yield ( i ) ) ;
1341 }
1342
1343 // to gather/scatter shorts or bytes, use goading
1344
1345 void _gather ( const value_type * const & p_src ,
1346 const index_type & indexes ,
1347 std::true_type )
1348 {
1349 for ( std::size_t i = 0 ; i < vsize ; i++ )
1350 inner [ i ] = p_src [ indexes [ i ] ] ;
1351 }
1352
1353 void _scatter ( value_type * const & p_trg ,
1354 const index_type & indexes ,
1355 std::true_type ) const
1356 {
1357 for ( std::size_t i = 0 ; i < vsize ; i++ )
1358 p_trg [ indexes [ i ] ] = inner [ i ] ;
1359 }
1360
1361 // these are the dispatching routines testing for small value_type
1362
1363 void gather ( const value_type * const & p_src ,
1364 const index_type & indexes )
1365 {
1366 typedef std::integral_constant
1367 < bool , sizeof ( value_type ) <= 2 > is_small_t ;
1368
1369 _gather ( p_src , indexes , std::false_type() ) ; // is_small_t() ) ;
1370 }
1371
1372 void scatter ( value_type * const & p_trg ,
1373 const index_type & indexes ) const
1374 {
1375 typedef std::integral_constant
1376 < bool , sizeof ( value_type ) <= 2 > is_small_t ;
1377
1378 _scatter ( p_trg , indexes , is_small_t() ) ;
1379 }
1380
1381#else // #ifdef FLV_AVX512f
1382
1383 // other ISAs seem to have no problem scattering all value_types
1384
1385 // gather with 'proper' index type, which is derived by type inference
1386 // from the tag type D which defines the underlying vector type. This
1387 // type is quite specific in highway: it is a vector of signed integers
1388 // with the same number of bits as the fundamental type of the vector
1389 // that is indexed. Calling code should obtain index_type from hwy_simd_type
1390 // (it's public), but it may provide indexes in other forms, which are
1391 // routed to the template further down.
1392
1393 void gather ( const value_type * const & p_src ,
1394 const index_type & indexes )
1395 {
1396 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1397 take ( i , hn::GatherIndex ( D() , p_src ,
1398 indexes.yield ( i ) ) ) ;
1399 }
1400
1401 void scatter ( value_type * const & p_trg ,
1402 const index_type & indexes ) const
1403 {
1404 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1405 hn::ScatterIndex ( yield ( i ) , D() , p_trg ,
1406 indexes.yield ( i ) ) ;
1407 }
1408
1409#endif
1410
1411 // goading implementation with arbitrary index type. This will be used
1412 // if the indexes aren't precisely index_type, but some other entity
1413 // providing operator[]. The loop construct may well be autovectorized,
1414 // but this can't be guaranteed. It's recommended to use the proper
1415 // index_type wherever possible.
1416
1417 template < typename index_t >
1418 void gather ( const value_type * const & p_src ,
1419 const index_t & indexes )
1420 {
1421 for ( std::size_t i = 0 ; i < vsize ; i++ )
1422 inner [ i ] = p_src [ indexes [ i ] ] ;
1423 }
1424
1425 template < typename index_t >
1426 void scatter ( value_type * const & p_trg ,
1427 const index_t & indexes ) const
1428 {
1429 for ( std::size_t i = 0 ; i < vsize ; i++ )
1430 p_trg [ indexes [ i ] ] = inner [ i ] ;
1431 }
1432
1433 // c'tor from pointer and indexes, uses gather
1434
1435 template < typename index_t >
1436 hwy_simd_type ( const value_type * const & p_src ,
1437 const index_t & indexes )
1438 {
1439 gather ( p_src , indexes ) ;
1440 }
1441
1442 // 'regular' gather and scatter, accessing strided memory so that the
1443 // first address visited is p_src/p_trg, and successive addresses are
1444 // 'step' apart - in units of T. IndexesFrom generates the 'proper'
1445 // index_type for 'real' g/s, so if the ISA supports it the operation
1446 // should be reasonably fast. It's used to implement de/interleaving
1447 // for scenarios which are not covered by the Load/StoreInterleaved
1448 // family of functions.
1449
1450 void rgather ( const value_type * const & p_src ,
1451 const std::size_t & step )
1452 {
1453 auto indexes = IndexesFrom ( 0 , step ) ;
1454 gather ( p_src , indexes ) ;
1455 }
1456
1457 void rscatter ( value_type * const & p_trg ,
1458 const std::size_t & step ) const
1459 {
1460 auto indexes = IndexesFrom ( 0 , step ) ;
1461 scatter ( p_trg , indexes ) ;
1462 }
1463
1464// // use 'indexes' to perform a gather from the data held in 'inner'
1465// // and return the result of the gather operation.
1466//
1467// template < typename index_type >
1468// hwy_simd_type shuffle ( index_type indexes )
1469// {
1470// hwy_simd_type result ;
1471// result.rgather ( inner , indexes ) ;
1472// return result ;
1473// }
1474
1475 // apply functions from namespace hn to each vector in the hwy_simd_type
1476 // or to each corresponding set of vectors going up to three for fma.
1477 // This method works well on some platforms with some functions, but
1478 // altogether the results aren't satisfactory - it's more of an
1479 // emergency measure if proper vector code can't be had. But it's
1480 // a quick way to get code up and running which requires certain
1481 // functions, which allows for rapid prototyping and leaving the
1482 // proper implementation for later.
1483
1484 #define BROADCAST_HWY_FUNC(FUNC,HFUNC) \
1485 friend hwy_simd_type FUNC ( const hwy_simd_type & arg ) \
1486 { \
1487 hwy_simd_type result ; \
1488 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += arg.L() ) \
1489 result.take ( i , hn::HFUNC ( D() , arg.yield ( i ) ) ) ; \
1490 return result ; \
1491 }
1492
1493 BROADCAST_HWY_FUNC(log,Log)
1494 BROADCAST_HWY_FUNC(exp,Exp)
1495 BROADCAST_HWY_FUNC(sin,Sin)
1496 BROADCAST_HWY_FUNC(cos,Cos)
1497
1498 // hn function not (yet) available for tan, rolling out to sin/cos
1499
1500// BROADCAST_HWY_FUNC(tan,Tan)
1501
1502 friend hwy_simd_type tan ( const hwy_simd_type & arg )
1503 {
1504 hwy_simd_type result ;
1505 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += arg.L() )
1506 result.take ( i ,
1507 hn::Div ( hn::Sin ( D() , arg.yield ( i ) ) ,
1508 hn::Cos ( D() , arg.yield ( i ) ) ) ) ;
1509 return result ;
1510 }
1511
1512 // goading version
1513
1514// friend hwy_simd_type tan ( hwy_simd_type arg )
1515// {
1516// static const mod_f f = [](const T & x)
1517// { return T ( std::tan ( x ) ) ; } ;
1518// return arg.broadcast ( f ) ;
1519// }
1520
1521 BROADCAST_HWY_FUNC(asin,Asin)
1522 BROADCAST_HWY_FUNC(acos,Acos)
1524
1525 #undef BROADCAST_HWY_FUNC
1526
1527 #define BROADCAST_HWY_FUNC(FUNC,HFUNC) \
1528 friend hwy_simd_type FUNC ( const hwy_simd_type & arg ) \
1529 { \
1530 hwy_simd_type result ; \
1531 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += arg.L() ) \
1532 result.take ( i , hn::HFUNC ( arg.yield ( i ) ) ) ; \
1533 return result ; \
1534 }
1535
1536 BROADCAST_HWY_FUNC(abs,Abs)
1537 BROADCAST_HWY_FUNC(trunc,Trunc)
1538 BROADCAST_HWY_FUNC(round,Round)
1539 BROADCAST_HWY_FUNC(floor,Floor)
1540 BROADCAST_HWY_FUNC(ceil,Ceil)
1541 BROADCAST_HWY_FUNC(sqrt,Sqrt)
1542
1543 #undef BROADCAST_HWY_FUNC
1544
1545 #define BROADCAST_HWY_FUNC2(FUNC,HFUNC) \
1546 friend hwy_simd_type FUNC ( const hwy_simd_type & arg1 , \
1547 const hwy_simd_type & arg2 ) \
1548 { \
1549 hwy_simd_type result ; \
1550 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += arg1.L() ) \
1551 result.take ( i , hn::HFUNC ( arg1.yield ( i ) , \
1552 arg2.yield ( i ) ) ) ; \
1553 return result ; \
1554 }
1555
1556 // hn function not available for atan2, pow
1557
1558// BROADCAST_HWY_FUNC2(pow,Pow)
1559
1560 // implementation of atan2 relying on a highway function Atan2.
1561 // The implementation of that is a port from Vc, in hwy_atan2.h
1562
1563 friend hwy_simd_type atan2 ( const hwy_simd_type & y , const hwy_simd_type & x )
1564 {
1565 hwy_simd_type result ;
1566 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += x.L() )
1567 result.take ( i , hn::Atan2 ( D() , y.yield(i) , x.yield(i) ) ) ;
1568 return result ;
1569 }
1570
1571 // goading version
1572
1573// friend hwy_simd_type atan2 ( hwy_simd_type y , const hwy_simd_type & x )
1574// {
1575// static const bin_f f = [](const T & y, const T & x)
1576// { return T ( std::atan2 ( y , x ) ) ; } ;
1577// return y.broadcast ( f , x ) ;
1578// }
1579
1580 friend hwy_simd_type pow ( const hwy_simd_type & base , const hwy_simd_type & exponent )
1581 {
1582 hwy_simd_type result ;
1583 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += base.L() )
1584 result.take ( i ,
1585 hn::Exp ( D() ,
1586 hn::Mul ( exponent.yield ( i ) ,
1587 hn::Log ( D() , base.yield ( i ) ) )
1588 )
1589 ) ;
1590 return result ;
1591 }
1592
1593#undef BROADCAST_HWY_FUNC2
1594
1595 // three-argument-functions are not currently used.
1596
1597 // macro used for the parameter 'CONSTRAINT' in the definitions
1598 // further down. Some operations are only allowed for integral types
1599 // or boolans. This might be enforced by enable_if, here we use a
1600 // static_assert with a clear error message. I found that highway
1601 // is quite relaxed about enforcing integer data - things like
1602 // Xor and Not are allowed even for floating point data and
1603 // presumably work on the bare bits.
1604 // TODO: might relax constraints by using 'std::is_convertible'
1605
1606 #define INTEGRAL_ONLY \
1607 static_assert ( std::is_integral < value_type > :: value , \
1608 "this operation is only allowed for integral types" ) ;
1609
1610 // augmented assignment with a chunk as rhs and with a T as rhs.
1611 // The first one is actually defined wider - the rhs can be any
1612 // type which can yield an object suitable as rhs to OPFN, and
1613 // the second variant fills a vector with the scalar and uses that
1614 // as the rhs to OPFN. This may not be optimal in every case, one
1615 // might consider using the scalar as rhs if OPFN has such an
1616 // overload.
1617
1618 #define OPEQ_FUNC(OP,OPFN,CONSTRAINT) \
1619 hwy_simd_type & OP ( const hwy_simd_type & rhs ) \
1620 { CONSTRAINT \
1621 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1622 take ( i , OPFN ( yield ( i ) , rhs.yield ( i ) ) ) ; \
1623 return *this ; \
1624 } \
1625 hwy_simd_type & OP ( const T & rhs ) \
1626 { CONSTRAINT \
1627 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1628 take ( i , OPFN ( yield ( i ) , hn::Set ( D() , rhs ) ) ) ; \
1629 return *this ; \
1630 }
1631
1632 OPEQ_FUNC(operator+=,hn::Add,)
1633 OPEQ_FUNC(operator-=,hn::Sub,)
1634 OPEQ_FUNC(operator*=,hn::Mul,)
1635
1636 // Div and Mod are not available for all types, see further down
1637
1638 OPEQ_FUNC(operator&=,hn::And,)
1639 OPEQ_FUNC(operator|=,hn::Or,)
1640 OPEQ_FUNC(operator^=,hn::Xor,)
1641
1642 // these definitions of left and right shift may not be
1643 // optimal for scalar rhs, which is provided by creating
1644 // a vec_t from the scalar and broadcasting that.
1645
1646 OPEQ_FUNC(operator<<=,hn::Shl,INTEGRAL_ONLY)
1647 OPEQ_FUNC(operator>>=,hn::Shr,INTEGRAL_ONLY)
1648
1649 #undef OPEQ_FUNC
1650
1651 // integer division is rolled out to scalar, but float data
1652 // will use hn::Div
1653
1654 template < typename rhs_t >
1655 hwy_simd_type & div ( const rhs_t & rhs , std::false_type )
1656 {
1657 auto * p_r = rhs.data() ;
1658 for ( std::size_t i = 0 ; i < size() ; i++ )
1659 inner [ i ] /= p_r [ i ] ;
1660 return *this ;
1661 }
1662
1663 hwy_simd_type & div ( const T & rhs , std::false_type )
1664 {
1665 for ( std::size_t i = 0 ; i < size() ; i++ )
1666 inner [ i ] /= rhs ;
1667 return *this ;
1668 }
1669
1670 // float division is handled with vector code:
1671
1672 template < typename rhs_t >
1673 hwy_simd_type & div ( const rhs_t & rhs , std::true_type )
1674 {
1675 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1676 take ( i , hn::Div ( yield ( i ) , rhs.yield ( i ) ) ) ;
1677 return *this ;
1678 }
1679
1680 hwy_simd_type & div ( const T & rhs , std::true_type )
1681 {
1682 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
1683 take ( i , hn::Div ( yield ( i ) , hn::Set ( D() , rhs ) ) ) ;
1684 return *this ;
1685 }
1686
1687 template < typename rhs_t >
1688 hwy_simd_type & operator/= ( const rhs_t & rhs )
1689 {
1690 typedef typename std::is_floating_point < T > :: type is_float_t ;
1691 return div ( rhs , is_float_t() ) ;
1692 }
1693
1694 // modulo is rolled out unconditionally, because there is no hwy::Mod
1695
1696 template < typename rhs_t >
1697 hwy_simd_type & operator%= ( const rhs_t & rhs )
1698 {
1699 auto * p_r = rhs.data() ;
1700 for ( std::size_t i = 0 ; i < size() ; i++ )
1701 inner [ i ] %= p_r [ i ] ;
1702 return *this ;
1703 }
1704
1705 hwy_simd_type & operator%= ( const T & rhs )
1706 {
1707 auto * p_r = rhs.data() ;
1708 for ( std::size_t i = 0 ; i < size() ; i++ )
1709 inner [ i ] %= rhs ;
1710 return *this ;
1711 }
1712
1713// we use a simple scheme for type promotion: the promoted type
1714// of two values should be the same as the type we would receive
1715// when adding the two values. That's standard C semantics, but
1716// it won't widen the result type to avoid overflow or increase
1717// precision - such conversions have to be made by user code if
1718// necessary.
1719
1720#define C_PROMOTE(A,B) \
1721typename std::conditional \
1722 < std::is_same < A , B > :: value , \
1723 A , \
1724 decltype ( std::declval < A > () \
1725 + std::declval < B > () ) \
1726 > :: type
1727
1728 // binary operators and left and right scalar operations with
1729 // value_type, unary operators -, ! and ~
1730
1731#define OP_FUNC(OPFUNC,OPEQ,CONSTRAINT) \
1732 template < typename RHST , \
1733 typename = typename std::enable_if \
1734 < std::is_fundamental < RHST > :: value \
1735 > :: type \
1736 > \
1737 hwy_simd_type < C_PROMOTE ( T , RHST ) , vsize > \
1738 OPFUNC ( hwy_simd_type < RHST , vsize > rhs ) const \
1739 { \
1740 CONSTRAINT \
1741 hwy_simd_type < C_PROMOTE ( T , RHST ) , vsize > help ( *this ) ; \
1742 help OPEQ rhs ; \
1743 return help ; \
1744 } \
1745 template < typename RHST , \
1746 typename = typename std::enable_if \
1747 < std::is_fundamental < RHST > :: value \
1748 > :: type \
1749 > \
1750 hwy_simd_type < C_PROMOTE ( T , RHST ) , vsize > \
1751 OPFUNC ( RHST rhs ) const \
1752 { \
1753 CONSTRAINT \
1754 hwy_simd_type < C_PROMOTE ( T , RHST ) , vsize > help ( *this ) ; \
1755 help OPEQ rhs ; \
1756 return help ; \
1757 } \
1758 template < typename LHST , \
1759 typename = typename std::enable_if \
1760 < std::is_fundamental < LHST > :: value \
1761 > :: type \
1762 > \
1763 friend hwy_simd_type < C_PROMOTE ( LHST , T ) , vsize > \
1764 OPFUNC ( LHST lhs , hwy_simd_type rhs ) \
1765 { \
1766 CONSTRAINT \
1767 hwy_simd_type < C_PROMOTE ( LHST , T ) , vsize > help ( lhs ) ; \
1768 help OPEQ rhs ; \
1769 return help ; \
1770 }
1771
1772 // binary operators and left and right scalar operations with
1773 // value_type, unary operators -, ! and ~
1774
1775 // #define OP_FUNC(OPFUNC,OPEQ,CONSTRAINT) \
1776 // hwy_simd_type OPFUNC ( const hwy_simd_type & rhs ) const \
1777 // { \
1778 // CONSTRAINT \
1779 // hwy_simd_type help ( *this ) ; \
1780 // help OPEQ rhs ; \
1781 // return help ; \
1782 // } \
1783 // hwy_simd_type OPFUNC ( const T & rhs ) const \
1784 // { \
1785 // CONSTRAINT \
1786 // hwy_simd_type help ( *this ) ; \
1787 // help OPEQ rhs ; \
1788 // return help ; \
1789 // } \
1790 // friend hwy_simd_type OPFUNC ( const value_type & lhs , \
1791 // const hwy_simd_type & rhs ) \
1792 // { \
1793 // CONSTRAINT \
1794 // hwy_simd_type help ( lhs ) ; \
1795 // help OPEQ rhs ; \
1796 // return help ; \
1797 // }
1798
1799 OP_FUNC(operator+,+=,)
1800 OP_FUNC(operator-,-=,)
1801 OP_FUNC(operator*,*=,)
1802 OP_FUNC(operator/,/=,)
1803
1804 OP_FUNC(operator%,%=,INTEGRAL_ONLY)
1805 OP_FUNC(operator&,&=,INTEGRAL_ONLY)
1806 OP_FUNC(operator|,|=,INTEGRAL_ONLY)
1807 OP_FUNC(operator^,^=,)
1808 OP_FUNC(operator<<,<<=,INTEGRAL_ONLY)
1809 OP_FUNC(operator>>,>>=,INTEGRAL_ONLY)
1810
1811 #undef OP_FUNC
1812
1813 // for unary operators, relying on the operators to be defined
1814 // for now - should use hwy functions instead
1815
1816 #define OP_FUNC(OPFUNC,OP,CONSTRAINT) \
1817 hwy_simd_type OPFUNC() const \
1818 { \
1819 hwy_simd_type help ; \
1820 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1821 help.take ( i , OP ( yield ( i ) ) ) ; \
1822 return help ; \
1823 }
1824
1825 OP_FUNC(operator-,hn::Neg,)
1826 OP_FUNC(operator~,hn::Not,)
1827
1828 #undef OP_FUNC
1829
1830 // provide methods to produce a mask on comparing a vector
1831 // with another vector or a value_type.
1832
1833 #define COMPARE_FUNC(OP,OPFUNC) \
1834 friend mask_type OPFUNC ( const hwy_simd_type & lhs , \
1835 const hwy_simd_type & rhs ) \
1836 { \
1837 mask_type m ; \
1838 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += lhs.L() ) \
1839 m.take ( i , OP ( lhs.yield ( i ) , rhs.yield ( i ) ) ) ; \
1840 return m ; \
1841 } \
1842 friend mask_type OPFUNC ( const hwy_simd_type & lhs , \
1843 const value_type & rhs ) \
1844 { \
1845 mask_type m ; \
1846 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += lhs.L() ) \
1847 m.take ( i , OP ( lhs.yield ( i ) , hn::Set ( D() , rhs ) ) ) ; \
1848 return m ; \
1849 } \
1850 friend mask_type OPFUNC ( const value_type & lhs , \
1851 const hwy_simd_type & rhs ) \
1852 { \
1853 mask_type m ; \
1854 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += rhs.L() ) \
1855 m.take ( i , OP ( hn::Set ( D() , lhs ) , rhs.yield ( i ) ) ) ; \
1856 return m ; \
1857 }
1858
1859 COMPARE_FUNC(Lt,operator<) ;
1860 COMPARE_FUNC(Le,operator<=) ;
1861 COMPARE_FUNC(Gt,operator>) ;
1862 COMPARE_FUNC(Ge,operator>=) ;
1863 COMPARE_FUNC(Eq,operator==) ;
1864 COMPARE_FUNC(Ne,operator!=) ;
1865
1866 #undef COMPARE_FUNC
1867
1868 // next we define a masked vector as an object holding two members:
1869 // one mask type, determining which of the vector's elements will
1870 // be 'open' to an effect, and one reference to a vector, which will
1871 // be affected by the operation.
1872 // The resulting object will only be viable as long as the referred-to
1873 // vector is alive - it's meant as a construct to be processed
1874 // in the same scope, as the lhs of an assignment, typically using
1875 // notation introduced by Vc: a vector's operator() is overloaded
1876 // to produce a masked_type when called with a mask_type object, and
1877 // the resulting masked_type object is then assigned to.
1878 // Note that this does not have any effect on those values in 'whither'
1879 // for which the mask is false. They remain unchanged.
1880
1882 {
1883 const mask_type whether ; // if the mask is true at whether[i]
1884 hwy_simd_type & whither ; // whither[i] will be assigned to
1885
1886 std::size_t L() const
1887 {
1888 return whither.L() ;
1889 }
1890
1891 masked_type ( const mask_type & _whether ,
1892 hwy_simd_type & _whither )
1893 : whether ( _whether ) ,
1894 whither ( _whither )
1895 { }
1896
1897 template < typename D2 , std::size_t N2 >
1899 hwy_simd_type & _whither )
1900 : whether ( _whether ) ,
1901 whither ( _whither )
1902 {
1903// whether = _whether ;
1904 }
1905
1906 // for the masked vector, we define the complete set of assignments:
1907
1908 hwy_simd_type & operator= ( const value_type & rhs ) \
1909 { \
1910 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1911 { \
1912 auto m = whether.yield ( i ) ; \
1913 auto v = whither.yield ( i ) ; \
1914 auto vr = hn::Set ( D() , rhs ) ; \
1915 whither.take ( i , hn::IfThenElse ( m , vr , v ) ) ; \
1916 } \
1917 return whither ; \
1918 } \
1919 hwy_simd_type & operator= ( const hwy_simd_type & rhs ) \
1920 { \
1921 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1922 { \
1923 auto m = whether.yield ( i ) ; \
1924 auto v = whither.yield ( i ) ; \
1925 auto vr = rhs.yield ( i ) ; \
1926 whither.take ( i , hn::IfThenElse ( m , vr , v ) ) ; \
1927 } \
1928 return whither ; \
1929 }
1930
1931 // most operators can be rolled out over vec_t
1932
1933 #define OPEQ_FUNC(OPFUNC,OP,CONSTRAINT) \
1934 hwy_simd_type & OPFUNC ( const value_type & rhs ) \
1935 { \
1936 CONSTRAINT \
1937 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1938 { \
1939 auto m = whether.yield ( i ) ; \
1940 auto v = whither.yield ( i ) ; \
1941 auto vr = hn::OP ( v , hn::Set ( D() , rhs ) ) ; \
1942 whither.take ( i , hn::IfThenElse ( m , vr , v ) ) ; \
1943 } \
1944 return whither ; \
1945 } \
1946 hwy_simd_type & OPFUNC ( const hwy_simd_type & rhs ) \
1947 { \
1948 CONSTRAINT \
1949 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() ) \
1950 { \
1951 auto m = whether.yield ( i ) ; \
1952 auto v = whither.yield ( i ) ; \
1953 auto vr = hn::OP ( v , rhs.yield ( i ) ) ; \
1954 whither.take ( i , hn::IfThenElse ( m , vr , v ) ) ; \
1955 } \
1956 return whither ; \
1957 }
1958
1959 OPEQ_FUNC(operator+=,Add,)
1960 OPEQ_FUNC(operator-=,Sub,)
1961 OPEQ_FUNC(operator*=,Mul,)
1962 OPEQ_FUNC(operator&=,And,INTEGRAL_ONLY)
1963 OPEQ_FUNC(operator|=,Or,INTEGRAL_ONLY)
1964 OPEQ_FUNC(operator^=,Xor,INTEGRAL_ONLY)
1965 OPEQ_FUNC(operator<<=,Shl,INTEGRAL_ONLY)
1966 OPEQ_FUNC(operator>>=,Shr,INTEGRAL_ONLY)
1967
1968 #undef OPEQ_FUNC
1969
1970 // some operators are implemented on the hwy_simd_type level:
1971 // first, obtain a hwy_simd_type by applying OPEQ rhs to whither. This
1972 // intermediate result would be appropriate for an all-true mask.
1973 // Then apply this by calling operator= on *this: this picks the
1974 // masked assignment above (aka operator=), which only copies from
1975 // the all-true intermediate result where the mask is true.
1976
1977 #define OPEQ_FUNC(OPFUNC,OPEQ,CONSTRAINT) \
1978 hwy_simd_type & OPFUNC ( const value_type & rhs ) \
1979 { \
1980 CONSTRAINT \
1981 hwy_simd_type mrhs ( whither ) ; \
1982 mrhs OPEQ rhs ; \
1983 return ( *this = mrhs ) ; \
1984 } \
1985 hwy_simd_type & OPFUNC ( const hwy_simd_type & rhs ) \
1986 { \
1987 CONSTRAINT \
1988 hwy_simd_type mrhs ( whither ) ; \
1989 mrhs OPEQ rhs ; \
1990 return ( *this = mrhs ) ; \
1991 }
1992
1993 OPEQ_FUNC(operator%=,%=,INTEGRAL_ONLY)
1994 OPEQ_FUNC(operator/=,/=,)
1995
1996 #undef OPEQ_FUNC
1997
1998 } ;
1999
2000 #undef INTEGRAL_ONLY
2001
2002 // mimicking Vc, we define operator() with a mask_type argument
2003 // to produce a masked_type object, which can be used later on to
2004 // masked-assign to the referred-to vector. With this definition
2005 // we can use the same syntax Vc uses, e.g. v1 ( v1 > v2 ) = v3
2006 // This helps write code which compiles with Vc and without,
2007 // because this idiom is 'very Vc'.
2008
2009 masked_type operator() ( const mask_type & mask )
2010 {
2011 return masked_type ( mask , *this ) ;
2012 }
2013
2014 // member functions at_least and at_most. These functions provide the
2015 // same functionality as max, or min, respectively. Given hwy_simd_type X
2016 // and some threshold Y, X.at_least ( Y ) == max ( X , Y )
2017 // Having the functionality as a member function makes it easy to
2018 // implement, e.g., min as: min ( X , Y ) { return X.at_most ( Y ) ; }
2019
2020 #define CLAMP(FNAME,REL) \
2021 hwy_simd_type FNAME ( const T & threshold ) const \
2022 { \
2023 return (*this) ( *this REL threshold ) = threshold ; \
2024 } \
2025 hwy_simd_type FNAME ( const hwy_simd_type & threshold ) const \
2026 { \
2027 return (*this) ( *this REL threshold ) = threshold ; \
2028 }
2029
2030 CLAMP(at_least,<)
2031 CLAMP(at_most,>)
2032
2033 #undef CLAMP
2034
2035 // sum of vector elements. Note that there is no type promotion; the
2036 // summation is done to value_type. Caller must make sure that overflow
2037 // is not a problem.
2038
2040 {
2041 vec_t s ( hn::Zero ( D() ) ) ;
2042 for ( std::size_t n = 0 , i = 0 ; n < vsize ; ++i , n += L() )
2043 s += yield ( i ) ;
2044 return hn::GetLane ( hn::SumOfLanes ( D() , s ) ) ;
2045 }
2046} ;
2047
2048} ; // namespace HWY_NAMESPACE
2049
2050HWY_AFTER_NAMESPACE(); // at file scope
2051
2052#endif // HWY_SIMD_TYPE_H
2053
2054namespace vspline
2055{
2056template < typename T , std::size_t N >
2058
2059// for highway data, vspline needs an allocator, which is in turn
2060// required by vigra::MultiArray to allocate vector-aligned storage.
2061// Initially I coded the allocation using aligned_alloc, but this
2062// function is not available on msys2, so I'm now using highway
2063// functions.
2064
2065template < typename T >
2067: public std::allocator < T >
2068{
2069 typedef std::allocator < T > base_t ;
2070 using typename base_t::pointer ;
2071 pointer allocate ( std::size_t n )
2072 {
2073 return (pointer) hwy::AllocateAlignedBytes
2074 ( n * sizeof(T) , nullptr , nullptr ) ;
2075 }
2076 void deallocate ( T* p , std::size_t n )
2077 {
2078 hwy::FreeAlignedBytes ( p , nullptr , nullptr ) ;
2079 }
2080 using base_t::base_t ;
2081} ;
2082
2083// Im fixing this allocator via std::allocator_traits, but it
2084// does not seem to be picked for all allocations - I had, e.g.
2085// a std::vector of hwy_simd_type which contained unaligned
2086// memory and caused a crash (only with c++11, 17 is okay)
2087// - I worked around it in lux, but the problem is not solved.
2088
2089template < typename T , std::size_t N >
2091{
2093} ;
2094} ;
2095
2096
2097#ifndef HWY_SIMD_ALLOCATOR
2098#define HWY_SIMD_ALLOCATOR
2099
2100namespace std
2101{
2102
2103template < typename T , std::size_t N >
2104struct allocator_traits < vspline::hwy_simd_type < T , N > >
2105{
2108} ;
2109
2110} ;
2111
2112#endif // HWY_SIMD_ALLOCATOR
2113
2114#ifndef VSPLINE_VECTOR_NBYTES
2115
2116// this is tentative, but an informed guess, because the data handled
2117// by the rendering code are single precision float or int at most,
2118// but the smallest data are unsigned char.
2119// A full hardware vector of unsigned char amounts to HWY_MAX_BYTES
2120// lanes (one lane <=> one byte), and a full hwy_simd_type of float
2121// needs four times as many bytes (one lane <=> four bytes), hence
2122// the choice of 4*HWY_MAX_BYTES.
2123// But TODO for very large HWY_MAX_BYTES this may become very large,
2124// and HWY_MAX_BYTES is a loose upper bound, so it may well exceed
2125// what the current CPU actually needs. This may become a problem
2126// with, e.g. SVE.
2127
2128#define VSPLINE_VECTOR_NBYTES (4*HWY_MAX_BYTES)
2129
2130#endif
2131
2132#endif // #define VSPLINE_HWY_SIMD_TYPE_H
class template simd_type provides a fixed-size container type for small sets of fundamentals which ar...
Definition: simd_type.h:147
void load(const value_type *const p_src)
Definition: simd_type.h:376
void store(value_type *const p_trg) const
Definition: simd_type.h:407
@ vsize
Definition: eval.cc:96
HWY_AFTER_NAMESPACE()
#define DEMOTE(SRC, TRG)
#define OP_FUNC(OPFUNC, OPEQ)
#define BROADCAST_HWY_FUNC(FUNC, HFUNC)
#define OPEQ_FUNC(OP, OPFN)
HWY_BEFORE_NAMESPACE()
#define CONVERT(SRC, TRG)
#define CLAMP(FNAME, REL)
#define INTEGRAL_ONLY
#define PROMOTE(SRC, TRG)
void convert(const hwy_simd_type< src_t, vsize > &src, hwy_simd_type< trg_t, vsize > &trg)
bool none_of(const mchunk_t< D, N > &arg)
bool any_of(const mchunk_t< D, N > &arg)
bool all_of(const mchunk_t< D, N > &arg)
struct HWY_ALIGN hwy_simd_type
class template hwy_simd_type provides a fixed-size container type for small-ish sets of fundamentals ...
HWY_INLINE V Atan2(const D d, V y, V x)
Definition: hwy_atan2.h:172
HWY_INLINE V Atan(const D d, V y, V x)
Definition: basis.h:79
masked_type(const mask_type &_whether, hwy_simd_type &_whither)
masked_type(const mchunk_t< D2, N2 > &_whether, hwy_simd_type &_whither)
hwy_simd_type(const vspline::simd_type< U, vsize > &rhs)
static mask_type isnegative(const hwy_simd_type &rhs)
hwy_simd_type & setQnan(const mask_type &m)
void load_aligned(const value_type *const &p_src)
static constexpr std::size_t L()
friend hwy_simd_type pow(const hwy_simd_type &base, const hwy_simd_type &exponent)
void take(const std::size_t &i, const vec_t &rhs)
COMPARE_FUNC(Ne, operator!=)
void gather(const value_type *const &p_src, const index_type &indexes)
vec_t yield(const std::size_t &i) const
hwy_simd_type< TI, vsize > index_type
static const index_type IndexesFromZero()
static mask_type isnan(const hwy_simd_type &rhs)
COMPARE_FUNC(Le, operator<=)
static mask_type isfinite(const hwy_simd_type &rhs)
void dt_take(const std::size_t &i, const hn::Vec< DT > &rhs)
void rscatter(value_type *const &p_trg, const std::size_t &step) const
mchunk_t< D, vsize > MaskType
hn::CappedTag< value_type, vsize > D
hn::Vec< hn::RebindToSigned< D > > hw_index_type
hwy_simd_type(const hwy_simd_type &&x)
static const hwy_simd_type One()
void store(value_type *const p_trg) const
static const hwy_simd_type Zero()
hwy_simd_type & div(const T &rhs, std::true_type)
hn::Vec< DT > dt_yield(const std::size_t &i) const
void gather(const value_type *const &p_src, const index_t &indexes)
void scatter(value_type *const &p_trg, const index_type &indexes) const
hwy_simd_type & div(const rhs_t &rhs, std::false_type)
void scatter(value_type *const &p_trg, const index_t &indexes) const
hwy_simd_type & div(const T &rhs, std::false_type)
friend hwy_simd_type atan2(const hwy_simd_type &y, const hwy_simd_type &x)
mchunk_t< D, vsize > mask_type
hwy_simd_type & div(const rhs_t &rhs, std::true_type)
void store_aligned(value_type *const &p_trg) const
static constexpr size_type size()
static const index_type IndexesFrom(const std::size_t &start, const std::size_t &step=1)
void rgather(const value_type *const &p_src, const std::size_t &step)
COMPARE_FUNC(Ge, operator>=)
hwy_simd_type(const value_type *const &p_src, const index_t &indexes)
hwy_simd_type(const hwy_simd_type &x)
hwy_simd_type & setZero(const mask_type &m)
void load(const value_type *const &p_src)
static const hwy_simd_type iota()
hn::DFromV< hw_index_type > DI
hwy_simd_type(const hwy_simd_type< U, vsize > &rhs)
COMPARE_FUNC(Eq, operator==)
static hwy_simd_type copysign(hwy_simd_type value, const hwy_simd_type &sign_source)
mask type for hwy_simd_type. This is a type which holds a set of masks stored in uint8_t,...
void take(const std::size_t &i, const vmask_type &rhs)
vmask_type yield(const std::size_t &i) const
static constexpr std::size_t L()
const uint8_t * data() const
mchunk_t(const mchunk_t< D2, vsize > &rhs)
void LoadFromBytes(const uint8_t *p_trg)
hn::Mask< D > vmask_type
void transfer(const mchunk_t< D1, vsize > &in_mask, mchunk_t< D2, vsize > &out_mask)
mchunk_t(const mchunk_t &)=default
void SaveToBytes(uint8_t *p_trg) const
vspline::simd_allocator< vspline::hwy_simd_type< T, N > > allocator_type
simd_allocator< hwy_simd_type< T, N > > type
vspline creates vigra::MultiArrays of vectorized types. As long as the vectorized types are Vc::SimdA...
Definition: common.h:267
void deallocate(T *p, std::size_t n)
std::allocator< T > base_t
pointer allocate(std::size_t n)