vspline 1.1.0
Generic C++11 Code for Uniform B-Splines
|
vspline is a header-only generic C++ library for the creation and use of uniform B-splines. It aims to be as comprehensive as feasibly possible, yet at the same time producing code which performs well, so that it can be used in production. vspline is CPU-based, it does not use the GPU.
vspline's main focus is interpolation of bulk nD raster data. It was developed originally to be used for image processing software. In image processing, oftentimes large amounts of pixels need to be submitted to identical operations, suggesting a functional approach. vspline offers functional programming elements to implement such programs.
vspline was developed on a Linux system using clang++ and g++. It has not been tested much with other systems or compilers. The code uses the C++11 standard, or C++17 with the std::simd back-end.
vspline's companion program lux, which uses vspline heavily, is now running on several platforms and demonstrates that the code is quite portable.
vspline relies on one other library:
Optionally, vspline can make use of these SIMD libraries:
I find VIGRA indispensible, omitting it from vspline is not really an option. Use of the SIMD libraries is optional, though, and may be activated by #defining 'USE_VC', 'USE_HWY' or 'USE_STDSIMD', respectively. This should be done by passing -D USE_... to the compiler; defining USE_... only for parts of a project may or may not work. Please note that vspline uses Vc's 1.4 branch, not the master branch. 1.4 is the branch which is likely distributed by package managers, if your system does not provide it, build Vc from source - if you check out Vc from github, make sure you pick the 1.4 branch. Vc is now in 'maintenance mode' and only recommended for intel/AMD CPUs with up to AVX2. The preferred SIMD back-end is now highway.
I have made an attempt to generalize the code so that it can handle
On the evaluation side I provide
To produce maximum perfomance, vspline also has a fair amount of collateral code, and some of this code may be helpful beyond vspline:
many examples, ample comments
*note that the filtering code and the transform-type routines multithread and vectorize automatically.
Note that vspline takes it's claim to being 'generic' seriously: when template arguments are offered to produce variant code, you can depend on actually getting code for every allowed value, rather than having to implement the special cases yourself and then introducing them to some generic dispatch mechanism. So you want to create a 4-D b-spline over five-channel long double data? No problem. And spline degree is even a run-time parameter and not a template argument, making the code more flexible than code which requires that you specify the spline degree at compile-time.
vspline is available as a debian package, so the easiest way to install vspline is to install the vspline package, if it is provided by your package manager. This will also take care of the (few) dependencies and install the examples. ubuntu should provide a vspline package from 18.04 on. As of this writing, vspline is still under development, and the master branch in the git repository may offer new/different functionality to the code in the package. While I try to keep the 'higher-level' functions' and objects' interface consistent, the signatures may still change, especially in their default arguments, so switching from the packaged version to the master branch, or upgrading to a newer packaged version may require changes to the calling code. With Vc having gone into 'maintenance mode' I had to find a new SIMD library to handle newer CPUs, and I chose highway. To be able to keep using my SIMD code, I wrote an interface to highway which uses the same API, and this led to a standardization of SIMD functionality, resulting in four distinc back-ends, 'goading', std::simd, Vc and highway. The use of Vc or highway requires the installation of the corresponding library code.
If you're installing manually, note that vspline is header-only, so it's sufficient to place the headers where your code can access them. They must all reside in the same folder. VIGRA (and optionally Vc and highway) are supposed to be installed in locations where they can be found so that includes along the lines of #include <vigra/...> succeed.
While your distro's packages may be sufficient to get vspline up and running, you may need newer versions of VIGRA, Vc and highway. At the time of this writing the latest versions commonly available were Vc 1.4.2 and VIGRA 1.11.0. Vc 0.x.x will not work with vspline, and only Vc's 1.3 and 1.4 branches are compatible with vspline. Using Vc only makes sense if you are aiming for maximum performance; vspline's code autovectorizes well and oftentimes performs just about as well without Vc, but if you are writing more complex pipelines, the gains from using explicit vectorization tend to increase. Some components of vspline, like prefiltering and the genralized digital filter code, do not benefit from Vc use. For non-intel/AMD CPUs and intel/AMD CPUs with AVX512, highway is a better choice. As of this writing, highway 1.0.4 is the latest version and works well with vspline.
To compile software using vspline, I preferably use clang++. I found that clang++ often produces faster code than other compilers. For some applications, though, g++ takes the lead. You'll have to find out which works best for you.
-pthread is needed to run the automatic multithreading used for most of vspline's bulk data processing, like prefiltering or the transform routines. It's possible to disable multithreading by passing -D VSPLINE_SINGLETHREAD, in which case you can omit -pthread:
optimization is essential to get decent performance, but if you just work with a few values or have trouble debugging optimzed code, you can omit it. I have found using -Ofast usually works as well.
vspline uses the C++11 standard and will not compile with lower standard versions.
vigraimpex (VIGRA's image import/export library) is used by some of the examples, for these you'll need an additional -lvigraimpex
Please note that an executable using vectorization produced for your system may likely not work on a machine with another CPU. It's best to compile for the intended target architecture explicitly using -march... 'Not work' in this context means that it may as well crash due to an illegal instruction or wrong alignment. If you're compiling for the same system you're on, you can use -march=native (for intel/AMD CPUs). If you don't specify the architecture for the compilation, you'll likely get some lowest-common-denominator binary (like, SSE) which will not exploit your hardware optimally, but run on a wider range of systems. If you intend to build software which can run on several vector ISAs, there is a simple trick: build a separate object file for each ISA where you use a preprocessor statement which #defines vspline to be some ISA-specific value, then use a dispatcher class calling ISA-specific variants via a pure virtual base class. In lux, vspline's companion program, this technique is used and you can study it in more detail.
If you tell the compiler to do so, autovectorization will still be used even without Vc being present, and if you compile for a given target architecture, the binary may contain vector instructions specific to the architecture and will crash on targets not supporting the relevant ISA.
If you don't want to use clang++, g++ will also work. In my tests I found that clang++ produced faster code, but this may or may not be the case for specific compiler versions or taget architectures. You'll have to try it out.
vspline is free software, licensed under this license:
vspline - a set of generic tools for creation and evaluation of uniform b-splines Copyright 2015 - 2023 by Kay F. Jahnke Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This section lines out the strategies used in vspline to get from original data to an interpolated result.
When you are using vspline, your information goes through several stages. Starting out with your original data, the first step is to 'prefilter' these data to obtain 'b-spline coefficients'. vspline keeps these coefficients in a 'bspline' object. Next you obtain an 'evaluator' using these coefficients. This is a functor; you call it with (real) coordinates and receive interpolated values. While you can produce your results by repeated calls to the evaluator, vspline also offers mass evaluation, a typical example is a 'classic' remap, where you provide an array full of coordinates and receive an array full of interpolated values. Using vspline's mass evaluation methods is a good idea, because this is done very efficiently using multithreading and hardware vectorization. I'll go through these stages in detail.
'direct' interpolation uses some arithmetic construct to derive an interpolated value from several data points in the 'vicinity' of the interpolation locus. The value of this arithmetic construct collapses to the data point itself if it coincides with the interpolation locus. b-splines are not direct interpolators: the interpolated value is not directly derived from an arithmetic construct involving the data points, instead it is derived from a set of coefficients which have been produced from the data points by 'prefiltering' them. This implies a sequence of operations which has to be followed in order to use b-spline interpolation:
This two-step process has one distinct disadvantage: the coefficients have to be held in memory until all evaluations of the spline are done. While there is no way around storing the coefficients, storing the original data is not necessary to evaluate the spline. So if the original data aren't used somewhere 'further down the line', they can be discarded as soon as the coefficients are known. Discarding the original data can even be done while calculating the coefficients, and it is possible to replace the original data with the corresponding coefficients in the process of prefiltering: the operation can be done 'in place'. vspline can exploit situations where the original data won't be used after the coefficients have been produced, reducing storage requirements to just above what is needed for the original data. This can be done in two ways:
Now you might ask why a bspline object shouldn't be created by somehow transforming the raw data. This is due to the fact that vspline holds more coefficients than original data: for efficient evaluation without needs for bounds checking, a small 'frame' of additional coefficients is put around the 'core' coefficients. This frame is handled by the bspline object, and the space in memory holding the 'core' coefficients is a 'subarray' of the total amount of coefficients held by the bspline object. It is possible to 'adopt' storage holding original data with a bspline object, but this can only be done if the raw data are already surrounded by a suitably-sized frame of additional storage to accept the 'frame' after prefiltering. If you set up your data so that the 'frame' is present, you can make the spline 'adopt' the data when it is created, but you might as well create the bspline object first and populate it's core afterwards, instead of creating, populating and adpoting the data.
When you set up a bspline object, you instantiate class bspline with the data type it should use for coefficients. Oftentimes this will be the same data type as your original data. But you can use a different type for the coefficients, as long as the types are convertible. When using a different type, you can still use the overload of 'prefilter' 'sucking in' the original data: they will be converted in the process. This is especially useful if your source data are of a type with small dynamic range (like 8 bit pixels), which aren't well-suited for arithmetic operations. While you get most precise results with real data types (like float) for b-spline coefficients, you are free to operate with integral b-spline coefficients: per default, all internal calculations (prefiltering and evaluation) are done with suitably 'promoted' data types, so when using integral coefficients, you only suffer from quantization errors. You have freedom of choice here: your original data, the b-spline coefficients, the type used for internal calculations and the result type of the evaluation may all be different, while there are sensible defaults if you don't want to concern yourself with the details. The only constraint is that arithmetic has to be done with a real data type. The down side of this flexibility is parametrization: while the defaults will do 'the sensible thing', you are offered the choice to specify all types which are involved as template arguments. This can be difficult to decipher.
Note that prefiltering is highly optimized: the operation is multithreaded automatically, and it is also vectorized, provided that the data types you're using can be handled by your processor's vector units. Even when you aren't using a SIMD back-end, prefiltering will use the CPU's vector units via a process I call 'goading': The data are presented in small aggregates of vector-friendly size, and with the filter code being reasonably trivial, the operation is auto-vectorized if permitted by the given compiler flags (like -mavx2). Let me stress this point: (auto)vectorization will not happen (beyond some lowest common denominator level) if you don't specify to the compiler to use architecture-specific instructions, so the binary code will limp along using SSE instructions even on an AVX512-capable machine.
prefiltering also takes care of filling in the 'frame' of additional coefficients providing 'support' to evaluate the spline even at marginal locations, where the coefficients from the 'core' alone would not suffice. If you have a bspline object with data in it's 'core', you can tell vspline to fill in the 'frame' by calling 'brace' on the bspline object. The brace function will not modify the coeffcients in any way. Splines which do not require prefiltering (degree-0 and degree-1 splines) still need to be braced - for such splines you can call either 'brace' or 'prefilter', the effect will be the same, because vspline looks at the spline's degree and only does what is needed.
Whichever way of producing the coefficients you may use, eventually you'll have a 'bspline' object holding the coefficients in a form which can be used by vspline's evaluation mechanism. Evaluation involves two steps:
Yet again you might ask why there isn't simply an evaluation method in class bspline itself. This is due to the fact that evaluation can be done in different ways, which are specified when the 'evaluator' object is created. And when I say 'evaluator' here, I use the term for a whole set of objects capable of evaluating a b-spline, where the capabilities may go beyond what class vspline::evaluator provides. Consider, for example, the factory functions make_evaluator() and make_safe_evaluator() in eval.h. They create functors which contain vspline::evaluator objects specialized for the task at hand, and extra arithmetic on the input for 'safe' evaluators - and they return the functors as type-erased objects in a common type.
Using a two-step approach for evaluation also makes sure that no code is duplicated: the evaluator contains everything that is known at and will not change after it's creation, which is a good part of the evaluation process. The actual invocation with specific arguments only adds the location-specific aspect. This makes the process very efficient.
Once an evaluator is ready, it can be used without further specialization: it's immutable after creation and constitutes a 'pure' functor in a functional-programming sense. It is an object which can be passed around (it's trivially copyable) even between several threads. Creating an evaluator is a cheap operation, so the strategy is to create an evaluator specific to the task at hand and to use this specific evaluator for the specific task only. Evaluator objects are also lightweight, so having many evaluators does not use much memory. Evaluators refer to the coefficients, and having an evaluator for a given b-spline object does not automatically keep the coefficients 'alive'. If you destroy the bspline object, evaluators referring to the coefficients hold dangling pointers and using them will produce undefined behaviour. It's your responsibility to keep the bspline object alive until the coefficients won't be used anymore.
vspline uses RAII, and the way to create evaluators fits this scheme. Typically you'd create the evaluator in some inner scope just before it is used, letting it perish just afterwards when leaving the scope. Since you need the bspline object to create an evaluator, it's natural to have the bspline object residing in an outer scope.
There is a third aspect to the strategy used by vspline. While the obvious way of using an evaluator would be to write a loop in which the evaluator is called repeatedly, this is only a good idea if performance is not an issue. If, on the other hand, you have large amounts of data to process, you want to employ methods like multithreading and vectorization to get better performance. You are of course free to use whatever multithreading and vectorization code you like, but you might as well use what vspline offers: I call it 'wielding' code. In an evaluation context, it means applying functors to large raster data sets.
vspline offers code for the purpose which automatically multithreads and vectorizes the operation of evaluating a b-spline at many locations. You can pass an array full of coordinates and receive an array of values, an operation called 'remapping'. vspline even goes further and uses an abstraction of this process: it has code to feed large amounts of data to a functor, where the effect is that of processing all data in turn, while the operation is in fact multithreaded and vectorized automatically. The result is stored in a (possibly multidimensional) array. A 'classic' remap is merely a special case of this procedure, which, in vspline, is done by vspline::transform().
When coding geometric transformations, the task can often be reduced to formulating a function which, given a 'target' coordinate - a coordinate into the result array - can produce a 'source' coordinate, namely the locus of the interpolation. For such situations, vspline offers an overload of vspline::transform which feeds the functor with discrete target coordinates. If the functor is merely an evaluator, the result will be a 'restoration' of the original data. But if you use a functor transforming the 'incoming' coordinate and feeding the transformed coordinate to en evaluator, the result will be a geometrically transformed image, volume, etc...
Creating such functors is actually quite simple, and the examples have code to show how it's done. Have a look at gsm.cc, which produces an image holding the gradient squared magnitude of an input image. There, the target pixels are created one by one in a loop. Now look at gsm2.cc. The result is the same, but here it is obtained by coding the 'interesting' bit as a functor derived from vspline::unary_functor and the coordinate-fed overload of vspline::transform is used. You can see that the coding is quite straightforward. What is not immediately apparent is the fact that doing it this way is also much faster: the operation is automatically multithreaded and, possibly, vectorized. For a single image, the difference is not a big deal, but when you're producing images for real-time display, every millisecond counts.
My use of functors in vspline isn't as straightforward as using, say, std::function. This is due to the fact that the functors used in vspline can process both unvectorized and vectorized data. The relevant 'eval' routines provide the dual functionality either as two distinct overloads or as a template if the operation can be coded as one. This is why I had to code my own version of the subset of functional programming constructs used in vspline, rather than using some ready-made expression template code.
There is one last step which I'll mention briefly. Suppose you have a bspline object and you want to recreate your original data from the coefficients it holds. vspline offers functions called 'restore' for the purpose. Restoration will only be possible within arithmetic precision - if you are using integral data, you may suffer from quantization errors, with real data, the restored data will be very close to the original data. Restoration is done with a separable convolution, which is multithreaded and vectorized automatically, just like prefiltering or transformations, so it's also very fast.
vspline's prefiltering code, which is simply a forward-backward recursive n-pole filter, and the convolution code used for restoration, can be used to apply other filters than the b-spline prefilter or restoration, with the same benefits of multithreading and vectorization, requiring little more than the specification of the filter itself, like the convolution kernel to use - see fir.cc and iir.cc in the examples. And vspline's multithreading is also commodified so that you can use it for your own purposes with just a few lines of code, see a simple example in eval.cc, where the speed test is run with a thread pool to get a measurement of the overall evaluation performance.
How about processing 1D data like sounds or simple time series? vspline has specialized code for the purpose. While simply iterating over the 1D sequence is trivial, it's slow. So vspline's specialized 1D code uses mathematical tricks to process the 1D data with multiple threads and vectorization just as images or volumes. These specializations exist for prefiltering, transformations and restoration, which are therefore just as fast as for nD data, and, as far as the filters are concerned, even faster, since the filter has to be applied along one axis only. As a user, again you don't have to 'do' anything, the 1D code is used automatically if your data are 1D. The mathematical trickery only fails if you require 'zero tolerance' for the coefficient generation, which forces the data to be prefiltered en bloc - an option which only makes sense in rare circumstances.
vspline will accept high-dimensional data, but when dimensionality gets very high, vspline's use of 'braced' coefficient arrays has some drawbacks: with increasing dimensionality, the 'surface' of an array grows very large, and the 'brace' around the coefficients is just such a surface, possibly several units wide. So with high-D splines, the braced coefficient array is (potentially much) larger than the knot point data. You've been warned ;)
Another important specialization is code for degree-0 and degree-1 b-splines. These are old acquaintances in disguise: a degree-0 b-spline is simply another name for a nearest-neigbour interpolation, and it's degree-1 cousin encodes linear interpolation. While the 'general-purpose' b-spline evaluation code can be used to evaluate degree-0 and degree-1 b-splines, doing so wastes some potential for optimization. vspline's evaluator objects can be 'specialized' for degree-0 and degree-1 splines, and when you're using the factory functions like make_evaluator, this is done automatically. With the optimizations for low-degree splines and 1D data, I feel that the 'problem' at hand - bspline processing - is dealt with exhaustively, leaving no use scenario with second-best solutions but providing good performance throughout.
A recent addition to vspline is use of 'alternative basis functions'. Including such code in vspline widens it's scope to related interpolation methods using other than the b-spline basis functions. This is still experimental, so I ask you to refer to the comments in eval.h for now, but the code is in active use in lux, and it's been working well there for quite some time now. What may still evolve further is 'harnessing' the code to make it easier to use. Adding these capabilities was done by coding a base class 'above' class evaluator and making class evaluator a specialization, so the interface remained unchanged.
The same technique of generalization was employed for class bspline itself, which now inherits from a base class which handles 'metadata'. This is used to create type-erased splines which do not offer information about the type of the coefficients they hold. This is also work in progress, and such type-erased splines will become part of vspline 'proper' as their coding consolidates.
vspline uses vigra to handle data. There are two vigra data types which are used throughout: vigra::MultiArrayView is used for multidimensional arrays. It's a thin wrapper around the three parameters needed to describe arbitrary n-dimensional arrays of data in memory: a pointer to some 'base' location coinciding with the coordinate origin, a shape and strides. If your code does not use vigra MultiArrayViews, it's easy to create them for the data you have: vigra offers a constructor for MultiArrayViews taking these three parameters. Very similar is vigra's MultiArray, which is is a MultiArrayView owning and allocating the data it refers to.
The other vigra data type used throughout vspline is vigra::TinyVector, a small fixed-size container type used to hold things like multidimensional coordinates or pixels. This type is also just a wrapper around a small 1D C array. It's zero overhead and contains nothing else, but offers lots of functionality like arithmetic operations. I recommend looking into vigra's documentation to get an idea about these data types, even if you only wrap your extant data in them to interface with vspline. vspline follows vigra's default axis ordering scheme: the fastest-varying index is first, so coordinates are (x,y,z...). This is important to keep in mind when evaluating a b-spline. Always pass the x coordinate first. Coordinates, strides and shapes are given in units of to the MultiArrayView's value_type. So if you have a MultiArrayView of, say, float RGB pixels, a stride of one means an actual offset of ( 3 * sizeof ( float ) ) in bytes. This is important to note, since there is no standard for expressing strides - some systems use bytes, some use the elementary type. vigra uses the array's value_type for the unit step.
If you stick with the high-level code, using class bspline or the transform functions, most of the parametrization is easy. Here are a few quick examples what you can do. This is really just to give you a first idea - there is example code covering most features of vspline where things are covered in more detail, with plenty of comments. The code in this text is also there, see quickstart.cc.
Let's suppose you have data in a 2D vigra MultiArray 'a'. vspline can handle integer and real data like float and double, and also their 'aggregates', meaning data types like pixels or vigra's TinyVectors. But for now, let's assume you have plain float data. Creating the bspline object is easy:
The memory needed to hold the coefficients is allocated when the bspline object is constructed.
Obviously many things have been done by default here: The default spline degree was used - it's 3, for a cubic spline. Also, boundary treatment mode 'MIRROR' was used per default. The spline is 'braced' so that it can be evaluated with vspline's evaluation routines, and the process is automatically partitioned and run in parallel by a thread pool. The only mandatory template arguments are the value type and the number of dimensions, which have to be known at compile time.
While the sequence of operations indicated here looks a bit verbose (why not create the bspline object by a call like bspl(a) ?), in 'real' code you would use bspl.core straight away as the space to contain your data - you might get the data from a file or by some other process or do something like this where the bspline object provides the array and you interface it via a view to it's 'core':
This is a common idiom, because it reflects a common mode of operation where you don't need the original, unfiltered data any more after creating the spline, so the prefiltering is done in-place overwriting the original data. If you do need the original data later, you can also use a third idiom:
Here, the bspline object is first created with the appropriate 'core' size, and prefilter() is called with an array matching the bspline's core. This results in my_data being read into the bspline object during the first pass of the prefiltering process.
There are more ways of setting up a bspline object, please refer to class bspline's constructor. Of course you are also free to directly use vspline's lower-level routines to create a set of coefficients. The lowest level of filtering routine is simply a forward-backward recursive filter with a set of arbitrary poles. This code is in prefilter.h.
If you have b-spline coefficients handy which were already prefiltered, or which you intend to use as b-spline coefficients, regardless of how they were indeed created, keep in mind that after filling in the bspline object's 'core' you have to call 'brace' to provide support for evaluations close to the spline's borders. One common scenario is that you use the original data as coefficients and omit prefiltering. The resulting spline will lack some high-frequency content, but this may be irrelevant or even desirable. Just make sure you still call 'brace' after filling the 'core'.
Next you may want to evaluate the spline from the first example at some pair of coordinates x, y. Evaluation of the spline can be done using vspline's 'evaluator' objects. Using the highest level of access, these objects are set up with a bspline object and, after being set up, provide methods to evaluate the spline at given coordinates. Technically, evaluator objects are functors which don't have mutable state (all state is created at creation time and remains constant afterwards), so they are thread-safe and 'pure' in a functional-programming sense. The evaluation is done by calling the evaluator's eval() member function, which takes it's first argument (the coordinate) as a const reference and writes the result to it's second argument, which is a reference to a variable capable of holding the result.
Again, some things have happened by default. The evaluator was constructed from a bspline object, making sure that the evaluator is compatible. Using the factory functions make_evaluator and make_safe_evaluator, this is done automatically, and the resulting evaluator object can be stored with 'auto' type specification.
Class evaluator does provide operator(). So do all the other functors vspline can generate. This is for convenience, so you can use vspline's unary_functors like a 'normal' function:
vspline offers a limited set of functional programming constructs - as of this writing, it provides just those constructs which it uses itself, usually in factory functions. You can find the functional constructs in unary_functor.h. While it's tempting to implement more along the lines of expression templates, I have tried to keep things limited to a comfortable minimum. Most of the time user code may remain ignorant of the functional programming aspects - the functional constructs obtained from the factory functions can just be assigned to auto variables; it's usually not necessary to make these types explicit.
What about the remap function? The little introduction demonstrated how you can evaluate the spline at a single location. Most of the time, though, you'll require evaluation at many coordinates. This is what remap does. Instead of a single coordinate, you pass a whole vigra::MultiArrayView full of coordinates to it - and another MultiArrayView of the same dimension and shape to accept the results of evaluating the spline at every coordinate in the first array. Here's a simple example, using the same array 'a' as above:
This is an 'ad-hoc' remap, passing source data as an array. You can also set up a bspline object and perform a 'transform' using an evaluator for this bspline object, with the same effect:
This routine has wider scope: while in this example, ev is a b-spline evaluator, ev's type can be any functor capable of yielding a value of the type held in 'target_array' for a value held in 'coordinate_array'. Here, you'd typically use an object derived from class vspline::unary_functor, and vspline::evaluator is in fact derived from this base class. A unary_functor's input and output can be any data type suitable for processing with vspline, you're not limited to things which can be thought of as 'coordinates' etc.
This generalization of remap is named 'transform' and is similar to vigra's point operator code, but uses vspline's automatic multithreading and vectorization to make it very efficient. There's a variation of it where the 'coordinate array' and the 'target array' are the same, effectively performing an in-place transformation, which is useful for things like coordinate transformations or colour space manipulations. This variation is called vspline::apply.
There is one variation of transform(). This overload doesn't take a 'coordinate array', but instead feeds the unary_functor with discrete coordinates of the target location that is being filled in. It's probably easiest to understand this variant if you start out thinking of feeding the previous transform() with an array which contains discrete indices. In 2D, this array would contain
(0,0) , (1,0) , (2,0) ... (0,1) , (1,1) , (2,1) ... ...
Note, again, that vspline follow's vigra's default axis ordering scheme: the fastest-varying index, usually thought of as the x axis, comes first. So why would you set up such an array, if it merely contains the coordinates of every cell? You might as well create these values on-the-fly and omit the coordinate array. This is precisely what the second variant of transform does:
If you use this variant of transform directly with a vspline::evaluator, it will reproduce your original data - within arithmetic precision of the evaluation. While this is one way to restore the original data, there's also a (more efficient) routine called 'restore'.
Class vspline::unary_functor is coded to make it easy to implement functors for things like image processing pipelines. For more complex operations, you'd code a functor representing your processing pipeline - often by delegating to 'inner' objects also derived from vspline::unary_functor - and finally use transform() to bulk-process your data with this functor. This is about as efficient as it gets, since the data are only accessed once, and vspline's 'wielding' code does the tedious work of multithreading, deinterleaving and interleaving for you, while you are left to concentrate on the interesting bit, namely writing the processing pipeline code.
vspline::unary_functors are reasonably straightforward to set up; you can start out with scalar code, and you'll see that writing vectorized code with a SIMD back-end isn't too hard either - if your code doesn't need conditionals, you can often even get away with using the same code for vectorized and unvectorized operation by simply making 'eval' a function template. Please refer to the examples. vspline offers some functional programming constructs for functor combination, like feeding one functor's output as input to the next (vspline::chain) or translating coordinates to a different range (vspline::domain). There are also some constructs which help with vectorizing code by providing idioms which work for both scalar and vector code. Using such constructs oftentimes makes writing specialized vector code unnecessary.
And that's about it - vspline aims to provide all possible variants of b-splines, code to create and evaluate them and to do so for arbitrarily shaped and strided nD arrays of data. If you dig deeper into the code base, you'll find that you can stray off the default path, but there should rarely be any need not to use the high-level objects 'bspline' and 'evaluator' and the transform-like functions. The 'next step up' is writing functors inheriting from vspline::unary_functor and using them with vspline's transform-like functions.
While one might argue that the remap/transform routines I present shouldn't be lumped together with the 'proper' b-spline code, I feel that I can only make them really fast by tightly coupling them with the b-spline code. And the hardware can be exploited fully only by processing several values at once (by multithreading and vectorization). Nevertheless, I felt that the code to process nD arrays of 'xel' data with multithreaded SIMD code is useful outside of a b-spline context, and I factored it out into a new library, zimt. I intend to delegate this aspect of vspline to zimt later on - as of this writing, zimt is still experimental, even though there is a working proof-of-concept for the 'delegating back' in the 'zimt' branch of lux.
While performance will vary from system to system and between different compile(r)s, I'll quote some measurements from my own system. I include benchmarking code (like roundtrip.cc in the examples folder). Here are some measurements done with "roundtrip", working on a full HD (1920*1080) RGB image, using single precision floats internally - the figures are averages of 32 runs:
testing bc code MIRROR spline degree 3 using SIMD emulation avg 32 x prefilter:............................ 11.0 ms avg 32 x transform with ready-made bspline:.... 27.2 ms avg 32 x classic remap:........................ 39.9 ms avg 32 x restore original data: ............... 10.8 ms testing bc code MIRROR spline degree 3 using Vc avg 32 x prefilter:............................ 9.2 ms avg 32 x transform with ready-made bspline:.... 19.0 ms avg 32 x classic remap:........................ 32.2 ms avg 32 x restore original data: ............... 10.7 ms difference original data/restored data:
As can be seen from these test results, using Vc on my system speeds evaluation up a good deal. Using highway results in similar speed-up. When it comes to prefiltering, a lot of time is spent buffering data to make them available for fast vector processing. The time spent on actual calculations is much less. Therefore prefiltering for higher-degree splines doesn't take much more time:
testing bc code MIRROR spline degree 5 using Vc avg 32 x prefilter:............................ 10.3 ms testing bc code MIRROR spline degree 7 using Vc avg 32 x prefilter:............................ 10.7 ms
Using double precision arithmetics, vectorization doesn't help so much. Note that it is entirely possible to work in long double, but since these operations can't be vectorized in hardware, this is 'slow'. vspline will - by default - use vector code for all operations - if hardware vectorization can't be used, vspline will use code to emulate the hardware vectorization, and use data types which are compatible with 'proper' SIMD data. This way, having stages in a processing pipeline using types which can't be vectorized in hardware is no problem and will not force the whole pipeline to be run in scalar code. To have vspline use scalar code, you can fix the vectorization width at 1, and at times this may even produce faster code. Again you'll have to try out what best suits your needs.
You can probably do everything vspline does with other software - there are several freely available implementations of b-spline interpolation and remap/transform routines. What I wanted to create was an implementation which was as general as possible and at the same time as fast as possible, and, on top of that, comprehensive and easy to use.
These demands are not easy to satisfy at the same time, but I feel that my design comes close. While generality is achieved by generic programming, speed needs exploitation of hardware features, and merely relying on the compiler is not enough. The largest speedup I saw was from multithreading the code. This may seem like a trivial observation, but my design is influenced by it: in order to efficiently multithread, the problem has to be partitioned so that it can be processed by independent threads. You can see the partitioning both in prefiltering and later in the transform routines. vspline uses a simple 'ticket distribution scheme': all threads access an atomic integral variable from which they 'pull numbers', which - due to the source being an atomic - are guaranteed to be unique and in a defined sequence. The individual threads all run the same 'payload code' which 'makes sense' of the acquired numbers and 'grabs' a share of the work load, like a line from an image. I also call this mode of operation 'atomic surfing'. It has very low overhead.
The other speedup method is data-parallel processing. This is often thought to be the domain of GPUs, but modern CPUs also offer it in the form of vector units. I chose implementing data-parallel processing on the CPU, as it offers tight integration with unvectorized CPU code. It's almost familiar terrain, and the way from writing conventional CPU code to vector unit code is not too far. Using horizontal vectorization does require some rethinking, though - mainly a conceptual shift from an AoS to an SoA approach. vspline doesn't use vertical vectorization at all, so the code may look odd to someone looking for vector representations of, say, pixels: instead of finding SIMD vectors with three elements, there are structures of three SIMD vectors of 'vsize' elements each. I chose to code so that vectorization is manifest in the program's structure. The specific mode of vectorization - emulation or explicit vectorization - can be chosen at compile time: while I implemented the vectorized code, I noticed that vectorization is, to a high degree, something that expresses itself in the code's structure: the data have to be 'presented' as SoA of vector-friendly size. If this is done, use of explicit vector code may not even be necessary: the structure of the data flow implies vectorization, and if the implicit vectorization is expressed in a way the compiler can 'understand', it will result in vector code produced by autovectorization.
To use vectorized evaluation efficiently, incoming data have to be presented to the evaluation code in vectorized form, but usually they will come from interleaved memory. Keeping the data in interleaved memory is even desirable, because it preserves locality, and usually processing accesses all parts of a value (i.e. all three channels of an RGB value) at once. After the evaluation is complete, data have to be stored again to interleaved memory. The deinterleaving and interleaving operations take time and the best strategy is to load once from interleaved memory, perform all necessary operations on deinterleaved, vectorized data and finally store the result back to interleaved memory. The sequence of operations performed on the vectorized data constitute a processing pipeline, and some data access code will feed the pipeline and dispose of it's result. vspline's unary_functor class is designed to occupy the niche of pipeline code, while remap, apply and transform provide the feed-and-dispose code - a task which I like to call 'wielding'. So with the framework of these routines, setting up vectorized processing pipelines becomes easy, since all the boilerplate code is there already, and only the point operations/operations on single vectors need to be provided by deriving from unary_functor.
Using all these techniques together makes vspline fast. The target I was roughly aiming at was to achieve frame rates of ca. 50 fps in RGB and full HD, producing the images via transform from a precalculated coordinate array. On my system, I have reached that goal - my transform times are around 20 msec (for a cubic spline), and with memory access etc. I come up to frame rates close to what I was aiming at. My main testing ground is lux, my image and panorama viewer. Here I can often take the spline degree up to two (a quadratic spline) and still have smooth animation in full HD. Using lux has another benefit: it makes it possible to immediately see the results of vspline's operation. If anything is amiss, it'll likely be visible.
Even without using an exlicit SIMD back-end, the code is certainly fast enough for most purposes. This way, vigra becomes the only dependency.
There is a large amount of literature on b-splines available online. Here's a pick:
http://bigwww.epfl.ch/thevenaz/interpolation/
http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/spline/B-spline/bspline-basis.html
http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/spline/B-spline/bspline-ex-1.html