`std::valarray`

?
I'm not sure how common this concept is, but it's something I've used for a couple years now for Fun and Profit. Some while back I wanted to write my own implementation of `std::valarray`

that could use SIMD instructions to improve performance. Given a compiler with SIMD intrinsics (I used gcc at the time) there is a fairly straightforward direct implementation; you simply implement the mathematical operations by looping over the data, loading it up into SIMD registers and performing your computation. Now, for simple code such as

```
```` a = b + c; `

this is fine. But what about a more complex expression? To use a trivial example, how about just evaluating a quadratic over a large amount of data?
```
```` y = a * x * x + b * x + c; `

In these more complex expressions the trivial implementation falls flat on its face, and your program ends up wasting a ton of time copying data between memory and registers unnecessarily. If we consider our data in tabular form...

a_{1} | x_{1} | x_{1} | b_{1} | x_{1} | c_{1} |

a_{2} | x_{2} | x_{2} | b_{2} | x_{2} | c_{2} |

a_{3} | x_{3} | x_{3} | b_{3} | x_{3} | c_{3} |

... |

the trivial implementation is evaluating the expression by computing down pairs of columns, storing the intermediate values in memory each time. Ideally what we want to do is perform the computation in groups of rows, allowing us to maintain intermediate values in registers and cutting down memory use to a bare minimum.

My solution to this is what I termed 'pipelining'. Instead of directly evaluating the data, we delay computation, instead taking the expression and using it to construct a functor which describes the calculation we want to perform, thus creating a pipeline. The computation then occurs only on demand (i.e., when you attempt to take its value) at which point data can be fed into the pipeline in chunks. Each of these SIMD-register sized chunks can then be fully evaluated before moving onto the next chunk.

The basis of the code is to encapsulate raw data in an object like:

Wheretemplate <class T, unsigned int BlockSize> struct _vec_data : public _vec_expr_comp<T, BlockSize> { const unsigned int size; _vec_block<T, BlockSize> *const data; _vec_data(unsigned int s) : size(s), data(new _vec_block<T, BlockSize>[s]) {} _vec_data(const _vec_data& o) : size(o.s), data(o.val) {} _vec_block<T, BlockSize> val(unsigned int i) { return data[i]; } };

`_vec_block`

is a simple structure containing a single block of data (sized for your SIMD registers).
Then, you implement your operands:

template <class T, unsigned int BlockSize, class Op, class P1, class P2> struct _vec_expr_bin : public _vec_expr_comp<T, BlockSize> { const Op oper; const P1 param1; const P2 param2; _vec_expr_bin(P1 p1, P2 p2) : param1(p1), param2(param2), oper() {} _vec_block<T, BlockSize> val(unsigned int i) { return oper(param1.val(i), param2.val(i)); } }; template <class T, unsigned int BlockSize, class Op, class P> struct _vec_expr_un : public _vec_expr_comp<T, BlockSize> { const Op oper; const P param; _vec_expr_bin(P p) : param(p), oper() {} _vec_block<T, BlockSize> val(unsigned int i) { return oper(param.val(i)); } };

`_vec_expr_comp`

is where the magic occurs in. It's also going to be a very large class, so I won't write the whole thing out.
Repeat those member-functions for every operator you want to implement (take similar steps for other mathematical functions you'd like to support). Finally, we move onto the actual valarray-style class.template <class T, unsigned int BlockSize> struct _vec_expr_comp { ... template<class P1, class P2> _vec_expr_bin<T, BlockSize, VecAddOp, P1, P2> operator + (const P1& p1, const P2& p2) { return _vec_expr_bin<T, BlockSize, VecAddOp, P1, P2>(p1, p2); } template<class P> _vec_expr_un<T, BlockSize, VecUnaryMinus, P> operator - (const P& p) { return _vec_expr_bin<T, BlockSize, VecUnaryMinus, P1, P2>(p); } ... };

There are other evaluation operators that need to be implemented, such thetemplate <class T> class valarray : public _vec_expr_comp<T, machine::BlockSize> { ... template<class T> valarray& operator = (const T& vexpr) { // do a test to make sure T is derived from _vec_expr_comp if you want for (unsigned int i = 0; i < size() / machine::BlockSize; i++) { _vec_block<T, machine::BlockSize> block = vexpr.val(i); for (unsigned int j = 0; j < machine::BlockSize; j++) data[i * machine::BlockSize + j] = block.data[j]; } } ... };

`*=`

and similar operators, and a cast operator on `_vec_expr_comp`

so that intermediate expressions can be treated as instances of the valarray class.
Obviously, the `machine`

structure or namespace holds the machine-specific code, and I won't elaborate on that. Also, you need to be a little more careful than I've been in these examples when you reach the end of your data, as your last calculation may not be register-sized. In this case you may need to take precautions to avoid mishaps such as accidental divide by zero errors. Also, you can't produce an exact valarray implementation with this technique, but you can create an almost-alike implementation. Slices are tricky to get right!

The end result of this is that we, as stated above, are able to perform computations on subchunks at a time, letting us manage our memory usage more efficiently. This isn't my actual code, although on my 2ghz P4 I've found that my pipelined/vectorized valarray is between 4 and 12 times faster than the stock gcc valarray, depending on specific expression.

Anyway, I hope you found this interesting! It was certainly a fun challenge trying to figure out how to do it.

-jep.

P.S. Don't forget to run with your compiler optimisations on high when you do this and spread `inline`

keywords liberally. It needs heavy function inlining to be efficient.

( X-posting in **cpp** )