?

Log in

 

Pipelining numerical computation - Advanced C++ Community

About Pipelining numerical computation

Previous Entry Pipelining numerical computation Apr. 11th, 2006 @ 11:49 am Next Entry
How do you efficiently vectorize std::valarray?

I'm not sure how common this concept is, but it's something I've used for a couple years now for Fun and Profit. Some while back I wanted to write my own implementation of std::valarray that could use SIMD instructions to improve performance. Given a compiler with SIMD intrinsics (I used gcc at the time) there is a fairly straightforward direct implementation; you simply implement the mathematical operations by looping over the data, loading it up into SIMD registers and performing your computation. Now, for simple code such as

a = b + c;
this is fine. But what about a more complex expression? To use a trivial example, how about just evaluating a quadratic over a large amount of data?
y = a * x * x + b * x + c;
In these more complex expressions the trivial implementation falls flat on its face, and your program ends up wasting a ton of time copying data between memory and registers unnecessarily. If we consider our data in tabular form...

a1 x1 x1 b1 x1 c1
a2 x2 x2 b2 x2 c2
a3 x3 x3 b3 x3 c3
...

the trivial implementation is evaluating the expression by computing down pairs of columns, storing the intermediate values in memory each time. Ideally what we want to do is perform the computation in groups of rows, allowing us to maintain intermediate values in registers and cutting down memory use to a bare minimum.

My solution to this is what I termed 'pipelining'. Instead of directly evaluating the data, we delay computation, instead taking the expression and using it to construct a functor which describes the calculation we want to perform, thus creating a pipeline. The computation then occurs only on demand (i.e., when you attempt to take its value) at which point data can be fed into the pipeline in chunks. Each of these SIMD-register sized chunks can then be fully evaluated before moving onto the next chunk.

The basis of the code is to encapsulate raw data in an object like:

template <class T, unsigned int BlockSize>
struct _vec_data : public _vec_expr_comp<T, BlockSize>
{
  const unsigned int size;
  _vec_block<T, BlockSize> *const data;

  _vec_data(unsigned int s) :
    size(s), data(new _vec_block<T, BlockSize>[s])
  {}

  _vec_data(const _vec_data& o) :
    size(o.s), data(o.val)
  {}

  _vec_block<T, BlockSize> val(unsigned int i)
  {
    return data[i];
  }
};
Where _vec_block is a simple structure containing a single block of data (sized for your SIMD registers).

Then, you implement your operands:

template <class T, unsigned int BlockSize, class Op, class P1, class P2>
struct _vec_expr_bin : public _vec_expr_comp<T, BlockSize>
{
  const Op oper;
  const P1 param1;
  const P2 param2;

  _vec_expr_bin(P1 p1, P2 p2) :
    param1(p1), param2(param2), oper()
  {}

  _vec_block<T, BlockSize> val(unsigned int i)
  {
    return oper(param1.val(i), param2.val(i));
  }
};

template <class T, unsigned int BlockSize, class Op, class P>
struct _vec_expr_un : public _vec_expr_comp<T, BlockSize>
{
  const Op oper;
  const P param;

  _vec_expr_bin(P p) :
    param(p), oper()
  {}

  _vec_block<T, BlockSize> val(unsigned int i)
  {
    return oper(param.val(i));
  }
};
_vec_expr_comp is where the magic occurs in. It's also going to be a very large class, so I won't write the whole thing out.
template <class T, unsigned int BlockSize>
struct _vec_expr_comp
{
  ...

  template<class P1, class P2>
  _vec_expr_bin<T, BlockSize, VecAddOp, P1, P2> operator + (const P1& p1, const P2& p2)
  {
    return _vec_expr_bin<T, BlockSize, VecAddOp, P1, P2>(p1, p2);
  }

  template<class P>
  _vec_expr_un<T, BlockSize, VecUnaryMinus, P> operator - (const P& p)
  {
    return _vec_expr_bin<T, BlockSize, VecUnaryMinus, P1, P2>(p);
  }

  ...
};
Repeat those member-functions for every operator you want to implement (take similar steps for other mathematical functions you'd like to support). Finally, we move onto the actual valarray-style class.
template <class T>
class valarray : public _vec_expr_comp<T, machine::BlockSize>
{
  ...

  template<class T>
  valarray& operator = (const T& vexpr)
  {
    // do a test to make sure T is derived from _vec_expr_comp if you want

    for (unsigned int i = 0; i < size() / machine::BlockSize; i++)
    {
      _vec_block<T, machine::BlockSize> block = vexpr.val(i);

      for (unsigned int j = 0; j < machine::BlockSize; j++)
	data[i * machine::BlockSize + j] = block.data[j];
    }
  }

  ...
};
There are other evaluation operators that need to be implemented, such the *= and similar operators, and a cast operator on _vec_expr_comp so that intermediate expressions can be treated as instances of the valarray class.

Obviously, the machine structure or namespace holds the machine-specific code, and I won't elaborate on that. Also, you need to be a little more careful than I've been in these examples when you reach the end of your data, as your last calculation may not be register-sized. In this case you may need to take precautions to avoid mishaps such as accidental divide by zero errors. Also, you can't produce an exact valarray implementation with this technique, but you can create an almost-alike implementation. Slices are tricky to get right!

The end result of this is that we, as stated above, are able to perform computations on subchunks at a time, letting us manage our memory usage more efficiently. This isn't my actual code, although on my 2ghz P4 I've found that my pipelined/vectorized valarray is between 4 and 12 times faster than the stock gcc valarray, depending on specific expression.

Anyway, I hope you found this interesting! It was certainly a fun challenge trying to figure out how to do it.

-jep.

P.S. Don't forget to run with your compiler optimisations on high when you do this and spread inline keywords liberally. It needs heavy function inlining to be efficient.

( X-posting in cpp )

Leave a comment
[User Picture Icon]
From:xil
Date:April 11th, 2006 05:12 pm (UTC)
(Link)
It is certainly an interesting method. Actually, it sounds very boost-ish to me, you might consider forming a generic C++-only (as in, non-machine specific) implementation to go along with your machine-specific version, and submitting it to boost.

Outta curiosity, what was the type of x here?:
y = a * x * x + b * x + c;

I'm assuming a,b,c are numeric types, and y is the valarray. Would x be the source valarray? or would it be a predefined, basically typeless variable? Like the _1..._9 in boost::lambda? I would assume the former, since, if it were the latter, it'd make more sense to use a different operator than =.
[User Picture Icon]
From:ringzero
Date:April 11th, 2006 05:27 pm (UTC)
(Link)
Hmm, maybe I will when I get the chance. My original implementation could do with being rewritten.

I was thinking x and y would be valarrays, and a, b and c would either be constants or valarrays. boost::lambda was some of the inspiration behind this (although more goes to a fiendishly clever library written by a friend of mine), but I'm not sure I follow what you mean by using a different operator.
[User Picture Icon]
From:xil
Date:April 11th, 2006 06:12 pm (UTC)
(Link)
Nah, the way you've got it makes more sense with =.

Actually, I guess what I was saying (the whole thing with the different operator, and using placeholders) would be reimplementing boost::lambda, so that'd be kinda pointless. Your way makes more sense.

Good job. ^_^
[User Picture Icon]
From:ringzero
Date:April 11th, 2006 07:35 pm (UTC)
(Link)
Thanks! I think the technique of using templates to construct custom functors is way underused. Sure, the functor stuff gets really ugly, but it makes the rest of your code really clean.

I'm a big believer in concentrating mess (never look in my closet).
From:neteraser
Date:April 11th, 2006 05:29 pm (UTC)
(Link)
Could you leave a produced asm code please :)

I just CAN'T beleave it to be "addps", it must be smth like "addss"[x4] which is yet another gay code imp.

The only way to make it as fast as it really can be is to produce an operation in the unrolled constructor (not in operator() etc.):

template
struct unrolled_operator {
unrolled_operator(...) : unrolled_operator(...) {
... // add i-th
}
};

But this worked only for IntelC 8.0 and still had a lot of problems.

So, the only solution is to use intrinsics. Just forget. No way to vectorize it. Or there was no... ? :)
[User Picture Icon]
From:ringzero
Date:April 11th, 2006 07:33 pm (UTC)
(Link)
Well, the actual version does use intrinsics: the functors VecAddOp and VecUnaryMinus would be implemented using vector intrinsics. I haven't used Intel since version 6, so I'm not sure about its intrinsics.

In gcc you play with your SIMD registers by using a union, one element of the union being your normal 'memory' array, and one being a magic variable that's really a SIMD register. The compile automatically manages loading and unloading SIMD registers and register allocation as you use the union. In gcc that's what _vec_block is. Then the VecAddOp and other operators are just wrappers around the platform specific intrinsics.

When the code compiles, the big chain of functors and intrinsic wrappers are inlined and it leaves just a loop surrounding a chain of intrinsics representing your calculation. The compiler can then allocate registers sanely, and produce the required SSE3/AltiVec/whatever code.

The produced assembly isn't particularly beautiful to look at, but it's much easier than writing the intrinsics out manually. :)

I'll dig out the code later once I fix my Debian install and post some example assembly listings.
[User Picture Icon]
From:aruslan
Date:April 11th, 2006 09:53 pm (UTC)
(Link)
Khm :)
From:neteraser
Date:April 12th, 2006 08:49 am (UTC)
(Link)
std::hello.
[User Picture Icon]
From:omnifarious
Date:April 17th, 2006 03:41 pm (UTC)
(Link)

which is yet another gay code imp

I was unaware that code could either be lighthearted and carefree or have a sexual orientation. Please enlighten me on this concept. Or, it's quite possible you're this kid, who thinks everything is gay.

[User Picture Icon]
From:sdt
Date:April 11th, 2006 11:56 pm (UTC)
(Link)
That sort of thing is a lot of fun. Well, a lot of fun if you're a C++ nurd like we are.

Seen blitz++? It's all about this:

http://www.oonumerics.org/blitz/

Twas even written by a Waterloo alumnus. MTL also does similar stuff: http://www.osl.iu.edu/research/mtl/

Sh actually works on a similar principle, except it collects/compiles the expressions at run-time, and thus can map them to different processors from the host compiler. http://libsh.org/wiki/index.php/How_Sh_Works
[User Picture Icon]
From:ringzero
Date:April 12th, 2006 12:49 pm (UTC)
(Link)
Sh actually works on a similar principle, except it collects/compiles the expressions at run-time, and thus can map them to different processors from the host compiler.

Where'd you think I got the idea from in the first place? :P
[User Picture Icon]
From:sdt
Date:April 12th, 2006 03:02 pm (UTC)
(Link)
Hah. Microsoft also has something called Accelerator: http://research.microsoft.com/research/pubs/view.aspx?type=technical%20report&id=1040

It's basically Sh-in-C# although instead of using an explicit retained mode they implicitly store computations based on certain typed arrays, and then later eval() them, along the same lines as your code (and blitz, etc.).

Using an explicit eval (or an explicit retained mode) also has the advantage that it all works over multiple assignment statements. But it's not as convenient in terms of "just looking like regular old code".
[User Picture Icon]
From:ringzero
Date:April 16th, 2006 07:21 pm (UTC)
(Link)
I'm not sure what you mean by explicit retained mode. Elaborate?
(Leave a comment)
Top of Page Powered by LiveJournal.com