Branchless Code

Posted by Thoughts and Ramblings on Tuesday, March 20, 2007

Apologies if this is technical, but since I learned something in this experiment, I thought I would write it up:

I remember going to a seminar a few weeks ago delivered by a faculty candidate which discussed the idea of how to get around the branch mis-prediction problem present in microprocessors. Anyway, her idea was to use predicated statements to make the instructions conditional on a comparison which may not have been completed yet. Essentially, when faced with a decision between two routes, the processor starts computation on both before it is even sure which it should take. Then, when the decision is known, it can proceed to throw out the results from the task it should not have started and only keep the results from the other. Her research mentioned using the processor’s internal register renaming system to prevent the results from the two colliding. There is a problem of both processor support as well as compiler support, as well as the fact that the active power in the processor is now increased. From the first two issues, I doubt we will ever see this come to light, especially when such things are already possible:

Recently, I rewrote a few C functions in FFmpeg’s H.264 video decompresser in Altivec. This way, instead of the processor working on one value at a time, it can work on 16, thus greatly increasing its speed. The only issue is the computation may take one route for some elements, and another route for others. So, how do you handle both situations when you can only take one at a time? The answer is to do both, and then use masks to keep a certain result.

So, to test if (x > y), just do something like: mask = vec_cmpgt(x, y); and multiple ifs, like (x > y && y > z) can be done like mask = vec_and(vec_cmpgt(x, y), vec_cmpgt(y, z)); Then, when doing your operation, simply keep the mask, and use it to determine if you commit the result. Simple, eh ;)

Anyway, as a result of writing this code, not only does it process more values at a time, but it also does it without any branches, thus not creating branch mis-predictions as well as doing memory operations in larger chunks, which is more efficient. Now the question is, if it were rewritten to simply use branchless code instead of being vectorized, would it have been faster, or is the branch predictor good enough in processors to have not made a difference? I suppose I just found an excellent test for that faculty candidates research.

Legacy Comments:

Alan Humpherys - Mar 22, 2007

The best place to implement this is within the hardware itself. The CPU designs I worked on at Amdahl in the late 80’s had this implemented in hardware. There were limitations, but we had a 6 stage execution pipeline where there were duplicates of registers and Arithmetic Units for 3 stages which gave sufficient time for the original conditional expression to complete evaluation so that the correct branch/nobranch sequence could be followed. As with all good things we did in the “dinosaur Mainframe” era, they will be “rediscovered” and put into a PC and the world will be amazed…. Virtualization… Been there - done that - 1986 (And we did it so well that the CIA approved our computers to run top level security clearance programs in one virtual mainframe, while potentially hostile code (obtained from inteligence operations) could be run in another virtual mainframe, and the 2 were guaranteed not to be able to interfere with each other… Something not even remotely possible on the pathetic PC hardware we are dealing with today.)

Greg Davies - May 3, 2008

I heard something about branchless programming before, but it was explained differently to me. Basically, boolean logic was altered by the compiler and linker to compute the address to jump to. That way you never do a comparison at all, just an unconditional jump.

Graham Booker - May 3, 2008

Eh, computing an address for a jump is really not much different from a branch. However, processors have branch prediction logic, so changing a branch to an address computation and a jump could potentially be much worse than a branch. The only case where such a change won’t cause a pipeline stall is when the address is computed well beforehand, in which case the branch wouldn’t stall either. There are actually no jumps in the code I mentioned above. It is essentially computing both paths from a branch, and then using boolean logic to determine which result is committed, and which is thrown away.