For profiling I use callgrind.
Afaik it is the most accurate as it simulates a processor and counts
cycles etc.

As others pointed out: my playout-code is somewhat lightweight. In that
40% version it only checked if a cross is empty. I added super-ko check
which gave a 10% hit on the number of pps. Currently I'm looking into
doing full validity checks. It currently does +/- 20-30k pps with one
core on a HT-sibling.

> 40% sounds pretty high. Are you sure its not an artefact of your profiling implementation?
> I prefer not to instrument, but to sample stack traces. You can do this using gdb by pressing ctrl-c, then type bt. Do this 10 times, and look for the parts of the stack that occur often. 

