simd - Can multiple processes hide latency of SSE instructions? -
i'm in need of high-performance merging , came accross: efficient implementation of sorting on multi-core simd cpu architecture jatin chhugani et al.
their aim performance out of 1 cpu, 1 part of solution use bitonic sorting network on simd level. hide latency of min/max , shuffle operations perform 4 sorting networks simultaneously (though think meant interleaved.). gives claimed 3.25x increase of performance.
my problem relaxed, have multiple pairs of arrays need processed (read independent) can run multiple processes , gain higher throughput.
though if oversubscribe amount of processes available cores, hide latency well? induced on higher level? or treading here in realm of hyperthreading , i'll never pass limit of 2 processes sharing same functional units in cpu-core?
i of course try, changing existing code rather involved , i'd hear theories first.
no, threading not effective solution pipeline bubbles. granularity doesn't fit: context switching takes hundreds of cycles, whereas sort of stall caused naive implementation of bitonic sorting in 2-4 cycle pieces.
with said, it's not clear use-case is, or bottleneck turn out be, multiprocessing help. 1 way find out.
Comments
Post a Comment