assembly - Optimal SIMD algorithm to rotate or transpose an array -


i working on data structure have array of 16 uint64. laid out in memory (each below representing single int64):

a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3 

the desired result transpose array this:

a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 

the rotation of array 90 degrees acceptable solution future loop:

d0 c0 b0 a0 d1 c1 b1 a1 d2 c2 b2 a2 d3 c3 b3 a3 

i need in order operate on arrow fast @ later point (traverse sequentially simd trip, 4 @ time).

so far, have tried "blend" data loading 4 x 64 bit vector of a's, bitmaskising , shuffling elements , or'ing b's etc , repeating c's... unfortunately, 5 x 4 simd instructions per segment of 4 elements in array (one load, 1 mask, 1 shuffle, 1 or next element , store). seems should able better.

i have avx2 available , compiling clang.

uint64_t a[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; __m256i row0 = _mm256_loadu_si256((__m256i*)&a[ 0]); //0 1 2 3 __m256i row1 = _mm256_loadu_si256((__m256i*)&a[ 4]); //4 5 6 7 __m256i row2 = _mm256_loadu_si256((__m256i*)&a[ 8]); //8 9 b __m256i row3 = _mm256_loadu_si256((__m256i*)&a[12]); //c d e f 

i don't have hardware test on right following should want

__m256i tmp3, tmp2, tmp1, tmp0; tmp0 = _mm256_unpacklo_epi64(row0, row1);            //0 4 2 6 tmp1 = _mm256_unpackhi_epi64(row0, row1);            //1 5 3 7 tmp2 = _mm256_unpacklo_epi64(row2, row3);            //8 c e tmp3 = _mm256_unpackhi_epi64(row2, row3);            //9 d b f //now select appropriate 128-bit lanes row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20);  //0 4 8 c row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20);  //1 5 9 d row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31);  //2 6 e row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31);  //3 7 b f 

the

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm) 

intrinsic selects 128-bit lanes 2 sources. can read in the intel intrinsic guide. there version _mm256_permute2f128_si256 needs avx , acts in floating point domain. used check used correct control words.


Comments

Popular posts from this blog

java - Oracle EBS .ClassNotFoundException: oracle.apps.fnd.formsClient.FormsLauncher.class ERROR -

c# - how to use buttonedit in devexpress gridcontrol -

How do you convert a timestamp into a datetime in python with the correct timezone? -