C++ Performance for Large Complex Data Editing
Hello,
I'm looking to get more performance out of a critical section of code. The code is looping over all elements of a vector and editing the imaginary part of complex values. It's about 4-5GB of complex float data. In order to determine the new value, I need to compute a few arctan2 (std::arg) and std::abs. I'm looking for any solution to get even 20% improvement. Thus far, I've tested the different optimization levels on my target (GCC 6.3.1, linux, c++14). I'm using "-03 -ffast-math" compile options. I've flattened my multidimensional data into a 1D vector that is preallocated. I've also multithreaded the solution using the maximum threads available on my CPU (no possibility of using a GPU). It's a thread-pool, so the threads are kicked off prior to the critical section. The threads compute the values on different sections of the vector to avoid locking.
Here is a representation (simplified, assume equal split) (forgive bugs, it works as intended, just hand-typed this):
// dim1 ~= 30k
// dim2 ~= 18k
std::vector<std::complex<float>> data(dim1*dim2); // allocated and filled before
// Critical section
for (int idx = 0; idx < dim1_edit_sections.size()-1; idx++) {
int dim1_section_start = dim1_edit_sections[idx];
int dim1_section_end = dim1_edit_sections[idx+1];
int split = dim2/num_threads;
int dim2_start = 0;
int dim2_end = split;
for (thread in pool) {
// lambda function given to thread
for (i = dim2_start; i <= dim2_end; i++) {
imag_term1 = std::arg(data[dim1_section_start*dim2 + i]) + ... ;
imag_term2 = std::arg(data[dim1_section_end*dim2 + i]) + ...;
for (j = dim1_section_start; j < dim1_section_end; j++) {
imag_pt = std::arg(data[j*dim2 + i]) + ...;
data[j*dim2 +i] = std::abs(data[j*dim2 + i]) + std::exp(phase terms sum);
}
}
// END lambda.
dim2_start = end+1;
dim2_end = dim1_start + split;
}
}
// wait on all threads
// END critical section