C++ Performance for Large Complex Data Editing

Hello,

I'm looking to get more performance out of a critical section of code. The code is looping over all elements of a vector and editing the imaginary part of complex values. It's about 4-5GB of complex float data. In order to determine the new value, I need to compute a few arctan2 (std::arg) and std::abs. I'm looking for any solution to get even 20% improvement. Thus far, I've tested the different optimization levels on my target (GCC 6.3.1, linux, c++14). I'm using "-03 -ffast-math" compile options. I've flattened my multidimensional data into a 1D vector that is preallocated. I've also multithreaded the solution using the maximum threads available on my CPU (no possibility of using a GPU). It's a thread-pool, so the threads are kicked off prior to the critical section. The threads compute the values on different sections of the vector to avoid locking.

Here is a representation (simplified, assume equal split) (forgive bugs, it works as intended, just hand-typed this):

// dim1 ~= 30k
// dim2 ~= 18k
std::vector<std::complex<float>> data(dim1*dim2); // allocated and filled before

// Critical section
for (int idx = 0; idx < dim1_edit_sections.size()-1; idx++) {
  int dim1_section_start = dim1_edit_sections[idx];
  int dim1_section_end = dim1_edit_sections[idx+1];

  int split = dim2/num_threads;
  int dim2_start = 0;
  int dim2_end   = split;
  for (thread in pool) {
    // lambda function given to thread
    for (i = dim2_start; i <= dim2_end; i++) {
      imag_term1 = std::arg(data[dim1_section_start*dim2 + i]) + ... ;
      imag_term2 = std::arg(data[dim1_section_end*dim2 + i]) + ...;

      for (j = dim1_section_start; j < dim1_section_end; j++) {
        imag_pt = std::arg(data[j*dim2 + i]) + ...;
        data[j*dim2 +i] = std::abs(data[j*dim2 + i]) + std::exp(phase terms sum);
      }
    }
    // END lambda.
    dim2_start = end+1;
    dim2_end = dim1_start + split;
  }
}
// wait on all threads
// END critical section