Make two separate arrays for x and y coordinates of cockroaches. Don't use horizontal instructions in tight loops. Perform all the develop and debug in 32-bit mode (evidently GCC/Clang on server has -m32 key). x86-64 has RIP-relative addressing, there would be bottlneck when you go back to 32-bit build and try to backport from 64-bit one. First loop for each sweet is straitforward. Don't use `displacement(base,index,scale)` addressing. Always run all over the `base` part. Finally it may give about a couple of hundred ms. In second loop for each sweet you may use BSF instruction to get least significant bit offset (remeber, judge runs on Sandy Bridge arch). Use LEA to calculate simple index arithmetic. Here is mine:
size_type R = 0; asm ( "vbroadcastsd %[dmin], %%ymm5;"
It turns out, that there are possible two way to solve this problem (both M*N) in addition to naïve one (1400ms). 1.) Read all sites, then for each query point do find all the preliminary neighbours using float coordinates (it is almost two time faster in terms of membory bandwitdth, then if use double). This is a "float" sieve. Then for all sites passed "float" sieve perform similar "double" sieve. It get about 500ms (for me from 1.45s to 950ms). Test #16 is still hardest. Prefetch instructions in long runs gain up to 100ms. First loop (precalc of squared float distances) can be unfolded 4 times by query points (vbroadcastss from fixed memory location works fast in loop, you should not occupy dedicated ymm register for these). Second loop, where squared distances compared against least distance (+ eps of course), found in previous loop, can be unfolded 4 times (you can occupy all 32 bit of some register using bit shift (ror, rol) after vmovmskps). I think it allow to better utilize branch prediction mechanism. This approach can be beaten by simple test: cloud of sites in one corner in 10x10 square and cloud of query points in opposite aslo in 10x10 square both in general position (not matters much). Or wiser: query points placed on quarter of circle with center in one corner and radius of 20000, sites on another quarter of circle with center in the same corner and small radius or vice versa. Or even: two distant strait lines: one with query points and another with sites. 2.) You can solve the problem in single loop using just doubles with randomization. If your random numbers hit right sites, then you can probably make even fastest solution (but it is too improbably). I can't find counterexample for this randomized approach.
I am very interested in Progbeat's solution details. It seems (by memory consumed) he used approach very similar to mine. It would be great if we'll exchange our solutions somehow =).