Common Board @ Timus Online Judge

Orient asm tips [1] // Problem 1369. Cockroach Race 25 Jan 2018 00:38

Make two separate arrays for x and y coordinates of cockroaches.
Don't use horizontal instructions in tight loops.
Perform all the develop and debug in 32-bit mode (evidently GCC/Clang on server has -m32 key). x86-64 has RIP-relative addressing, there would be bottlneck when you go back to 32-bit build and try to backport from 64-bit one.
First loop for each sweet is straitforward. Don't use `displacement(base,index,scale)` addressing. Always run all over the `base` part. Finally it may give about a couple of hundred ms.
In second loop for each sweet you may use BSF instruction to get least significant bit offset (remeber, judge runs on Sandy Bridge arch). Use LEA to calculate simple index arithmetic. Here is mine:

            size_type R = 0;
            asm
            (
            "vbroadcastsd %[dmin], %%ymm5;"

            "dloop:"
            "vcmpgt_oqpd %[d](,%[i],8), %%ymm5, %%ymm6;"
            "vmovmskpd %%ymm6, %%eax;"
            "test %%eax, %%eax;"
            "jz rnext;"
            "rloop:"
            "bsf %%eax, %%edx;"
            "lea 1(%%edx,%[i]), %%edx;"
            "mov %%edx, %[results](,%[R],4);"
            "inc %[R];"
            "lea -1(%%eax), %%edx;"
            "and %%edx, %%eax;"
            "jnz rloop;"

            "rnext:"
            "add $4, %[i];"
            "cmp %[M], %[i];"
            "jl dloop;"
            : [R]"=b"(R)
            : [d]"m"(d), [dmin]"m"(dmin),
              "0"(R), [results]"m"(results),
              [M]"r"(M), [i]"c"(0)
            : "cc", "memory",
              "%eax", "%edx",
              "%ymm5", "%ymm6"
            );

Custom IO gives about 300+ ms.

I hope M*N solutions finally will loose their ACs someday (at least until hardware will be updated to AVX512-compatible).

Orient Re: asm tips // Problem 1369. Cockroach Race 11 Feb 2018 13:26

It turns out, that there are possible two way to solve this problem (both M*N) in addition to naïve one (1400ms).
1.) Read all sites, then for each query point do find all the preliminary neighbours using float coordinates (it is almost two time faster in terms of membory bandwitdth, then if use double). This is a "float" sieve. Then for all sites passed "float" sieve perform similar "double" sieve. It get about 500ms (for me from 1.45s to 950ms). Test #16 is still hardest. Prefetch instructions in long runs gain up to 100ms.
First loop (precalc of squared float distances) can be unfolded 4 times by query points (vbroadcastss from fixed memory location works fast in loop, you should not occupy dedicated ymm register for these). Second loop, where squared distances compared against least distance (+ eps of course), found in previous loop, can be unfolded 4 times (you can occupy all 32 bit of some register using bit shift (ror, rol) after vmovmskps). I think it allow to better utilize branch prediction mechanism.
This approach can be beaten by simple test: cloud of sites in one corner in 10x10 square and cloud of query points in opposite aslo in 10x10 square both in general position (not matters much). Or wiser: query points placed on quarter of circle with center in one corner and radius of 20000, sites on another quarter of circle with center in the same corner and small radius or vice versa. Or even: two distant strait lines: one with query points and another with sites.
2.) You can solve the problem in single loop using just doubles with randomization. If your random numbers hit right sites, then you can probably make even fastest solution (but it is too improbably). I can't find counterexample for this randomized approach.

I am very interested in Progbeat's solution details. It seems (by memory consumed) he used approach very similar to mine. It would be great if we'll exchange our solutions somehow =).

Bazeev Damir`~ Как вибирать из двух одинаковых по длине ответов? // Problem 1651. Shortest Subchain 11 Feb 2018 10:40

Dan Stefan I got WA on test 9, please give me some sample data, I'm desperate [6] // Problem 1043. Cover an Arc 29 May 2004 00:45

2ch Solution // Problem 1268. Little Chu 11 Feb 2018 01:49

Aditya Singh AC in 0.015 [1] // Problem 1118. Nontrivial Numbers 15 Apr 2017 05:24

iOli for n = 10, k = 2, ans = 89. How?? // Problem 1009. K-based Numbers 10 Feb 2018 08:18

Vitaliy Herasymiv Test 26 [2] // Problem 1866. Poetic Foot 15 Oct 2011 15:07

Bazeev Damir`~ WA 4 // Problem 1651. Shortest Subchain 9 Feb 2018 19:53

Shorekh Test 6 [1] // Problem 1131. Copying 20 Oct 2016 22:48

Adkham what about Ramanujan's 1729 // Problem 1349. Farm 8 Feb 2018 19:27

Adkham what is wrong with my code ??? // Problem 1110. Power 8 Feb 2018 02:23

2ch Solution is pretty easy // Problem 1109. Conference 7 Feb 2018 21:51

Bekzat what is the 30 test [2] // Problem 1820. Ural Steaks 25 Nov 2016 09:01

JohnDarkman TLE #21 // Problem 2102. Michael and Cryptography 6 Feb 2018 19:25

Khanhhuy_19 Example Testcase [1] // Problem 2018. The Debut Album 21 Jan 2018 23:51

BdE Test // Problem 1846. GCD 2010 5 Feb 2018 22:15

LordPhantom as of Feb 2018, what is test #4? // Problem 1120. Sum of Sequential Numbers 3 Feb 2018 01:39

Al.Cash Hint [1] // Problem 1661. Dodecahedron 3 May 2009 14:37

Artem Why dont work?C# // Problem 1068. Sum 2 Feb 2018 13:12

Valentina Runtime error - Test 2 // Problem 1987. Nested Segments 31 Jan 2018 23:23