Faster modular multiplication

#	User	Rating
1	tourist	3690
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

#	User	Contrib.
1	maomao90	174
2	awoo	165
3	adamant	161
4	TheScrasse	160
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	orz	146
9	SecondThread	145
9	pajenegod	145

Note: This technique had been used before at https://codeforces.com/blog/entry/60851 (editorial code of problem F) and https://codeforces.com/problemset/submission/1017/41357847 .

While this solution is faster than using int64_t (because Codeforces machines are 32-bit), the time limit should be loose enough for solution that does not use this trick to pass. However, this trick may be useful if you want to be very fast or implement an unintended/suboptimal brute force solution. (the same thing can be said about fast input/output methods. On Codeforces, cin/cout is usually fast enough)

This is definitely faster than the usual return (long long)a * b % mod, but it might be slower than Montgomery multiplication.

Given a positive integer md, and two positive integers a and b in range [0, md-1], the following function will compute (int) ((long long) a * b % md): (the quotient of the division is stored in variable d)

int mul(int a, int b) {
  unsigned long long x = (long long) a * b;
  unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
  asm(
    "divl %4; \n\t"
    : "=a" (d), "=d" (m)
    : "d" (xh), "a" (xl), "r" (md)
  );
  return m;
}

x86 assembly has an instruction to divide a 64-bit integer by a 32-bit integer, provided that both the quotient and the remainder fits in 32 bits. (More details)

However, it's not a compiler bug that GCC does not optimize code like this

uint32_t f(uint64_t a, uint32_t b) { return a % b; }

to use that assembly instruction, because that would not work well (as required by the C++ standard) when the quotient exceeds 2^32. See also 64308 – Missed optimization: 64-bit divide used when 32-bit divide would work.

Unfortunately, I think there isn't any intrinsic function of GCC that provides the functionality, and it's necessary to use asm. (source)

Benchmark code: (you can copy-paste that to https://codeforces.com/problemset/customtest )

Code

#include <iostream>
#include <ctime>

using namespace std;

const int md = 1e9 + 7;

int mul_complex(int a, int b) {
  unsigned long long x = (long long) a * b;
  unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
  asm(
    "divl %4; \n\t"
    : "=a" (d), "=d" (m)
    : "d" (xh), "a" (xl), "r" (md)
  );
  return m;
}

int mul_simple(int a, int b){
  return ((long long)a*b)%md;
}

int main() {
    { // warmup
        int x = 34567, y= 4325;
        long long ans = 0;
        for(int itr = 1e7; itr; --itr){
          ans += mul_complex(x, y);
          ++x, ++y;
        }
    }

    int const N_ITER = 1e8;
    {
        auto start = clock();
        int x = 34567, y= 4325;
        long long ans = 0;
        for(int itr = N_ITER; itr; --itr){
          ans += mul_complex(x, y);
          ++x, ++y;
        }
        cout << ans << ' ' << double(clock() - start) / CLOCKS_PER_SEC << '\n';
    }
    {
        auto start = clock();
        int x = 34567, y= 4325;
        long long ans = 0;
        for(int itr = N_ITER; itr; --itr){
          ans += mul_simple(x, y);
          ++x, ++y;
        }
        cout << ans << ' ' << double(clock() - start) / CLOCKS_PER_SEC << '\n';
    }
}

static_assert(sizeof(size_t) == 4, "not 32 bit machine");

Because Codeforces caches the result, you may need to change the input or some whitespaces in the code to re-run the code. When I run it the result is ~= 0.45 vs 0.65, which is a 30% performance gain.

Comments (6)

Write comment?

madhur4127

4 years ago, # |

← Rev. 3 →

Can you share the results of the benchmark which showed performance gains?

On the contrary, I didn't observe any gain for 64-bit: http://quick-bench.com/SUfusGHbNvP8a961_8QrWYLqbJw

→ Reply

Xellos

4 years ago, # ^ |

Is that on a 32-bit machine?

gultai4ukr

+10

You used mul_complex function in both measurements. So you just compare it with itself. Nothing strange that you didn't see the difference :)

Urghh, I forgot to change it back before posting.

I saw x3.3 slowdown on 64-bit though, which makes this trick useful only for 32bit judges.

z4120

Interestingly, on 64 bit machines, the compiler actually optimize division into multiplication. Check out https://godbolt.org/z/GbSkHS .

Interesting read: Labor of Division (Episode I)

libdivide, optimized integer division may be able to do that on Codeforces's 32-bit machines. Or just hard code it for the 1e9+7 case (as it's very common).

Also, static_assert(sizeof(size_t) == 4, "not 32 bit machine"); fails on quick-bench, which shows that they use a 64-bit machine.

UPD: Benchmark code added to the post, which shows a 30% performance improvement on Codeforces machines.

z4120's blog