ymmparsa's blog

By ymmparsa, 11 months ago, In English

This blog focuses on the GCC compiler.

The most basic way to find a prime factorization for an integer $$$n$$$ is to test every possible divisor up to $$$\sqrt{n}$$$. This can be optimized by precomputing primes and only testing those. What else can we do? Note that the modulo operation is rather expensive. To optimize it, we can use something like the Montgomery multiplication, but it's too complicated for an "easy way". But also note that when the divisor is a compile-time constant, the compiler optimizes the modulo operation (or division, for that matter) to cheap multiplication and bit-shift operations. Let's take advantage of that:

#include <bits/stdc++.h>
using namespace std;

typedef uint32_t u32;

// The cold attribute tells the compiler that this function is unlikely to be
// called.  Without it, compilation will take much more time because the
// compiler tries to optimize the function calls.  Attributes are a GCC
// extension.
__attribute__((cold))
void factor_helper(vector<u32> &vec, u32 &x, u32 y)
{
	do {
		vec.push_back(y);
		x /= y;
	} while (x % y == 0);
}

vector<u32> factor(u32 x) {
	vector<u32> vec;
#define F(y) if (x%y == 0) { factor_helper(vec, x, y); }
	F(2)F(3)F(5)F(7)F(11)F(13)... /* all primes up to sqrt(1e9) */
#undef F
	if (x > 1)
		vec.push_back(x);
	return vec;
}

As you can see, it's pretty simple and the long line containing all primes can be generated with a few lines of code. The only problem that remains is its long compile time. The above code is tweaked in a way so that the compilation can be done within a few seconds with GCC (which has been achieved through trial and error, like how the parameters are passed, the cold attribute, etc.), so that's no longer an issue either. Also unsigned integers are used because they are faster in this case.

But how fast actually is this? With my tests it can factorize $$$10^6$$$ integers in less than 2 seconds! Feel free to test it for yourself (Be sure to use the latest version of g++ available on Codeforces for testing or actually using this).

  • Vote: I like it
  • +197
  • Vote: I do not like it

»
11 months ago, # |
  Vote: I like it +42 Vote: I do not like it

That sounds really crazy...

»
11 months ago, # |
Rev. 2   Vote: I like it +82 Vote: I do not like it

Spoiler
  • »
    »
    11 months ago, # ^ |
      Vote: I like it +8 Vote: I do not like it

    Same here.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    Here you go.

    WARNING: very large amount of text, open at your own risk!
  • »
    »
    11 months ago, # ^ |
    Rev. 4   Vote: I like it +40 Vote: I do not like it

    You can do without pasting enormous sequence of ints...

    code

    Runs in 2199s in custom run... Had to split in chunks of 800, because CF has limit of 900 on recursive templates.

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Wow that's nice! Guess I should learn more about constexpr and templates. Although the original version runs in 1341ms...

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +32 Vote: I do not like it

        I think the difference in execution time might be due to different bounds. I tried out the suggestion by ToxicPie9 below and adjusted the bounds:

        code

        Runs 1419ms now.

    • »
      »
      »
      11 months ago, # ^ |
      Rev. 4   Vote: I like it +75 Vote: I do not like it

      You can replace fact_p<L+1, R> with fact_p<L, (L+R)/2> and fact_p<(L+R)/2, R> to get $$$\log(P)$$$ depth so it fits into the limit of 900.

      i don't actually know how to write binary search. this is probably incorrect
»
11 months ago, # |
  Vote: I like it +31 Vote: I do not like it

constexpr + template are all you need

»
11 months ago, # |
Rev. 2   Vote: I like it +52 Vote: I do not like it

Here is my final version using C++ templates and constexpr (based on adamant's code here), so you don't have to paste 4000 prime numbers:

code

With standard optimizations (-O2), its performance matches the code in the blog. It uses templates and the always_inline attribute to generate code like a macro.

It took a lot of modifications to get a working one, and I needed some time to analyze why some versions are a lot slower than other ones. Below are some interesting technical facts, reader discretion is advised.

The key here is the noinline attribute (cold might also work), which prevents factor_helper from being inlined. Without it, not only would compiling time increase a lot, the runtime also increases by about 6 times!

When factor_helper is not inlined, The compiled code has a structure like this:

code

Here only x % p == 0 have optimized operations, but that's ok because divisions in factor_helper only happen $$$O(\log(x))$$$ times.

On the other hand, when factor_helper is inlined, The compiled code has a structure like this:

code

When push_back is also inlined, this results in a enormous (up to 470 kB) factor function, which is so large that cache misses start to have a significant effect. In adamant's case where he manually inlined factor_helper, cache and branch misses caused the same amount of instructions to take 5.5x time to run. Took us a while to figure out what was going on :|

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +80 Vote: I do not like it

    #pragma GCC unroll is faster to compile than C++ templates.

    vector<u32> factor(u32 x) {
        vector<u32> vec;
        #pragma GCC unroll 3401
        for (int i = 0; i < 3401; i++)
            if (x % primes[i] == 0) [[unlikely]]
                factor_helper(vec, x, primes[i]);
        if (x > 1)
            vec.push_back(x);
        return vec;
    }
    
    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +5 Vote: I do not like it

      Wow I didn't know unroll can be used like this, thank you for the info. It also makes the code a lot shorter and cleaner.

    • »
      »
      »
      11 months ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Have you checked it in CF custom test?

      I get compilation time out on G++20 and G++17 (64 bit), and on G++17 the runtime is 7878ms...

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +50 Vote: I do not like it

        I didn't get a compile timeout in the custom test, even though it took a long time to compile.

        • »
          »
          »
          »
          »
          11 months ago, # ^ |
          Rev. 3   Vote: I like it 0 Vote: I do not like it

          Ok, I found the problem. __attribute__((noinline)) must be near factor_helper for this to compile:

          code

          This runs in 1357ms.

»
10 months ago, # |
  Vote: I like it +49 Vote: I do not like it

Actually, you can do this even faster, b/c you don't need to check primes $$$p*p>x$$$. This is ~3-5 times faster than your solution. (using recursive templates)

template<int L = 0, int R = P, int M = (L + R) / 2>
constexpr void fact_p(vector<u32>& vec, u32 &x) {
        if constexpr (L + 1 < R) {
                fact_p<L, M>(vec, x);
                if (L+20<R && x < primes[M]*primes[M]) return;
                fact_p<M, R>(vec, x);
        } else if(x % primes[L] == 0) {
                factor_helper(vec, x, primes[L]);
        }
}