sslotin's blog

By sslotin, 2 years ago, In English

I've got an interesting math problem.

Here is a binary search variant that picks the element to compare against at random instead of always picking the middle one:

int lower_bound(int x) {
    int l = 0, r = n; // [l, r]
    while (l < r) {
        int m = l + rand() % (r - l); // <- note that r is never picked because there's no point
        if (a[m] >= x)
            r = m;
        else
            l = m + 1;
    }
    return l;
}

Assuming that the keys and queries are also drawn independently at random, as $$$n \to \infty$$$, how many times more comparisons does the random binary search perform compared to the normal one?

Edit: the binary search used to return a value instead of index. This is not correct if the the lower bound doesn't exist.

  • Vote: I like it
  • +148
  • Vote: I do not like it

| Write comment?
»
2 years ago, # |
Rev. 5   Vote: I like it +100 Vote: I do not like it

I will use $$$1$$$-indexing in what follows. Also I'll assume that we count the number of element comparisons (since comparing $$$l, r$$$ is not memory intensive, so it is not that important anyway). If we also count comparisons of $$$l, r$$$, it is twice of what we are counting here, minus 1, so it won't end up changing the ratio at all.

Firstly note that we can assume that the underlying array is $$$1, \ldots, n$$$ (since we can compress elements, and for the sake of simplicity we can assume that the array consists of distinct elements). I suspect that the result should remain the same even when there are duplicate elements.

Let

Unable to parse markup [type=CF_MATHJAX]

be the expected number of element comparisons this algorithm does when you have a query $$$q$$$ in an interval $$$[l, r]$$$.

Then note that

Unable to parse markup [type=CF_MATHJAX]

.

Using this, if

Unable to parse markup [type=CF_MATHJAX]

, we can get a recurrence:

Unable to parse markup [type=CF_MATHJAX]

.

We want

Unable to parse markup [type=CF_MATHJAX]

, and the recurrence on summing for $$$q = 1$$$ to $$$n$$$ becomes

Unable to parse markup [type=CF_MATHJAX]

.

Let

Unable to parse markup [type=CF_MATHJAX]

. We then have

Unable to parse markup [type=CF_MATHJAX]

.

So we have

Unable to parse markup [type=CF_MATHJAX]

. So the answer to the problem is asymptotically

Unable to parse markup [type=CF_MATHJAX]

.

Edit: I believe that the binary search given here works correctly if and only if

Unable to parse markup [type=CF_MATHJAX]

.
»
2 years ago, # |
Rev. 5   Vote: I like it 0 Vote: I do not like it

I thought it would be fun to solve it for a general probability distribution

Unable to parse markup [type=CF_MATHJAX]

over the choices, i.e., out of

Unable to parse markup [type=CF_MATHJAX]

,

Unable to parse markup [type=CF_MATHJAX]

is the probability of picking $$$i$$$. In the above problem,

Unable to parse markup [type=CF_MATHJAX]

and $$$0$$$ otherwise.

The equation for number of comparisons for

Unable to parse markup [type=CF_MATHJAX]

becomes

Unable to parse markup [type=CF_MATHJAX]

and the expected number of comparisons for $$$n$$$,

Unable to parse markup [type=CF_MATHJAX]

Simplifying gives

Unable to parse markup [type=CF_MATHJAX]

Unable to parse markup [type=CF_MATHJAX]

. Note that

Unable to parse markup [type=CF_MATHJAX]

is symmetric, i.e.,

Unable to parse markup [type=CF_MATHJAX]

.

[UPD: what follows is not quite rigorous as pointed out]

Using this we can prove that the normal binary search is optimal. We know that

Unable to parse markup [type=CF_MATHJAX]

, both are trivial bounds. Which

Unable to parse markup [type=CF_MATHJAX]

is concave upwards. Now we can easily prove that

Unable to parse markup [type=CF_MATHJAX]

for

Unable to parse markup [type=CF_MATHJAX]

is sub-optimal. Since $$$Q$$$ is symmetric,

Unable to parse markup [type=CF_MATHJAX]

for

Unable to parse markup [type=CF_MATHJAX]

.
  • »
    »
    2 years ago, # ^ |
    Rev. 7   Vote: I like it +19 Vote: I do not like it

    The claim that

    Unable to parse markup [type=CF_MATHJAX]

    is concave upwards sounds a bit dubious to me (it doesn't hold for arbitrary functions).

    However, it can be shown that for uniformly random queries (i.e., for integers uniformly chosen from $$$[0, n]$$$) for the corrected version of binary search in the original post ($$$r = n$$$ instead of

    Unable to parse markup [type=CF_MATHJAX]

    ), we have that

    Unable to parse markup [type=CF_MATHJAX]

    , where $$$f$$$ is the binary entropy function (naming as done in https://oeis.org/A003314, this property is proved in https://epubs.siam.org/doi/10.1137/0117001 — not open access but fairly simple to prove). This bound is sharp as well.

    It is a piecewise linear function with the breakpoints at powers of $$$2$$$, where its value is precisely $$$n \log_2 n$$$.

    • »
      »
      »
      2 years ago, # ^ |
      Rev. 2   Vote: I like it +13 Vote: I do not like it

      Here's a complete proof of why normal binary search is optimal (no matter which midpoint you pick). I'll rederive using the correct binary search, which is as follows:

      int lower_bound(int x) {
          int l = 0, r = n; // [l, r)
          while (l < r) {
              int m = rng(l, r); // random number in [l, r)
              if (a[m] >= x) r = m;
              else l = m + 1;
          }
          return l;
      }
      

      I will assume that the queries are uniformly random, from $$$[0, n]$$$ (rather than from $$$[0, n - 1]$$$, since there are $$$n + 1$$$ possible answers). I'll also revert back to $$$0$$$-based indexing.

      For a given $$$q$$$, the number of steps becomes

      Unable to parse markup [type=CF_MATHJAX]

      .

      Taking the average from $$$q = 0$$$ to $$$n$$$, we get

      Unable to parse markup [type=CF_MATHJAX]

      , where

      Unable to parse markup [type=CF_MATHJAX]

      . Taking advantage of the fact that

      Unable to parse markup [type=CF_MATHJAX]

      , we can write this as

      Unable to parse markup [type=CF_MATHJAX]

      We can lower bound this expression by

      Unable to parse markup [type=CF_MATHJAX]

      , where

      Unable to parse markup [type=CF_MATHJAX]

      . Note that this bound is achievable, by putting all probability mass at either

      Unable to parse markup [type=CF_MATHJAX]

      or at

      Unable to parse markup [type=CF_MATHJAX]

      .

      Let

      Unable to parse markup [type=CF_MATHJAX]

      . Then we have

      Unable to parse markup [type=CF_MATHJAX]

      .

      Lemma: If $$$f$$$ is convex on the integers, then we have

      Unable to parse markup [type=CF_MATHJAX]

      for $$$k \ge 0$$$.

      Proof: Use convexity to represent $$$a$$$ and $$$b$$$ as convex combinations of $$$a - k$$$ and

      Unable to parse markup [type=CF_MATHJAX]

      .

      Claim: $$$d(n)$$$ is convex over

      Unable to parse markup [type=CF_MATHJAX]

      .

      Proof: It is sufficient to show that

      Unable to parse markup [type=CF_MATHJAX]

      for $$$n \ge 1$$$. This is trivially true for $$$n = 1$$$. Now let's suppose that $$$d$$$ is convex on $$$[0, n]$$$ for some $$$n \ge 1$$$. Note that by the lemma, for all $$$1 \le i \le n + 1$$$,

      Unable to parse markup [type=CF_MATHJAX]

      . The inductive step is now straightforward, and we are done.

      Note that by the proof of the previous claim, we directly have that

      Unable to parse markup [type=CF_MATHJAX]

      is the midpoint of the range (and when there are two midpoints, any of them works), and we are thus done.

      Edit: from this recurrence, using induction it is also fairly straightforward to show that this function is piecewise linear, and

      Unable to parse markup [type=CF_MATHJAX]

      are the points where the slopes change. More precisely, we can show that if $$$n + 1$$$ is not a power of 2, then

      Unable to parse markup [type=CF_MATHJAX]

      .
»
2 years ago, # |
Rev. 2   Vote: I like it -21 Vote: I do not like it

My level is not very advanced but i know a lot about Generating Random Numbers, so if we want to pick a random number from l to r : so we will use this Equation => Start + rand() % (End — Start + 1).

but here m = l + rand() % (r — l), why didn't you add 1 between () ?

  • »
    »
    2 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Because picking the last element doesn't really do anything, as the comparison won't shrink the search range.

    Think of it as randomly selecting between the non-empty splits of an $$$n$$$-element array: there are exactly $$$(n-1)$$$ of them, not $$$n$$$.

    • »
      »
      »
      2 years ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      Ah. I think the culprit is that your lower_bound returns a value, not an index, and therefore assumes the last element is always larger than (or equal to) x. The usual lower_bound doesn't. Might be a good idea to add a comment about this to the post or something

    • »
      »
      »
      2 years ago, # ^ |
        Vote: I like it +11 Vote: I do not like it

      Better now, but lower_bound still never returns n if all elements are less than x.

      • »
        »
        »
        »
        2 years ago, # ^ |
          Vote: I like it +5 Vote: I do not like it

        Oh, for fuck's sake.

        Never thought that there would be a day I make two errors in a binary search.

        Especially after just writing a 20-page article on it.