Random Binary Search

→ Pay attention

Before contest
Codeforces Round 950 (Div. 3)
32:22:14
Register now »

→ Top rated

#	User	Rating
1	tourist	3757
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	maomao90	171
2	awoo	165
3	adamant	164
4	TheScrasse	159
5	nor	155
5	maroonrk	155
7	-is-this-fft-	152
8	Petr	147
9	orz	146
10	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

sslotin's blog

Random Binary Search

By sslotin, 2 years ago, In English

I've got an interesting math problem.

Here is a binary search variant that picks the element to compare against at random instead of always picking the middle one:

int lower_bound(int x) {
    int l = 0, r = n; // [l, r]
    while (l < r) {
        int m = l + rand() % (r - l); // <- note that r is never picked because there's no point
        if (a[m] >= x)
            r = m;
        else
            l = m + 1;
    }
    return l;
}

Assuming that the keys and queries are also drawn independently at random, as $$$n \to \infty$$$, how many times more comparisons does the random binary search perform compared to the normal one?

Edit: the binary search used to return a value instead of index. This is not correct if the the lower bound doesn't exist.

+148

sslotin
2 years ago
10

Comments (9)

Show archived | Write comment?

nor

2 years ago, # |

← Rev. 5 →

+100

I will use $$$1$$$-indexing in what follows. Also I'll assume that we count the number of element comparisons (since comparing $$$l, r$$$ is not memory intensive, so it is not that important anyway). If we also count comparisons of $$$l, r$$$, it is twice of what we are counting here, minus 1, so it won't end up changing the ratio at all.

Firstly note that we can assume that the underlying array is $$$1, \ldots, n$$$ (since we can compress elements, and for the sake of simplicity we can assume that the array consists of distinct elements). I suspect that the result should remain the same even when there are duplicate elements.

Let

Unable to parse markup [type=CF_MATHJAX]

be the expected number of element comparisons this algorithm does when you have a query $$$q$$$ in an interval $$$[l, r]$$$.

Then note that

Unable to parse markup [type=CF_MATHJAX]

Using this, if

Unable to parse markup [type=CF_MATHJAX]

, we can get a recurrence:

Unable to parse markup [type=CF_MATHJAX]

We want

Unable to parse markup [type=CF_MATHJAX]

, and the recurrence on summing for $$$q = 1$$$ to $$$n$$$ becomes

Unable to parse markup [type=CF_MATHJAX]

Let

Unable to parse markup [type=CF_MATHJAX]

. We then have

Unable to parse markup [type=CF_MATHJAX]

So we have

Unable to parse markup [type=CF_MATHJAX]

. So the answer to the problem is asymptotically

Unable to parse markup [type=CF_MATHJAX]

Edit: I believe that the binary search given here works correctly if and only if

Unable to parse markup [type=CF_MATHJAX]

→ Reply

koderkushy

2 years ago, # |

← Rev. 5 →

I thought it would be fun to solve it for a general probability distribution

Unable to parse markup [type=CF_MATHJAX]

over the choices, i.e., out of

Unable to parse markup [type=CF_MATHJAX]

is the probability of picking $$$i$$$. In the above problem,

Unable to parse markup [type=CF_MATHJAX]

and $$$0$$$ otherwise.

The equation for number of comparisons for

Unable to parse markup [type=CF_MATHJAX]

becomes

Unable to parse markup [type=CF_MATHJAX]

and the expected number of comparisons for $$$n$$$,

Unable to parse markup [type=CF_MATHJAX]

Simplifying gives

Unable to parse markup [type=CF_MATHJAX]

. Note that

Unable to parse markup [type=CF_MATHJAX]

is symmetric, i.e.,

Unable to parse markup [type=CF_MATHJAX]

[UPD: what follows is not quite rigorous as pointed out]

Using this we can prove that the normal binary search is optimal. We know that

Unable to parse markup [type=CF_MATHJAX]

, both are trivial bounds. Which

Unable to parse markup [type=CF_MATHJAX]

is concave upwards. Now we can easily prove that

Unable to parse markup [type=CF_MATHJAX]

for

Unable to parse markup [type=CF_MATHJAX]

is sub-optimal. Since $$$Q$$$ is symmetric,

Unable to parse markup [type=CF_MATHJAX]

for

Unable to parse markup [type=CF_MATHJAX]

→ Reply

nor

2 years ago, # ^ |

← Rev. 7 →

+19

The claim that

Unable to parse markup [type=CF_MATHJAX]

is concave upwards sounds a bit dubious to me (it doesn't hold for arbitrary functions).

However, it can be shown that for uniformly random queries (i.e., for integers uniformly chosen from $$$[0, n]$$$) for the corrected version of binary search in the original post ($$$r = n$$$ instead of

Unable to parse markup [type=CF_MATHJAX]

), we have that

Unable to parse markup [type=CF_MATHJAX]

, where $$$f$$$ is the binary entropy function (naming as done in https://oeis.org/A003314, this property is proved in https://epubs.siam.org/doi/10.1137/0117001 — not open access but fairly simple to prove). This bound is sharp as well.

It is a piecewise linear function with the breakpoints at powers of $$$2$$$, where its value is precisely $$$n \log_2 n$$$.

→ Reply

nor

2 years ago, # ^ |

← Rev. 2 →

+13

Here's a complete proof of why normal binary search is optimal (no matter which midpoint you pick). I'll rederive using the correct binary search, which is as follows:

int lower_bound(int x) {
    int l = 0, r = n; // [l, r)
    while (l < r) {
        int m = rng(l, r); // random number in [l, r)
        if (a[m] >= x) r = m;
        else l = m + 1;
    }
    return l;
}

I will assume that the queries are uniformly random, from $$$[0, n]$$$ (rather than from $$$[0, n - 1]$$$, since there are $$$n + 1$$$ possible answers). I'll also revert back to $$$0$$$-based indexing.

For a given $$$q$$$, the number of steps becomes

Unable to parse markup [type=CF_MATHJAX]

Taking the average from $$$q = 0$$$ to $$$n$$$, we get

Unable to parse markup [type=CF_MATHJAX]

, where

Unable to parse markup [type=CF_MATHJAX]

. Taking advantage of the fact that

Unable to parse markup [type=CF_MATHJAX]

, we can write this as

Unable to parse markup [type=CF_MATHJAX]

We can lower bound this expression by

Unable to parse markup [type=CF_MATHJAX]

, where

Unable to parse markup [type=CF_MATHJAX]

. Note that this bound is achievable, by putting all probability mass at either

Unable to parse markup [type=CF_MATHJAX]

or at

Unable to parse markup [type=CF_MATHJAX]

Let

Unable to parse markup [type=CF_MATHJAX]

. Then we have

Unable to parse markup [type=CF_MATHJAX]

Lemma: If $$$f$$$ is convex on the integers, then we have

Unable to parse markup [type=CF_MATHJAX]

for $$$k \ge 0$$$.

Proof: Use convexity to represent $$$a$$$ and $$$b$$$ as convex combinations of $$$a - k$$$ and

Unable to parse markup [type=CF_MATHJAX]

Claim: $$$d(n)$$$ is convex over

Unable to parse markup [type=CF_MATHJAX]

Proof: It is sufficient to show that

Unable to parse markup [type=CF_MATHJAX]

for $$$n \ge 1$$$. This is trivially true for $$$n = 1$$$. Now let's suppose that $$$d$$$ is convex on $$$[0, n]$$$ for some $$$n \ge 1$$$. Note that by the lemma, for all $$$1 \le i \le n + 1$$$,

Unable to parse markup [type=CF_MATHJAX]

. The inductive step is now straightforward, and we are done.

Note that by the proof of the previous claim, we directly have that

Unable to parse markup [type=CF_MATHJAX]

is the midpoint of the range (and when there are two midpoints, any of them works), and we are thus done.

Edit: from this recurrence, using induction it is also fairly straightforward to show that this function is piecewise linear, and

Unable to parse markup [type=CF_MATHJAX]

are the points where the slopes change. More precisely, we can show that if $$$n + 1$$$ is not a power of 2, then

Unable to parse markup [type=CF_MATHJAX]

→ Reply

Mohammed.Amer.245

2 years ago, # |

← Rev. 2 →

-21

My level is not very advanced but i know a lot about Generating Random Numbers, so if we want to pick a random number from l to r : so we will use this Equation => Start + rand() % (End — Start + 1).

but here m = l + rand() % (r — l), why didn't you add 1 between () ?

→ Reply

sslotin

2 years ago, # ^ |

Because picking the last element doesn't really do anything, as the comparison won't shrink the search range.

Think of it as randomly selecting between the non-empty splits of an $$$n$$$-element array: there are exactly $$$(n-1)$$$ of them, not $$$n$$$.

→ Reply

purplesyringa

2 years ago, # ^ |

Ah. I think the culprit is that your lower_bound returns a value, not an index, and therefore assumes the last element is always larger than (or equal to) x. The usual lower_bound doesn't. Might be a good idea to add a comment about this to the post or something

→ Reply

purplesyringa

2 years ago, # ^ |

+11

Better now, but lower_bound still never returns n if all elements are less than x.

→ Reply

sslotin

2 years ago, # ^ |

Oh, for fuck's sake.

Never thought that there would be a day I make two errors in a binary search.

Especially after just writing a 20-page article on it.

→ Reply