I've got an interesting math problem.
Here is a binary search variant that picks the element to compare against at random instead of always picking the middle one:
int lower_bound(int x) {
int l = 0, r = n; // [l, r]
while (l < r) {
int m = l + rand() % (r - l); // <- note that r is never picked because there's no point
if (a[m] >= x)
r = m;
else
l = m + 1;
}
return l;
}
Assuming that the keys and queries are also drawn independently at random, as $$$n \to \infty$$$, how many times more comparisons does the random binary search perform compared to the normal one?
Edit: the binary search used to return a value instead of index. This is not correct if the the lower bound doesn't exist.
I will use $$$1$$$-indexing in what follows. Also I'll assume that we count the number of element comparisons (since comparing $$$l, r$$$ is not memory intensive, so it is not that important anyway). If we also count comparisons of $$$l, r$$$, it is twice of what we are counting here, minus 1, so it won't end up changing the ratio at all.
Firstly note that we can assume that the underlying array is $$$1, \ldots, n$$$ (since we can compress elements, and for the sake of simplicity we can assume that the array consists of distinct elements). I suspect that the result should remain the same even when there are duplicate elements.
Let
Unable to parse markup [type=CF_MATHJAX]
be the expected number of element comparisons this algorithm does when you have a query $$$q$$$ in an interval $$$[l, r]$$$.Then note that
Unable to parse markup [type=CF_MATHJAX]
.Using this, if
Unable to parse markup [type=CF_MATHJAX]
, we can get a recurrence:Unable to parse markup [type=CF_MATHJAX]
.We want
Unable to parse markup [type=CF_MATHJAX]
, and the recurrence on summing for $$$q = 1$$$ to $$$n$$$ becomesUnable to parse markup [type=CF_MATHJAX]
.Let
Unable to parse markup [type=CF_MATHJAX]
. We then haveUnable to parse markup [type=CF_MATHJAX]
.So we have
Unable to parse markup [type=CF_MATHJAX]
. So the answer to the problem is asymptoticallyUnable to parse markup [type=CF_MATHJAX]
.Edit: I believe that the binary search given here works correctly if and only if
Unable to parse markup [type=CF_MATHJAX]
.I thought it would be fun to solve it for a general probability distribution
Unable to parse markup [type=CF_MATHJAX]
over the choices, i.e., out ofUnable to parse markup [type=CF_MATHJAX]
,Unable to parse markup [type=CF_MATHJAX]
is the probability of picking $$$i$$$. In the above problem,Unable to parse markup [type=CF_MATHJAX]
and $$$0$$$ otherwise.The equation for number of comparisons for
Unable to parse markup [type=CF_MATHJAX]
becomesUnable to parse markup [type=CF_MATHJAX]
and the expected number of comparisons for $$$n$$$,Unable to parse markup [type=CF_MATHJAX]
Simplifying gives
Unable to parse markup [type=CF_MATHJAX]
Unable to parse markup [type=CF_MATHJAX]
. Note thatUnable to parse markup [type=CF_MATHJAX]
is symmetric, i.e.,Unable to parse markup [type=CF_MATHJAX]
.[UPD: what follows is not quite rigorous as pointed out]
Using this we can prove that the normal binary search is optimal. We know that
Unable to parse markup [type=CF_MATHJAX]
, both are trivial bounds. WhichUnable to parse markup [type=CF_MATHJAX]
is concave upwards. Now we can easily prove thatUnable to parse markup [type=CF_MATHJAX]
forUnable to parse markup [type=CF_MATHJAX]
is sub-optimal. Since $$$Q$$$ is symmetric,Unable to parse markup [type=CF_MATHJAX]
forUnable to parse markup [type=CF_MATHJAX]
.The claim that
Unable to parse markup [type=CF_MATHJAX]
is concave upwards sounds a bit dubious to me (it doesn't hold for arbitrary functions).However, it can be shown that for uniformly random queries (i.e., for integers uniformly chosen from $$$[0, n]$$$) for the corrected version of binary search in the original post ($$$r = n$$$ instead of
Unable to parse markup [type=CF_MATHJAX]
), we have thatUnable to parse markup [type=CF_MATHJAX]
, where $$$f$$$ is the binary entropy function (naming as done in https://oeis.org/A003314, this property is proved in https://epubs.siam.org/doi/10.1137/0117001 — not open access but fairly simple to prove). This bound is sharp as well.It is a piecewise linear function with the breakpoints at powers of $$$2$$$, where its value is precisely $$$n \log_2 n$$$.
Here's a complete proof of why normal binary search is optimal (no matter which midpoint you pick). I'll rederive using the correct binary search, which is as follows:
I will assume that the queries are uniformly random, from $$$[0, n]$$$ (rather than from $$$[0, n - 1]$$$, since there are $$$n + 1$$$ possible answers). I'll also revert back to $$$0$$$-based indexing.
For a given $$$q$$$, the number of steps becomes
Unable to parse markup [type=CF_MATHJAX]
.Taking the average from $$$q = 0$$$ to $$$n$$$, we get
Unable to parse markup [type=CF_MATHJAX]
, whereUnable to parse markup [type=CF_MATHJAX]
. Taking advantage of the fact thatUnable to parse markup [type=CF_MATHJAX]
, we can write this asUnable to parse markup [type=CF_MATHJAX]
We can lower bound this expression by
Unable to parse markup [type=CF_MATHJAX]
, whereUnable to parse markup [type=CF_MATHJAX]
. Note that this bound is achievable, by putting all probability mass at eitherUnable to parse markup [type=CF_MATHJAX]
or atUnable to parse markup [type=CF_MATHJAX]
.Let
Unable to parse markup [type=CF_MATHJAX]
. Then we haveUnable to parse markup [type=CF_MATHJAX]
.Lemma: If $$$f$$$ is convex on the integers, then we have
Unable to parse markup [type=CF_MATHJAX]
for $$$k \ge 0$$$.Proof: Use convexity to represent $$$a$$$ and $$$b$$$ as convex combinations of $$$a - k$$$ and
Unable to parse markup [type=CF_MATHJAX]
.Claim: $$$d(n)$$$ is convex over
Unable to parse markup [type=CF_MATHJAX]
.Proof: It is sufficient to show that
Unable to parse markup [type=CF_MATHJAX]
for $$$n \ge 1$$$. This is trivially true for $$$n = 1$$$. Now let's suppose that $$$d$$$ is convex on $$$[0, n]$$$ for some $$$n \ge 1$$$. Note that by the lemma, for all $$$1 \le i \le n + 1$$$,Unable to parse markup [type=CF_MATHJAX]
. The inductive step is now straightforward, and we are done.Note that by the proof of the previous claim, we directly have that
Unable to parse markup [type=CF_MATHJAX]
is the midpoint of the range (and when there are two midpoints, any of them works), and we are thus done.Edit: from this recurrence, using induction it is also fairly straightforward to show that this function is piecewise linear, and
Unable to parse markup [type=CF_MATHJAX]
are the points where the slopes change. More precisely, we can show that if $$$n + 1$$$ is not a power of 2, thenUnable to parse markup [type=CF_MATHJAX]
.My level is not very advanced but i know a lot about Generating Random Numbers, so if we want to pick a random number from l to r : so we will use this Equation => Start + rand() % (End — Start + 1).
but here m = l + rand() % (r — l), why didn't you add 1 between () ?
Because picking the last element doesn't really do anything, as the comparison won't shrink the search range.
Think of it as randomly selecting between the non-empty splits of an $$$n$$$-element array: there are exactly $$$(n-1)$$$ of them, not $$$n$$$.
Ah. I think the culprit is that your lower_bound returns a value, not an index, and therefore assumes the last element is always larger than (or equal to)
x
. The usual lower_bound doesn't. Might be a good idea to add a comment about this to the post or somethingBetter now, but
lower_bound
still never returnsn
if all elements are less thanx
.Oh, for fuck's sake.
Never thought that there would be a day I make two errors in a binary search.
Especially after just writing a 20-page article on it.