In many standard distributed search engines (not limited to text), the document
index is sharded across many machines. If we want to retrieve the top *m*
results from a given corpus, we need to query each shard for the local top *m*.
We need the local top *m* from each shard in case the global top *m* are all
contained on one shard.

This paper from my alma mater contributes an interesting statistical model to
reduce the number of results required from a given shard for an acceptable
probability that not all top *m* results will be present.

The idea comes from the fact that the documents are often randomly (and
uniformly) distributed across the shards. Therefore, it’s extremely unlikely
that the global top *m* are on the same shard. As the paper states, the
probability is .

For the derivation of the model you should really see the paper but I will highlight a few interesting points.

The probability that all of the global top *m* results will be contained in the
top *k* from each of the *n* nodes is given by *p(n,m,k)*. Their equation is
solvable with dynamic programming.

For some sample points, if a system has 64 shards we could request the top 9 results from each shard to return the top 100 globally with over 99% probability.

The authors then change the model to determine how much of the top *m* items
should be expected to be returned for a given *n, m* and *k*.

For example, a system with 64 shards retrieving the top 3 on each shard gives us
an expected global top *m* of 40.

We don’t usually need to have the global top *m* as long as we can most of them.
Our ranking functions are not perfect either! Therefore, we could trade the
quality of the results for … what?

Well, for a simple text search engine where the results from each shard is often just the document ID and the score there is little benefit to limiting the number of results returned per node. There’s not a lot of networking overhead here.

The authors were working on a question answering system based on passage retrieval where having a smaller initial result set is very beneficial.

One of the content-based image search systems I worked on used a two step ranking system. First many thousands of initial results were retrieved from the shards and then they were re-ranked with CPU intensive CV algorithms. I didn’t test this at the time, but I wonder how search latency could be improved if we didn’t retrieve so many candidate images. With less computation per query we could have also increased query throughput.

Perhaps some other systems could benefit from this approximate top *m* result
approach. What if we could avoid disk for most searches if we found the required
number of results in memory? (We’d have to be careful about what we put on disk
of course.)

What if, on the second ranking pass, we compute as many results as we can (in priority order) until a timeout. We could model the expected quality of the results.