Reservoir sampling

Problem Statement

Randomly choosing k items from a list S containing n items where n is too large or unknown.

Method: Reservoir Sampling

Steps:

keep the first item in memory
for the next item i (i > 1)
- keep the new item with probability 1/i
- ignore the new item with probability 1 – 1/i

It is easy to prove that when there are n items, each item is kept with probability 1/n.

Python Implementation

Suppose we have a long document but only keep 10 tokens from the document.

import random
SAMPLE_COUNT = 10

random.seed(2333)

doc = '''Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.'''

tokens = doc.split(' ')

sample_tokens = []
for idx, token in enumerate(tokens):
        # Generate the reservoir
        if idx < SAMPLE_COUNT:
                sample_tokens.append(token)
        else:
                # Randomly replace elements in the reservoir
                r = random.randint(0, idx)
                if r < SAMPLE_COUNT:
                        sample_tokens[r] = token

print(sample_tokens)

Leave a comment Cancel reply