Problem Statement
Randomly choosing k items from a list S containing n items where n is too large or unknown.
Method: Reservoir Sampling
Steps:
- keep the first item in memory
- for the next item i (i > 1)
- keep the new item with probability 1/i
- ignore the new item with probability 1 – 1/i
It is easy to prove that when there are n items, each item is kept with probability 1/n.
Python Implementation
Suppose we have a long document but only keep 10 tokens from the document.
import random
SAMPLE_COUNT = 10
random.seed(2333)
doc = '''Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.'''
tokens = doc.split(' ')
sample_tokens = []
for idx, token in enumerate(tokens):
# Generate the reservoir
if idx < SAMPLE_COUNT:
sample_tokens.append(token)
else:
# Randomly replace elements in the reservoir
r = random.randint(0, idx)
if r < SAMPLE_COUNT:
sample_tokens[r] = token
print(sample_tokens)