Reservoir sampling

Problem Statement

Randomly choosing k items from a list S containing n items where n is too large or unknown.

Method: Reservoir Sampling

Steps:

1. keep the first item in memory
2. for the next item i (i > 1)
• keep the new item with probability 1/i
• ignore the new item with probability 1 – 1/i

It is easy to prove that when there are n items, each item is kept with probability 1/n.

Python Implementation

Suppose we have a long document but only keep 10 tokens from the document.

``````import random
SAMPLE_COUNT = 10

random.seed(2333)

doc = '''Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.'''

tokens = doc.split(' ')

sample_tokens = []
for idx, token in enumerate(tokens):
# Generate the reservoir
if idx < SAMPLE_COUNT:
sample_tokens.append(token)
else:
# Randomly replace elements in the reservoir
r = random.randint(0, idx)
if r < SAMPLE_COUNT:
sample_tokens[r] = token

print(sample_tokens)``````