Randomly choosing k items from a list S containing n items where n is too large or unknown.
Method: Reservoir Sampling
- keep the first item in memory
- for the next item i (i > 1)
- keep the new item with probability 1/i
- ignore the new item with probability 1 – 1/i
It is easy to prove that when there are n items, each item is kept with probability 1/n.
Suppose we have a long document but only keep 10 tokens from the document.
import random SAMPLE_COUNT = 10 random.seed(2333) doc = '''Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.''' tokens = doc.split(' ') sample_tokens =  for idx, token in enumerate(tokens): # Generate the reservoir if idx < SAMPLE_COUNT: sample_tokens.append(token) else: # Randomly replace elements in the reservoir r = random.randint(0, idx) if r < SAMPLE_COUNT: sample_tokens[r] = token print(sample_tokens)