Reservoir sampling

Problem Statement

Randomly choosing k items from a list S containing n items where n is too large or unknown.

Method: Reservoir Sampling

Steps:

  1. keep the first item in memory
  2. for the next item i (i > 1)
    • keep the new item with probability 1/i
    • ignore the new item with probability 1 – 1/i

It is easy to prove that when there are n items, each item is kept with probability 1/n.

Python Implementation

Suppose we have a long document but only keep 10 tokens from the document.

import random
SAMPLE_COUNT = 10

random.seed(2333)

doc = '''Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.'''

tokens = doc.split(' ')

sample_tokens = []
for idx, token in enumerate(tokens):
        # Generate the reservoir
        if idx < SAMPLE_COUNT:
                sample_tokens.append(token)
        else:
                # Randomly replace elements in the reservoir
                r = random.randint(0, idx)
                if r < SAMPLE_COUNT:
                        sample_tokens[r] = token

print(sample_tokens)

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s