# Shuffle and Sampling - 隨機抽樣和洗牌
### 洗牌算法
- [Shuffle a given array - GeeksforGeeks](http://www.geeksforgeeks.org/shuffle-a-given-array/)
Given an array, write a program to generate a random permutation of array elements. This question is also asked as “shuffle a deck of cards” or “randomize a given array”.
### 題解
這里以 Fisher–Yates shuffle 算法為例,偽代碼如下:
~~~
To shuffle an array a of n elements (indices 0..n-1):
for i from 0 downto i do
j ← random integer such that 0 ≤ j ≤ i
exchange a[j] and a[i]
~~~
轉化為代碼為:
~~~
/*
* shuffle cards
*/
public static void shuffleCard(int[] cards) {
if (cards == null || cards.length == 0) return;
Random rand = new Random();
for (int i = 0; i < cards.length; i++) {
int k = rand.nextInt(i + 1); // 0~i (inclusive)
int temp = cards[i];
cards[i] = cards[k];
cards[k] = temp;
}
}
~~~
看了算法和代碼后讓我們來使用歸納法簡單證明下這個洗牌算法的正確性。我們要證明的問題是:**數組中每個元素在每個索引處出現的概率均相等。**
對于單個元素來講,以上算法顯然正確,因為交換后仍然只有一個元素。接下來我們不妨假設其遍歷到數組索引為`i-1`時滿足隨機排列特性,那么當遍歷到數組索引為`i`時,隨機數`k`為`i`的概率為`1/i`, 為`0~i-1`的概率為`(i-1)/i`. 接下來與索引為`i`的值交換,可以得知`card[i]`出現在索引`i`的位置的概率為`1/i`, 在其他索引位置的概率也為`1/i`; 而對于`card[i]`之前的元素,以索引`j`處的元素`card[j]`為例進行分析可知其在位置`j`的概率為`1/(i-1) * (i-1)/i = 1/i`, 具體含義為遍歷到索引`i-1`時`card[j]`位于索引`j`的概率(`1/(i-1)`)乘以遍歷到索引`i`時隨機數未選擇與索引`j`的數進行交換的概率(`(i-1)/i`).
需要注意的是前面的`j <= i-1`, 那么`card[j]`位于索引`i`的概率又是多少呢?要位于索引`i`,則隨機數`k`須為`i`, 這種概率為`1/i`.
綜上,以上算法可以實現完美洗牌(等概率)。
### Random sampling - 隨機抽樣
隨機抽樣也稱為水池抽樣,Randomly choosing a sample of k items from a list S containing n items. 大意是從大小為 n 的數組中隨機選出 m 個整數,要求每個元素被選中的概率相同。
### 題解
比較簡潔的有算法 Algorithm R, 偽代碼如下:
~~~
/*
S has items to sample, R will contain the result
*/
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]
~~~
轉化為代碼為:
~~~
/*
* random sample
*/
public static int[] randomSample(int[] nums, int m) {
if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
int[] sample = new int[m];
for (int i = 0; i < m; i++) {
sample[i] = nums[i];
}
Random random = new Random();
for (int i = m; i < nums.length; i++) {
int k = random.nextInt(i + 1); // 0~i(inclusive)
if (k < m) {
sample[k] = nums[i];
}
}
return sample;
}
~~~
和洗牌算法類似,我們要證明的問題是:**數組中每個元素在最終采樣的數組中出現的概率均相等且為`m/n`.** 洗牌算法中是排列,而對于隨機抽樣則為組合。
維基百科上的證明相對容易懂一些,這里我稍微復述下。首先將數組前 m 個元素填充進新數組`sample`, 然后從`m`開始遍歷直至數組最后一個索引。隨機數`k`的范圍為`0~i`, 如果`k < m`,新數組的索引為 k 的元素則和原數組索引為`i`的元素交換;如果`k >= m`, 則不進行交換。`i == m`時,以原數組中第`j`個元素為例,它被`nums[m]`替換的概率為`1/(m+1)`, 也就是說保留在`sample`數組中的概率為`m/(m+1)`. 對與第`m+1`個元素`nums[m]`來說,其位于`sample`數組中的概率則為`m*1/(m+1)`(可替換 m 個不同的元素).
接下來仍然使用數學歸納法證明,若`i`遍歷到`r`時,其之前的元素出現的概率為`m/(r-1)`, 那么其之前的元素中任一元素`nums[j]`被替換的概率為`m/r * 1/m = 1/r`, 不被替換的概率則為`(r-1)/r`. 故元素`nums[j]`在`i`遍歷完`r`后仍然保留的概率為`m/(r-1) * (r-1)/r = m/r`. 而對于元素`nums[r]`來說,其要被替換至`sample`數組中的概率則為`m/r`(隨機數小于m 的個數為 m).
綜上,以上算法在遍歷完長度為 n 的數組后每個數出現在最終`sample`數組中的概率都為`m/n`.
### Implementation and Test case
**Talk is cheap, show me the code!**
### Java
~~~
import java.util.*;
import java.util.Random;
public class Probability {
public static void main(String[] args) {
int[] cards = new int[10];
for (int i = 0; i < 10; i++) {
cards[i] = i;
}
// 100000 times test
final int times = 100000;
final int m = 5;
int[][] count = new int[cards.length][cards.length];
int[][] count2 = new int[cards.length][m];
for (int i = 0; i < times; i++) {
shuffleCard(cards);
shuffleTest(cards, count);
int[] sample = randomSample(cards, m);
shuffleTest(sample, count2);
}
System.out.println("Shuffle cards");
shufflePrint(count);
System.out.println();
System.out.println("Random sample");
shufflePrint(count2);
}
/*
* shuffle cards
*/
public static void shuffleCard(int[] cards) {
if (cards == null || cards.length == 0) return;
Random rand = new Random();
for (int i = 0; i < cards.length; i++) {
int k = rand.nextInt(i + 1);
int temp = cards[i];
cards[i] = cards[k];
cards[k] = temp;
}
}
/*
* random sample
*/
public static int[] randomSample(int[] nums, int m) {
if (nums == null || nums.length == 0 || m <= 0) return new int[]{};
m = Math.min(m, nums.length);
int[] sample = new int[m];
for (int i = 0; i < m; i++) {
sample[i] = nums[i];
}
Random random = new Random();
for (int i = m; i < nums.length; i++) {
int k = random.nextInt(i + 1);
if (k < m) {
sample[k] = nums[i];
}
}
return sample;
}
/*
* nums[i] = j, num j appear in index i ==> count[j][i]
*/
public static void shuffleTest(int[] nums, int[][] count) {
if (nums == null || nums.length == 0) return;
for (int i = 0; i < nums.length; i++) {
count[nums[i]][i]++;
}
}
/*
* print shuffle test
*/
public static void shufflePrint(int[][] count) {
if (count == null || count.length == 0) return;
// print index
System.out.print(" ");
for (int i = 0; i < count[0].length; i++) {
System.out.printf("%-7d", i);
}
System.out.println();
// print num appear in index i in total
for (int i = 0; i < count.length; i++) {
System.out.print(i + ": ");
for (int j = 0; j < count[i].length; j++) {
System.out.printf("%-7d", count[i][j]);
}
System.out.println();
}
}
}
~~~
以十萬次試驗為例,左側是元素`i`, 列代表在相應索引位置出現的次數。可以看出分布還是比較隨機的。
~~~
Shuffle cards
0 1 2 3 4 5 6 7 8 9
0: 10033 9963 10043 9845 9932 10020 9964 10114 10043 10043
1: 9907 9951 9989 10071 10059 9966 10054 10023 10015 9965
2: 10042 10046 9893 10080 10050 9994 10024 9852 10098 9921
3: 10039 10023 10039 10024 9919 10057 10188 9916 9907 9888
4: 9944 9913 10196 10059 9838 10205 9899 9945 9850 10151
5: 10094 9971 10054 9958 10022 9922 10047 9978 9965 9989
6: 9995 10147 9824 10015 10023 9804 10050 10192 9939 10011
7: 9941 10131 9902 9920 10040 10121 10010 9928 9984 10023
8: 10010 9926 9883 10098 10083 10028 9801 9936 10200 10035
9: 9995 9929 10177 9930 10034 9883 9963 10116 9999 9974
Random sample
0 1 2 3 4
0: 9966 10026 10078 9966 9891
1: 9958 9806 10066 10022 10039
2: 9923 9936 9964 10051 10083
3: 10165 10088 10184 9928 9916
4: 9998 9990 9973 9931 9832
5: 10026 9932 9873 10085 10035
6: 9942 9972 9990 10030 10026
7: 9903 10153 9997 10051 10044
8: 10082 10066 9804 9899 10147
9: 10037 10031 10071 10037 9987
~~~
### Reference
- [Fisher–Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm) - 洗牌算法的詳述,比較簡潔的算法
- [Reservoir sampling ](https://en.wikipedia.org/wiki/Reservoir_sampling) - 水池抽樣算法
- [如何測試洗牌程序 | 酷 殼 - CoolShell.cn](http://coolshell.cn/articles/8593.html) - 借鑒了其中的一些測試方法
- 《計算機程序設計藝術》第二卷(半數值算法) - 3.4.2 隨機抽樣和洗牌
- 《編程珠璣》第十二章 - 抽樣問題
- Preface
- Part I - Basics
- Basics Data Structure
- String
- Linked List
- Binary Tree
- Huffman Compression
- Queue
- Heap
- Stack
- Set
- Map
- Graph
- Basics Sorting
- Bubble Sort
- Selection Sort
- Insertion Sort
- Merge Sort
- Quick Sort
- Heap Sort
- Bucket Sort
- Counting Sort
- Radix Sort
- Basics Algorithm
- Divide and Conquer
- Binary Search
- Math
- Greatest Common Divisor
- Prime
- Knapsack
- Probability
- Shuffle
- Basics Misc
- Bit Manipulation
- Part II - Coding
- String
- strStr
- Two Strings Are Anagrams
- Compare Strings
- Anagrams
- Longest Common Substring
- Rotate String
- Reverse Words in a String
- Valid Palindrome
- Longest Palindromic Substring
- Space Replacement
- Wildcard Matching
- Length of Last Word
- Count and Say
- Integer Array
- Remove Element
- Zero Sum Subarray
- Subarray Sum K
- Subarray Sum Closest
- Recover Rotated Sorted Array
- Product of Array Exclude Itself
- Partition Array
- First Missing Positive
- 2 Sum
- 3 Sum
- 3 Sum Closest
- Remove Duplicates from Sorted Array
- Remove Duplicates from Sorted Array II
- Merge Sorted Array
- Merge Sorted Array II
- Median
- Partition Array by Odd and Even
- Kth Largest Element
- Binary Search
- Binary Search
- Search Insert Position
- Search for a Range
- First Bad Version
- Search a 2D Matrix
- Search a 2D Matrix II
- Find Peak Element
- Search in Rotated Sorted Array
- Search in Rotated Sorted Array II
- Find Minimum in Rotated Sorted Array
- Find Minimum in Rotated Sorted Array II
- Median of two Sorted Arrays
- Sqrt x
- Wood Cut
- Math and Bit Manipulation
- Single Number
- Single Number II
- Single Number III
- O1 Check Power of 2
- Convert Integer A to Integer B
- Factorial Trailing Zeroes
- Unique Binary Search Trees
- Update Bits
- Fast Power
- Hash Function
- Count 1 in Binary
- Fibonacci
- A plus B Problem
- Print Numbers by Recursion
- Majority Number
- Majority Number II
- Majority Number III
- Digit Counts
- Ugly Number
- Plus One
- Linked List
- Remove Duplicates from Sorted List
- Remove Duplicates from Sorted List II
- Remove Duplicates from Unsorted List
- Partition List
- Two Lists Sum
- Two Lists Sum Advanced
- Remove Nth Node From End of List
- Linked List Cycle
- Linked List Cycle II
- Reverse Linked List
- Reverse Linked List II
- Merge Two Sorted Lists
- Merge k Sorted Lists
- Reorder List
- Copy List with Random Pointer
- Sort List
- Insertion Sort List
- Check if a singly linked list is palindrome
- Delete Node in the Middle of Singly Linked List
- Rotate List
- Swap Nodes in Pairs
- Remove Linked List Elements
- Binary Tree
- Binary Tree Preorder Traversal
- Binary Tree Inorder Traversal
- Binary Tree Postorder Traversal
- Binary Tree Level Order Traversal
- Binary Tree Level Order Traversal II
- Maximum Depth of Binary Tree
- Balanced Binary Tree
- Binary Tree Maximum Path Sum
- Lowest Common Ancestor
- Invert Binary Tree
- Diameter of a Binary Tree
- Construct Binary Tree from Preorder and Inorder Traversal
- Construct Binary Tree from Inorder and Postorder Traversal
- Subtree
- Binary Tree Zigzag Level Order Traversal
- Binary Tree Serialization
- Binary Search Tree
- Insert Node in a Binary Search Tree
- Validate Binary Search Tree
- Search Range in Binary Search Tree
- Convert Sorted Array to Binary Search Tree
- Convert Sorted List to Binary Search Tree
- Binary Search Tree Iterator
- Exhaustive Search
- Subsets
- Unique Subsets
- Permutations
- Unique Permutations
- Next Permutation
- Previous Permuation
- Unique Binary Search Trees II
- Permutation Index
- Permutation Index II
- Permutation Sequence
- Palindrome Partitioning
- Combinations
- Combination Sum
- Combination Sum II
- Minimum Depth of Binary Tree
- Word Search
- Dynamic Programming
- Triangle
- Backpack
- Backpack II
- Minimum Path Sum
- Unique Paths
- Unique Paths II
- Climbing Stairs
- Jump Game
- Word Break
- Longest Increasing Subsequence
- Palindrome Partitioning II
- Longest Common Subsequence
- Edit Distance
- Jump Game II
- Best Time to Buy and Sell Stock
- Best Time to Buy and Sell Stock II
- Best Time to Buy and Sell Stock III
- Best Time to Buy and Sell Stock IV
- Distinct Subsequences
- Interleaving String
- Maximum Subarray
- Maximum Subarray II
- Longest Increasing Continuous subsequence
- Longest Increasing Continuous subsequence II
- Graph
- Find the Connected Component in the Undirected Graph
- Route Between Two Nodes in Graph
- Topological Sorting
- Word Ladder
- Bipartial Graph Part I
- Data Structure
- Implement Queue by Two Stacks
- Min Stack
- Sliding Window Maximum
- Longest Words
- Heapify
- Problem Misc
- Nuts and Bolts Problem
- String to Integer
- Insert Interval
- Merge Intervals
- Minimum Subarray
- Matrix Zigzag Traversal
- Valid Sudoku
- Add Binary
- Reverse Integer
- Gray Code
- Find the Missing Number
- Minimum Window Substring
- Continuous Subarray Sum
- Continuous Subarray Sum II
- Longest Consecutive Sequence
- Part III - Contest
- Google APAC
- APAC 2015 Round B
- Problem A. Password Attacker
- Microsoft
- Microsoft 2015 April
- Problem A. Magic Box
- Problem B. Professor Q's Software
- Problem C. Islands Travel
- Problem D. Recruitment
- Microsoft 2015 April 2
- Problem A. Lucky Substrings
- Problem B. Numeric Keypad
- Problem C. Spring Outing
- Microsoft 2015 September 2
- Problem A. Farthest Point
- Appendix I Interview and Resume
- Interview
- Resume