|
| 1 | +Count-Min Sketch |
| 2 | +================ |
| 3 | + |
| 4 | +Count–Min Sketch is a simple space-efficient probabilistic data structure |
| 5 | +that is used to estimate frequencies of elements in data streams and can |
| 6 | +address the Heavy hitters problem. It was presented in 2003 [1] by |
| 7 | +Graham Cormode and Shan Muthukrishnan and published in 2005 [2]. |
| 8 | + |
| 9 | +References |
| 10 | +---------- |
| 11 | +[1] Cormode, G., Muthukrishnan, S. |
| 12 | + What's hot and what's not: Tracking most frequent items dynamically |
| 13 | + Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles |
| 14 | + of database systems, San Diego, California - June 09-11, 2003, |
| 15 | + pp. 296–306, ACM New York, NY. |
| 16 | + http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf |
| 17 | +[2] Cormode, G., Muthukrishnan, S. |
| 18 | + An Improved Data Stream Summary: The Count–Min Sketch and its Applications |
| 19 | + Journal of Algorithms, Vol. 55 (1), pp. 58–75. |
| 20 | + http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf |
| 21 | + |
| 22 | + |
| 23 | +This implementation uses MurmurHash3 family of hash functions |
| 24 | +which yields a 32-bit hash value. Thus, the length of the counters |
| 25 | +is expected to be smaller or equal to the (2^{32} - 1), since |
| 26 | +we cannot access elements with indexes above this value. |
| 27 | + |
| 28 | + |
| 29 | +.. code:: python |
| 30 | +
|
| 31 | + from pdsa.frequency.count_min_sketch import CountMinSketch |
| 32 | +
|
| 33 | + cms = CountMinSketch(5, 2000) |
| 34 | + cms.add("hello") |
| 35 | + cms.frequency("hello") |
| 36 | +
|
| 37 | +
|
| 38 | +
|
| 39 | +Build a sketch |
| 40 | +---------------- |
| 41 | + |
| 42 | +You can build a new sketch either from specifiyng its dimensions |
| 43 | +(number of counter arrays and their length), or from the expected |
| 44 | +overestimation diviation and standard error probability. |
| 45 | + |
| 46 | + |
| 47 | +Build filter from its dimensions |
| 48 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 49 | + |
| 50 | +.. code:: python |
| 51 | +
|
| 52 | + from pdsa.frequency.count_min_sketch import CountMinSketch |
| 53 | +
|
| 54 | + cms = CountMinSketch(num_of_counters=5, length_of_counter=2000) |
| 55 | +
|
| 56 | +
|
| 57 | +Build filter from the expected errors |
| 58 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 59 | + |
| 60 | +In this case the number of counter arrays and their length |
| 61 | +will be calculated corresponsing to the expected overestimation |
| 62 | +and the requested error. |
| 63 | + |
| 64 | + |
| 65 | +.. code:: python |
| 66 | +
|
| 67 | + from pdsa.frequency.count_min_sketch import CountMinSketch |
| 68 | +
|
| 69 | + cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01) |
| 70 | +
|
| 71 | +
|
| 72 | +.. note:: |
| 73 | + |
| 74 | + The `deviation` is the error ε in answering the paricular query. |
| 75 | + For example, if we expect 10^7 elements and allow the fixed |
| 76 | + overestimate of 10, the deviation is 10/10^7 = 10^{-6}. |
| 77 | + |
| 78 | + The `error` is the standard error δ (0 < error < 1). |
| 79 | + |
| 80 | + |
| 81 | +.. note:: |
| 82 | + |
| 83 | + The Count–Min Sketch is approximate and probabilistic at the same |
| 84 | + time, therefore two parameters, the error ε in answering the paricular |
| 85 | + query and the error probability δ, affect the space and time |
| 86 | + requirements. In fact, it provides the guarantee that the estimation |
| 87 | + error for frequencies will not exceed ε x n |
| 88 | + with probability at least 1 – δ. |
| 89 | + |
| 90 | + |
| 91 | +Index element into the sketch |
| 92 | +------------------------------ |
| 93 | + |
| 94 | + |
| 95 | +.. code:: python |
| 96 | +
|
| 97 | + cms.add("hello") |
| 98 | +
|
| 99 | +
|
| 100 | +.. note:: |
| 101 | + |
| 102 | + It is possible to index into the counter any elements (internally |
| 103 | + it uses *repr()* of the python object to calculate hash values for |
| 104 | + elements that are not integers, strings or bytes. |
| 105 | + |
| 106 | + |
| 107 | +Estmiate frequency of the element |
| 108 | +--------------------------------------- |
| 109 | + |
| 110 | +.. code:: python |
| 111 | +
|
| 112 | + print(cms.frequency("hello")) |
| 113 | +
|
| 114 | +
|
| 115 | +.. warning:: |
| 116 | + |
| 117 | + It is only an approximation of the exact frequency. |
| 118 | + |
| 119 | + |
| 120 | + |
| 121 | +Size of the sketch in bytes |
| 122 | +---------------------------- |
| 123 | + |
| 124 | +.. code:: python |
| 125 | +
|
| 126 | + print(cms.sizeof()) |
| 127 | +
|
| 128 | +
|
| 129 | +Length of the sketch |
| 130 | +--------------------- |
| 131 | + |
| 132 | +.. code:: python |
| 133 | +
|
| 134 | + print(len(cms)) |
0 commit comments