Skip to content

Commit 5fdcf55

Browse files
committed
Provide documentation for recently added Count/CM Sketches
impora: sntax fix
1 parent c41d40c commit 5fdcf55

File tree

5 files changed

+283
-0
lines changed

5 files changed

+283
-0
lines changed

README.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,11 @@ The latest documentation can be found at `<http://pdsa.readthedocs.io/en/latest/
7676
- `Probabilistic counter (Flajolet–Martin algorithm) <http://pdsa.readthedocs.io/en/latest/cardinality/probabilistic_counter.html>`_
7777
- `HyperLogLog <http://pdsa.readthedocs.io/en/latest/cardinality/hyperloglog.html>`_
7878

79+
**Frequency problem**
80+
81+
- `Count Sketch <http://pdsa.readthedocs.io/en/latest/frequency/count_sketch.html>`_
82+
- `Count-Min Sketch <http://pdsa.readthedocs.io/en/latest/frequency/count_min_sketch.html>`_
83+
7984
**Rank problem**
8085

8186
- `q-digest <http://pdsa.readthedocs.io/en/latest/rank/qdigest.html>`_
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
Count-Min Sketch
2+
================
3+
4+
Count–Min Sketch is a simple space-efficient probabilistic data structure
5+
that is used to estimate frequencies of elements in data streams and can
6+
address the Heavy hitters problem. It was presented in 2003 [1] by
7+
Graham Cormode and Shan Muthukrishnan and published in 2005 [2].
8+
9+
References
10+
----------
11+
[1] Cormode, G., Muthukrishnan, S.
12+
What's hot and what's not: Tracking most frequent items dynamically
13+
Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles
14+
of database systems, San Diego, California - June 09-11, 2003,
15+
pp. 296–306, ACM New York, NY.
16+
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf
17+
[2] Cormode, G., Muthukrishnan, S.
18+
An Improved Data Stream Summary: The Count–Min Sketch and its Applications
19+
Journal of Algorithms, Vol. 55 (1), pp. 58–75.
20+
http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
21+
22+
23+
This implementation uses MurmurHash3 family of hash functions
24+
which yields a 32-bit hash value. Thus, the length of the counters
25+
is expected to be smaller or equal to the (2^{32} - 1), since
26+
we cannot access elements with indexes above this value.
27+
28+
29+
.. code:: python
30+
31+
from pdsa.frequency.count_min_sketch import CountMinSketch
32+
33+
cms = CountMinSketch(5, 2000)
34+
cms.add("hello")
35+
cms.frequency("hello")
36+
37+
38+
39+
Build a sketch
40+
----------------
41+
42+
You can build a new sketch either from specifiyng its dimensions
43+
(number of counter arrays and their length), or from the expected
44+
overestimation diviation and standard error probability.
45+
46+
47+
Build filter from its dimensions
48+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49+
50+
.. code:: python
51+
52+
from pdsa.frequency.count_min_sketch import CountMinSketch
53+
54+
cms = CountMinSketch(num_of_counters=5, length_of_counter=2000)
55+
56+
57+
Build filter from the expected errors
58+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59+
60+
In this case the number of counter arrays and their length
61+
will be calculated corresponsing to the expected overestimation
62+
and the requested error.
63+
64+
65+
.. code:: python
66+
67+
from pdsa.frequency.count_min_sketch import CountMinSketch
68+
69+
cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01)
70+
71+
72+
.. note::
73+
74+
The `deviation` is the error ε in answering the paricular query.
75+
For example, if we expect 10^7 elements and allow the fixed
76+
overestimate of 10, the deviation is 10/10^7 = 10^{-6}.
77+
78+
The `error` is the standard error δ (0 < error < 1).
79+
80+
81+
.. note::
82+
83+
The Count–Min Sketch is approximate and probabilistic at the same
84+
time, therefore two parameters, the error ε in answering the paricular
85+
query and the error probability δ, affect the space and time
86+
requirements. In fact, it provides the guarantee that the estimation
87+
error for frequencies will not exceed ε x n
88+
with probability at least 1 – δ.
89+
90+
91+
Index element into the sketch
92+
------------------------------
93+
94+
95+
.. code:: python
96+
97+
cms.add("hello")
98+
99+
100+
.. note::
101+
102+
It is possible to index into the counter any elements (internally
103+
it uses *repr()* of the python object to calculate hash values for
104+
elements that are not integers, strings or bytes.
105+
106+
107+
Estmiate frequency of the element
108+
---------------------------------------
109+
110+
.. code:: python
111+
112+
print(cms.frequency("hello"))
113+
114+
115+
.. warning::
116+
117+
It is only an approximation of the exact frequency.
118+
119+
120+
121+
Size of the sketch in bytes
122+
----------------------------
123+
124+
.. code:: python
125+
126+
print(cms.sizeof())
127+
128+
129+
Length of the sketch
130+
---------------------
131+
132+
.. code:: python
133+
134+
print(len(cms))

docs/frequency/count_sketch.rst

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
Count Sketch
2+
================
3+
4+
Count Sketch is a simple space-efficient probabilistic data structure
5+
that is used to estimate frequencies of elements in data streams and can
6+
address the Heavy hitters problem. It was proposed by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002.
7+
8+
References
9+
----------
10+
[1] Charikar, M., Chen, K., Farach-Colton, M.
11+
Finding Frequent Items in Data Streams
12+
Proceedings of the 29th International Colloquium on Automata, Languages and
13+
Programming, pp. 693–703, Springer, Heidelberg.
14+
https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf
15+
16+
17+
This implementation uses MurmurHash3 family of hash functions
18+
which yields a 32-bit hash value. Thus, the length of the counters
19+
is expected to be smaller or equal to the (2^{32} - 1), since
20+
we cannot access elements with indexes above this value.
21+
22+
23+
.. code:: python
24+
25+
from pdsa.frequency.count_min_sketch import CountSketch
26+
27+
cs = CountSketch(5, 2000)
28+
cs.add("hello")
29+
cs.frequency("hello")
30+
31+
32+
33+
Build a sketch
34+
----------------
35+
36+
You can build a new sketch either from specifiyng its dimensions
37+
(number of counter arrays and their length), or from the expected
38+
overestimation diviation and standard error probability.
39+
40+
41+
Build filter from its dimensions
42+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
43+
44+
.. code:: python
45+
46+
from pdsa.frequency.count_min_sketch import CountSketch
47+
48+
cs = CountSketch(num_of_counters=5, length_of_counter=2000)
49+
50+
51+
Build filter from the expected errors
52+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53+
54+
In this case the number of counter arrays and their length
55+
will be calculated corresponsing to the expected overestimation
56+
and the requested error.
57+
58+
59+
.. code:: python
60+
61+
from pdsa.frequency.count_min_sketch import CountSketch
62+
63+
cs = CountSketch.create_from_expected_error(deviation=0.000001, error=0.01)
64+
65+
66+
.. note::
67+
68+
The `deviation` is the error ε in answering the paricular query.
69+
For example, if we expect 10^7 elements and allow the fixed
70+
overestimate of 10, the deviation is 10/10^7 = 10^{-6}.
71+
72+
The `error` is the standard error δ (0 < error < 1).
73+
74+
75+
.. note::
76+
77+
The Count–Min Sketch is approximate and probabilistic at the same
78+
time, therefore two parameters, the error ε in answering the paricular
79+
query and the error probability δ, affect the space and time
80+
requirements. In fact, it provides the guarantee that the estimation
81+
error for frequencies will not exceed ε x n
82+
with probability at least 1 – δ.
83+
84+
85+
Index element into the sketch
86+
------------------------------
87+
88+
89+
.. code:: python
90+
91+
cs.add("hello")
92+
93+
94+
.. note::
95+
96+
It is possible to index into the counter any elements (internally
97+
it uses *repr()* of the python object to calculate hash values for
98+
elements that are not integers, strings or bytes.
99+
100+
101+
Estmiate frequency of the element
102+
---------------------------------------
103+
104+
.. code:: python
105+
106+
print(cs.frequency("hello"))
107+
108+
109+
.. warning::
110+
111+
It is only an approximation of the exact frequency.
112+
113+
114+
115+
Size of the sketch in bytes
116+
----------------------------
117+
118+
.. code:: python
119+
120+
print(cs.sizeof())
121+
122+
123+
Length of the sketch
124+
---------------------
125+
126+
.. code:: python
127+
128+
print(len(cs))

docs/frequency/index.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Frequency
2+
============
3+
4+
Many important problems with streaming applications that operate large
5+
data streams are related to the estimation of the frequencies of elements,
6+
including determining the most frequent element or detecting the trending
7+
ones over some period of time.
8+
9+
10+
11+
.. toctree::
12+
:maxdepth: 2
13+
14+
count_sketch
15+
count_min_sketch

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,5 +39,6 @@ GitHub repository: `<https://github.com/gakhov/pdsa>`_
3939

4040
quickstart
4141
cardinality/index
42+
frequency/index
4243
membership/index
4344
rank/index

0 commit comments

Comments
 (0)