Provide documentation for recently added Count/CM Sketches

gakhov · gakhov · commit 5fdcf55ec9e5 · 2019-08-27T16:01:54.000+02:00
impora: sntax fix
diff --git a/README.rst b/README.rst
@@ -76,6 +76,11 @@ The latest documentation can be found at `<http://pdsa.readthedocs.io/en/latest/
 - `Probabilistic counter (Flajolet–Martin algorithm) <http://pdsa.readthedocs.io/en/latest/cardinality/probabilistic_counter.html>`_
 - `HyperLogLog <http://pdsa.readthedocs.io/en/latest/cardinality/hyperloglog.html>`_
 
+**Frequency problem**
+
+- `Count Sketch <http://pdsa.readthedocs.io/en/latest/frequency/count_sketch.html>`_
+- `Count-Min Sketch <http://pdsa.readthedocs.io/en/latest/frequency/count_min_sketch.html>`_
+
 **Rank problem**
 
 - `q-digest <http://pdsa.readthedocs.io/en/latest/rank/qdigest.html>`_
diff --git a/docs/frequency/count_min_sketch.rst b/docs/frequency/count_min_sketch.rst
@@ -0,0 +1,134 @@
+Count-Min Sketch
+================
+
+Count–Min Sketch is a simple space-efficient probabilistic data structure
+that is used to estimate frequencies of elements in data streams and can
+address the Heavy hitters problem. It was presented in 2003 [1] by
+Graham Cormode and Shan Muthukrishnan and published in 2005 [2].
+
+References
+----------
+[1] Cormode, G., Muthukrishnan, S.
+    What's hot and what's not: Tracking most frequent items dynamically
+    Proceedings of the 22th ACM SIGMOD-SIGACT-SIGART symposium on Principles
+    of database systems, San Diego, California - June 09-11, 2003,
+    pp. 296–306, ACM New York, NY.
+    http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CormodeM-hot.pdf
+[2] Cormode, G., Muthukrishnan, S.
+    An Improved Data Stream Summary: The Count–Min Sketch and its Applications
+    Journal of Algorithms, Vol. 55 (1), pp. 58–75.
+    http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
+
+
+This implementation uses MurmurHash3 family of hash functions
+which yields a 32-bit hash value. Thus, the length of the counters
+is expected to be smaller or equal to the (2^{32} - 1), since
+we cannot access elements with indexes above this value.
+
+
+.. code:: python
+
+    from pdsa.frequency.count_min_sketch import CountMinSketch
+
+    cms = CountMinSketch(5, 2000)
+    cms.add("hello")
+    cms.frequency("hello")
+
+
+
+Build a sketch
+----------------
+
+You can build a new sketch either from specifiyng its dimensions
+(number of counter arrays and their length), or from the expected
+overestimation diviation and standard error probability.
+
+
+Build filter from its dimensions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: python
+
+    from pdsa.frequency.count_min_sketch import CountMinSketch
+
+    cms = CountMinSketch(num_of_counters=5, length_of_counter=2000)
+
+
+Build filter from the expected errors
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this case the number of counter arrays and their length
+will be calculated corresponsing to the expected overestimation
+and the requested error.
+
+
+.. code:: python
+
+    from pdsa.frequency.count_min_sketch import CountMinSketch
+
+    cms = CountMinSketch.create_from_expected_error(deviation=0.000001, error=0.01)
+
+
+.. note::
+
+    The `deviation` is the error ε in answering the paricular query.
+    For example, if we expect 10^7 elements and allow the fixed
+    overestimate of 10, the deviation is 10/10^7 = 10^{-6}.
+
+    The `error` is the standard error δ (0 < error < 1).
+
+
+.. note::
+
+    The Count–Min Sketch is approximate and probabilistic at the same
+    time, therefore two parameters, the error ε in answering the paricular
+    query and the error probability δ, affect the space and time
+    requirements. In fact, it provides the guarantee that the estimation
+    error for frequencies will not exceed ε x n
+    with probability at least 1 – δ.
+
+
+Index element into the sketch
+------------------------------
+
+
+.. code:: python
+
+    cms.add("hello")
+
+
+.. note::
+
+   It is possible to index into the counter any elements (internally
+   it uses *repr()* of the python object to calculate hash values for
+   elements that are not integers, strings or bytes.
+
+
+Estmiate frequency of the element
+---------------------------------------
+
+.. code:: python
+
+    print(cms.frequency("hello"))
+
+
+.. warning::
+
+   It is only an approximation of the exact frequency.
+
+
+
+Size of the sketch in bytes
+----------------------------
+
+.. code:: python
+
+    print(cms.sizeof())
+
+
+Length of the sketch
+---------------------
+
+.. code:: python
+
+    print(len(cms))
diff --git a/docs/frequency/count_sketch.rst b/docs/frequency/count_sketch.rst
@@ -0,0 +1,128 @@
+Count Sketch
+================
+
+Count Sketch is a simple space-efficient probabilistic data structure
+that is used to estimate frequencies of elements in data streams and can
+address the Heavy hitters problem. It was proposed by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002.
+
+References
+----------
+[1] Charikar, M., Chen, K., Farach-Colton, M.
+    Finding Frequent Items in Data Streams
+    Proceedings of the 29th International Colloquium on Automata, Languages and
+    Programming, pp. 693–703, Springer, Heidelberg.
+    https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf
+
+
+This implementation uses MurmurHash3 family of hash functions
+which yields a 32-bit hash value. Thus, the length of the counters
+is expected to be smaller or equal to the (2^{32} - 1), since
+we cannot access elements with indexes above this value.
+
+
+.. code:: python
+
+    from pdsa.frequency.count_min_sketch import CountSketch
+
+    cs = CountSketch(5, 2000)
+    cs.add("hello")
+    cs.frequency("hello")
+
+
+
+Build a sketch
+----------------
+
+You can build a new sketch either from specifiyng its dimensions
+(number of counter arrays and their length), or from the expected
+overestimation diviation and standard error probability.
+
+
+Build filter from its dimensions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: python
+
+    from pdsa.frequency.count_min_sketch import CountSketch
+
+    cs = CountSketch(num_of_counters=5, length_of_counter=2000)
+
+
+Build filter from the expected errors
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this case the number of counter arrays and their length
+will be calculated corresponsing to the expected overestimation
+and the requested error.
+
+
+.. code:: python
+
+    from pdsa.frequency.count_min_sketch import CountSketch
+
+    cs = CountSketch.create_from_expected_error(deviation=0.000001, error=0.01)
+
+
+.. note::
+
+    The `deviation` is the error ε in answering the paricular query.
+    For example, if we expect 10^7 elements and allow the fixed
+    overestimate of 10, the deviation is 10/10^7 = 10^{-6}.
+
+    The `error` is the standard error δ (0 < error < 1).
+
+
+.. note::
+
+    The Count–Min Sketch is approximate and probabilistic at the same
+    time, therefore two parameters, the error ε in answering the paricular
+    query and the error probability δ, affect the space and time
+    requirements. In fact, it provides the guarantee that the estimation
+    error for frequencies will not exceed ε x n
+    with probability at least 1 – δ.
+
+
+Index element into the sketch
+------------------------------
+
+
+.. code:: python
+
+    cs.add("hello")
+
+
+.. note::
+
+   It is possible to index into the counter any elements (internally
+   it uses *repr()* of the python object to calculate hash values for
+   elements that are not integers, strings or bytes.
+
+
+Estmiate frequency of the element
+---------------------------------------
+
+.. code:: python
+
+    print(cs.frequency("hello"))
+
+
+.. warning::
+
+   It is only an approximation of the exact frequency.
+
+
+
+Size of the sketch in bytes
+----------------------------
+
+.. code:: python
+
+    print(cs.sizeof())
+
+
+Length of the sketch
+---------------------
+
+.. code:: python
+
+    print(len(cs))
diff --git a/docs/frequency/index.rst b/docs/frequency/index.rst
@@ -0,0 +1,15 @@
+Frequency
+============
+
+Many important problems with streaming applications that operate large
+data streams are related to the estimation of the frequencies of elements,
+including determining the most frequent element or detecting the trending
+ones over some period of time.
+
+
+
+.. toctree::
+   :maxdepth: 2
+
+   count_sketch
+   count_min_sketch
diff --git a/docs/index.rst b/docs/index.rst
@@ -39,5 +39,6 @@ GitHub repository: `<https://github.com/gakhov/pdsa>`_
 
    quickstart
    cardinality/index
+   frequency/index
    membership/index
    rank/index