DOC-5225 finished set types and started stats types

andy-stark-redis · andy-stark-redis · commit e33fdb78243b · 2025-05-13T16:26:04.000+01:00
diff --git a/content/develop/clients/redis-py/prob.md b/content/develop/clients/redis-py/prob.md
@@ -9,17 +9,25 @@ categories:
 - oss
 - kubernetes
 - clients
-description: Learn how to use approximate statistics with Redis.
+description: Learn how to use approximate calculations with Redis.
 linkTitle: Probabilistic data types
 title: Probabilistic data types
 weight: 45
 ---
 
 Redis supports several
 [probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
-that let you gather approximate statistical information. To see why
-this would be useful, consider the task of counting the number
-of distinct IP addresses that access a website in one day.
+that let you calculate values approximately rather than exactly.
+The types fall into two basic categories:
+
+-   [Set operations](#set-operations): These types let you calculate (approximately)
+    the number of items in a set of distinct values, and whether or not a given value is
+    a member of a set.
+-   [Numeric data calculations](#numeric-data): These types give you an approximation of
+    statistics such as the percentile, rank, and frequency of numeric data points in a list.
+
+To see why these approximate calculations would be useful, consider the task of
+counting the number of distinct IP addresses that access a website in one day.
 
 Assuming that you already have code that supplies you with each IP
 address as a string, you could record the addresses in Redis using
@@ -46,15 +54,132 @@ The count of distinct IP addresses would probably be rounded to the
 nearest thousand or more when the usage statistics are delivered, so
 getting it exactly right is not important. It would be useful
 if you could trade off some precision in exchange for lower memory
-consumption and obfuscation of IP address strings. The probabilistic
-data types provide exactly this kind of trade-off for statistics.
-Specifically, you can count the approximate number of items in a
+consumption. The probabilistic data types provide exactly this kind of
+trade-off. Specifically, you can count the approximate number of items in a
 set using the
 [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
 data type, as described below.
 
-## HyperLogLog
+In general, the probabilistic data types let you perform approximations with a
+bounded degree of error that have much lower memory or execution time than
+the equivalent precise calculations.
+
+## Set operations
+
+Redis supports the following approximate set operations:
+
+-   [Membership](#set-membership): The Bloom filter and Cuckoo filter data types
+    let you track whether or not a given item is a member of a set.
+-   [Cardinality](#set-cardinality): The HyperLogLog data type gives you an approximate
+    value for the number of items in a set, also known as the *cardinality* of
+    the set.
+
+The sections below describe these operations in more detail.
+
+### Set membership
+
+[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
+[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
+objects provide a set membership operation that lets you track whether or not a
+particular item has been added to a set. These two types provide different
+trade-offs for memory usage and speed, so you can select the best one for your
+use case. Note that for both types, there is an asymmetry between presence and
+absence of items in the set. If an item is reported as absent, then it is definitely
+absent, but if it is reported as present, then there is a small chance it may really be
+absent.
+
+Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
+a Bloom filter records the presence or absence of the
+[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
+This gives a very compact representation of the
+set's membership with a fixed memory size, regardless of how many items you
+add. The following example adds some names to a Bloom filter representing
+a list of users and checks for the presence or absence of users in the list.
+Note that you must use the `bf()` method to access the Bloom filter commands.
+
+```py
+res1 = r.bf().madd("recorded_users", "andy", "cameron", "david", "michelle")
+print(res1)  # >>> [1, 1, 1, 1]
+
+res2 = r.bf().exists("recorded_users", "cameron")
+print(res2)  # >>> 1
+
+res3 = r.bf().exists("recorded_users", "kaitlyn")
+print(res3)  # >>> 0
+```
+
+A Cuckoo filter has similar features to a Bloom filter, but also supports
+a deletion operation to remove hashes from a set, as shown in the example
+below. Note that you must use the `cf()` method to access the Cuckoo filter
+commands.
+
+```py
+res4 = r.cf().add("other_users", "paolo")
+print(res4)  # >>> 1
+
+res5 = r.cf().add("other_users", "kaitlyn")
+print(res5)  # >>> 1
+
+res6 = r.cf().add("other_users", "rachel")
+print(res6)  # >>> 1
+
+res7 = r.cf().mexists("other_users", "paolo", "rachel", "andy")
+print(res7)  # >>> [1, 1, 0]
+
+res8 = r.cf().delete("other_users", "paolo")
+print(res8)
+
+res9 = r.cf().exists("other_users", "paolo")
+print(res9)  # >>> 0
+```
+
+Which of these two data types you choose depends on your use case.
+Bloom filters are generally faster than Cuckoo filters when adding new items,
+and also have better memory usage. Cuckoo filters are generally faster
+at checking membership and also support the delete operation. See the
+[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
+[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
+reference pages for more information and comparison between the two types.
+
+### Set cardinality
+
+A HyperLogLog object doesn't support the set membership operation but
+instead is specialized to calculate the cardinality of the set. You can
+also merge two or more HyperLogLogs to find the cardinality of the
+union of the sets they represent.
+
+```py
+res10 = r.pfadd("group:1", "andy", "cameron", "david")
+print(res10)  # >>> 1
+
+res11 = r.pfcount("group:1")
+print(res11)  # >>> 3
+
+res12 = r.pfadd("group:2", "kaitlyn", "michelle", "paolo", "rachel")
+print(res12)  # >>> 1
+
+res13 = r.pfcount("group:2")
+print(res13)  # >>> 4
+
+res14 = r.pfmerge("both_groups", "group:1", "group:2")
+print(res14)  # >>> True
+
+res15 = r.pfcount("both_groups")
+print(res15)  # >>> 7
+```
+
+The main benefit that HyperLogLogs offer is their very low
+memory usage. They can count up to 2^64 items with less than
+1% standard error using a maximum 12KB of memory.
+
+## Numeric data
+
+Redis supports several approximate statistical calculations
+on numeric data sets:
 
-A HyperLogLog object works in a similar way to a [set]({{< relref "/develop/data-types/sets" >}}),
-except that it only lets you add items and approximate the number of items
-in the set (also known as the *cardinality* of the set). 
+-   Frequency: The Count-min sketch data type lets you find the
+    approximate frequency of a labeled item in a data stream.
+-   Percentiles: The t-digest data type estimates the percentile
+    of a supplied value in a data stream.
+-   Ranking: The Top-K data type estimates the ranking of items
+    by frequency in a data stream.