Skip to content

Commit f64d236

Browse files
Merge pull request #1629 from redis/DOC-5225-python-prob-examples
DOC-5225 Python probabilistic data type examples
2 parents 6b38d10 + 894089a commit f64d236

File tree

2 files changed

+398
-1
lines changed

2 files changed

+398
-1
lines changed
Lines changed: 396 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,396 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Learn how to use approximate calculations with Redis.
13+
linkTitle: Probabilistic data types
14+
title: Probabilistic data types
15+
weight: 45
16+
---
17+
18+
Redis supports several
19+
[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
20+
that let you calculate values approximately rather than exactly.
21+
The types fall into two basic categories:
22+
23+
- [Set operations](#set-operations): These types let you calculate (approximately)
24+
the number of items in a set of distinct values, and whether or not a given value is
25+
a member of a set.
26+
- [Statistics](#statistics): These types give you an approximation of
27+
statistics such as the quantiles, ranks, and frequencies of numeric data points in
28+
a list.
29+
30+
To see why these approximate calculations would be useful, consider the task of
31+
counting the number of distinct IP addresses that access a website in one day.
32+
33+
Assuming that you already have code that supplies you with each IP
34+
address as a string, you could record the addresses in Redis using
35+
a [set]({{< relref "/develop/data-types/sets" >}}):
36+
37+
```py
38+
r.sadd("ip_tracker", new_ip_address)
39+
```
40+
41+
The set can only contain each key once, so if the same address
42+
appears again during the day, the new instance will not change
43+
the set. At the end of the day, you could get the exact number of
44+
distinct addresses using the `scard()` function:
45+
46+
```py
47+
num_distinct_ips = r.scard("ip_tracker")
48+
```
49+
50+
This approach is simple, effective, and precise but if your website
51+
is very busy, the `ip_tracker` set could become very large and consume
52+
a lot of memory.
53+
54+
You would probably round the count of distinct IP addresses to the
55+
nearest thousand or more to deliver the usage statistics, so
56+
getting it exactly right is not important. It would be useful
57+
if you could trade off some accuracy in exchange for lower memory
58+
consumption. The probabilistic data types provide exactly this kind of
59+
trade-off. Specifically, you can count the approximate number of items in a
60+
set using the [HyperLogLog](#set-cardinality) data type, as described below.
61+
62+
In general, the probabilistic data types let you perform approximations with a
63+
bounded degree of error that have much lower memory consumption or execution
64+
time than the equivalent precise calculations.
65+
66+
## Set operations
67+
68+
Redis supports the following approximate set operations:
69+
70+
- [Membership](#set-membership): The
71+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
72+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
73+
data types let you track whether or not a given item is a member of a set.
74+
- [Cardinality](#set-cardinality): The
75+
[HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
76+
data type gives you an approximate value for the number of items in a set, also
77+
known as the *cardinality* of the set.
78+
79+
The sections below describe these operations in more detail.
80+
81+
### Set membership
82+
83+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
84+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
85+
objects provide a set membership operation that lets you track whether or not a
86+
particular item has been added to a set. These two types provide different
87+
trade-offs for memory usage and speed, so you can select the best one for your
88+
use case. Note that for both types, there is an asymmetry between presence and
89+
absence of items in the set. If an item is reported as absent, then it is definitely
90+
absent, but if it is reported as present, then there is a small chance it may really be
91+
absent.
92+
93+
Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
94+
a Bloom filter records the presence or absence of the
95+
[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
96+
This gives a very compact representation of the
97+
set's membership with a fixed memory size, regardless of how many items you
98+
add. The following example adds some names to a Bloom filter representing
99+
a list of users and checks for the presence or absence of users in the list.
100+
Note that you must use the `bf()` method to access the Bloom filter commands.
101+
102+
```py
103+
res1 = r.bf().madd("recorded_users", "andy", "cameron", "david", "michelle")
104+
print(res1) # >>> [1, 1, 1, 1]
105+
106+
res2 = r.bf().exists("recorded_users", "cameron")
107+
print(res2) # >>> 1
108+
109+
res3 = r.bf().exists("recorded_users", "kaitlyn")
110+
print(res3) # >>> 0
111+
```
112+
113+
<!-- < clients-example home_prob_dts bloom Python >}}
114+
< /clients-example >}} -->
115+
116+
A Cuckoo filter has similar features to a Bloom filter, but also supports
117+
a deletion operation to remove hashes from a set, as shown in the example
118+
below. Note that you must use the `cf()` method to access the Cuckoo filter
119+
commands.
120+
121+
```py
122+
res4 = r.cf().add("other_users", "paolo")
123+
print(res4) # >>> 1
124+
125+
res5 = r.cf().add("other_users", "kaitlyn")
126+
print(res5) # >>> 1
127+
128+
res6 = r.cf().add("other_users", "rachel")
129+
print(res6) # >>> 1
130+
131+
res7 = r.cf().mexists("other_users", "paolo", "rachel", "andy")
132+
print(res7) # >>> [1, 1, 0]
133+
134+
res8 = r.cf().delete("other_users", "paolo")
135+
print(res8) # >>> 1
136+
137+
res9 = r.cf().exists("other_users", "paolo")
138+
print(res9) # >>> 0
139+
```
140+
141+
<!-- < clients-example home_prob_dts cuckoo Python >}}
142+
< /clients-example >}} -->
143+
144+
Which of these two data types you choose depends on your use case.
145+
Bloom filters are generally faster than Cuckoo filters when adding new items,
146+
and also have better memory usage. Cuckoo filters are generally faster
147+
at checking membership and also support the delete operation. See the
148+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
149+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
150+
reference pages for more information and comparison between the two types.
151+
152+
### Set cardinality
153+
154+
A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
155+
object calculates the cardinality of a set. As you add
156+
items, the HyperLogLog tracks the number of distinct set members but
157+
doesn't let you retrieve them or query which items have been added.
158+
You can also merge two or more HyperLogLogs to find the cardinality of the
159+
[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they
160+
represent.
161+
162+
```py
163+
res10 = r.pfadd("group:1", "andy", "cameron", "david")
164+
print(res10) # >>> 1
165+
166+
res11 = r.pfcount("group:1")
167+
print(res11) # >>> 3
168+
169+
res12 = r.pfadd("group:2", "kaitlyn", "michelle", "paolo", "rachel")
170+
print(res12) # >>> 1
171+
172+
res13 = r.pfcount("group:2")
173+
print(res13) # >>> 4
174+
175+
res14 = r.pfmerge("both_groups", "group:1", "group:2")
176+
print(res14) # >>> True
177+
178+
res15 = r.pfcount("both_groups")
179+
print(res15) # >>> 7
180+
```
181+
182+
<!-- < clients-example home_prob_dts hyperloglog Python >}}
183+
< /clients-example >}} -->
184+
185+
The main benefit that HyperLogLogs offer is their very low
186+
memory usage. They can count up to 2^64 items with less than
187+
1% standard error using a maximum 12KB of memory. This makes
188+
them very useful for counting things like the total of distinct
189+
IP addresses that access a website or the total of distinct
190+
bank card numbers that make purchases within a day.
191+
192+
## Statistics
193+
194+
Redis supports several approximate statistical calculations
195+
on numeric data sets:
196+
197+
- [Frequency](#frequency): The
198+
[Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
199+
data type lets you find the approximate frequency of a labeled item in a data stream.
200+
- [Quantiles](#quantiles): The
201+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
202+
data type estimates the quantile of a query value in a data stream.
203+
- [Ranking](#ranking): The
204+
[Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
205+
estimates the ranking of labeled items by frequency in a data stream.
206+
207+
The sections below describe these operations in more detail.
208+
209+
### Frequency
210+
211+
A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
212+
(CMS) object keeps count of a set of related items represented by
213+
string labels. The count is approximate, but you can specify
214+
how close you want to keep the count to the true value (as a fraction)
215+
and the acceptable probability of failing to keep it in this
216+
desired range. For example, you can request that the count should
217+
stay within 0.1% of the true value and have a 0.05% probability
218+
of going outside this limit. The example below shows how to create
219+
a Count-min sketch object, add data to it, and then query it.
220+
Note that you must use the `cms()` method to access the Count-min
221+
sketch commands.
222+
223+
```py
224+
# Specify that you want to keep the counts within 0.01
225+
# (0.1%) of the true value with a 0.005 (0.05%) chance
226+
# of going outside this limit.
227+
res16 = r.cms().initbyprob("items_sold", 0.01, 0.005)
228+
print(res16) # >>> True
229+
230+
# The parameters for `incrby()` are two lists. The count
231+
# for each item in the first list is incremented by the
232+
# value at the same index in the second list.
233+
res17 = r.cms().incrby(
234+
"items_sold",
235+
["bread", "tea", "coffee", "beer"], # Items sold
236+
[300, 200, 200, 100]
237+
)
238+
print(res17) # >>> [300, 200, 200, 100]
239+
240+
res18 = r.cms().incrby(
241+
"items_sold",
242+
["bread", "coffee"],
243+
[100, 150]
244+
)
245+
print(res18) # >>> [400, 350]
246+
247+
res19 = r.cms().query("items_sold", "bread", "tea", "coffee", "beer")
248+
print(res19) # >>> [400, 200, 350, 100]
249+
```
250+
251+
<!-- < clients-example home_prob_dts cms Python >}}
252+
< /clients-example >}} -->
253+
254+
The advantage of using a CMS over keeping an exact count with a
255+
[sorted set]({{< relref "/develop/data-types/sorted-sets" >}})
256+
is that that a CMS has very low and fixed memory usage, even for
257+
large numbers of items. Use CMS objects to keep daily counts of
258+
items sold, accesses to individual web pages on your site, and
259+
other similar statistics.
260+
261+
### Quantiles
262+
263+
A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value
264+
below which a certain fraction of samples lie. For example, with
265+
a set of measurements of people's heights, the quantile of 0.75 is
266+
the value of height below which 75% of all people's heights lie.
267+
[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent
268+
to quantiles, except that the fraction is expressed as a percentage.
269+
270+
A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
271+
object can estimate quantiles from a set of values added to it
272+
without having to store each value in the set explicitly. This can
273+
save a lot of memory when you have a large number of samples.
274+
275+
The example below shows how to add data samples to a t-digest
276+
object and obtain some basic statistics, such as the minimum and
277+
maximum values, the quantile of 0.75, and the
278+
[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
279+
(CDF), which is effectively the inverse of the quantile function. It also
280+
shows how to merge two or more t-digest objects to query the combined
281+
data set. Note that you must use the `tdigest()` method to access the
282+
t-digest commands.
283+
284+
```py
285+
res20 = r.tdigest().create("male_heights")
286+
print(res20) # >>> True
287+
288+
res21 = r.tdigest().add(
289+
"male_heights",
290+
[175.5, 181, 160.8, 152, 177, 196, 164]
291+
)
292+
print(res21) # >>> OK
293+
294+
res22 = r.tdigest().min("male_heights")
295+
print(res22) # >>> 152.0
296+
297+
res23 = r.tdigest().max("male_heights")
298+
print(res23) # >>> 196.0
299+
300+
res24 = r.tdigest().quantile("male_heights", 0.75)
301+
print(res24) # >>> 181
302+
303+
# Note that the CDF value for 181 is not exactly
304+
# 0.75. Both values are estimates.
305+
res25 = r.tdigest().cdf("male_heights", 181)
306+
print(res25) # >>> [0.7857142857142857]
307+
308+
res26 = r.tdigest().create("female_heights")
309+
print(res26) # >>> True
310+
311+
res27 = r.tdigest().add(
312+
"female_heights",
313+
[155.5, 161, 168.5, 170, 157.5, 163, 171]
314+
)
315+
print(res27) # >>> OK
316+
317+
res28 = r.tdigest().quantile("female_heights", 0.75)
318+
print(res28) # >>> [170]
319+
320+
res29 = r.tdigest().merge(
321+
"all_heights", 2, "male_heights", "female_heights"
322+
)
323+
print(res29) # >>> OK
324+
325+
res30 = r.tdigest().quantile("all_heights", 0.75)
326+
print(res30) # >>> [175.5]
327+
```
328+
329+
<!-- < clients-example home_prob_dts tdigest Python >}}
330+
< /clients-example >}} -->
331+
332+
A t-digest object also supports several other related commands, such
333+
as querying by rank. See the
334+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
335+
reference for more information.
336+
337+
### Ranking
338+
339+
A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}})
340+
object estimates the rankings of different labeled items in a data
341+
stream according to frequency. For example, you could use this to
342+
track the top ten most frequently-accessed pages on a website, or the
343+
top five most popular items sold.
344+
345+
The example below adds several different items to a Top-K object
346+
that tracks the top three items (this is the second parameter to
347+
the `topk().reserve()` method). It also shows how to list the
348+
top *k* items and query whether or not a given item is in the
349+
list. Note that you must use the `topk()` method to access the
350+
Top-K commands.
351+
352+
```py
353+
# The `reserve()` method creates the Top-K object with
354+
# the given key. The parameters are the number of items
355+
# in the ranking and values for `width`, `depth`, and
356+
# `decay`, described in the Top-K reference page.
357+
res31 = r.topk().reserve("top_3_songs", 3, 7, 8, 0.9)
358+
print(res31) # >>> True
359+
360+
# The parameters for `incrby()` are two lists. The count
361+
# for each item in the first list is incremented by the
362+
# value at the same index in the second list.
363+
res32 = r.topk().incrby(
364+
"top_3_songs",
365+
[
366+
"Starfish Trooper",
367+
"Only one more time",
368+
"Rock me, Handel",
369+
"How will anyone know?",
370+
"Average lover",
371+
"Road to everywhere"
372+
],
373+
[
374+
3000,
375+
1850,
376+
1325,
377+
3890,
378+
4098,
379+
770
380+
]
381+
)
382+
print(res32)
383+
# >>> [None, None, None, 'Rock me, Handel', 'Only one more time', None]
384+
385+
res33 = r.topk().list("top_3_songs")
386+
print(res33)
387+
# >>> ['Average lover', 'How will anyone know?', 'Starfish Trooper']
388+
389+
res34 = r.topk().query(
390+
"top_3_songs", "Starfish Trooper", "Road to everywhere"
391+
)
392+
print(res34) # >>> [1, 0]
393+
```
394+
395+
<!-- < clients-example home_prob_dts topk Python >}}
396+
< /clients-example >}} -->

0 commit comments

Comments
 (0)