@@ -9,17 +9,25 @@ categories:
9
9
- oss
10
10
- kubernetes
11
11
- clients
12
- description : Learn how to use approximate statistics with Redis.
12
+ description : Learn how to use approximate calculations with Redis.
13
13
linkTitle : Probabilistic data types
14
14
title : Probabilistic data types
15
15
weight : 45
16
16
---
17
17
18
18
Redis supports several
19
19
[ probabilistic data types] ({{< relref "/develop/data-types/probabilistic" >}})
20
- that let you gather approximate statistical information. To see why
21
- this would be useful, consider the task of counting the number
22
- of distinct IP addresses that access a website in one day.
20
+ that let you calculate values approximately rather than exactly.
21
+ The types fall into two basic categories:
22
+
23
+ - [ Set operations] ( #set-operations ) : These types let you calculate (approximately)
24
+ the number of items in a set of distinct values, and whether or not a given value is
25
+ a member of a set.
26
+ - [ Numeric data calculations] ( #numeric-data ) : These types give you an approximation of
27
+ statistics such as the percentile, rank, and frequency of numeric data points in a list.
28
+
29
+ To see why these approximate calculations would be useful, consider the task of
30
+ counting the number of distinct IP addresses that access a website in one day.
23
31
24
32
Assuming that you already have code that supplies you with each IP
25
33
address as a string, you could record the addresses in Redis using
@@ -46,15 +54,132 @@ The count of distinct IP addresses would probably be rounded to the
46
54
nearest thousand or more when the usage statistics are delivered, so
47
55
getting it exactly right is not important. It would be useful
48
56
if you could trade off some precision in exchange for lower memory
49
- consumption and obfuscation of IP address strings. The probabilistic
50
- data types provide exactly this kind of trade-off for statistics.
51
- Specifically, you can count the approximate number of items in a
57
+ consumption. The probabilistic data types provide exactly this kind of
58
+ trade-off. Specifically, you can count the approximate number of items in a
52
59
set using the
53
60
[ HyperLogLog] ({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
54
61
data type, as described below.
55
62
56
- ## HyperLogLog
63
+ In general, the probabilistic data types let you perform approximations with a
64
+ bounded degree of error that have much lower memory or execution time than
65
+ the equivalent precise calculations.
66
+
67
+ ## Set operations
68
+
69
+ Redis supports the following approximate set operations:
70
+
71
+ - [ Membership] ( #set-membership ) : The Bloom filter and Cuckoo filter data types
72
+ let you track whether or not a given item is a member of a set.
73
+ - [ Cardinality] ( #set-cardinality ) : The HyperLogLog data type gives you an approximate
74
+ value for the number of items in a set, also known as the * cardinality* of
75
+ the set.
76
+
77
+ The sections below describe these operations in more detail.
78
+
79
+ ### Set membership
80
+
81
+ [ Bloom filter] ({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
82
+ [ Cuckoo filter] ({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
83
+ objects provide a set membership operation that lets you track whether or not a
84
+ particular item has been added to a set. These two types provide different
85
+ trade-offs for memory usage and speed, so you can select the best one for your
86
+ use case. Note that for both types, there is an asymmetry between presence and
87
+ absence of items in the set. If an item is reported as absent, then it is definitely
88
+ absent, but if it is reported as present, then there is a small chance it may really be
89
+ absent.
90
+
91
+ Instead of storing strings directly, like a [ set] ({{< relref "/develop/data-types/sets" >}}),
92
+ a Bloom filter records the presence or absence of the
93
+ [ hash value] ( https://en.wikipedia.org/wiki/Hash_function ) of a string.
94
+ This gives a very compact representation of the
95
+ set's membership with a fixed memory size, regardless of how many items you
96
+ add. The following example adds some names to a Bloom filter representing
97
+ a list of users and checks for the presence or absence of users in the list.
98
+ Note that you must use the ` bf() ` method to access the Bloom filter commands.
99
+
100
+ ``` py
101
+ res1 = r.bf().madd(" recorded_users" , " andy" , " cameron" , " david" , " michelle" )
102
+ print (res1) # >>> [1, 1, 1, 1]
103
+
104
+ res2 = r.bf().exists(" recorded_users" , " cameron" )
105
+ print (res2) # >>> 1
106
+
107
+ res3 = r.bf().exists(" recorded_users" , " kaitlyn" )
108
+ print (res3) # >>> 0
109
+ ```
110
+
111
+ A Cuckoo filter has similar features to a Bloom filter, but also supports
112
+ a deletion operation to remove hashes from a set, as shown in the example
113
+ below. Note that you must use the ` cf() ` method to access the Cuckoo filter
114
+ commands.
115
+
116
+ ``` py
117
+ res4 = r.cf().add(" other_users" , " paolo" )
118
+ print (res4) # >>> 1
119
+
120
+ res5 = r.cf().add(" other_users" , " kaitlyn" )
121
+ print (res5) # >>> 1
122
+
123
+ res6 = r.cf().add(" other_users" , " rachel" )
124
+ print (res6) # >>> 1
125
+
126
+ res7 = r.cf().mexists(" other_users" , " paolo" , " rachel" , " andy" )
127
+ print (res7) # >>> [1, 1, 0]
128
+
129
+ res8 = r.cf().delete(" other_users" , " paolo" )
130
+ print (res8)
131
+
132
+ res9 = r.cf().exists(" other_users" , " paolo" )
133
+ print (res9) # >>> 0
134
+ ```
135
+
136
+ Which of these two data types you choose depends on your use case.
137
+ Bloom filters are generally faster than Cuckoo filters when adding new items,
138
+ and also have better memory usage. Cuckoo filters are generally faster
139
+ at checking membership and also support the delete operation. See the
140
+ [ Bloom filter] ({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
141
+ [ Cuckoo filter] ({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
142
+ reference pages for more information and comparison between the two types.
143
+
144
+ ### Set cardinality
145
+
146
+ A HyperLogLog object doesn't support the set membership operation but
147
+ instead is specialized to calculate the cardinality of the set. You can
148
+ also merge two or more HyperLogLogs to find the cardinality of the
149
+ union of the sets they represent.
150
+
151
+ ``` py
152
+ res10 = r.pfadd(" group:1" , " andy" , " cameron" , " david" )
153
+ print (res10) # >>> 1
154
+
155
+ res11 = r.pfcount(" group:1" )
156
+ print (res11) # >>> 3
157
+
158
+ res12 = r.pfadd(" group:2" , " kaitlyn" , " michelle" , " paolo" , " rachel" )
159
+ print (res12) # >>> 1
160
+
161
+ res13 = r.pfcount(" group:2" )
162
+ print (res13) # >>> 4
163
+
164
+ res14 = r.pfmerge(" both_groups" , " group:1" , " group:2" )
165
+ print (res14) # >>> True
166
+
167
+ res15 = r.pfcount(" both_groups" )
168
+ print (res15) # >>> 7
169
+ ```
170
+
171
+ The main benefit that HyperLogLogs offer is their very low
172
+ memory usage. They can count up to 2^64 items with less than
173
+ 1% standard error using a maximum 12KB of memory.
174
+
175
+ ## Numeric data
176
+
177
+ Redis supports several approximate statistical calculations
178
+ on numeric data sets:
57
179
58
- A HyperLogLog object works in a similar way to a [ set] ({{< relref "/develop/data-types/sets" >}}),
59
- except that it only lets you add items and approximate the number of items
60
- in the set (also known as the * cardinality* of the set).
180
+ - Frequency: The Count-min sketch data type lets you find the
181
+ approximate frequency of a labeled item in a data stream.
182
+ - Percentiles: The t-digest data type estimates the percentile
183
+ of a supplied value in a data stream.
184
+ - Ranking: The Top-K data type estimates the ranking of items
185
+ by frequency in a data stream.
0 commit comments