@@ -52,28 +52,29 @@ is very busy, the `ip_tracker` set could become very large and consume
52
52
a lot of memory.
53
53
54
54
You would probably round the count of distinct IP addresses to the
55
- nearest thousand or more when you deliver the usage statistics, so
55
+ nearest thousand or more to deliver the usage statistics, so
56
56
getting it exactly right is not important. It would be useful
57
- if you could trade off some precision in exchange for lower memory
57
+ if you could trade off some accuracy in exchange for lower memory
58
58
consumption. The probabilistic data types provide exactly this kind of
59
59
trade-off. Specifically, you can count the approximate number of items in a
60
- set using the
61
- [ HyperLogLog] ({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
62
- data type, as described below.
60
+ set using the [ HyperLogLog] ( #set-cardinality ) data type, as described below.
63
61
64
62
In general, the probabilistic data types let you perform approximations with a
65
- bounded degree of error that have much lower memory or execution time than
66
- the equivalent precise calculations.
63
+ bounded degree of error that have much lower memory consumption or execution
64
+ time than the equivalent precise calculations.
67
65
68
66
## Set operations
69
67
70
68
Redis supports the following approximate set operations:
71
69
72
- - [ Membership] ( #set-membership ) : The Bloom filter and Cuckoo filter data types
73
- let you track whether or not a given item is a member of a set.
74
- - [ Cardinality] ( #set-cardinality ) : The HyperLogLog data type gives you an approximate
75
- value for the number of items in a set, also known as the * cardinality* of
76
- the set.
70
+ - [ Membership] ( #set-membership ) : The
71
+ [ Bloom filter] ({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
72
+ [ Cuckoo filter] ({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
73
+ data types let you track whether or not a given item is a member of a set.
74
+ - [ Cardinality] ( #set-cardinality ) : The
75
+ [ HyperLogLog] ({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
76
+ data type gives you an approximate value for the number of items in a set, also
77
+ known as the * cardinality* of the set.
77
78
78
79
The sections below describe these operations in more detail.
79
80
@@ -98,7 +99,6 @@ add. The following example adds some names to a Bloom filter representing
98
99
a list of users and checks for the presence or absence of users in the list.
99
100
Note that you must use the ` bf() ` method to access the Bloom filter commands.
100
101
101
- <!--
102
102
``` py
103
103
res1 = r.bf().madd(" recorded_users" , " andy" , " cameron" , " david" , " michelle" )
104
104
print (res1) # >>> [1, 1, 1, 1]
@@ -109,16 +109,15 @@ print(res2) # >>> 1
109
109
res3 = r.bf().exists(" recorded_users" , " kaitlyn" )
110
110
print (res3) # >>> 0
111
111
```
112
- -->
113
- {{ < clients-example home_prob_dts bloom Python >}}
114
- {{ < /clients-example >}}
112
+
113
+ <!-- < clients-example home_prob_dts bloom Python >}}
114
+ < /clients-example >}} -->
115
115
116
116
A Cuckoo filter has similar features to a Bloom filter, but also supports
117
117
a deletion operation to remove hashes from a set, as shown in the example
118
118
below. Note that you must use the ` cf() ` method to access the Cuckoo filter
119
119
commands.
120
120
121
- <!--
122
121
``` py
123
122
res4 = r.cf().add(" other_users" , " paolo" )
124
123
print (res4) # >>> 1
@@ -138,9 +137,9 @@ print(res8)
138
137
res9 = r.cf().exists(" other_users" , " paolo" )
139
138
print (res9) # >>> 0
140
139
```
141
- -->
142
- {{ < clients-example home_prob_dts cuckoo Python >}}
143
- {{ < /clients-example >}}
140
+
141
+ <!-- < clients-example home_prob_dts cuckoo Python >}}
142
+ < /clients-example >}} -->
144
143
145
144
Which of these two data types you choose depends on your use case.
146
145
Bloom filters are generally faster than Cuckoo filters when adding new items,
@@ -157,9 +156,9 @@ object calculates the cardinality of a set. As you add
157
156
items, the HyperLogLog tracks the number of distinct set members but
158
157
doesn't let you retrieve them or query which items have been added.
159
158
You can also merge two or more HyperLogLogs to find the cardinality of the
160
- union of the sets they represent.
159
+ [ union] ( https://en.wikipedia.org/wiki/Union_(set_theory) ) of the sets they
160
+ represent.
161
161
162
- <!--
163
162
``` py
164
163
res10 = r.pfadd(" group:1" , " andy" , " cameron" , " david" )
165
164
print (res10) # >>> 1
@@ -179,9 +178,9 @@ print(res14) # >>> True
179
178
res15 = r.pfcount(" both_groups" )
180
179
print (res15) # >>> 7
181
180
```
182
- -->
183
- {{ < clients-example home_prob_dts hyperloglog Python >}}
184
- {{ < /clients-example >}}
181
+
182
+ <!-- < clients-example home_prob_dts hyperloglog Python >}}
183
+ < /clients-example >}} -->
185
184
186
185
The main benefit that HyperLogLogs offer is their very low
187
186
memory usage. They can count up to 2^64 items with less than
@@ -205,6 +204,8 @@ on numeric data sets:
205
204
[ Top-K] ({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
206
205
estimates the ranking of labeled items by frequency in a data stream.
207
206
207
+ The sections below describe these operations in more detail.
208
+
208
209
### Frequency
209
210
210
211
A [ Count-min sketch] ({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
@@ -214,10 +215,41 @@ how close you want to keep the count to the true value (as a fraction)
214
215
and the acceptable probability of failing to keep it in this
215
216
desired range. For example, you can request that the count should
216
217
stay within 0.1% of the true value and have a 0.05% probability
217
- of going outside this limit.
218
+ of going outside this limit. The example below shows how to create
219
+ a Count-min sketch object, add data to it, and then query it.
220
+ Note that you must use the ` cms() ` method to access the Count-min
221
+ sketch commands.
222
+
223
+ ``` py
224
+ # Specify that you want to keep the counts within 0.01
225
+ # (0.1%) of the true value with a 0.005 (0.05%) chance
226
+ # of going outside this limit.
227
+ res16 = r.cms().initbyprob(" items_sold" , 0.01 , 0.005 )
228
+ print (res16) # >>> True
229
+
230
+ # The parameters for `incrby()` are two lists. The count
231
+ # for each item in the first list is incremented by the
232
+ # value at the same index in the second list.
233
+ res17 = r.cms().incrby(
234
+ " items_sold" ,
235
+ [" bread" , " tea" , " coffee" , " beer" ], # Items sold
236
+ [300 , 200 , 200 , 100 ]
237
+ )
238
+ print (res17) # >>> [300, 200, 200, 100]
239
+
240
+ res18 = r.cms().incrby(
241
+ " items_sold" ,
242
+ [" bread" , " coffee" ],
243
+ [100 , 150 ]
244
+ )
245
+ print (res18) # >>> [400, 350]
246
+
247
+ res19 = r.cms().query(" items_sold" , " bread" , " tea" , " coffee" , " beer" )
248
+ print (res19) # >>> [400, 200, 350, 100]
249
+ ```
218
250
219
- {{ < clients-example home_prob_dts cms Python >}}
220
- {{ < /clients-example >}}
251
+ <!-- < clients-example home_prob_dts cms Python >}}
252
+ < /clients-example >}} -->
221
253
222
254
The advantage of using a CMS over keeping an exact count with a
223
255
[ sorted set] ({{< relref "/develop/data-types/sorted-sets" >}})
@@ -231,7 +263,7 @@ other similar statistics.
231
263
A [ quantile] ( https://en.wikipedia.org/wiki/Quantile ) is the value
232
264
below which a certain fraction of samples lie. For example, with
233
265
a set of measurements of people's heights, the quantile of 0.75 is
234
- the value of height below which 75% of people's heights lie.
266
+ the value of height below which 75% of all people's heights lie.
235
267
[ Percentiles] ( https://en.wikipedia.org/wiki/Percentile ) are equivalent
236
268
to quantiles, except that the fraction is expressed as a percentage.
237
269
@@ -246,10 +278,56 @@ maximum values, the quantile of 0.75, and the
246
278
[ cumulative distribution function] ( https://en.wikipedia.org/wiki/Cumulative_distribution_function )
247
279
(CDF), which is effectively the inverse of the quantile function. It also
248
280
shows how to merge two or more t-digest objects to query the combined
249
- data set.
281
+ data set. Note that you must use the ` tdigest() ` method to access the
282
+ t-digest commands.
283
+
284
+ ``` py
285
+ res20 = r.tdigest().create(" male_heights" )
286
+ print (res20) # >>> True
287
+
288
+ res21 = r.tdigest().add(
289
+ " male_heights" ,
290
+ [175.5 , 181 , 160.8 , 152 , 177 , 196 , 164 ]
291
+ )
292
+ print (res21) # >>> OK
293
+
294
+ res22 = r.tdigest().min(" male_heights" )
295
+ print (res22) # >>> 152.0
250
296
251
- {{< clients-example home_prob_dts tdigest Python >}}
252
- {{< /clients-example >}}
297
+ res23 = r.tdigest().max(" male_heights" )
298
+ print (res23) # >>> 196.0
299
+
300
+ res24 = r.tdigest().quantile(" male_heights" , 0.75 )
301
+ print (res24) # >>> 181
302
+
303
+ # Note that the CDF value for 181 is not exactly
304
+ # 0.75. Both values are estimates.
305
+ res25 = r.tdigest().cdf(" male_heights" , 181 )
306
+ print (res25) # >>> [0.7857142857142857]
307
+
308
+ res26 = r.tdigest().create(" female_heights" )
309
+ print (res26) # >>> True
310
+
311
+ res27 = r.tdigest().add(
312
+ " female_heights" ,
313
+ [155.5 , 161 , 168.5 , 170 , 157.5 , 163 , 171 ]
314
+ )
315
+ print (res27) # >>> OK
316
+
317
+ res28 = r.tdigest().quantile(" female_heights" , 0.75 )
318
+ print (res28) # >>> [170]
319
+
320
+ res29 = r.tdigest().merge(
321
+ " all_heights" , 2 , " male_heights" , " female_heights"
322
+ )
323
+ print (res29) # >>> OK
324
+
325
+ res30 = r.tdigest().quantile(" all_heights" , 0.75 )
326
+ print (res30) # >>> [175.5]
327
+ ```
328
+
329
+ <!-- < clients-example home_prob_dts tdigest Python >}}
330
+ < /clients-example >}} -->
253
331
254
332
A t-digest object also supports several other related commands, such
255
333
as querying by rank. See the
@@ -268,7 +346,51 @@ The example below adds several different items to a Top-K object
268
346
that tracks the top three items (this is the second parameter to
269
347
the ` topk().reserve() ` method). It also shows how to list the
270
348
top * k* items and query whether or not a given item is in the
271
- list.
349
+ list. Note that you must use the ` topk() ` method to access the
350
+ Top-K commands.
351
+
352
+ ``` py
353
+ # The `reserve()` method creates the Top-K object with
354
+ # the given key. The parameters are the number of items
355
+ # in the ranking and values for `width`, `depth`, and
356
+ # `decay`, described in the Top-K reference page.
357
+ res31 = r.topk().reserve(" top_3_songs" , 3 , 7 , 8 , 0.9 )
358
+ print (res31) # >>> True
359
+
360
+ # The parameters for `incrby()` are two lists. The count
361
+ # for each item in the first list is incremented by the
362
+ # value at the same index in the second list.
363
+ res32 = r.topk().incrby(
364
+ " top_3_songs" ,
365
+ [
366
+ " Starfish Trooper" ,
367
+ " Only one more time" ,
368
+ " Rock me, Handel" ,
369
+ " How will anyone know?" ,
370
+ " Average lover" ,
371
+ " Road to everywhere"
372
+ ],
373
+ [
374
+ 3000 ,
375
+ 1850 ,
376
+ 1325 ,
377
+ 3890 ,
378
+ 4098 ,
379
+ 770
380
+ ]
381
+ )
382
+ print (res32)
383
+ # >>> [None, None, None, 'Rock me, Handel', 'Only one more time', None]
384
+
385
+ res33 = r.topk().list(" top_3_songs" )
386
+ print (res33)
387
+ # >>> ['Average lover', 'How will anyone know?', 'Starfish Trooper']
388
+
389
+ res34 = r.topk().query(
390
+ " top_3_songs" , " Starfish Trooper" , " Road to everywhere"
391
+ )
392
+ print (res34) # >>> [1, 0]
393
+ ```
272
394
273
- {{ < clients-example home_prob_dts topk Python >}}
274
- {{ < /clients-example >}}
395
+ <!-- < clients-example home_prob_dts topk Python >}}
396
+ < /clients-example >}} -->
0 commit comments