Skip to content

Commit 9c6e4c7

Browse files
DOC-5226 started C# probabilistic data type examples page
1 parent f64d236 commit 9c6e4c7

File tree

1 file changed

+229
-0
lines changed
  • content/develop/clients/dotnet

1 file changed

+229
-0
lines changed
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Learn how to use approximate calculations with Redis.
13+
linkTitle: Probabilistic data types
14+
title: Probabilistic data types
15+
weight: 45
16+
---
17+
18+
Redis supports several
19+
[probabilistic data types]({{< relref "/develop/data-types/probabilistic" >}})
20+
that let you calculate values approximately rather than exactly.
21+
The types fall into two basic categories:
22+
23+
- [Set operations](#set-operations): These types let you calculate (approximately)
24+
the number of items in a set of distinct values, and whether or not a given value is
25+
a member of a set.
26+
- [Statistics](#statistics): These types give you an approximation of
27+
statistics such as the quantiles, ranks, and frequencies of numeric data points in
28+
a list.
29+
30+
To see why these approximate calculations would be useful, consider the task of
31+
counting the number of distinct IP addresses that access a website in one day.
32+
33+
Assuming that you already have code that supplies you with each IP
34+
address as a string, you could record the addresses in Redis using
35+
a [set]({{< relref "/develop/data-types/sets" >}}):
36+
37+
```py
38+
r.sadd("ip_tracker", new_ip_address)
39+
```
40+
41+
The set can only contain each key once, so if the same address
42+
appears again during the day, the new instance will not change
43+
the set. At the end of the day, you could get the exact number of
44+
distinct addresses using the `scard()` function:
45+
46+
```py
47+
num_distinct_ips = r.scard("ip_tracker")
48+
```
49+
50+
This approach is simple, effective, and precise but if your website
51+
is very busy, the `ip_tracker` set could become very large and consume
52+
a lot of memory.
53+
54+
You would probably round the count of distinct IP addresses to the
55+
nearest thousand or more to deliver the usage statistics, so
56+
getting it exactly right is not important. It would be useful
57+
if you could trade off some accuracy in exchange for lower memory
58+
consumption. The probabilistic data types provide exactly this kind of
59+
trade-off. Specifically, you can count the approximate number of items in a
60+
set using the [HyperLogLog](#set-cardinality) data type, as described below.
61+
62+
In general, the probabilistic data types let you perform approximations with a
63+
bounded degree of error that have much lower memory consumption or execution
64+
time than the equivalent precise calculations.
65+
66+
## Set operations
67+
68+
Redis supports the following approximate set operations:
69+
70+
- [Membership](#set-membership): The
71+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
72+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
73+
data types let you track whether or not a given item is a member of a set.
74+
- [Cardinality](#set-cardinality): The
75+
[HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
76+
data type gives you an approximate value for the number of items in a set, also
77+
known as the *cardinality* of the set.
78+
79+
The sections below describe these operations in more detail.
80+
81+
### Set membership
82+
83+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
84+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
85+
objects provide a set membership operation that lets you track whether or not a
86+
particular item has been added to a set. These two types provide different
87+
trade-offs for memory usage and speed, so you can select the best one for your
88+
use case. Note that for both types, there is an asymmetry between presence and
89+
absence of items in the set. If an item is reported as absent, then it is definitely
90+
absent, but if it is reported as present, then there is a small chance it may really be
91+
absent.
92+
93+
Instead of storing strings directly, like a [set]({{< relref "/develop/data-types/sets" >}}),
94+
a Bloom filter records the presence or absence of the
95+
[hash value](https://en.wikipedia.org/wiki/Hash_function) of a string.
96+
This gives a very compact representation of the
97+
set's membership with a fixed memory size, regardless of how many items you
98+
add. The following example adds some names to a Bloom filter representing
99+
a list of users and checks for the presence or absence of users in the list.
100+
Note that you must use the `BF()` method to access the Bloom filter commands.
101+
102+
{{< clients-example home_prob_dts bloom "C#" >}}
103+
{{< /clients-example >}}
104+
105+
A Cuckoo filter has similar features to a Bloom filter, but also supports
106+
a deletion operation to remove hashes from a set, as shown in the example
107+
below. Note that you must use the `CF()` method to access the Cuckoo filter
108+
commands.
109+
110+
{{< clients-example home_prob_dts cuckoo "C#" >}}
111+
{{< /clients-example >}}
112+
113+
Which of these two data types you choose depends on your use case.
114+
Bloom filters are generally faster than Cuckoo filters when adding new items,
115+
and also have better memory usage. Cuckoo filters are generally faster
116+
at checking membership and also support the delete operation. See the
117+
[Bloom filter]({{< relref "/develop/data-types/probabilistic/bloom-filter" >}}) and
118+
[Cuckoo filter]({{< relref "/develop/data-types/probabilistic/cuckoo-filter" >}})
119+
reference pages for more information and comparison between the two types.
120+
121+
### Set cardinality
122+
123+
A [HyperLogLog]({{< relref "/develop/data-types/probabilistic/hyperloglogs" >}})
124+
object calculates the cardinality of a set. As you add
125+
items, the HyperLogLog tracks the number of distinct set members but
126+
doesn't let you retrieve them or query which items have been added.
127+
You can also merge two or more HyperLogLogs to find the cardinality of the
128+
[union](https://en.wikipedia.org/wiki/Union_(set_theory)) of the sets they
129+
represent.
130+
131+
{{< clients-example home_prob_dts hyperloglog "C#" >}}
132+
{{< /clients-example >}}
133+
134+
The main benefit that HyperLogLogs offer is their very low
135+
memory usage. They can count up to 2^64 items with less than
136+
1% standard error using a maximum 12KB of memory. This makes
137+
them very useful for counting things like the total of distinct
138+
IP addresses that access a website or the total of distinct
139+
bank card numbers that make purchases within a day.
140+
141+
## Statistics
142+
143+
Redis supports several approximate statistical calculations
144+
on numeric data sets:
145+
146+
- [Frequency](#frequency): The
147+
[Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
148+
data type lets you find the approximate frequency of a labeled item in a data stream.
149+
- [Quantiles](#quantiles): The
150+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
151+
data type estimates the quantile of a query value in a data stream.
152+
- [Ranking](#ranking): The
153+
[Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}}) data type
154+
estimates the ranking of labeled items by frequency in a data stream.
155+
156+
The sections below describe these operations in more detail.
157+
158+
### Frequency
159+
160+
A [Count-min sketch]({{< relref "/develop/data-types/probabilistic/count-min-sketch" >}})
161+
(CMS) object keeps count of a set of related items represented by
162+
string labels. The count is approximate, but you can specify
163+
how close you want to keep the count to the true value (as a fraction)
164+
and the acceptable probability of failing to keep it in this
165+
desired range. For example, you can request that the count should
166+
stay within 0.1% of the true value and have a 0.05% probability
167+
of going outside this limit. The example below shows how to create
168+
a Count-min sketch object, add data to it, and then query it.
169+
Note that you must use the `CMS()` method to access the Count-min
170+
sketch commands.
171+
172+
{{< clients-example home_prob_dts cms "C#" >}}
173+
{{< /clients-example >}}
174+
175+
The advantage of using a CMS over keeping an exact count with a
176+
[sorted set]({{< relref "/develop/data-types/sorted-sets" >}})
177+
is that that a CMS has very low and fixed memory usage, even for
178+
large numbers of items. Use CMS objects to keep daily counts of
179+
items sold, accesses to individual web pages on your site, and
180+
other similar statistics.
181+
182+
### Quantiles
183+
184+
A [quantile](https://en.wikipedia.org/wiki/Quantile) is the value
185+
below which a certain fraction of samples lie. For example, with
186+
a set of measurements of people's heights, the quantile of 0.75 is
187+
the value of height below which 75% of all people's heights lie.
188+
[Percentiles](https://en.wikipedia.org/wiki/Percentile) are equivalent
189+
to quantiles, except that the fraction is expressed as a percentage.
190+
191+
A [t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
192+
object can estimate quantiles from a set of values added to it
193+
without having to store each value in the set explicitly. This can
194+
save a lot of memory when you have a large number of samples.
195+
196+
The example below shows how to add data samples to a t-digest
197+
object and obtain some basic statistics, such as the minimum and
198+
maximum values, the quantile of 0.75, and the
199+
[cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
200+
(CDF), which is effectively the inverse of the quantile function. It also
201+
shows how to merge two or more t-digest objects to query the combined
202+
data set. Note that you must use the `TDIGEST()` method to access the
203+
t-digest commands.
204+
205+
{{< clients-example home_prob_dts tdigest "C#" >}}
206+
{{< /clients-example >}}
207+
208+
A t-digest object also supports several other related commands, such
209+
as querying by rank. See the
210+
[t-digest]({{< relref "/develop/data-types/probabilistic/t-digest" >}})
211+
reference for more information.
212+
213+
### Ranking
214+
215+
A [Top-K]({{< relref "/develop/data-types/probabilistic/top-k" >}})
216+
object estimates the rankings of different labeled items in a data
217+
stream according to frequency. For example, you could use this to
218+
track the top ten most frequently-accessed pages on a website, or the
219+
top five most popular items sold.
220+
221+
The example below adds several different items to a Top-K object
222+
that tracks the top three items (this is the second parameter to
223+
the `topk().reserve()` method). It also shows how to list the
224+
top *k* items and query whether or not a given item is in the
225+
list. Note that you must use the `TOPK()` method to access the
226+
Top-K commands.
227+
228+
{{< clients-example home_prob_dts topk "C#" >}}
229+
{{< /clients-example >}}

0 commit comments

Comments
 (0)