-
Notifications
You must be signed in to change notification settings - Fork 787
feat(cache): a variant of sieve, with lazy op #13904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It is no longer Lru and theoretically has better properties. For the sake of evaluation, I did not change the name. You can refer to the unit tests for a quick understanding. The impact on Databend performance and cache hit rate needs further evaluation. |
This algorithm is storagebackend-friendly. In fact, we only need to maintain a "visited" linkedhashmap as the state. Since the state and storage are decoupled, we can consider refactoring our disk cache and implementing S3 cache in the future. |
Docker Image for PR
|
Docker Image for PR
|
This version lacks a hand pointer, so there is no protection for the frequently accessed parts. However, personally, I think it is acceptable to evict and reload them. One possible optimization is to change "visited" into a counter with an upper bound, which would provide some protection mechanism. This approach seems somewhat like a version between s3-FIFO and sieve, but further evaluation is still needed. The specific impact needs to be discussed based on the workload. |
👍 Besides the missing of And about the 'SIEVE is not scan-resistant' thing mentioned in the Sieve paper - any idea how that may affect us? Also, does |
No. But hand does have a significant meaning, so we will try to compare only this PR and Lru.
If we frequently encounter large scans, then scan-resistant will be a very important feature. This means that the elements we insert will soon no longer be accessed. However, the Lru we previously used also does not have scan resistance. We can try using probability models or other methods to further improve it.
Currently, yes. This means it is an O(n) operation. Perhaps we can use other techniques to accomplish this since we only need to find an element that has not been accessed before. One simple solution is to allow the elements in "visited" to be moved, so we actually only need a deque to complete the sorting, ensuring that the key to be removed is always at a certain position, depending on when we perform the move. |
Although there seems to be improvement on the hits dataset, it performs similarly to Lru on some public traces and causes a decrease in throughput due to the O(n) traversal. I will try to make further modifications. |
@PragmaTwice Cool work! One possible way to avoid the |
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Signed-off-by: Chojan Shang <psiace@apache.org>
Docker Image for PR
|
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
This optimization is inspired by the sieve, which can reduce unnecessary element movements and has a potential filtering effect.
Simply put, we maintain the visited status of each key and try to evict those elements that were inserted a long time ago but have not been accessed yet.
This change is