The open-source cache dataset were compiled from multiple sources, including Microsoft, CloudPhysics, Tencent, Alibaba, Twitter, Meta production systems. We provide both plain text and oracleGeneral format.
You can use the dataset to perform different tasks, including but not limited to:
- Evaluation: Testing your caching systems (Memcached, Redis, database bufferpool)
- Analysis: Gaining insights about production systems and observing access patterns (diurnal, weekly)
- Research: Designing and evaluating new distributed systems and databases
The datasets are stored in AWS S3. You can either download the traces to your local cluster, or launch some EC2 instances to perform computation. Since the dataset is large, we recommend provisioning a cluster to run the computation.You can use mountpoint to mount the bucket on each node and distComp to launch computation jobs.
Cache Type | Dataset | Year | Time span (days) | # Trace | # Request (million) | Request (TB) | # Object (million) | Object (TB) | Source | Txt format | OracleGeneral format |
---|---|---|---|---|---|---|---|---|---|---|---|
Key-value | MetaKV | 2022 | 1 | 5 | 1,644 | 958 | 82 | 76 | Cachelib | S3 | HF | HF |
2020 | 7 | 54 | 195,441 | 106 | 10,650 | 6 | OSDI '20 | SNIA | HF | HF | ||
Object | MetaCDN | 2025 | 7 | 3 | 231 | 8,800 | 76 | 1,563 | Cachelib | S3 | HF | HF |
Wikimedia CDN | 2019 | 7 | 3 | 2,863 | 200 | 56 | 13 | Wikitech | Wiki | HF | HF | |
Tencent Photo | 2018 | 8 | 2 | 5,650 | 141 | 1,038 | 24 | ICS '18 | SNIA | HF | HF | |
IBM Docker | 2018 | 75 | 7 | 38 | 11 | - | 171 | FAST '18 | SNIA | HF | HF | |
Block | 2024 | 61 | 3 | 115 | 12,420 | - | - | ASPLOS '24 | Google Cloud | HF | HF | |
MetaStorage | 2023 | 5 | 5 | 14 | 48 | 7 | 30 | Cachelib | S3 | HF | HF | |
Tencent CBS | 2020 | 8 | 4,030 | 33,690 | 1091 | 551 | 66 | ATC '20 | SNIA | HF | HF | |
Alibaba Block | 2020 | 30 | 1,000 | 19,676 | 664 | 1,702 | 117 | IISWC '20 | Host | HF | HF | |
CloudPhysics | 2015 | 7 | 106 | 2,114 | 82 | 492 | 22 | FAST '15 | HF | HF | |
Microsoft Cambridge | 2007 | 7 | 13 | 410 | 10 | 74 | 3 | FAST '08 | SNIA | HF | HF |
Note
A more detailed description of each dataset can be found in the source link and the sections below.
We provide both plain text format that is human readable and oracleGeneral
format that is suitable for using with libCacheSim platform.
struct {
uint32_t timestamp;
uint64_t obj_id;
uint32_t obj_size;
int64_t next_access_vtime; // -1 if no next access
}
All datasets are compressed with zstd. You can use zstd -d
to decompress the data.
Note
libCacheSim can directly work with compressed data, so no decompression is needed if you use libCacheSim to run simulations.
This dataset contains traces from Meta Cachelib.
It has two datasets collected at different time and has different formats. The original release can be found at here.
Those are traces captured for 5 consecutive days from a Meta's key-value cache cluster consisting of 500 hosts Each host uses (roughly) 42 GB of DRAM and 930 GB of SSD for caching. The open-source traces were merged from multiple hosts and the effective sampling ratio of is around 1/100.
key
: anonymized requested object IDop
: operation,GET
orSET
size
: the size of the object, could be 0 if it is a cache missop_count
: number of operations in the current secondkey_size
: size of the object ID
Those are traces captured for 5 consecutive days from a Meta's key-value cache cluster consisting of 8000 hosts Each host uses (roughly) 42 GB of DRAM and 930 GB of SSD for caching.The open-source traces were merged from multiple hosts and the effective sampling ratio of is around 1/125.
op_time
: the time of the requestkey
: anonymized requested object IDkey_size
: size of the object IDop
: operation,GET
,GET_LEASE
,SET
,DELETE
op_count
: number of operations in the current secondsize
: the size of the object, could be 0 if it is a cache misscache_hits
: the number of cache hitsttl
: time-to-live in secondsusecase
: identifies the tenant, i.e., application using distributed key-value cachesub_usecase
: further categorize the different traffics from the same usecase, but may be imcomplete or inaccurate
This dataset continas the traces from Twitter's in-memory key-value caching (Twemcache/Pelikan) clusters. The traces were collected from 54 clusters in Mar 2020. The traces are one-week-long. Note that these cache clusters are first-level caches, so the data popularity of this dataset is highly skewed.
The details of the trace can be found in A large scale analysis of hundreds of in-memory cache clusters at Twitter.
The original traces are plain text structured as comma-separated columns. Each row represents one request in the following format.
timestamp
: the time when the cache receives the request, in secanonymized
key: the original key with anonymizationkey size
: the size of key in bytesvalue size
: the size of value in bytes, could be 0 if it is a cache missclient id
: the anonymized clients (frontend service) who sends the requestoperation
: one of get/gets/set/add/replace/cas/append/prepend/delete/incr/decrTTL
: the time-to-live (TTL) of the object set by the client, it is 0 when the request is not a write request.
This is a CDN request dataset collected in March 2023 and April 2025. The original release can be found at here.
Those are traces captured for 7 consecurive days from a selected CDN edge cluster. The cluster contains around 300 hosts. Each host uses roughly 105GB DRAM and 3577 GB SSD for caching. Traffic factor is 1/7.08. Scaled cache size DRAM 15150MB, SSD 1032294 MB.
These traces were captured from Meta's 3 selected CDN cache clusters (named nha, prn, eag) respectively for 7 days.
Each cluster consists of 1000's of hosts. Each host uses (roughly) 40 GB of DRAM and 1.8TB of SSD for caching. Traffic factor and scaled cache sizes are:
- nha: 1/6.37, DRAM 6006 MB, NVM 272314 MB
- prn: 1/4.58, DRAM 8357 MB, NVM 375956 MB
- eag: 1/13.4, DRAM 2857 MB, NVM 129619 MB
I believe these CDN cache clusters are the edge clusters (rather than FNA clusters) given their sizes and compulsory miss ratios. Meta CDN uses a two-layer hierarchy, where the first layer is FNA cluster (they are also called MNA, Meta Network Appliance). FNA clusters are small and deployed inside ISP networks. Cache misses from the FNA cluster go to a larger edge cluster deployed in an IXP data center.
timestamp
: the timestamp of the requestcacheKey
: anonymized object IDOpType
: unknown, but it seems only contains value 1objectSize
: the size of the object in byteresponseSize
: the HTTP response sizeresponseHeaderSize
: the HTTP response header size,rangeStart
: the start offset of a range request, -1 if it is not a range requestrangeEnd
the end offset of a range request, -1 if it is not a range requestTTL
: time-to-live in secondsSamplingRate
: trace collection sampling ratio, should be ignored because sampled traces from many hosts are mixedcache_hit
: value 1 indicates this request is a cache hit, 0 indicates a cache missitem_value
: unknown, value is either 0 or 1RequestHandler
: unknowncdn_content_type_id
: annoymized content type id, it is either int or -vip_type
: unknown
This dataset contains traces from Wikimedia CDN infrastructure. It includes datasets collected at different times (2008, 2016, and 2019) with different formats. We omit the 2008 version as it doesn't have object size information and is incomplete.
The original release can be found at here.
This dataset is a restricted snapshot of the wmf.webrequest
table. It consists of 42 compressed files covering 21 days of data: 21 files contain upload (image) request data (cache-u
), and 21 contain text pageview request data (cache-t
). Each file covered an exactly 24-hour period.
relative_unix
: Seconds since dataset starthashed_path_query
: Salted hash of request path and queryimage_type
: Image type:jpeg
,png
,gif
,svg+xml
,x-icon
,tiff
,webp
,vnd.djvu
,x-xcf
response_size
: Response size in bytestime_firstbyte
: Time to first byte in seconds (float)
relative_unix
: Seconds since dataset starthashed_host_path_query
: Salted hash of host, path, and queryresponse_size
: Response size in bytestime_firstbyte
: Time to first byte in seconds
These are request traces collected from Wikimedia’s upload cache infrastructure. The traces span two consecutive weeks and were captured from a single front-end cache node (cp4006
) located in the ulsfo data center. Each node has (roughly) 96 GB of DRAM for memory caching and (roughly) 720 GB of SSD for disk caching. The trace was filtered to include only user traffic with 200 OK responses, and hashed URLs to preserve privacy. The effective sampling corresponds to requests routed through a single cache. The data includes anonymized object identifiers, response sizes, content types, and latency metrics, and has been used for performance evaluation of cache replacement policies.
hashed_host_and_path
: Salted hash of request host and pathuri_query
: Full requested URL with query parameters (plaintext)content_type
: MIME type from HTTP Content-Type headerresponse_size
: Bytes sent until first byte of responseX_Cache
: CDN caching metadata and cache hierarchy
This is a photo CDN request dataset collected in 2016 over a span of 9 consecutive days at the sampling ratio of 1/100. It captured real-world production workloads from Tencent’s large-scale photo storage system, QQPhoto, including cache hit results under an LRU policy with a total cache size of (roughly) 5TB.
QQPhoto supports separate photo upload and download channels and two-tier cache is employed for photo download, including outside caches and data center caches. The details of the trace can be found in Demystifying Cache Policies for Photo Stores at Scale: A Tencent Case Study.
timestamp
: Request time inYYYYMMDDHHMMSS
formatphoto_id
: Hexadecimal checksum of requested photoimage_format
:0
= jpg,5
= webpsize_category
: Image size tier:l
,a
,o
,m
,c
,b
(increasing sizes)return_size
: Bytes returned for requested imagecache_hit
:1
= cache hit,0
= cache missterminal_type
: Device type:PHONE
orPC
response_time
: Response time in milliseconds (0
= <1ms)
l
: 33,136 bytesa
: 3,263,749 byteso
: 4,925,317 bytesm
: 6,043,467 bytesc
: 6,050,183 bytesb
: 8,387,821 bytes
This dataset contains Docker Registry traces from IBM's infrastructure, captured over 75 days (June 20 - September 2, 2017) across 7 availability zones. The traces analyze production workload patterns for container image storage and retrieval, processing 208 GB of raw data into 22.4 GB of processed traces with 38M+ requests and 180TB+ data transfer.
The details of the trace can be found in Improving Docker Registry Design based on Production Workload Analysis.
host
: Anonymized registry serverhttp.request.duration
: Response time in secondshttp.request.method
: HTTP method (GET, PUT, HEAD, PATCH, POST)http.request.remoteaddr
: Anonymized client IPhttp.request.uri
: Anonymized requested URLhttp.request.useragent
: Docker client versionhttp.response.status
: HTTP response codehttp.response.written
: Data received/sent in bytesid
: Unique request identifiertimestamp
: Request arrival time (UTC)
This dataset contains synthetically generated I/O traces that represent Google's storage cluster workloads. The traces are created using advanced synthesis techniques to produce realistic I/O patterns that closely match actual production disk behaviors observed in Google's data centers.
The synthetic traces are generated for multiple disk categories across different storage clusters, with detailed validation performed over multiple days. The primary dataset showcases the workloads from 2024/01/15 to 2024/03/15 focusing on the largest disk category in one storage cluster. The synthesis methodology has been validated across various disk types and storage clusters, demonstrating consistent accuracy. Samping rate at 1/10,000.
The details of the trace can be found in Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples.
filename
: Local filenamefile_offset
: File offsetapplication
: Application owner of the filec_time
: File creation timeio_zone
: Warm or coldredundancy_type
: Replicated or erasure-codedop_type
: Read or writeservice_class
: Request's priority: latency-sensitive, throughput-oriented, or otherfrom_flash_cache
: Whether the request is from flash cachecache_hit
: Whether the request is served by server's buffer cacherequest_io_size_bytes
: Size of the requestdisk_io_size_bytes
: Size of the disk operationresponse_io_size_bytes
: Size of the responsestart_time
: Request's arrival time at the serverdisk_time
: Disk read time (for cache-miss read)latency
: Latency of the operation (from arrival time to response time at the server)simulated_disk_start_time
: Start time of disk read (for cache-miss read)simulated_latency
: Latency of the operation (adjusted by the trace reorganizer)
- Plain text:Google Cloud | HF
- OracleGeneral format: HF
Those are traces captured for 5 consecutive days from a Meta's block storage cluster consisting of 3000 hosts at the sampling ratio of 1/4000. Each host uses (roughly) 10 GB of DRAM and 380 GB of SSD for caching. The open-source traces were merged from multiple hosts and the effective sampling ratio of is around 1.
The original release can be found at here.
op_time
: The time of the requestblock_id
: The requested block IDblock_id_size
: The size of the requested block (in MB), this field is always 40io_size
: The requested size, note that requests often do not ask for the full blockio_offset
: The start offset in the blockuser_name
: Anonymized username (represent different use cases)user_namespace
: Anonymized namespace, can ignore, only one valueop_name
: Operation name, can be one ofgetChunkData.NotInitialized
,getChunkData.Permanent
,putChunk.NotInitialized
, andputChunk.Permanent
op_count
: The number of requests in the same secondhost_name
: Anonymized host name that serves the requestrs_shard_id
: Reed-solomon shard ID
This dataset consists of 216 I/O trace files collected from a production cloud block storage (CBS) system over a 10-day period (October 1–10, 2018). The traces originate from a single warehouse (also known as a failure domain), covering I/O requests from 5,584 cloud virtual volumes (CVVs). These requests were ultimately redirected to a storage cluster comprising 40 storage nodes (i.e., disks).
These traces were well-suited for per-volume analysis, i.e., studying access patterns of individual CVVs by grouping requests by VolumeID. More details on the dataset can be found in OSCA: An Online-Model Based Cache Allocation Scheme in Cloud Block Storage Systems.
Timestamp
: Time the I/O request was issued, given in Unix time (seconds since Jan 1, 1970)Offset
: Starting offset of the I/O request, in sectors (1 sector = 512 bytes) from the beginning of the logical volume for the original released traces (converted to bytes in our open-sourced traces)Size
: Transfer size of the I/O request, in sectorsIOType
: Type of operation — 0 for Read, 1 for WriteVolumeID
: Anonymized ID of the cloud virtual volume (CVV)
The dataset was collected from a production cluster of Alibaba Cloud's Elastic Block Storage (EBS) service, which provides virtual disk storage. A total of 1,000 virtual disks were randomly sampled from the cluster. All I/O activities to these disks were recorded for the entire month of January 2020.
The selected disks are Ultra Disk products, a cost-effective tier in Alibaba Cloud’s block storage offerings. Ultra Disks are backed by a distributed storage system that ensures high reliability but with relatively lower random I/O performance compared to Standard SSD or Enhanced SSD products. Typical applications of Ultra Disks include OS hosting, web servers, and big data workloads. More details on the dataset can be found in An In-Depth Analysis of Cloud Block Storage Workloads in Large-Scale Production and Github.
device_id
: ID of the virtual disk, remapped to the range 0 ∼ 999opcode
: Operation type — R for Read, W for Writeoffset
: Byte offset of the operation from the beginning of the disklength
: Length of the I/O operation in bytestimestamp
: Time the operation was received by the server, in microseconds since the Unix epoch
device_id
: ID of the virtual diskcapacity
: Capacity of the virtual disk in bytes
This block-level I/O trace dataset was collected from 106 virtual disks on VMware ESXi hypervisors in production environments for one week. The traces were recorded with VMware’s vscsiStats. Local sampling was used when full trace uploads and corresponding storage analysis weren't needed.
The traced VMs run Linux or Windows, with disk sizes from 8 GB to 34 TB (median 90 GB), memory up to 64 GB (median 6 GB), and up to 32 vCPUs (median 2).
The details of the trace can be found in Efficient MRC Construction with SHARDS.
timestamp
: The time the I/O request was issued, in microsecondslbn
: The starting Logical Block Number (LBA) of the I/O request, in sectorslen
: The size of the I/O request, in sectorscmd
: The SCSI command code indicating the access type (e.g., read or write)ver
: A version field used to distinguish between VSCSI1 and VSCSI2 formats
This is a block-level I/O trace collected in February 2007. It captures activity from 36 volumes (comprising 179 disks) across 13 enterprise servers in a production data center over the course of one week, starting at 5 PM GMT on February 22, 2007.
Each server was configured with a RAID-1 boot volume using two internal disks, and one or more RAID-5 data volumes backed by co-located, rack-mounted DAS. All servers ran Windows Server 2003 SP2, with data stored on NTFS volumes and accessed via protocols such as CIFS and HTTP. I/O requests were recorded using the Event Tracing for Windows (ETW) tool. We note this dataset was collected to study power-saving strategies in enterprise storage systems.
Details of the trace can be found in Write Off-Loading: Practical Power Management for Enterprise Storage.
Timestamp
: The time the I/O was issued, recorded in “Windows filetime” format.Hostname
: The hostname of the server, which should match the hostname in the trace file name.DiskNumber
: The logical disk number, which should match the disk number in the trace file name.Type
: The type of I/O operation, either “Read” or “Write”.Offset
: The starting offset of the I/O request in bytes from the beginning of the logical disk.Size
: The transfer size of the I/O request in bytes.ResponseTime
: The time taken by the I/O to complete, measured in Windows filetime units.
Due to the large size of these datasets, we recommend using larger servers for computation, such as AWS EC2 VMs.
Using libCacheSim to read the dataset
Using libCacheSim to analyze and plot the trace
Using libCacheSim to run cache simulation
Note that some open-source datasets are not included in this release. These datasets often do not provide a clear description of how the data were collected.
- 2024/12/01: First version
If you have any questions, please join the Google Group or [Slack].
This work is licensed under the Creative Commons Attribution 4.0 International Public License (CC BY 4.0). To obtain a copy of this license, see LICENSE-CC-BY-4.0.txt in the archive, visit CreativeCommons.org or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
Important
Term
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
The original traces were collected by contributors from multiple institutes, including Meta, Twitter, CloudPhysics, Microsoft, Wikimedia, Alibaba, Tencent, and several others.
This collection and converted traces are open-sourced by Juncheng Yang from School of Engineering and Applied Science at Harvard University.
The storage of this dataset is sponsored by AWS under open data agreement.
If you would like your paper to be featured here, please send a PR.
If you used this open-source datasets in your research, please cite the papers where the traces originally released, bibtex
reference entris can be found in references.md.