Skip to content

MCPCorpus is a comprehensive dataset for analyzing the Model Context Protocol (MCP) ecosystem, containing ~14K MCP servers and 300 MCP clients with 20+ normalized metadata attributes.

Notifications You must be signed in to change notification settings

Snakinya/MCPCorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MCPCorpus: A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis

MCPCorpus is a comprehensive dataset for analyzing the Model Context Protocol (MCP) ecosystem, containing ~14K MCP servers and 300 MCP clients with 20+ normalized metadata attributes.

📊 Dataset Overview

  • Scale: ~14K MCP servers + 300 MCP clients
  • Attributes: 20+ normalized metadata per artifact
  • Sources: GitHub, community hubs, package managers
  • Applications: Ecosystem analysis, security research, interoperability studies

📁 Structure

MCPCorpus/
├── Crawler/                    # Data collection tools
│   ├── Servers/               # Server data
│   ├── Clients/               # Client data
│   ├── github_info_collector.py  # GitHub metadata collector
│   └── data_cleaner.py        # Data normalization
└── Website/                   # Web search interface
    ├── server.py              # Local web server
    └── index.html             # Search interface
    └── mcpso_servers_cleaned.json
    └── mcpso_clients_cleaned.json

🚀 Quick Start

Explore Dataset

cd Website
python server.py
# Open http://localhost:8000

Access Data Programmatically

import json
import pandas as pd

# Load datasets
with open('Crawler/Servers/mcpso_servers_cleaned.json', 'r') as f:
    servers = json.load(f)
with open('Crawler/Clients/mcpso_clients_cleaned.json', 'r') as f:
    clients = json.load(f)

# Convert to DataFrame
servers_df = pd.DataFrame(servers)
clients_df = pd.DataFrame(clients)

Update Dataset (Optional)

# Collect new data
cd Crawler/Servers && python Server_request.py
cd ../Clients && python Client_request.py

# Add GitHub metadata
cd .. && python github_info_collector.py --token YOUR_GITHUB_TOKEN

📚 Citation

If you use MCPCorpus in your research, please cite it as:

@misc{lin2025largescaleevolvabledatasetmodel,
      title={A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis}, 
      author={Zhiwei Lin and Bonan Ruan and Jiahao Liu and Weibo Zhao},
      year={2025},
      eprint={2506.23474},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2506.23474}, 
}

About

MCPCorpus is a comprehensive dataset for analyzing the Model Context Protocol (MCP) ecosystem, containing ~14K MCP servers and 300 MCP clients with 20+ normalized metadata attributes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published