GitHub - dcimring/edgar-scraper

Software Requirements Document: SEC EDGAR Crypto Filing Alert Service

1. Introduction

This document outlines the software requirements for a backend service designed to monitor the U.S. Securities and Exchange Commission (SEC) EDGAR system for new filings, identify those related to cryptocurrency or blockchain topics, and dispatch alert messages to a designated Telegram chat. The goal is to provide timely notifications to stakeholders interested in the intersection of public company disclosures and the crypto asset space.

2. Functional Requirements

The following are the core functionalities the system must provide:

FR-001: SEC EDGAR Filing Monitoring: The system shall periodically (every 5 minutes) check for new filings published on the SEC EDGAR system.
FR-002: Filing Content Retrieval: For each new filing identified, the system shall retrieve the full text or relevant sections of the filing.
FR-003: Crypto Content Detection & Summary Generation (LLM-based): The system shall utilize a Large Language Model (LLM) to:
- Determine if the retrieved filing content contains information related to crypto/cryptocurrency topics (e.g., Bitcoin, Ethereum, stablecoin, blockchain, digital assets, tokens, NFT, Web3, DLT, cryptographic, mining, DeFi, DAO).
- If crypto-related content is detected, generate a concise summary or snippet of the relevant crypto-related content from the filing.
FR-004: Alert Message Generation: If crypto-related content is detected, the system shall generate an alert message containing the following information:
- Company Name
- Form Type (e.g., 10-K, 10-Q, 8-K)
- Filing Date
- Direct Link to the SEC EDGAR filing
- The LLM-generated short snippet or summary of the relevant crypto-related content.
FR-005: Telegram Alert Dispatch: The generated alert message shall be sent to a pre-configured Telegram chat using the Telegram Bot API.
FR-006: Rate Limit Handling: The system shall gracefully handle API rate limits imposed by external services (SEC, Telegram, LLM) by implementing appropriate delays or back-off strategies.
FR-007: Request Retry Mechanism: The system shall implement a retry mechanism for failed API requests (e.g., network errors, temporary service unavailability) with exponential back-off.

3. Non-Functional Requirements

These requirements define the quality attributes of the system:

NFR-001: Performance - Latency: Alerts for new crypto-related filings shall be dispatched to Telegram within 10 minutes of the filing appearing on EDGAR.
NFR-002: Performance - Throughput: The system shall be capable of processing up to 1,000 new filings per 5-minute interval.
NFR-003: Reliability - Availability: The service shall maintain an uptime of 99.5%.
NFR-004: Reliability - Error Handling: The system shall log all errors, including failed API calls, parsing issues, LLM failures, and content detection failures, with sufficient detail for debugging.
NFR-005: Scalability: The architecture shall support horizontal scaling to accommodate an increase in filing volume or the number of monitored keywords/topics.
NFR-006: Security:
- API keys (e.g., Telegram Bot Token, LLM API Key) shall be stored securely (e.g., environment variables, secret management service).
- All external communication shall use HTTPS.
NFR-007: Maintainability: The codebase shall be modular, well-documented, and follow established coding standards to facilitate future enhancements and bug fixes.
NFR-008: Observability: The system shall provide metrics and logs to monitor its health, performance, and operational status (e.g., number of filings processed, alerts sent, errors).

4. Technical Architecture Overview

The system will follow a microservice-oriented or modular monolithic architecture, primarily event-driven for processing new filings.

Scheduler: Triggers the periodic check for new SEC filings.
SEC EDGAR Poller: Fetches new filing metadata and links.
Filing Downloader: Retrieves the full text of individual filings.
LLM Integration Module: Interacts with the Large Language Model for crypto content detection and summary generation.
Alert Generator: Formulates the alert message based on detected content and the LLM-generated summary.
Telegram Dispatcher: Sends the alert message to the Telegram chat.
Persistence Layer (Optional but Recommended): A database to store processed filing IDs to prevent re-processing and potentially store alert history.
Queueing System (Optional but Recommended): To decouple components and handle bursts of filings, ensuring resilience against processing delays.


graph TD
A[Scheduler: Every 5 min] --\> B{SEC EDGAR Poller};
B --\> C[Filing Metadata];
C --\> D[Filing Downloader];
D --\> E[Raw Filing Content];
E --\> F[LLM Integration Module];
F -- Crypto Detected & Summary --\> G[Alert Generator];
G --\> H[Telegram Dispatcher];
H --\> I[Telegram Chat];
F -- No Crypto --\> J[Log: No crypto content];
D -- Error/Retry --\> D;
H -- Error/Retry --\> H;
F -- Error/Retry --\> F;
B -- Persistence --\> K[Database: Processed Filing IDs];
F -- Persistence --\> K;

5. API Usage Details

5.1. SEC EDGAR API

The SEC provides public access to EDGAR data. The primary method for checking new filings is typically through the daily index files or RSS feeds.

Endpoint for New Filings:
- The SEC does not have a single "latest filings" API endpoint in the traditional sense. Instead, new filings are typically identified by checking daily index files or RSS feeds.
- Recommended Approach: Monitor the EDGAR RSS feed for new filings: https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=&company=&dateb=&owner=only&start=0&count=40&output=atom (adjust count as needed). This feed provides links to the filing details.
- Alternatively, access daily index files (e.g., https://www.sec.gov/Archives/edgar/daily-index/) which list all filings for a given day.
Filing Content Retrieval: Once a filing's accession number is identified from the RSS feed or index, the full text can be accessed directly from the SEC archives.
- Example URL Structure: https://www.sec.gov/Archives/edgar/data/{CIK}/{ACCESSION_NUMBER_NO_DASHES}/{DOCUMENT_NAME}.txt
- The DOCUMENT_NAME is usually the primary filing document (e.g., form-type.txt or primary_document.htm). Parsing HTML documents will require an HTML parser.
Rate Limits: The SEC does not publish strict rate limits but advises against excessive polling. A reasonable delay between requests (e.g., 100ms-500ms) and retry logic is crucial.

5.2. Telegram Bot API

The Telegram Bot API is used to send messages to a specific chat.

Base URL: https://api.telegram.org/bot<YOUR_BOT_TOKEN>/
Endpoint for Sending Messages: sendMessage
- Method: POST
- Parameters:
  - chat_id: Unique identifier for the target chat or username of the target channel (e.g., @my_channel).
  - text: Text of the message to be sent.
  - parse_mode (Optional): MarkdownV2 or HTML for rich text formatting.
  - disable_web_page_preview (Optional): Set to true if you don't want a link preview.
Example Request (Conceptual):


POST [https://api.telegram.org/botYOUR\_BOT\_TOKEN/sendMessage](https://api.telegram.org/botYOUR_BOT_TOKEN/sendMessage)
Content-Type: application/json

{
"chat\_id": "-1234567890", // Example chat ID
"text": "🚨 *New Crypto Filing Alert\!* 🚨\\n\\n*Company:* Example Corp\\n*Form:* 8-K\\n*Date:* 2023-10-26\\n*Link:* [https://www.sec.gov/example-filing](https://www.google.com/search?q=https://www.sec.gov/example-filing)\\n\\n*Snippet:* '...our strategic investment in *blockchain* technology and *digital assets*...'",
"parse\_mode": "MarkdownV2"
}

Rate Limits: Telegram has rate limits for bots (e.g., 30 messages per second to a single chat, 20 messages per minute to different chats). Implement a queue and token bucket algorithm if message volume is high.

5.3. Large Language Model (LLM) API

The system will integrate with a Large Language Model (e.g., Gemini API) for advanced content analysis and summary generation.

Endpoint for Text Generation/Analysis: https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=${apiKey} (or similar, depending on the chosen LLM provider and model).
Method: POST
Parameters:
contents: An array of parts representing the conversation history, including the filing content to be analyzed.
generationConfig: Configuration for the generation, potentially including responseMimeType for structured output.
Example Request (Conceptual for Crypto Detection and Summary):


let chatHistory = [];
chatHistory.push({ role: "user", parts: [{ text: "Analyze the following SEC filing text for crypto-related content. If found, provide a concise summary (max 100 words) of the relevant sections. If no crypto content is found, respond with 'NO\_CRYPTO'.\\n\\nFiling Text: [Full text of SEC filing here]" }] });
const payload = { contents: chatHistory };
const apiKey = ""; // Will be provided by the environment
const apiUrl = `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=${apiKey}`;
const response = await fetch(apiUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload)
});
const result = await response.json();
// Process result.candidates[0].content.parts[0].text for summary or "NO\_CRYPTO"

Rate Limits: LLM APIs typically have rate limits based on requests per minute or tokens per minute. Implement robust retry mechanisms and potentially a queue for LLM requests.

6. Key Components/Services/Modules

Scheduler Module: Responsible for initiating the filing check process at regular intervals. Can be implemented using cron jobs, internal timers, or a cloud-based scheduler.
SEC EDGAR Client: A dedicated module for interacting with SEC EDGAR. It will handle:
Fetching RSS feeds or index files.
Parsing filing metadata (company name, form type, date, accession number).
Constructing URLs for full filing retrieval.
Downloading filing content (HTML/TXT).
Implementing rate limiting and retry logic for SEC requests.
LLM Integration Module: The core logic for interacting with the Large Language Model.
Text Preparation: Prepares filing content for LLM input (e.g., truncation if too long, prompt engineering).
API Call Management: Handles API calls to the LLM, including authentication, request formatting, and response parsing.
Error Handling & Retries: Manages LLM-specific errors and implements retry logic.
Content Interpretation: Interprets the LLM's response to determine crypto relevance and extract the summary.
Alert Formatting Module: Constructs the alert message string, ensuring proper formatting (e.g., Markdown for Telegram).
Telegram Client: A dedicated module for sending messages to Telegram.
Handles authentication with the Bot Token.
Implements rate limiting and retry logic for Telegram API calls.
Configuration Management: A system to manage sensitive information (API keys, chat IDs) and configurable parameters (polling interval, keyword list, LLM model parameters).
Logging and Monitoring: Integrates with a logging system (e.g., ELK stack, cloud logging) and a monitoring solution (e.g., Prometheus, Datadog) for operational insights.
Database (e.g., PostgreSQL, MongoDB):
To store the accession numbers of processed filings to avoid duplicate alerts.
Optionally, to store a history of alerts sent.

7. Optional Enhancements and Future Improvements

FE-001: Advanced NLP for Context (Further Refinement): While LLM is used, further refinement of prompts or fine-tuning could lead to even more precise contextual understanding and nuanced summaries.
FE-002: Customizable Keywords (LLM Prompting): Allow users to influence the LLM's detection by providing custom "hints" or examples in the prompt, rather than strict keyword lists.
FE-003: Multiple Alert Channels: Support additional alert channels beyond Telegram, such as email, Slack, or webhooks.
FE-004: User Interface: Develop a simple web interface for:
Viewing a history of alerts.
Configuring LLM parameters and alert channels.
Monitoring system health.
FE-005: Filing Type Filtering: Allow users to specify which SEC filing types (e.g., only 8-K, 10-Q) should be monitored.
FE-006: Sentiment Analysis: Analyze the sentiment of the crypto-related content (positive, negative, neutral) using the LLM and include it in the alert.
FE-007: Data Export: Provide functionality to export detected crypto filings and their summaries in structured formats (e.g., CSV, JSON).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
aave-scraper-prd.md		aave-scraper-prd.md
aave_html_structure.html		aave_html_structure.html
coding_standards.md		coding_standards.md
gemini.md		gemini.md
initial_prompt.txt		initial_prompt.txt
main.py		main.py
requirements.txt		requirements.txt
sec-crypto-alert-prd-2.md		sec-crypto-alert-prd-2.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Software Requirements Document: SEC EDGAR Crypto Filing Alert Service

1. Introduction

2. Functional Requirements

3. Non-Functional Requirements

4. Technical Architecture Overview

5. API Usage Details

5.1. SEC EDGAR API

5.2. Telegram Bot API

5.3. Large Language Model (LLM) API

6. Key Components/Services/Modules

7. Optional Enhancements and Future Improvements

About

Uh oh!

Releases

Packages

Languages

dcimring/edgar-scraper

Folders and files

Latest commit

History

Repository files navigation

Software Requirements Document: SEC EDGAR Crypto Filing Alert Service

1. Introduction

2. Functional Requirements

3. Non-Functional Requirements

4. Technical Architecture Overview

5. API Usage Details

5.1. SEC EDGAR API

5.2. Telegram Bot API

5.3. Large Language Model (LLM) API

6. Key Components/Services/Modules

7. Optional Enhancements and Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages