Skip to content

feat: add safeguard for large files in readFileTool when maxReadFileLine is -1 #6174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

roomote[bot]
Copy link

@roomote roomote bot commented Jul 24, 2025

This PR implements a safeguard to prevent consuming the entire context window when reading large files with maxReadFileLine set to -1.

Problem

When maxReadFileLine is set to -1, the readFileTool reads the full contents of a file. This can be problematic if the file is huge since it could consume the entire context window.

Solution

Added a basic safeguard that:

  1. Checks if a file has more than 1000 lines when maxReadFileLine is -1
  2. Uses tiktoken to count tokens in the file content
  3. If token count exceeds 50,000 (approximately 50% of a typical 100k context window), automatically switches to partial read mode
  4. Reads only the first 2000 lines with an informative notice explaining why
  5. Includes a fallback for very large files (>5000 lines) when token counting fails

Changes

  • Modified src/core/tools/readFileTool.ts to add the safeguard logic
  • Added comprehensive tests in src/core/tools/__tests__/readFileTool.spec.ts

Testing

  • All existing tests pass
  • Added new test cases covering:
    • Files with high token count that trigger the safeguard
    • Files with low token count that do not trigger the safeguard
    • Files under the line threshold that skip token counting entirely
    • Very large files when token counting fails
    • Safeguard not applying when maxReadFileLine is not -1
    • Line ranges bypassing the safeguard
    • Boundary conditions for thresholds

Performance Considerations

  • Token counting is only performed for files over 1000 lines to minimize overhead
  • The safeguard preserves the existing behavior for smaller files
  • The thresholds are configurable constants that can be adjusted if needed

Important

Adds a safeguard in readFileTool to limit reading large files to 2000 lines when maxReadFileLine is -1, using token count and file size checks.

  • Behavior:
    • Adds safeguard in readFileTool to limit reading to 2000 lines if maxReadFileLine is -1 and file is large.
    • Uses tiktoken to count tokens; if over 50,000 tokens, applies safeguard.
    • For files >1MB, applies safeguard if token counting fails.
    • Provides notice in output when safeguard is applied.
  • Testing:
    • Adds tests in readFileTool.spec.ts for large files, token count thresholds, and safeguard application.
    • Tests include scenarios for small files, large files, and token counting failures.
  • Performance:
    • Token counting only for files >1000 lines to reduce overhead.
    • Configurable thresholds for file size and token count.

This description was created by Ellipsis for 89790bc. You can customize this summary. It will automatically update as commits are pushed.

…ine is -1

- Add token counting check using tiktoken for files over 1000 lines
- Automatically switch to partial read (first 2000 lines) when token count exceeds 50k
- Add fallback safeguard for very large files (>5000 lines) when token counting fails
- Include informative notice explaining why partial read is being used
- Add comprehensive test coverage for all safeguard scenarios

This prevents consuming the entire context window when reading very large files.
@roomote roomote bot requested review from mrubens, cte and jr as code owners July 24, 2025 15:36
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 24, 2025
@cte
Copy link
Collaborator

cte commented Jul 24, 2025

@roomote-agent This looks promising, but can we get access to the actual context window size of the current model via the reference to the Task object and use that instead of assuming a 100K window size? Additionally, 1,000 lines is a bit low; let's bump that up to 10K lines.

Copy link
Author

roomote bot commented Jul 24, 2025

Hi @cte! I see your feedback. Ill work on:

  1. Getting the actual context window size from the Task object instead of assuming 100K
  2. Increasing the line threshold from 1,000 to 10,000 lines

Let me check out the branch and make these changes.

const lineRangeAttr = ` lines="1-${totalLines}"`
// Handle normal file read with safeguard for large files
// Define thresholds for the safeguard
const LARGE_FILE_LINE_THRESHOLD = 1000 // Consider files with more than 1000 lines as "large"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving the safeguard threshold constants (LARGE_FILE_LINE_THRESHOLD, MAX_TOKEN_THRESHOLD, FALLBACK_MAX_LINES) to a shared configuration or constants module to avoid magic numbers and improve maintainability.

This comment was generated because it violated a code review rule: irule_tTqpIuNs8DV0QFGj.

if (tokenCount > MAX_TOKEN_THRESHOLD) {
shouldApplySafeguard = true
linesToRead = FALLBACK_MAX_LINES
safeguardNotice = `<notice>This file contains ${totalLines} lines and approximately ${tokenCount.toLocaleString()} tokens, which could consume a significant portion of the context window. Showing only the first ${FALLBACK_MAX_LINES} lines to preserve context space. Use line_range if you need to read specific sections.</notice>\n`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safeguard notice message is hardcoded in English. For internationalization, consider using the i18n function (e.g., t()) so that the message can be translated for all supported languages.

This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.

@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jul 24, 2025
@cte
Copy link
Collaborator

cte commented Jul 24, 2025

@roomote-agent Instead of line count let's use file size to detect files that potentially contain a lot of tokens; this should handle the case where there's a file with a huge amount of content on a single line.

- Access actual context window size from Task object via cline.api.getModel().info.contextWindow
- Increase LARGE_FILE_LINE_THRESHOLD from 1,000 to 10,000 lines
- Calculate MAX_TOKEN_THRESHOLD as 50% of actual context window instead of hardcoded 50K
- Update tests to reflect new thresholds and mock api.getModel() properly
Copy link
Author

roomote bot commented Jul 24, 2025

Thanks for the feedback @cte! I understand - switching from line count to file size makes sense to better handle files with large amounts of content on single lines. I'll update the implementation to use file size as the primary detection method for potentially token-heavy files. Working on this now.

Copy link
Author

roomote bot commented Jul 24, 2025

Hi @cte! Ive successfully implemented both requested changes:

Changes made:

  1. Dynamic context window sizing: Now accessing the actual context window size from cline.api.getModel().info.contextWindow instead of assuming 100K
  2. Increased line threshold: Bumped LARGE_FILE_LINE_THRESHOLD from 1,000 to 10,000 lines as requested

📝 Implementation details:

  • The MAX_TOKEN_THRESHOLD is now calculated as 50% of the actual models context window
  • Added fallback to 100K if context window info is not available
  • Updated all related tests to mock the api.getModel() method properly
  • All tests are passing locally

🔍 CI Status:

  • Most checks have passed ✅
  • The Windows unit test failure appears to be a pre-existing issue (Ubuntu tests passed fine)

The changes ensure that the safeguard adapts to different model context windows while being less aggressive about triggering for moderately-sized files.

- Replace line count threshold with file size threshold (100KB)
- Files larger than 100KB trigger token count check
- Files larger than 1MB automatically apply safeguard if token counting fails
- Update tests to reflect new file size-based approach
- This better handles files with large amounts of content on single lines

As requested by @cte in PR comment
Copy link
Author

roomote bot commented Jul 24, 2025

Hi @cte! I've successfully implemented the requested change to use file size instead of line count for detecting large files.

Summary of changes:

  1. Replaced line count threshold with file size threshold: Files larger than 100KB now trigger the token count check (instead of files with more than 10,000 lines)

  2. Added file size-based safeguard: Files larger than 1MB automatically apply the safeguard if token counting fails

  3. Updated notice messages: The safeguard notices now show file size in KB instead of line count

  4. Updated all tests: Modified test suite to use file size mocking and updated test descriptions accordingly

This approach better handles edge cases where files might have a huge amount of content on single lines, as you mentioned. The implementation still respects the dynamic context window calculation that was added in the previous commit.

All tests are passing locally, and CI checks are currently running. The changes have been pushed to this PR branch.

if (tokenCount > MAX_TOKEN_THRESHOLD) {
shouldApplySafeguard = true
linesToRead = FALLBACK_MAX_LINES
safeguardNotice = `<notice>This file is ${fileSizeKB}KB and contains approximately ${tokenCount.toLocaleString()} tokens, which could consume a significant portion of the context window. Showing only the first ${FALLBACK_MAX_LINES} lines to preserve context space. Use line_range if you need to read specific sections.</notice>\n`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the i18n translation function for the safeguard notice message (e.g. on line 550) instead of hardcoded English text, to ensure consistency with other user‐facing messages.

This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 24, 2025
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR - Draft / In Progress size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
Status: PR [Draft / In Progress]
Development

Successfully merging this pull request may close these issues.

3 participants