-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: add safeguard for large files in readFileTool when maxReadFileLine is -1 #6174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ine is -1 - Add token counting check using tiktoken for files over 1000 lines - Automatically switch to partial read (first 2000 lines) when token count exceeds 50k - Add fallback safeguard for very large files (>5000 lines) when token counting fails - Include informative notice explaining why partial read is being used - Add comprehensive test coverage for all safeguard scenarios This prevents consuming the entire context window when reading very large files.
@roomote-agent This looks promising, but can we get access to the actual context window size of the current model via the reference to the Task object and use that instead of assuming a 100K window size? Additionally, 1,000 lines is a bit low; let's bump that up to 10K lines. |
Hi @cte! I see your feedback. Ill work on:
Let me check out the branch and make these changes. |
src/core/tools/readFileTool.ts
Outdated
const lineRangeAttr = ` lines="1-${totalLines}"` | ||
// Handle normal file read with safeguard for large files | ||
// Define thresholds for the safeguard | ||
const LARGE_FILE_LINE_THRESHOLD = 1000 // Consider files with more than 1000 lines as "large" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the safeguard threshold constants (LARGE_FILE_LINE_THRESHOLD, MAX_TOKEN_THRESHOLD, FALLBACK_MAX_LINES) to a shared configuration or constants module to avoid magic numbers and improve maintainability.
This comment was generated because it violated a code review rule: irule_tTqpIuNs8DV0QFGj.
src/core/tools/readFileTool.ts
Outdated
if (tokenCount > MAX_TOKEN_THRESHOLD) { | ||
shouldApplySafeguard = true | ||
linesToRead = FALLBACK_MAX_LINES | ||
safeguardNotice = `<notice>This file contains ${totalLines} lines and approximately ${tokenCount.toLocaleString()} tokens, which could consume a significant portion of the context window. Showing only the first ${FALLBACK_MAX_LINES} lines to preserve context space. Use line_range if you need to read specific sections.</notice>\n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safeguard notice message is hardcoded in English. For internationalization, consider using the i18n function (e.g., t()) so that the message can be translated for all supported languages.
This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.
@roomote-agent Instead of line count let's use file size to detect files that potentially contain a lot of tokens; this should handle the case where there's a file with a huge amount of content on a single line. |
- Access actual context window size from Task object via cline.api.getModel().info.contextWindow - Increase LARGE_FILE_LINE_THRESHOLD from 1,000 to 10,000 lines - Calculate MAX_TOKEN_THRESHOLD as 50% of actual context window instead of hardcoded 50K - Update tests to reflect new thresholds and mock api.getModel() properly
Thanks for the feedback @cte! I understand - switching from line count to file size makes sense to better handle files with large amounts of content on single lines. I'll update the implementation to use file size as the primary detection method for potentially token-heavy files. Working on this now. |
Hi @cte! Ive successfully implemented both requested changes: ✅ Changes made:
📝 Implementation details:
🔍 CI Status:
The changes ensure that the safeguard adapts to different model context windows while being less aggressive about triggering for moderately-sized files. |
- Replace line count threshold with file size threshold (100KB) - Files larger than 100KB trigger token count check - Files larger than 1MB automatically apply safeguard if token counting fails - Update tests to reflect new file size-based approach - This better handles files with large amounts of content on single lines As requested by @cte in PR comment
Hi @cte! I've successfully implemented the requested change to use file size instead of line count for detecting large files. Summary of changes:
This approach better handles edge cases where files might have a huge amount of content on single lines, as you mentioned. The implementation still respects the dynamic context window calculation that was added in the previous commit. All tests are passing locally, and CI checks are currently running. The changes have been pushed to this PR branch. |
if (tokenCount > MAX_TOKEN_THRESHOLD) { | ||
shouldApplySafeguard = true | ||
linesToRead = FALLBACK_MAX_LINES | ||
safeguardNotice = `<notice>This file is ${fileSizeKB}KB and contains approximately ${tokenCount.toLocaleString()} tokens, which could consume a significant portion of the context window. Showing only the first ${FALLBACK_MAX_LINES} lines to preserve context space. Use line_range if you need to read specific sections.</notice>\n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the i18n translation function for the safeguard notice message (e.g. on line 550) instead of hardcoded English text, to ensure consistency with other user‐facing messages.
This comment was generated because it violated a code review rule: irule_C0ez7Rji6ANcGkkX.
This PR implements a safeguard to prevent consuming the entire context window when reading large files with
maxReadFileLine
set to -1.Problem
When
maxReadFileLine
is set to -1, thereadFileTool
reads the full contents of a file. This can be problematic if the file is huge since it could consume the entire context window.Solution
Added a basic safeguard that:
maxReadFileLine
is -1Changes
src/core/tools/readFileTool.ts
to add the safeguard logicsrc/core/tools/__tests__/readFileTool.spec.ts
Testing
maxReadFileLine
is not -1Performance Considerations
Important
Adds a safeguard in
readFileTool
to limit reading large files to 2000 lines whenmaxReadFileLine
is -1, using token count and file size checks.readFileTool
to limit reading to 2000 lines ifmaxReadFileLine
is -1 and file is large.tiktoken
to count tokens; if over 50,000 tokens, applies safeguard.readFileTool.spec.ts
for large files, token count thresholds, and safeguard application.This description was created by
for 89790bc. You can customize this summary. It will automatically update as commits are pushed.