- 
                Notifications
    You must be signed in to change notification settings 
- Fork 45
Ensure queue middleware releases semaphore entries #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thank you for this detailed proposal addressing the memory leak in the queue middleware. This is indeed a serious issue that warrants careful consideration and thorough review before implementation. Initial Questions & ConcernsBefore moving forward, I believe the following points need clarification: 
 Specific Technical Concerns1. Defer Execution GuaranteesThe proposal states the defer will "execute before the panic-recovery block," but this depends entirely on defer declaration order. Could you provide: 
 2. Observable Behavior Changes
 This needs more scrutiny: 
 3. Test ReliabilityWhile fixing tests to work without external data is positive, we should verify: 
 Recommended Next StepsGiven the severity of this change, I strongly recommend we take a measured approach: 
 Request for Additional InformationBefore approving this change, please provide: 
 Community Input NeededThis is a critical fix for a serious production issue, but it touches core request handling logic. I'm calling for: 
 Please take the time to review this carefully. Memory leaks are serious, but introducing subtle bugs in panic recovery could be equally problematic. Let's get this right rather than getting it fast. | 
| in 98e8b29 Addressed
 Outstanding
 | 
| Thank you for the detailed explanation and the work that’s been put into this PR. We can see that a lot of thought has gone into addressing the panic paths, defer ordering, race conditions, and testing coverage, and we appreciate the thoroughness of your approach. However, there are a few areas where we’re still a bit unclear about the full impact and flow of the changes, especially when it comes to how the different components interact. Root Cause Analysis & Defer HandlingIt seems that the primary issue was with cleanup being skipped during panic paths, which you've addressed by introducing a new defer statement to prevent resource leaks. While this is great, we’d appreciate a bit more context on how this affects the overall behavior of the system, particularly in terms of timing and resource management. It would be helpful to understand if there’s any potential for unexpected behavior in high-concurrency scenarios where the defer order might be critical. Race Conditions & Request HandlingThe claim that the  It’s also worth highlighting that the deletion timing seems a bit critical here—while it’s good to hear that you’ve confirmed it won’t cause nil derefs or issues with metrics, there’s a certain level of complexity in the flow that’s not entirely clear to us. For example, how does the early deletion of  Testing & MonitoringThe testing coverage looks robust, with the addition of targeted tests for cache-redirect, generic panic handling, and high-concurrency timeouts. However, we’d like to better understand the scope of the tests and whether all possible edge cases (especially with regard to concurrency) have been covered. It’s great to hear that the new tests pass while the old code fails, but we’re wondering if there’s a need for additional test cases to cover less common or extreme conditions. The inclusion of the  Alternative Approaches & Rollback PlanWe understand the decision to reject the context.WithCancel-based cleanup approach, given that handlers call  On the rollback plan, it’s good that you’ve mentioned the revert process, but we’re curious whether there are any other safety nets in place in case something unexpected happens post-deployment. Would there be any immediate impact to the overall system if the change were to be reverted, or would it be more of a low-risk operation to roll back? A bit more visibility into the potential risks here would be reassuring. Canary and Proof of ConceptWe’re glad to hear that the changes are already in canary/staging to capture baseline versus post-fix behavior. It would be helpful to understand the specific metrics or indicators that we should be looking at during this period to verify the fix’s effectiveness, as well as any known limitations in the current testing environments that might require further validation in production. Outstanding ItemsIt sounds like the community review period and extended review window are still pending, so it might be helpful to clarify if there are any specific concerns that the team is hoping to address through this extended period. Additionally, we understand that an external load test is required to validate the canary environment, but we’re unclear on whether this test is already in progress or if it’s something we should be coordinating for separately. Final ThoughtsOverall, we definitely see the value in the work being done here, and we appreciate the effort to address multiple areas (panic recovery, defer ordering, race conditions, testing, and metrics). However, the overall flow of the changes, particularly in terms of how the different elements—defer execution, cleanup timing, race conditions, and testing—work together, is still a bit hard to fully track. It might help to break down the sequence of events more clearly, especially regarding the timing of defer execution and the impact on concurrent requests. We’re looking forward to hearing more details about these interactions and getting additional clarity around the outstanding items. Once we have a better understanding, we’ll be in a better position to provide a more thorough review. Thanks again for the hard work here, and we’re eager to see this move forward with some of these questions addressed! | 
| lmao | 
Problem Statement
The queue middleware tracks in-flight requests in inCourse. When the
handler bails out through one of its panic/recover shortcuts (e.g.,
redirecting to catch a cache), the clean-up code never runs, so the entry
stays forever. Under load, the map grows unbounded, leading to memory
pressure and eventual restarts.
Proposed Solution
Move inCourse.Delete(reqNum) into its own defer that executes before
the panic-recovery block. This guarantees the tracking entry is removed
regardless of how the handler exits. Tests were updated to validate
behavior even when the nostr system isn’t seeded.
Benefits
Stops inCourse from leaking entries, keeping memory use stable.
No change in request semantics; roll-back or panic paths still behave
the same.
Tests now pass without external data, so CI and local runs succeed.
##Trade-offs
Slightly earlier deletion means the entry may disappear a moment before
the recover logic runs, but the panic is already delivering the final
response; no functionality depends on the entry afterward.
Performance Implications
Detoxifies performance: prevents slow memory growth. Added defer is
negligible.
If Not Implemented
Leaked entries accumulate, eventually consuming memory and requiring
service restarts. Tests remain flaky in fresh environments, reducing
confidence in future changes.