-
-
Couldn't load subscription status.
- Fork 12.7k
fix: exclude h3 subcategory links from h2 category extraction #5892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: exclude h3 subcategory links from h2 category extraction #5892
Conversation
|
Automated Quality Checks (from CONTRIBUTING minimum standards)
These checks are a best-effort automation and do not replace human review. |
WalkthroughCompute the TOC from Changes
Sequence Diagram(s)sequenceDiagram
participant Main as main
participant DOM as HTML DOM
participant ExtractCats as extractCategories
participant ExtractOne as extractCategory
Note over Main,DOM #DDEBF7: Start category extraction
Main->>DOM: Find `body #contents`
DOM-->>Main: contents node
Main->>ExtractCats: compute TOC: next `ul` under first following element (fallback NextFiltered("ul"))
ExtractCats->>ExtractOne: for each TOC item -> extractCategory(item)
alt links found
ExtractOne->>DOM: NextFilteredUntil("ul","h2, h3") -> gather links
DOM-->>ExtractOne: links list
ExtractOne-->>ExtractCats: return category with links
else no links (empty)
ExtractOne-->>ExtractCats: return errCategoryNoLinks
Note right of ExtractCats #FDE9D9: continue to next TOC item
else other error
ExtractOne-->>ExtractCats: return wrapped error
ExtractCats-->>Main: abort with error
end
ExtractCats-->>Main: aggregated categories map
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
main.go (1)
223-225: Replace string error comparison with sentinel error or error type.Comparing error messages using string equality is fragile and not idiomatic Go. If the error message in
extractCategorychanges, this check will silently break.Apply this diff to use a sentinel error:
+var errCategoryNoLinks = errors.New("category does not contain links") + func extractCategories(doc *goquery.Document) (map[string]Category, error) { categories := make(map[string]Category) var rootErr errorThen update the error handling:
category, err := extractCategory(doc, selector) if err != nil { - if err.Error() == "build a category: category does not contain links" { + if errors.Is(err, errCategoryNoLinks) { return true } rootErr = fmt.Errorf("extract category: %w", err) return false }And modify
extractCategoryto return the sentinel error:// FIXME: In this case we would have an empty category in main index.html with link to 404 page. if len(links) == 0 { - err = errors.New("category does not contain links") + err = errCategoryNoLinks return false }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
main.go(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code
Files:
main.go
{**/*.go,.github/scripts/**/*.js}
📄 CodeRabbit inference engine (AGENTS.md)
Use American English spelling in code comments
Files:
main.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (2)
main.go (2)
251-251: Core fix looks correct.The change from
NextFilteredUntil("ul", "h2")toNextFilteredUntil("ul", "h2, h3")directly addresses the stated bug. By stopping at both h2 and h3 headers, the selector now correctly prevents h2 categories from including links that belong to h3 subcategories.
201-206: The TOC extraction change is intentional and part of the coordinated fix.The diff reveals all three changes are intentional parts of a single commit. The TOC extraction logic (lines 201-206) adds defensive handling: it first tries
Next().Find("ul").First()(for cases whereulis nested within the next sibling), then falls back to the originalNextFiltered("ul")(for cases whereulis directly the next sibling). This suggests the author discovered the HTML structure can vary.The main bug fix is clearly correct: changing
NextFilteredUntil("ul", "h2")toNextFilteredUntil("ul", "h2, h3")(line 251) stops traversal at both h2 and h3 headers, preventing h3 subcategory links from being included in h2 categories—exactly as described in the removed FIXME comment.One minor note: the error handling (lines 223-225) uses fragile string comparison (
err.Error() == "..."), which is generally considered an anti-pattern, though it is functional in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
main.go (1)
248-291: Add unit tests for the extractCategory and extractCategories functions.The code changes modify critical category extraction logic, but verification shows no explicit unit tests exist for
extractCategoryorextractCategories. The PR's "needs-coverage" label and the coding guideline requirement of ≥80% coverage for non-data packages both indicate this gap must be addressed.Add tests covering:
- Categories with h3 subcategories correctly exclude subcategory links
- H2 categories without h3 subcategories continue working
- Empty categories return
errCategoryNoLinkswithout error- The specific "Command Line" category bug case mentioned in the PR description
🧹 Nitpick comments (1)
main.go (1)
270-273: LGTM! Correct use of sentinel error for empty categories.Returning
errCategoryNoLinkswhen no links are found is appropriate. Combined with the error handling at lines 226-228, empty categories are now skipped entirely, which prevents the 404 page issue mentioned in the FIXME comment.Consider updating the FIXME comment to reflect that empty categories are now skipped:
-// FIXME: In this case we would have an empty category in main index.html with link to 404 page. +// Empty categories are skipped via errCategoryNoLinks to avoid creating 404 pages in index.html.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
main.go(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code
Files:
main.go
{**/*.go,.github/scripts/**/*.js}
📄 CodeRabbit inference engine (AGENTS.md)
Use American English spelling in code comments
Files:
main.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (4)
main.go (4)
37-38: LGTM! Idiomatic sentinel error.The sentinel error declaration follows Go best practices and enables clear error handling for categories without links.
226-228: LGTM! Correct error handling for empty categories.Using
errors.Is()to detect the sentinel error and continuing iteration is the right approach. This allows categories without links to be skipped gracefully rather than aborting the entire extraction process.
254-254: LGTM! Core fix correctly excludes h3 subcategory links.The change from
NextFilteredUntil("ul", "h2")toNextFilteredUntil("ul", "h2, h3")correctly stops link extraction at both h2 and h3 headers, preventing subcategory links from being incorrectly included in parent categories. This directly addresses the documented bug mentioned in the PR description.
204-207: Verify whether the fallback selector is actually needed, or document its purpose with a clarifying comment.Based on the current README.md structure and goldmark HTML conversion, the first selector (
.Next().Find("ul").First()) should handle the actual structure where<h2 id="contents">is followed by a<details>wrapper containing the list. The second selector (.NextFiltered("ul")) appears to be defensive code for a different structure where<ul>directly follows the heading.Without test coverage showing the fallback path being used, it's unclear whether:
- This is intentional defensive programming for potential README format variations, or
- The fallback selector is dead code that can be removed
Consider either: (a) adding a comment explaining both selector paths and when each is used, as suggested in the original review, or (b) validating through tests whether the fallback is actually reachable and removing it if unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
category_extraction_test.go (1)
115-124: Consider guarding against index out of bounds in the validation loop.If the length check fails but the test continues (using
t.Errorfrather thant.Fatalf), the loop at lines 120-123 could panic when accessingexpectedLinks[i]iflen(category.Links) > len(expectedLinks).Consider one of these approaches:
Option 1: Use
t.Fatalfto stop earlyif len(category.Links) != 3 { - t.Errorf("Expected 3 links but got %d", len(category.Links)) + t.Fatalf("Expected 3 links but got %d", len(category.Links)) }Option 2: Guard the loop
expectedLinks := []string{"Tool A", "Tool B", "Tool C"} - for i, link := range category.Links { + for i := 0; i < min(len(category.Links), len(expectedLinks)); i++ { + link := category.Links[i] if link.Title != expectedLinks[i] { t.Errorf("Expected link %d to be '%s', got '%s'", i, expectedLinks[i], link.Title) } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
category_extraction_test.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code
Files:
category_extraction_test.go
{**/*.go,.github/scripts/**/*.js}
📄 CodeRabbit inference engine (AGENTS.md)
Use American English spelling in code comments
Files:
category_extraction_test.go
🧬 Code graph analysis (1)
category_extraction_test.go (1)
pkg/markdown/convert.go (1)
ToHTML(16-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (1)
category_extraction_test.go (1)
12-263: Excellent test coverage for the subcategory exclusion fix!The test suite comprehensively validates the bug fix and covers multiple scenarios:
- Core fix validation (excluding h3 subcategory links from h2 categories)
- Simple categories without subcategories
- Empty category error handling with sentinel error
errCategoryNoLinks- Categories containing only subcategories
- TOC-driven category extraction
- Skipping empty categories in batch extraction
The test structure is clear, follows Go conventions, and provides good debugging output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
category_extraction_test.go (5)
51-54: Derive heading anchors from the generated HTML instead of hardcoding them.Hardcoded IDs (e.g., "#test-category") couple tests to the current ID generator and can break with minor changes. Resolve by computing the anchor from the actual h2 element text.
Apply diffs like below (sample hunks shown; apply similarly to all anchor usages):
- category, err := extractCategory(doc, "#test-category") + category, err := extractCategory(doc, anchorForH2(t, doc, "Test Category"))- category, err := extractCategory(doc, "#simple-category") + category, err := extractCategory(doc, anchorForH2(t, doc, "Simple Category"))- _, err = extractCategory(doc, "#empty-category") + _, err = extractCategory(doc, anchorForH2(t, doc, "Empty Category"))- _, err = extractCategory(doc, "#parent-only-category") + _, err = extractCategory(doc, anchorForH2(t, doc, "Parent Only Category"))- catA, ok := categories["#category-a"] + aAnchor := anchorForH2(t, doc, "Category A") + catA, ok := categories[aAnchor]- catB, ok := categories["#category-b"] + bAnchor := anchorForH2(t, doc, "Category B") + catB, ok := categories[bAnchor]- if _, ok := categories["#good-category"]; !ok { + if _, ok := categories[anchorForH2(t, doc, "Good Category")]; !ok {- if _, ok := categories["#another-good-category"]; !ok { + if _, ok := categories[anchorForH2(t, doc, "Another Good Category")]; !ok {- if _, ok := categories["#empty-category"]; ok { + if _, ok := categories[anchorForH2(t, doc, "Empty Category")]; ok {Add this helper in the test file (outside the shown ranges):
func anchorForH2(t *testing.T, doc *goquery.Document, heading string) string { t.Helper() var out string doc.Find("h2").EachWithBreak(func(_ int, s *goquery.Selection) bool { if strings.TrimSpace(s.Text()) == heading { if id, ok := s.Attr("id"); ok { out = "#" + id } return false } return true }) if out == "" { t.Fatalf("anchor for heading %q not found", heading) } return out }Remember to import "strings".
Also applies to: 106-109, 148-155, 187-195, 243-253, 255-262, 316-326
41-49: DRY the test setup with a shared docFromMarkdown helper.The HTML conversion and goquery document creation are duplicated. Centralize for clarity and maintainability.
Apply this pattern (sample hunk for one test; replicate in others):
- html, err := markdown.ToHTML([]byte(testMarkdown)) - if err != nil { - t.Fatalf("Failed to convert markdown to HTML: %v", err) - } - - doc, err := goquery.NewDocumentFromReader(bytes.NewReader(html)) - if err != nil { - t.Fatalf("Failed to create goquery document: %v", err) - } + doc := docFromMarkdown(t, testMarkdown)Add the helper once in this file:
func docFromMarkdown(t *testing.T, md string) *goquery.Document { t.Helper() html, err := markdown.ToHTML([]byte(md)) if err != nil { t.Fatalf("Failed to convert markdown to HTML: %v", err) } doc, err := goquery.NewDocumentFromReader(bytes.NewReader(html)) if err != nil { t.Fatalf("Failed to create goquery document: %v", err) } return doc }Also applies to: 96-104, 138-146, 177-185, 224-232, 296-304
12-13: Enable test parallelization.These tests are independent; running them in parallel speeds up CI and surfaces hidden data races.
Insert at the top of each test:
func TestCategoryExtractionDoesNotIncludeSubcategories(t *testing.T) { + t.Parallel()(Repeat for the other test functions.)
Also applies to: 85-86, 127-128, 158-159, 197-199, 265-267
60-64: Remove or gate verbose logging in tests.The per-link t.Logf noise isn’t needed for assertions and clutters CI output. Remove or guard with testing.Verbose().
- t.Logf("Found %d links in category", len(category.Links)) - for i, link := range category.Links { - t.Logf(" Link %d: %s - %s", i+1, link.Title, link.URL) - } + if testing.Verbose() { + t.Logf("Found %d links in category", len(category.Links)) + for i, link := range category.Links { + t.Logf(" Link %d: %s - %s", i+1, link.Title, link.URL) + } + }
115-124: Strengthen assertions (URLs and titles).
- Also assert URLs in TestCategoryWithoutSubcategories.
- Assert Category B’s title to match expectations.
- expectedLinks := []string{"Tool A", "Tool B", "Tool C"} + expectedLinks := []string{"Tool A", "Tool B", "Tool C"} + expectedURLs := []string{ + "https://example.com/a", + "https://example.com/b", + "https://example.com/c", + } for i, link := range category.Links { if link.Title != expectedLinks[i] { t.Errorf("Expected link %d to be '%s', got '%s'", i, expectedLinks[i], link.Title) } + if link.URL != expectedURLs[i] { + t.Errorf("Expected URL for link %q to be %s, got %s", link.Title, expectedURLs[i], link.URL) + } }catB, ok := categories["#category-b"] if !ok { t.Error("Category B not found") } else { + if catB.Title != "Category B" { + t.Errorf("Expected title 'Category B', got '%s'", catB.Title) + } if len(catB.Links) != 1 { t.Errorf("Expected 1 link in Category B, got %d", len(catB.Links)) } }Also applies to: 255-262
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
category_extraction_test.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code
Files:
category_extraction_test.go
{**/*.go,.github/scripts/**/*.js}
📄 CodeRabbit inference engine (AGENTS.md)
Use American English spelling in code comments
Files:
category_extraction_test.go
🧬 Code graph analysis (1)
category_extraction_test.go (1)
pkg/markdown/convert.go (1)
ToHTML(16-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Codacy Static Code Analysis
Summary
Categories with subcategories (h3) were incorrectly including all nested links. For example, "Command Line" would include links from both "Advanced Console UIs" and "Standard CLI" subcategories.
Fixes the documented bug at main.go:245-247.
Summary by CodeRabbit
Bug Fixes
Tests