fix: exclude h3 subcategory links from h2 category extraction #5892

carlosgrillet · 2025-10-22T12:24:11Z

Summary

Categories with subcategories (h3) were incorrectly including all nested links. For example, "Command Line" would include links from both "Advanced Console UIs" and "Standard CLI" subcategories.

Fixes the documented bug at main.go:245-247.

Summary by CodeRabbit

Bug Fixes
- Improved category extraction to include sub-lists under headers and to derive categories from table-of-contents structures.
- Empty categories are now detected and skipped so processing continues instead of stopping.
Tests
- Added comprehensive tests covering subcategory exclusion, empty-category handling, TOC-driven extraction, and link-count/filtering validation.

github-actions · 2025-10-22T12:24:32Z

Automated Quality Checks (from CONTRIBUTING minimum standards)

Repo link: missing
pkg.go.dev: missing
goreportcard: missing
coverage: missing

These checks are a best-effort automation and do not replace human review.

coderabbitai · 2025-10-22T12:24:39Z

Walkthrough

Compute the TOC from body #contents by selecting the next ul under its first following element (fallback to NextFiltered("ul")); start processing categories from that TOC; treat empty categories as sentinel errCategoryNoLinks and continue on it; broaden link lookup to stop at h2 or h3.

Changes

Cohort / File(s)	Summary
Category extraction logic `main.go`	Introduced sentinel `errCategoryNoLinks`; compute TOC from the next `ul` under the first following element of `body #contents` (fallback to `NextFiltered("ul")`); `extractCategories` begins from that TOC and continues on `errCategoryNoLinks`; `extractCategory` uses `NextFilteredUntil("ul", "h2, h3")` to locate lists and returns `errCategoryNoLinks` for empty categories.
Category extraction tests `category_extraction_test.go`	Added tests validating category extraction: excluding subcategory-only links, extracting direct links, signaling `errCategoryNoLinks` for empty/subcategory-only categories, handling multiple categories via details/summary TOC, and skipping empty categories. Tests use `markdown.ToHTML`, `goquery`, `extractCategory`, `extractCategories`, and `errCategoryNoLinks`.

Sequence Diagram(s)

sequenceDiagram
  participant Main as main
  participant DOM as HTML DOM
  participant ExtractCats as extractCategories
  participant ExtractOne as extractCategory

  Note over Main,DOM #DDEBF7: Start category extraction
  Main->>DOM: Find `body #contents`
  DOM-->>Main: contents node
  Main->>ExtractCats: compute TOC: next `ul` under first following element (fallback NextFiltered("ul"))
  ExtractCats->>ExtractOne: for each TOC item -> extractCategory(item)
  alt links found
    ExtractOne->>DOM: NextFilteredUntil("ul","h2, h3") -> gather links
    DOM-->>ExtractOne: links list
    ExtractOne-->>ExtractCats: return category with links
  else no links (empty)
    ExtractOne-->>ExtractCats: return errCategoryNoLinks
    Note right of ExtractCats #FDE9D9: continue to next TOC item
  else other error
    ExtractOne-->>ExtractCats: return wrapped error
    ExtractCats-->>Main: abort with error
  end
  ExtractCats-->>Main: aggregated categories map

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

"I hopped through lists and headers bright,
Found TOCs hidden out of sight,
Skipped the empty with a twitch of nose,
Let parsing carry on its toes,
A rabbit's cheer for tidy bytes 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "fix: exclude h3 subcategory links from h2 category extraction" directly aligns with the stated objectives. The PR addresses a documented bug where h2 categories incorrectly included links from their h3 subcategories (e.g., "Command Line" was including links from "Advanced Console UIs" and "Standard CLI"). The changes in main.go modify the link extraction logic by using `NextFilteredUntil("ul", "h2, h3")` to stop at h3 headers, thereby excluding nested subcategory links, and the test suite validates this behavior. The title is concise, specific, and clearly communicates the primary fix without unnecessary detail.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

main.go (1)

223-225: Replace string error comparison with sentinel error or error type.

Comparing error messages using string equality is fragile and not idiomatic Go. If the error message in extractCategory changes, this check will silently break.

Apply this diff to use a sentinel error:

+var errCategoryNoLinks = errors.New("category does not contain links")
+
 func extractCategories(doc *goquery.Document) (map[string]Category, error) {
 	categories := make(map[string]Category)
 	var rootErr error

Then update the error handling:

 					category, err := extractCategory(doc, selector)
 					if err != nil {
-						if err.Error() == "build a category: category does not contain links" {
+						if errors.Is(err, errCategoryNoLinks) {
 							return true
 						}
 						rootErr = fmt.Errorf("extract category: %w", err)
 						return false
 					}

And modify extractCategory to return the sentinel error:

 		// FIXME: In this case we would have an empty category in main index.html with link to 404 page.
 		if len(links) == 0 {
-			err = errors.New("category does not contain links")
+			err = errCategoryNoLinks
 			return false
 		}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90f89e3 and 00e6c7d.

📒 Files selected for processing (1)

main.go (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code

Files:

main.go

{**/*.go,.github/scripts/**/*.js}

📄 CodeRabbit inference engine (AGENTS.md)

Use American English spelling in code comments

Files:

main.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (2)

main.go (2)

251-251: Core fix looks correct.

The change from NextFilteredUntil("ul", "h2") to NextFilteredUntil("ul", "h2, h3") directly addresses the stated bug. By stopping at both h2 and h3 headers, the selector now correctly prevents h2 categories from including links that belong to h3 subcategories.

201-206: The TOC extraction change is intentional and part of the coordinated fix.

The diff reveals all three changes are intentional parts of a single commit. The TOC extraction logic (lines 201-206) adds defensive handling: it first tries Next().Find("ul").First() (for cases where ul is nested within the next sibling), then falls back to the original NextFiltered("ul") (for cases where ul is directly the next sibling). This suggests the author discovered the HTML structure can vary.

The main bug fix is clearly correct: changing NextFilteredUntil("ul", "h2") to NextFilteredUntil("ul", "h2, h3") (line 251) stops traversal at both h2 and h3 headers, preventing h3 subcategory links from being included in h2 categories—exactly as described in the removed FIXME comment.

One minor note: the error handling (lines 223-225) uses fragile string comparison (err.Error() == "..."), which is generally considered an anti-pattern, though it is functional in this context.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

main.go (1)

248-291: Add unit tests for the extractCategory and extractCategories functions.

The code changes modify critical category extraction logic, but verification shows no explicit unit tests exist for extractCategory or extractCategories. The PR's "needs-coverage" label and the coding guideline requirement of ≥80% coverage for non-data packages both indicate this gap must be addressed.

Add tests covering:

Categories with h3 subcategories correctly exclude subcategory links

H2 categories without h3 subcategories continue working

Empty categories return errCategoryNoLinks without error

The specific "Command Line" category bug case mentioned in the PR description

🧹 Nitpick comments (1)

main.go (1)
270-273: LGTM! Correct use of sentinel error for empty categories.

Returning errCategoryNoLinks when no links are found is appropriate. Combined with the error handling at lines 226-228, empty categories are now skipped entirely, which prevents the 404 page issue mentioned in the FIXME comment.

Consider updating the FIXME comment to reflect that empty categories are now skipped:
-// FIXME: In this case we would have an empty category in main index.html with link to 404 page.
+// Empty categories are skipped via errCategoryNoLinks to avoid creating 404 pages in index.html.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00e6c7d and bebe3ae.

📒 Files selected for processing (1)

main.go (5 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code

Files:

main.go

{**/*.go,.github/scripts/**/*.js}

📄 CodeRabbit inference engine (AGENTS.md)

Use American English spelling in code comments

Files:

main.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (4)

main.go (4)

37-38: LGTM! Idiomatic sentinel error.

The sentinel error declaration follows Go best practices and enables clear error handling for categories without links.

226-228: LGTM! Correct error handling for empty categories.

Using errors.Is() to detect the sentinel error and continuing iteration is the right approach. This allows categories without links to be skipped gracefully rather than aborting the entire extraction process.

254-254: LGTM! Core fix correctly excludes h3 subcategory links.

The change from NextFilteredUntil("ul", "h2") to NextFilteredUntil("ul", "h2, h3") correctly stops link extraction at both h2 and h3 headers, preventing subcategory links from being incorrectly included in parent categories. This directly addresses the documented bug mentioned in the PR description.

204-207: Verify whether the fallback selector is actually needed, or document its purpose with a clarifying comment.

Based on the current README.md structure and goldmark HTML conversion, the first selector (.Next().Find("ul").First()) should handle the actual structure where <h2 id="contents"> is followed by a <details> wrapper containing the list. The second selector (.NextFiltered("ul")) appears to be defensive code for a different structure where <ul> directly follows the heading.

Without test coverage showing the fallback path being used, it's unclear whether:

This is intentional defensive programming for potential README format variations, or

The fallback selector is dead code that can be removed

Consider either: (a) adding a comment explaining both selector paths and when each is used, as suggested in the original review, or (b) validating through tests whether the fallback is actually reachable and removing it if unnecessary.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

category_extraction_test.go (1)
115-124: Consider guarding against index out of bounds in the validation loop.

If the length check fails but the test continues (using t.Errorf rather than t.Fatalf), the loop at lines 120-123 could panic when accessing expectedLinks[i] if len(category.Links) > len(expectedLinks).

Consider one of these approaches:

Option 1: Use t.Fatalf to stop early
 	if len(category.Links) != 3 {
-		t.Errorf("Expected 3 links but got %d", len(category.Links))
+		t.Fatalf("Expected 3 links but got %d", len(category.Links))
 	}
Option 2: Guard the loop
 	expectedLinks := []string{"Tool A", "Tool B", "Tool C"}
-	for i, link := range category.Links {
+	for i := 0; i < min(len(category.Links), len(expectedLinks)); i++ {
+		link := category.Links[i]
 		if link.Title != expectedLinks[i] {
 			t.Errorf("Expected link %d to be '%s', got '%s'", i, expectedLinks[i], link.Title)
 		}
 	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bebe3ae and 93eb19b.

📒 Files selected for processing (1)

category_extraction_test.go (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code

Files:

category_extraction_test.go

{**/*.go,.github/scripts/**/*.js}

📄 CodeRabbit inference engine (AGENTS.md)

Use American English spelling in code comments

Files:

category_extraction_test.go

🧬 Code graph analysis (1)

category_extraction_test.go (1)

pkg/markdown/convert.go (1)

ToHTML (16-38)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (1)

category_extraction_test.go (1)

12-263: Excellent test coverage for the subcategory exclusion fix!

The test suite comprehensively validates the bug fix and covers multiple scenarios:

Core fix validation (excluding h3 subcategory links from h2 categories)

Simple categories without subcategories

Empty category error handling with sentinel error errCategoryNoLinks

Categories containing only subcategories

TOC-driven category extraction

Skipping empty categories in batch extraction

The test structure is clear, follows Go conventions, and provides good debugging output.

category_extraction_test.go

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

category_extraction_test.go (5)

51-54: Derive heading anchors from the generated HTML instead of hardcoding them.

Hardcoded IDs (e.g., "#test-category") couple tests to the current ID generator and can break with minor changes. Resolve by computing the anchor from the actual h2 element text.

Apply diffs like below (sample hunks shown; apply similarly to all anchor usages):

-	category, err := extractCategory(doc, "#test-category")
+	category, err := extractCategory(doc, anchorForH2(t, doc, "Test Category"))

-	category, err := extractCategory(doc, "#simple-category")
+	category, err := extractCategory(doc, anchorForH2(t, doc, "Simple Category"))

-	_, err = extractCategory(doc, "#empty-category")
+	_, err = extractCategory(doc, anchorForH2(t, doc, "Empty Category"))

-	_, err = extractCategory(doc, "#parent-only-category")
+	_, err = extractCategory(doc, anchorForH2(t, doc, "Parent Only Category"))

-	catA, ok := categories["#category-a"]
+	aAnchor := anchorForH2(t, doc, "Category A")
+	catA, ok := categories[aAnchor]

-	catB, ok := categories["#category-b"]
+	bAnchor := anchorForH2(t, doc, "Category B")
+	catB, ok := categories[bAnchor]

-	if _, ok := categories["#good-category"]; !ok {
+	if _, ok := categories[anchorForH2(t, doc, "Good Category")]; !ok {

-	if _, ok := categories["#another-good-category"]; !ok {
+	if _, ok := categories[anchorForH2(t, doc, "Another Good Category")]; !ok {

-	if _, ok := categories["#empty-category"]; ok {
+	if _, ok := categories[anchorForH2(t, doc, "Empty Category")]; ok {

Add this helper in the test file (outside the shown ranges):

func anchorForH2(t *testing.T, doc *goquery.Document, heading string) string {
	t.Helper()
	var out string
	doc.Find("h2").EachWithBreak(func(_ int, s *goquery.Selection) bool {
		if strings.TrimSpace(s.Text()) == heading {
			if id, ok := s.Attr("id"); ok {
				out = "#" + id
			}
			return false
		}
		return true
	})
	if out == "" {
		t.Fatalf("anchor for heading %q not found", heading)
	}
	return out
}

Remember to import "strings".

Also applies to: 106-109, 148-155, 187-195, 243-253, 255-262, 316-326

41-49: DRY the test setup with a shared docFromMarkdown helper.

The HTML conversion and goquery document creation are duplicated. Centralize for clarity and maintainability.

Apply this pattern (sample hunk for one test; replicate in others):

-	html, err := markdown.ToHTML([]byte(testMarkdown))
-	if err != nil {
-		t.Fatalf("Failed to convert markdown to HTML: %v", err)
-	}
-
-	doc, err := goquery.NewDocumentFromReader(bytes.NewReader(html))
-	if err != nil {
-		t.Fatalf("Failed to create goquery document: %v", err)
-	}
+	doc := docFromMarkdown(t, testMarkdown)

Add the helper once in this file:

func docFromMarkdown(t *testing.T, md string) *goquery.Document {
	t.Helper()
	html, err := markdown.ToHTML([]byte(md))
	if err != nil {
		t.Fatalf("Failed to convert markdown to HTML: %v", err)
	}
	doc, err := goquery.NewDocumentFromReader(bytes.NewReader(html))
	if err != nil {
		t.Fatalf("Failed to create goquery document: %v", err)
	}
	return doc
}

Also applies to: 96-104, 138-146, 177-185, 224-232, 296-304

12-13: Enable test parallelization.

These tests are independent; running them in parallel speeds up CI and surfaces hidden data races.

Insert at the top of each test:

 func TestCategoryExtractionDoesNotIncludeSubcategories(t *testing.T) {
+	t.Parallel()

(Repeat for the other test functions.)

Also applies to: 85-86, 127-128, 158-159, 197-199, 265-267

60-64: Remove or gate verbose logging in tests.

The per-link t.Logf noise isn’t needed for assertions and clutters CI output. Remove or guard with testing.Verbose().

-	t.Logf("Found %d links in category", len(category.Links))
-	for i, link := range category.Links {
-		t.Logf("  Link %d: %s - %s", i+1, link.Title, link.URL)
-	}
+	if testing.Verbose() {
+		t.Logf("Found %d links in category", len(category.Links))
+		for i, link := range category.Links {
+			t.Logf("  Link %d: %s - %s", i+1, link.Title, link.URL)
+		}
+	}

115-124: Strengthen assertions (URLs and titles).

Also assert URLs in TestCategoryWithoutSubcategories.
Assert Category B’s title to match expectations.

-	expectedLinks := []string{"Tool A", "Tool B", "Tool C"}
+	expectedLinks := []string{"Tool A", "Tool B", "Tool C"}
+	expectedURLs := []string{
+		"https://example.com/a",
+		"https://example.com/b",
+		"https://example.com/c",
+	}
 	for i, link := range category.Links {
 		if link.Title != expectedLinks[i] {
 			t.Errorf("Expected link %d to be '%s', got '%s'", i, expectedLinks[i], link.Title)
 		}
+		if link.URL != expectedURLs[i] {
+			t.Errorf("Expected URL for link %q to be %s, got %s", link.Title, expectedURLs[i], link.URL)
+		}
 	}

 	catB, ok := categories["#category-b"]
 	if !ok {
 		t.Error("Category B not found")
 	} else {
+		if catB.Title != "Category B" {
+			t.Errorf("Expected title 'Category B', got '%s'", catB.Title)
+		}
 		if len(catB.Links) != 1 {
 			t.Errorf("Expected 1 link in Category B, got %d", len(catB.Links))
 		}
 	}

Also applies to: 255-262

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93eb19b and 752012d.

📒 Files selected for processing (1)

category_extraction_test.go (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use standard formatting (gofmt) and idiomatic Go style
Favor small, testable functions in Go
Keep exported Go APIs documented with Go-style comments
Maintain ≥80% coverage for non-data packages and ≥90% for data packages when adding new testable Go code

Files:

category_extraction_test.go

{**/*.go,.github/scripts/**/*.js}

📄 CodeRabbit inference engine (AGENTS.md)

Use American English spelling in code comments

Files:

category_extraction_test.go

🧬 Code graph analysis (1)

category_extraction_test.go (1)

pkg/markdown/convert.go (1)

ToHTML (16-38)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Codacy Static Code Analysis

fix: exclude h3 subcategory links from h2 category extraction

00e6c7d

carlosgrillet had a problem deploying to action October 22, 2025 12:24 — with GitHub Actions Failure

github-actions bot added the needs-info\nneeds-coverage\nquality:fail label Oct 22, 2025

coderabbitai bot had a problem deploying to action October 22, 2025 12:26 Failure

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

refactor: use sentinel error for empty category handling

bebe3ae

carlosgrillet had a problem deploying to action October 22, 2025 12:54 — with GitHub Actions Failure

coderabbitai bot had a problem deploying to action October 22, 2025 12:56 Failure

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

test: test cases added for extractCategory and extractCategories

93eb19b

carlosgrillet had a problem deploying to action October 22, 2025 19:22 — with GitHub Actions Failure

coderabbitai bot had a problem deploying to action October 22, 2025 19:24 Failure

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

category_extraction_test.go Outdated Show resolved Hide resolved

fix(test): removed unused function

752012d

carlosgrillet had a problem deploying to action October 22, 2025 19:30 — with GitHub Actions Failure

coderabbitai bot had a problem deploying to action October 22, 2025 19:32 Failure

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

fix: exclude h3 subcategory links from h2 category extraction #5892

fix: exclude h3 subcategory links from h2 category extraction #5892

Uh oh!

carlosgrillet commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

fix: exclude h3 subcategory links from h2 category extraction #5892

Are you sure you want to change the base?

fix: exclude h3 subcategory links from h2 category extraction #5892

Uh oh!

Conversation

carlosgrillet commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

carlosgrillet commented Oct 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading