Horseman Article Parser

Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, images, metadata, sentiment, keywords/keyphrases, named entities, optional summaries, optional spelling suggestions, readability metrics and basic counts (characters, words, sentences, paragraphs), site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking. The parser now detects the article language and exposes ISO codes, with best-effort support for non-English content (features may fall back to English dictionaries when specific resources are missing).

import { parseArticle } from "horseman-article-parser";

const options = {
  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
  enabled: [
    "lighthouse",
    "screenshot",
    "links",
    "images",
    "sentiment",
    "entities",
    "spelling",
    "keywords",
    "summary",
    "readability",
  ],
};

(async () => {
  try {
    const article = await parseArticle(options);

    const response = {
      title: article.title.text,
      excerpt: article.excerpt,
      metadescription: article.meta.description.text,
      url: article.url,
      sentiment: {
        score: article.sentiment.score,
        comparative: article.sentiment.comparative,
      },
      keyphrases: article.processed.keyphrases,
      keywords: article.processed.keywords,
      people: article.people,
      orgs: article.orgs,
      places: article.places,
      language: article.language,
      readability: {
        readingTime: article.readability.readingTime,
        characters: article.readability.characters,
        words: article.readability.words,
        sentences: article.readability.sentences,
        paragraphs: article.readability.paragraphs,
      },
      text: {
        raw: article.processed.text.raw,
        formatted: article.processed.text.formatted,
        html: article.processed.text.html,
        summary: article.processed.text.summary,
        sentences: article.processed.text.sentences,
      },
      spelling: article.spelling,
      meta: article.meta,
      links: article.links,
      images: article.images,
      structuredData: article.structuredData,
      lighthouse: article.lighthouse,
    };

    console.log(response);
  } catch (error) {
    console.log(error.message);
    console.log(error.stack);
  }
})();

Structured JSON-LD article nodes (including the original schema objects) are exposed via article.structuredData. In-article structural elements such as tables, definition lists, and figures are normalised into article.structuredData.body. Body images (with captions, alt text, titles, and data-src fallbacks when present) are returned under article.images when the feature is enabled.

parseArticle(options, <socket>) accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.

See horseman-article-parser-ui as an example.

Options

The options below are set by default

var options = {
  // Imposes a hard limit on how long the parser will run. When the limit is reached, the browser instance is closed and a timeout error is thrown.
  // This prevents the parser from hanging indefinitely and ensures long‑running parses are cut off after the specified duration.
  timeoutMs: 40000,
  // puppeteer options (https://github.com/GoogleChrome/puppeteer)
  puppeteer: {
    // puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)
    launch: {
      headless: true,
      defaultViewport: null,
    },
    // Optional user agent and headers (some sites require a realistic UA)
    // userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
    // extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },
    // puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)
    goto: {
      waitUntil: "domcontentloaded",
    },
    // Ignore content security policy
    setBypassCSP: true,
  },
  // clean-html options (https://ghub.io/clean-html)
  cleanhtml: {
    "add-remove-tags": ["blockquote", "span"],
    "remove-empty-tags": ["span"],
    "replace-nbsp": true,
  },
  // html-to-text options (https://ghub.io/html-to-text)
  htmltotext: {
    wordwrap: 100,
    noLinkBrackets: true,
    ignoreHref: true,
    tables: true,
    uppercaseHeadings: true,
  },
  // retext-keywords options (https://ghub.io/retext-keywords)
  retextkeywords: { maximum: 10 },
  // content detection defaults (detector is always enabled)
  contentDetection: {
    // minimum characters required for a candidate
    minLength: 400,
    // maximum link density allowed for a candidate
    maxLinkDensity: 0.5,
    // optional: promote selection to a parent container when
    // article paragraphs are split across sibling blocks
    fragment: {
      // require at least this many sibling parts containing paragraphs
      minParts: 2,
      // minimum text length per part
      minChildChars: 150,
      // minimum combined text across parts (set higher to be stricter)
      minCombinedChars: 400,
      // override parent link-density threshold (default uses max(maxLinkDensity, 0.65))
      // maxLinkDensity: 0.65
    },
    // reranker is disabled by default; enable after training weights
    // Note: scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker
    reranker: { enabled: false },
    // optional: dump top-N candidates per page for labeling
    // debugDump: { path: 'candidates_with_url.csv', topN: 5, addUrl: true }
  },
  // retext-spell defaults and output tweaks
  retextspell: {
    tweaks: {
      // filter URL/domain-like tokens and long slugs by default
      ignoreUrlLike: true,
      // positions: only start by default
      includeEndPosition: false,
      // offsets excluded by default
      includeOffsets: false,
    },
  },
};

At a minimum you should pass a url

var options = {
  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
};

If you want to enable the advanced features you should pass the following

var options = {
  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
  enabled: [
    "lighthouse",
    "screenshot",
    "links",
    "sentiment",
    "entities",
    "spelling",
    "keywords",
    "summary",
    "readability",
  ],
};

Add "summary" to options.enabled to generate a short summary of the article text. The result includes text.summary and a text.sentences array containing the first five sentences.

Add "readability" to options.enabled to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as article.readability with readingTime (seconds), characters, words, sentences, and paragraphs.

You may pass rules for returning an articles title & contents. This is useful in a case where the parser is unable to return the desired title or content e.g.

rules: [
  {
    host: "www.bbc.co.uk",
    content: () => {
      var j = window.$;
      j("article section, article figure, article header").remove();
      return j("article").html();
    },
  },
  {
    host: "www.youtube.com",
    title: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results
        .results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text;
    },
    content: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results
        .results.contents[1].videoSecondaryInfoRenderer.description.runs[0]
        .text;
    },
  },
];

If you want to pass cookies to puppeteer use the following

var options = {
  puppeteer: {
    cookies: [
      { name: "cookie1", value: "val1", domain: ".domain1" },
      { name: "cookie2", value: "val2", domain: ".domain2" },
    ],
  },
};

To strip tags before processing use the following

var options = {
  striptags: [".something", "#somethingelse"],
};

If you need to dismiss any popups e.g. a privacy popup use the following

var options = {
  clickelements: ["#button1", "#button2"],
};

there are some additional "complex" options available

var options = {

  // array of html elements to stip before analysis
  striptags: [],

  // array of resource types to block e.g. ['image' ]
  blockedResourceTypes: [],

  // array of resource source names (all resources from
  // these sources are skipped) e.g. [ 'google', 'facebook' ]
  skippedResources: [],


  // retext spell options (https://ghub.io/retext-spell)
  retextspell: {
    // dictionary defaults to en-GB; you can override
    // dictionary,
    tweaks: {
      // Filter URL/domain-like tokens and long slugs (default: true)
      ignoreUrlLike: true,
      // Include end position (endLine/endColumn) in each item (default: false)
      includeEndPosition: false,
      // Include offsets (offsetStart/offsetEnd) in each item (default: false)
      includeOffsets: false
    }
  }

  // compromise nlp options
  nlp: { plugins: [ myPlugin, anotherPlugin ] }

}

Using Compromise plugins to improve results

Compromise is the natural language processor that allows horseman-article-parser to return topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:

/** add some names */
const testPlugin = (Doc, world) => {
  world.addWords({
    rishi: "FirstName",
    sunak: "LastName",
  });
};

const options = {
  url: "https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies",
  enabled: [
    "lighthouse",
    "screenshot",
    "links",
    "sentiment",
    "entities",
    "spelling",
    "keywords",
    "summary",
    "readability",
  ],
  // Optional: tweak spelling output/filters
  retextspell: {
    tweaks: {
      ignoreUrlLike: true,
      includeEndPosition: true,
      includeOffsets: true,
    },
  },
  nlp: {
    plugins: [testPlugin],
  },
};

By tagging new words as FirstName and LastName, the parser records fallback hints and can still detect the full name even if Compromise doesn't tag it directly. This allows us to match names which are not in the base Compromise word lists.

Check out the compromise plugin docs for more info.

Extended name hints and secondary NER sources

loadNlpPlugins also accepts additional hint buckets for middle names and suffixes. You can provide them directly via options.nlp.hints or through a Compromise plugin using MiddleName/Suffix tags. These extra hints help prevent false splits in names that include common middle initials or honorifics.

const options = {
  nlp: {
    hints: {
      first: ['José', 'Ana'],
      middle: ['Luis', 'María'],
      last: ['Rodríguez', 'López'],
      suffix: ['Jr']
    },
    secondary: {
      endpoint: 'https://ner.yourservice.example/people',
      method: 'POST',
      timeoutMs: 1500,
      minConfidence: 0.65
    }
  }
}

When secondary is configured the parser will send the article text to that endpoint (default payload { text: "…" }) and merge any PERSON entities it returns with the Compromise results. Responses that include a simple people array or spaCy-style ents collections are supported. If the service is unreachable or errors, the parser automatically falls back to Compromise-only detection.

Content Detection

The detector is always enabled and uses a structured-data-first strategy, falling back to heuristic scoring:

Structured data: Extracts JSON-LD Article/NewsArticle (headline, articleBody).
Heuristics: Gathers DOM candidates (e.g., article, main, [role=main], content-like containers) and scores them by text length, punctuation, link density, paragraph count, semantic tags, and boilerplate penalties.
Fragment promotion: When content is split across sibling blocks, a fragmentation heuristic merges them into a single higher-level candidate.
ML reranker (optional): If weights are supplied, a lightweight reranker can refine the heuristic ranking.
Title detection: Chooses from structured headline, og:title/twitter:title, first <h1>, or document.title, with normalization.
Debug dump (optional): Write top-N candidates to CSV for dataset labeling.

You can tune thresholds and fragmentation frequency under options.contentDetection:

contentDetection: {
  minLength: 400,
  maxLinkDensity: 0.5,
  fragment: {
    // require at least this many sibling parts
    minParts: 2,
    // minimum text length per part
    minChildChars: 150,
    // minimum combined text across parts
    minCombinedChars: 400
  },
  // enable after training weights
  reranker: { enabled: false }
}

Language Detection

Horseman automatically detects the article language and exposes ISO codes via article.language in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found.

Development

Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.

Build the dependencies with:

npm install

Lint the project files with:

npm run lint

Quick single-run (sanity check):

npm run sample:single -- --url "https://example.com/article"

Quick Start (CLI)

Run quick tests and batches from this repo without writing code.

Commands

merge:csv: Merge CSVs (utility for dataset building).
- npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv
sample:prepare: Fetch curated URLs from feeds/sitemaps into scripts/data/urls.txt.
- npm run sample:prepare -- --count 200 --progress-only
sample:single: Run a single URL parse and write JSON to scripts/results/single-sample-run-result.json.
- npm run sample:single -- --url "https://example.com/article"
sample:batch: Run the multi-URL sample with progress bar and summaries.
- npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --progress-only
batch:crawl: Crawl URLs and dump content-candidate features to CSV.
- npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only
train:ranker: Train reranker weights from a candidates CSV.
- npm run train:ranker -- <candidatesCsv>

Common arguments

--bar-width: progress bar width for scripts with progress bars.
--feed-concurrency / --feed-timeout: tuning for curated feed collection.

Single URL test

Writes a detailed JSON to scripts/results/single-sample-run-result.json.

npm run sample:single -- --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000
# or run directly
node scripts/single-sample-run.js --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000

Parameters

--timeout: maximum time (ms) for the parse. If omitted, the test uses its default (40000 ms).
--url: the article page to parse.

Batch sampler (curated URLs, progress bar)

Fetch a fresh set of URLs:

npm run sample:prepare -- --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only
# or run directly
node scripts/fetch-curated-urls.js --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only

Parameters

--count: target number of URLs to collect into scripts/data/urls.txt.
--feed-concurrency: number of feeds to fetch in parallel (optional).
--feed-timeout: per-feed timeout in ms (optional).
--bar-width: progress bar width (optional).
--progress-only: print only progress updates (optional).

Run a batch against unique hosts with a simple progress-only view. Progress and a final summary print to the console; JSON/CSV reports are saved under scripts/results/.

npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only
# or run directly
node scripts/batch-sample-run.js --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only

Parameters

--count: number of URLs to process.
--concurrency: number of concurrent parses.
--urls-file: file containing URLs to parse.
--timeout: maximum time (ms) allowed for each parse.
--unique-hosts: ensure each sampled URL has a unique host (optional).
--progress-only: print only progress updates (optional).
--bar-width: progress bar width (optional).

Training the Reranker (optional)

You can train a simple logistic-regression reranker to improve candidate selection.

Generate candidate features

Single URL (appends candidates):
- npm run sample:single -- --url <articleUrl>
Batch (recommended):
- npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only
- Adjust --start and --limit to process in slices (e.g., --start 200 --limit 200, --start 400 --limit 200, ...). Parameters
- --urls-file: input list of URLs to crawl
- --out-file: output CSV file for candidate features
- --start: start offset (row index) in the URLs file
- --limit: number of URLs to process in this run
- --concurrency: number of parallel crawlers
- --unique-hosts: ensure each URL has a unique host (optional)
- --progress-only: show only progress updates (optional)
The project dumps candidate features with URL by default (see scripts/single-sample-run.js):
- Header: url,xpath,css_selector,text_length,punctuation_count,link_density,paragraph_count,has_semantic_container,boilerplate_penalty,direct_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count,aria_role_main,aria_role_negative,aria_hidden,image_alt_ratio,image_count,training_label,default_selected
- Up to topN unique-XPath rows per page (default 5)

Label the dataset

Open scripts/data/candidates_with_url.csv in a spreadsheet/editor.
For each URL group, set label = 1 for the correct article body candidate (leave others as 0).
Column meanings (subset):
- url: source page
- xpath: Chrome console snippet to select the container (e.g., $x('...')[0])
- css_selector: Chrome console snippet to select via CSS (e.g., document.querySelector('...'))
- text_length: raw character length
- punctuation_count: count of punctuation (.,!?,;:)
- link_density: ratio of link text length to total text (0..1)
- paragraph_count: count of <p> and <br> nodes under the container
- has_semantic_container: 1 if within article/main/role=main/itemtype*=Article, else 0
- boilerplate_penalty: number of boilerplate containers detected (nav/aside/comments/social/newsletter/consent), capped
- direct_paragraph_count, direct_block_count, paragraph_to_block_ratio, average_paragraph_length, dom_depth, heading_children_count: direct-children structure features used by heuristics
- aria_role_main, aria_role_negative, aria_hidden: accessibility signals
- image_alt_ratio, image_count: image accessibility metrics
- training_label: 1 for the true article candidate; 0 otherwise
- default_selected: 1 if this candidate would be chosen by the default heuristic (no custom weights)

Train weights and export JSON

Via npm (use --silent and arg separator):
- npm run --silent train:ranker -- scripts/data/candidates_with_url.csv > weights.json
Or run directly (avoids npm banner output):
- node scripts/train-reranker.js scripts/data/candidates_with_url.csv weights.json Parameters
- scripts/data/candidates_with_url.csv: labeled candidates CSV (input)
- weights.json: output weights file (JSON) Tips
- -- passes subsequent args to the underlying script
- > weights.json redirects stdout to a file

Use the weights

scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker:
- options.contentDetection.reranker = { enabled: true, weights }

Notes

If no reranker is configured, the detector uses heuristic scoring only.
You can merge CSVs from multiple runs: npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv.
Tip: placing a weights.json in the project root will make scripts/single-sample-run.js auto-enable the reranker on the next run.

Update API docs with:

npm run docs

Dependencies

Puppeteer: High-level API to control Chrome or Chromium over the DevTools Protocol
puppeteer-extra: Framework for puppeteer plugins
puppeteer-extra-plugin-stealth: Plugin to evade detection
puppeteer-extra-plugin-user-data-dir: Persist and reuse Chromium user data
lighthouse: Automated auditing, performance metrics, and best practices
compromise: Natural language processing in the browser
retext: Natural language processor powered by plugins
retext-pos: Plugin to add part-of-speech (POS) tags
retext-keywords: Keyword extraction with Retext
retext-spell: Spelling checker for retext
retext-language: Language detection for retext
franc: Fast language detection from text
sentiment: AFINN-based sentiment analysis for Node.js
jquery: JavaScript library for DOM operations
jsdom: A JavaScript implementation of many web standards
lodash: Lodash modular utilities
absolutify: Relative to Absolute URL Replacer
clean-html: HTML cleaner and beautifier
dictionary-en-gb: English (United Kingdom) spelling dictionary in UTF-8
html-to-text: Advanced HTML to plain text converter
nlcst-to-string: Stringify NLCST

Dev Dependencies

eslint: An AST-based pattern checker for JavaScript
eslint-plugin-import: Import with sanity
eslint-plugin-json: Lint JSON files
eslint-plugin-n: Additional ESLint rules for Node.js
eslint-plugin-promise: Enforce best practices for JavaScript promises

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 512 Commits
controllers		controllers
overrides/puppeteer-extra-plugin-user-data-dir		overrides/puppeteer-extra-plugin-user-data-dir
scripts		scripts
tests		tests
.gitignore		.gitignore
.npmignore		.npmignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
helpers.js		helpers.js
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Param	Type	Description
options	`Object`	the options object
socket	`Object`	the optional socket

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Horseman Article Parser

Table of Contents

Prerequisites

Install

Usage

parseArticle(options, socket) ? `Object`

Async/Await Example

Options

Using Compromise plugins to improve results

Extended name hints and secondary NER sources

Content Detection

Language Detection

Development

Quick Start (CLI)

Commands

Common arguments

Single URL test

Batch sampler (curated URLs, progress bar)

Training the Reranker (optional)

Dependencies

Dev Dependencies

License

About

Uh oh!

Releases 42

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

fmacpro/horseman-article-parser

Folders and files

Latest commit

History

Repository files navigation

Horseman Article Parser

Table of Contents

Prerequisites

Install

Usage

parseArticle(options, socket) ? Object

Async/Await Example

Options

Using Compromise plugins to improve results

Extended name hints and secondary NER sources

Content Detection

Language Detection

Development

Quick Start (CLI)

Commands

Common arguments

Single URL test

Batch sampler (curated URLs, progress bar)

Training the Reranker (optional)

Dependencies

Dev Dependencies

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 42

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

parseArticle(options, socket) ? `Object`

Packages