Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, images, metadata, sentiment, keywords/keyphrases, named entities, optional summaries, optional spelling suggestions, readability metrics and basic counts (characters, words, sentences, paragraphs), site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking. The parser now detects the article language and exposes ISO codes, with best-effort support for non-English content (features may fall back to English dictionaries when specific resources are missing).
- Prerequisites
- Install
- Usage
- Async/Await Example
- Options
- Development
- Dependencies
- Dev Dependencies
- License
Node.js >= 18, NPM >= 9. For Linux environments, ensure Chromium dependencies for Puppeteer are installed.
npm install horseman-article-parser --save| Param | Type | Description |
|---|---|---|
| options | Object |
the options object |
| socket | Object |
the optional socket |
Returns: Object - article parser results object
import { parseArticle } from "horseman-article-parser";
const options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
enabled: [
"lighthouse",
"screenshot",
"links",
"images",
"sentiment",
"entities",
"spelling",
"keywords",
"summary",
"readability",
],
};
(async () => {
try {
const article = await parseArticle(options);
const response = {
title: article.title.text,
excerpt: article.excerpt,
metadescription: article.meta.description.text,
url: article.url,
sentiment: {
score: article.sentiment.score,
comparative: article.sentiment.comparative,
},
keyphrases: article.processed.keyphrases,
keywords: article.processed.keywords,
people: article.people,
orgs: article.orgs,
places: article.places,
language: article.language,
readability: {
readingTime: article.readability.readingTime,
characters: article.readability.characters,
words: article.readability.words,
sentences: article.readability.sentences,
paragraphs: article.readability.paragraphs,
},
text: {
raw: article.processed.text.raw,
formatted: article.processed.text.formatted,
html: article.processed.text.html,
summary: article.processed.text.summary,
sentences: article.processed.text.sentences,
},
spelling: article.spelling,
meta: article.meta,
links: article.links,
images: article.images,
structuredData: article.structuredData,
lighthouse: article.lighthouse,
};
console.log(response);
} catch (error) {
console.log(error.message);
console.log(error.stack);
}
})();Structured JSON-LD article nodes (including the original schema objects) are exposed via article.structuredData.
In-article structural elements such as tables, definition lists, and figures are normalised into article.structuredData.body.
Body images (with captions, alt text, titles, and data-src fallbacks when present) are returned under article.images when the feature is enabled.
parseArticle(options, <socket>) accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.
See horseman-article-parser-ui as an example.
The options below are set by default
var options = {
// Imposes a hard limit on how long the parser will run. When the limit is reached, the browser instance is closed and a timeout error is thrown.
// This prevents the parser from hanging indefinitely and ensures long‑running parses are cut off after the specified duration.
timeoutMs: 40000,
// puppeteer options (https://github.com/GoogleChrome/puppeteer)
puppeteer: {
// puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)
launch: {
headless: true,
defaultViewport: null,
},
// Optional user agent and headers (some sites require a realistic UA)
// userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
// extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },
// puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)
goto: {
waitUntil: "domcontentloaded",
},
// Ignore content security policy
setBypassCSP: true,
},
// clean-html options (https://ghub.io/clean-html)
cleanhtml: {
"add-remove-tags": ["blockquote", "span"],
"remove-empty-tags": ["span"],
"replace-nbsp": true,
},
// html-to-text options (https://ghub.io/html-to-text)
htmltotext: {
wordwrap: 100,
noLinkBrackets: true,
ignoreHref: true,
tables: true,
uppercaseHeadings: true,
},
// retext-keywords options (https://ghub.io/retext-keywords)
retextkeywords: { maximum: 10 },
// content detection defaults (detector is always enabled)
contentDetection: {
// minimum characters required for a candidate
minLength: 400,
// maximum link density allowed for a candidate
maxLinkDensity: 0.5,
// optional: promote selection to a parent container when
// article paragraphs are split across sibling blocks
fragment: {
// require at least this many sibling parts containing paragraphs
minParts: 2,
// minimum text length per part
minChildChars: 150,
// minimum combined text across parts (set higher to be stricter)
minCombinedChars: 400,
// override parent link-density threshold (default uses max(maxLinkDensity, 0.65))
// maxLinkDensity: 0.65
},
// reranker is disabled by default; enable after training weights
// Note: scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker
reranker: { enabled: false },
// optional: dump top-N candidates per page for labeling
// debugDump: { path: 'candidates_with_url.csv', topN: 5, addUrl: true }
},
// retext-spell defaults and output tweaks
retextspell: {
tweaks: {
// filter URL/domain-like tokens and long slugs by default
ignoreUrlLike: true,
// positions: only start by default
includeEndPosition: false,
// offsets excluded by default
includeOffsets: false,
},
},
};At a minimum you should pass a url
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
};If you want to enable the advanced features you should pass the following
var options = {
url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
enabled: [
"lighthouse",
"screenshot",
"links",
"sentiment",
"entities",
"spelling",
"keywords",
"summary",
"readability",
],
};Add "summary" to options.enabled to generate a short summary of the article text. The result
includes text.summary and a text.sentences array containing the first five sentences.
Add "readability" to options.enabled to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as article.readability with readingTime (seconds), characters, words, sentences, and paragraphs.
You may pass rules for returning an articles title & contents. This is useful in a case where the parser is unable to return the desired title or content e.g.
rules: [
{
host: "www.bbc.co.uk",
content: () => {
var j = window.$;
j("article section, article figure, article header").remove();
return j("article").html();
},
},
{
host: "www.youtube.com",
title: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results
.results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text;
},
content: () => {
return window.ytInitialData.contents.twoColumnWatchNextResults.results
.results.contents[1].videoSecondaryInfoRenderer.description.runs[0]
.text;
},
},
];If you want to pass cookies to puppeteer use the following
var options = {
puppeteer: {
cookies: [
{ name: "cookie1", value: "val1", domain: ".domain1" },
{ name: "cookie2", value: "val2", domain: ".domain2" },
],
},
};To strip tags before processing use the following
var options = {
striptags: [".something", "#somethingelse"],
};If you need to dismiss any popups e.g. a privacy popup use the following
var options = {
clickelements: ["#button1", "#button2"],
};there are some additional "complex" options available
var options = {
// array of html elements to stip before analysis
striptags: [],
// array of resource types to block e.g. ['image' ]
blockedResourceTypes: [],
// array of resource source names (all resources from
// these sources are skipped) e.g. [ 'google', 'facebook' ]
skippedResources: [],
// retext spell options (https://ghub.io/retext-spell)
retextspell: {
// dictionary defaults to en-GB; you can override
// dictionary,
tweaks: {
// Filter URL/domain-like tokens and long slugs (default: true)
ignoreUrlLike: true,
// Include end position (endLine/endColumn) in each item (default: false)
includeEndPosition: false,
// Include offsets (offsetStart/offsetEnd) in each item (default: false)
includeOffsets: false
}
}
// compromise nlp options
nlp: { plugins: [ myPlugin, anotherPlugin ] }
}Compromise is the natural language processor that allows horseman-article-parser to return
topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:
/** add some names */
const testPlugin = (Doc, world) => {
world.addWords({
rishi: "FirstName",
sunak: "LastName",
});
};
const options = {
url: "https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies",
enabled: [
"lighthouse",
"screenshot",
"links",
"sentiment",
"entities",
"spelling",
"keywords",
"summary",
"readability",
],
// Optional: tweak spelling output/filters
retextspell: {
tweaks: {
ignoreUrlLike: true,
includeEndPosition: true,
includeOffsets: true,
},
},
nlp: {
plugins: [testPlugin],
},
};By tagging new words as FirstName and LastName, the parser records fallback hints and can still detect the full name even if Compromise doesn't tag it directly. This allows us to match names which are not in the base Compromise word lists.
Check out the compromise plugin docs for more info.
loadNlpPlugins also accepts additional hint buckets for middle names and suffixes. You can provide them directly via options.nlp.hints or through a Compromise plugin using MiddleName/Suffix tags. These extra hints help prevent false splits in names that include common middle initials or honorifics.
const options = {
nlp: {
hints: {
first: ['José', 'Ana'],
middle: ['Luis', 'María'],
last: ['Rodríguez', 'López'],
suffix: ['Jr']
},
secondary: {
endpoint: 'https://ner.yourservice.example/people',
method: 'POST',
timeoutMs: 1500,
minConfidence: 0.65
}
}
}When secondary is configured the parser will send the article text to that endpoint (default payload { text: "…" }) and merge any PERSON entities it returns with the Compromise results. Responses that include a simple people array or spaCy-style ents collections are supported. If the service is unreachable or errors, the parser automatically falls back to Compromise-only detection.
The detector is always enabled and uses a structured-data-first strategy, falling back to heuristic scoring:
- Structured data: Extracts JSON-LD Article/NewsArticle (
headline,articleBody). - Heuristics: Gathers DOM candidates (e.g.,
article,main,[role=main], content-like containers) and scores them by text length, punctuation, link density, paragraph count, semantic tags, and boilerplate penalties. - Fragment promotion: When content is split across sibling blocks, a fragmentation heuristic merges them into a single higher-level candidate.
- ML reranker (optional): If weights are supplied, a lightweight reranker can refine the heuristic ranking.
- Title detection: Chooses from structured
headline,og:title/twitter:title, first<h1>, ordocument.title, with normalization. - Debug dump (optional): Write top-N candidates to CSV for dataset labeling.
You can tune thresholds and fragmentation frequency under options.contentDetection:
contentDetection: {
minLength: 400,
maxLinkDensity: 0.5,
fragment: {
// require at least this many sibling parts
minParts: 2,
// minimum text length per part
minChildChars: 150,
// minimum combined text across parts
minCombinedChars: 400
},
// enable after training weights
reranker: { enabled: false }
}Horseman automatically detects the article language and exposes ISO codes via article.language in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found.
Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.
Build the dependencies with:
npm install
Lint the project files with:
npm run lint
Quick single-run (sanity check):
npm run sample:single -- --url "https://example.com/article"
Run quick tests and batches from this repo without writing code.
- merge:csv: Merge CSVs (utility for dataset building).
npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv
- sample:prepare: Fetch curated URLs from feeds/sitemaps into
scripts/data/urls.txt.npm run sample:prepare -- --count 200 --progress-only
- sample:single: Run a single URL parse and write JSON to
scripts/results/single-sample-run-result.json.npm run sample:single -- --url "https://example.com/article"
- sample:batch: Run the multi-URL sample with progress bar and summaries.
npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --progress-only
- batch:crawl: Crawl URLs and dump content-candidate features to CSV.
npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only
- train:ranker: Train reranker weights from a candidates CSV.
npm run train:ranker -- <candidatesCsv>
--bar-width: progress bar width for scripts with progress bars.--feed-concurrency/--feed-timeout: tuning for curated feed collection.
Writes a detailed JSON to scripts/results/single-sample-run-result.json.
npm run sample:single -- --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000
# or run directly
node scripts/single-sample-run.js --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000Parameters
--timeout: maximum time (ms) for the parse. If omitted, the test uses its default (40000 ms).--url: the article page to parse.
- Fetch a fresh set of URLs:
npm run sample:prepare -- --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only
# or run directly
node scripts/fetch-curated-urls.js --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-onlyParameters
--count: target number of URLs to collect intoscripts/data/urls.txt.--feed-concurrency: number of feeds to fetch in parallel (optional).--feed-timeout: per-feed timeout in ms (optional).--bar-width: progress bar width (optional).--progress-only: print only progress updates (optional).
- Run a batch against unique hosts with a simple progress-only view. Progress and a final summary print to the console; JSON/CSV reports are saved under
scripts/results/.
npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only
# or run directly
node scripts/batch-sample-run.js --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-onlyParameters
--count: number of URLs to process.--concurrency: number of concurrent parses.--urls-file: file containing URLs to parse.--timeout: maximum time (ms) allowed for each parse.--unique-hosts: ensure each sampled URL has a unique host (optional).--progress-only: print only progress updates (optional).--bar-width: progress bar width (optional).
You can train a simple logistic-regression reranker to improve candidate selection.
- Generate candidate features
-
Single URL (appends candidates):
npm run sample:single -- --url <articleUrl>
-
Batch (recommended):
-
npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only -
Adjust
--startand--limitto process in slices (e.g.,--start 200 --limit 200,--start 400 --limit 200, ...). Parameters -
--urls-file: input list of URLs to crawl -
--out-file: output CSV file for candidate features -
--start: start offset (row index) in the URLs file -
--limit: number of URLs to process in this run -
--concurrency: number of parallel crawlers -
--unique-hosts: ensure each URL has a unique host (optional) -
--progress-only: show only progress updates (optional)
-
-
The project dumps candidate features with URL by default (see
scripts/single-sample-run.js):- Header:
url,xpath,css_selector,text_length,punctuation_count,link_density,paragraph_count,has_semantic_container,boilerplate_penalty,direct_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count,aria_role_main,aria_role_negative,aria_hidden,image_alt_ratio,image_count,training_label,default_selected - Up to
topNunique-XPath rows per page (default 5)
- Header:
- Label the dataset
- Open
scripts/data/candidates_with_url.csvin a spreadsheet/editor. - For each URL group, set
label = 1for the correct article body candidate (leave others as 0). - Column meanings (subset):
url: source pagexpath: Chrome console snippet to select the container (e.g.,$x('...')[0])css_selector: Chrome console snippet to select via CSS (e.g.,document.querySelector('...'))text_length: raw character lengthpunctuation_count: count of punctuation (.,!?,;:)link_density: ratio of link text length to total text (0..1)paragraph_count: count of<p>and<br>nodes under the containerhas_semantic_container: 1 if withinarticle/main/role=main/itemtype*=Article, else 0boilerplate_penalty: number of boilerplate containers detected (nav/aside/comments/social/newsletter/consent), cappeddirect_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count: direct-children structure features used by heuristicsaria_role_main,aria_role_negative,aria_hidden: accessibility signalsimage_alt_ratio,image_count: image accessibility metricstraining_label: 1 for the true article candidate; 0 otherwisedefault_selected: 1 if this candidate would be chosen by the default heuristic (no custom weights)
- Train weights and export JSON
-
Via npm (use
--silentand arg separator):npm run --silent train:ranker -- scripts/data/candidates_with_url.csv > weights.json
-
Or run directly (avoids npm banner output):
-
node scripts/train-reranker.js scripts/data/candidates_with_url.csv weights.jsonParameters -
scripts/data/candidates_with_url.csv: labeled candidates CSV (input) -
weights.json: output weights file (JSON) Tips -
--passes subsequent args to the underlying script -
> weights.jsonredirects stdout to a file
-
- Use the weights
scripts/single-sample-run.jsauto-loadsweights.json(if present) and enables the reranker:options.contentDetection.reranker = { enabled: true, weights }
Notes
- If no reranker is configured, the detector uses heuristic scoring only.
- You can merge CSVs from multiple runs:
npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv. - Tip: placing a
weights.jsonin the project root will makescripts/single-sample-run.jsauto-enable the reranker on the next run.
Update API docs with:
npm run docs
- Puppeteer: High-level API to control Chrome or Chromium over the DevTools Protocol
- puppeteer-extra: Framework for puppeteer plugins
- puppeteer-extra-plugin-stealth: Plugin to evade detection
- puppeteer-extra-plugin-user-data-dir: Persist and reuse Chromium user data
- lighthouse: Automated auditing, performance metrics, and best practices
- compromise: Natural language processing in the browser
- retext: Natural language processor powered by plugins
- retext-pos: Plugin to add part-of-speech (POS) tags
- retext-keywords: Keyword extraction with Retext
- retext-spell: Spelling checker for retext
- retext-language: Language detection for retext
- franc: Fast language detection from text
- sentiment: AFINN-based sentiment analysis for Node.js
- jquery: JavaScript library for DOM operations
- jsdom: A JavaScript implementation of many web standards
- lodash: Lodash modular utilities
- absolutify: Relative to Absolute URL Replacer
- clean-html: HTML cleaner and beautifier
- dictionary-en-gb: English (United Kingdom) spelling dictionary in UTF-8
- html-to-text: Advanced HTML to plain text converter
- nlcst-to-string: Stringify NLCST
- eslint: An AST-based pattern checker for JavaScript
- eslint-plugin-import: Import with sanity
- eslint-plugin-json: Lint JSON files
- eslint-plugin-n: Additional ESLint rules for Node.js
- eslint-plugin-promise: Enforce best practices for JavaScript promises
This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details