Skip to content

A web page article parser which returns an object containing the article's formatted text and other attributes including sentiment, keyphrases, people, places, organisations, spelling suggestions, in-article links, meta data & lighthouse audit results.

License

Notifications You must be signed in to change notification settings

fmacpro/horseman-article-parser

Repository files navigation

Horseman Article Parser

Horseman is a focused article scraping module for the open web. It loads pages (dynamic or AMP), detects the main story body, and returns clean, structured content ready for downstream use. Alongside text and title, it includes in-article links, images, metadata, sentiment, keywords/keyphrases, named entities, optional summaries, optional spelling suggestions, readability metrics and basic counts (characters, words, sentences, paragraphs), site icon, and Lighthouse signals. It also copes with live blogs, applies simple per-domain tweaks (headers/cookies/goto), and uses Puppeteer + stealth to reduce blocking. The parser now detects the article language and exposes ISO codes, with best-effort support for non-English content (features may fall back to English dictionaries when specific resources are missing).

Table of Contents

Prerequisites

Node.js >= 18, NPM >= 9. For Linux environments, ensure Chromium dependencies for Puppeteer are installed.

Install

npm install horseman-article-parser --save

Usage

parseArticle(options, socket) ? Object

Param Type Description
options Object the options object
socket Object the optional socket

Returns: Object - article parser results object

Async/Await Example

import { parseArticle } from "horseman-article-parser";

const options = {
  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
  enabled: [
    "lighthouse",
    "screenshot",
    "links",
    "images",
    "sentiment",
    "entities",
    "spelling",
    "keywords",
    "summary",
    "readability",
  ],
};

(async () => {
  try {
    const article = await parseArticle(options);

    const response = {
      title: article.title.text,
      excerpt: article.excerpt,
      metadescription: article.meta.description.text,
      url: article.url,
      sentiment: {
        score: article.sentiment.score,
        comparative: article.sentiment.comparative,
      },
      keyphrases: article.processed.keyphrases,
      keywords: article.processed.keywords,
      people: article.people,
      orgs: article.orgs,
      places: article.places,
      language: article.language,
      readability: {
        readingTime: article.readability.readingTime,
        characters: article.readability.characters,
        words: article.readability.words,
        sentences: article.readability.sentences,
        paragraphs: article.readability.paragraphs,
      },
      text: {
        raw: article.processed.text.raw,
        formatted: article.processed.text.formatted,
        html: article.processed.text.html,
        summary: article.processed.text.summary,
        sentences: article.processed.text.sentences,
      },
      spelling: article.spelling,
      meta: article.meta,
      links: article.links,
      images: article.images,
      structuredData: article.structuredData,
      lighthouse: article.lighthouse,
    };

    console.log(response);
  } catch (error) {
    console.log(error.message);
    console.log(error.stack);
  }
})();

Structured JSON-LD article nodes (including the original schema objects) are exposed via article.structuredData. In-article structural elements such as tables, definition lists, and figures are normalised into article.structuredData.body. Body images (with captions, alt text, titles, and data-src fallbacks when present) are returned under article.images when the feature is enabled.

parseArticle(options, <socket>) accepts an optional socket for pipeing the response object, status messages and errors to a front end UI.

See horseman-article-parser-ui as an example.

Options

The options below are set by default

var options = {
  // Imposes a hard limit on how long the parser will run. When the limit is reached, the browser instance is closed and a timeout error is thrown.
  // This prevents the parser from hanging indefinitely and ensures long‑running parses are cut off after the specified duration.
  timeoutMs: 40000,
  // puppeteer options (https://github.com/GoogleChrome/puppeteer)
  puppeteer: {
    // puppeteer launch options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions)
    launch: {
      headless: true,
      defaultViewport: null,
    },
    // Optional user agent and headers (some sites require a realistic UA)
    // userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
    // extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' },
    // puppeteer goto options (https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagegotourl-options)
    goto: {
      waitUntil: "domcontentloaded",
    },
    // Ignore content security policy
    setBypassCSP: true,
  },
  // clean-html options (https://ghub.io/clean-html)
  cleanhtml: {
    "add-remove-tags": ["blockquote", "span"],
    "remove-empty-tags": ["span"],
    "replace-nbsp": true,
  },
  // html-to-text options (https://ghub.io/html-to-text)
  htmltotext: {
    wordwrap: 100,
    noLinkBrackets: true,
    ignoreHref: true,
    tables: true,
    uppercaseHeadings: true,
  },
  // retext-keywords options (https://ghub.io/retext-keywords)
  retextkeywords: { maximum: 10 },
  // content detection defaults (detector is always enabled)
  contentDetection: {
    // minimum characters required for a candidate
    minLength: 400,
    // maximum link density allowed for a candidate
    maxLinkDensity: 0.5,
    // optional: promote selection to a parent container when
    // article paragraphs are split across sibling blocks
    fragment: {
      // require at least this many sibling parts containing paragraphs
      minParts: 2,
      // minimum text length per part
      minChildChars: 150,
      // minimum combined text across parts (set higher to be stricter)
      minCombinedChars: 400,
      // override parent link-density threshold (default uses max(maxLinkDensity, 0.65))
      // maxLinkDensity: 0.65
    },
    // reranker is disabled by default; enable after training weights
    // Note: scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker
    reranker: { enabled: false },
    // optional: dump top-N candidates per page for labeling
    // debugDump: { path: 'candidates_with_url.csv', topN: 5, addUrl: true }
  },
  // retext-spell defaults and output tweaks
  retextspell: {
    tweaks: {
      // filter URL/domain-like tokens and long slugs by default
      ignoreUrlLike: true,
      // positions: only start by default
      includeEndPosition: false,
      // offsets excluded by default
      includeOffsets: false,
    },
  },
};

At a minimum you should pass a url

var options = {
  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
};

If you want to enable the advanced features you should pass the following

var options = {
  url: "https://www.theguardian.com/politics/2018/sep/24/theresa-may-calls-for-immigration-based-on-skills-and-wealth",
  enabled: [
    "lighthouse",
    "screenshot",
    "links",
    "sentiment",
    "entities",
    "spelling",
    "keywords",
    "summary",
    "readability",
  ],
};

Add "summary" to options.enabled to generate a short summary of the article text. The result includes text.summary and a text.sentences array containing the first five sentences.

Add "readability" to options.enabled to evaluate readability, estimate reading time, and gather basic text statistics. The result is available as article.readability with readingTime (seconds), characters, words, sentences, and paragraphs.

You may pass rules for returning an articles title & contents. This is useful in a case where the parser is unable to return the desired title or content e.g.

rules: [
  {
    host: "www.bbc.co.uk",
    content: () => {
      var j = window.$;
      j("article section, article figure, article header").remove();
      return j("article").html();
    },
  },
  {
    host: "www.youtube.com",
    title: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results
        .results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text;
    },
    content: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results
        .results.contents[1].videoSecondaryInfoRenderer.description.runs[0]
        .text;
    },
  },
];

If you want to pass cookies to puppeteer use the following

var options = {
  puppeteer: {
    cookies: [
      { name: "cookie1", value: "val1", domain: ".domain1" },
      { name: "cookie2", value: "val2", domain: ".domain2" },
    ],
  },
};

To strip tags before processing use the following

var options = {
  striptags: [".something", "#somethingelse"],
};

If you need to dismiss any popups e.g. a privacy popup use the following

var options = {
  clickelements: ["#button1", "#button2"],
};

there are some additional "complex" options available

var options = {

  // array of html elements to stip before analysis
  striptags: [],

  // array of resource types to block e.g. ['image' ]
  blockedResourceTypes: [],

  // array of resource source names (all resources from
  // these sources are skipped) e.g. [ 'google', 'facebook' ]
  skippedResources: [],


  // retext spell options (https://ghub.io/retext-spell)
  retextspell: {
    // dictionary defaults to en-GB; you can override
    // dictionary,
    tweaks: {
      // Filter URL/domain-like tokens and long slugs (default: true)
      ignoreUrlLike: true,
      // Include end position (endLine/endColumn) in each item (default: false)
      includeEndPosition: false,
      // Include offsets (offsetStart/offsetEnd) in each item (default: false)
      includeOffsets: false
    }
  }

  // compromise nlp options
  nlp: { plugins: [ myPlugin, anotherPlugin ] }

}

Using Compromise plugins to improve results

Compromise is the natural language processor that allows horseman-article-parser to return topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:

/** add some names */
const testPlugin = (Doc, world) => {
  world.addWords({
    rishi: "FirstName",
    sunak: "LastName",
  });
};

const options = {
  url: "https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies",
  enabled: [
    "lighthouse",
    "screenshot",
    "links",
    "sentiment",
    "entities",
    "spelling",
    "keywords",
    "summary",
    "readability",
  ],
  // Optional: tweak spelling output/filters
  retextspell: {
    tweaks: {
      ignoreUrlLike: true,
      includeEndPosition: true,
      includeOffsets: true,
    },
  },
  nlp: {
    plugins: [testPlugin],
  },
};

By tagging new words as FirstName and LastName, the parser records fallback hints and can still detect the full name even if Compromise doesn't tag it directly. This allows us to match names which are not in the base Compromise word lists.

Check out the compromise plugin docs for more info.

Extended name hints and secondary NER sources

loadNlpPlugins also accepts additional hint buckets for middle names and suffixes. You can provide them directly via options.nlp.hints or through a Compromise plugin using MiddleName/Suffix tags. These extra hints help prevent false splits in names that include common middle initials or honorifics.

const options = {
  nlp: {
    hints: {
      first: ['José', 'Ana'],
      middle: ['Luis', 'María'],
      last: ['Rodríguez', 'López'],
      suffix: ['Jr']
    },
    secondary: {
      endpoint: 'https://ner.yourservice.example/people',
      method: 'POST',
      timeoutMs: 1500,
      minConfidence: 0.65
    }
  }
}

When secondary is configured the parser will send the article text to that endpoint (default payload { text: "…" }) and merge any PERSON entities it returns with the Compromise results. Responses that include a simple people array or spaCy-style ents collections are supported. If the service is unreachable or errors, the parser automatically falls back to Compromise-only detection.

Content Detection

The detector is always enabled and uses a structured-data-first strategy, falling back to heuristic scoring:

  • Structured data: Extracts JSON-LD Article/NewsArticle (headline, articleBody).
  • Heuristics: Gathers DOM candidates (e.g., article, main, [role=main], content-like containers) and scores them by text length, punctuation, link density, paragraph count, semantic tags, and boilerplate penalties.
  • Fragment promotion: When content is split across sibling blocks, a fragmentation heuristic merges them into a single higher-level candidate.
  • ML reranker (optional): If weights are supplied, a lightweight reranker can refine the heuristic ranking.
  • Title detection: Chooses from structured headline, og:title/twitter:title, first <h1>, or document.title, with normalization.
  • Debug dump (optional): Write top-N candidates to CSV for dataset labeling.

You can tune thresholds and fragmentation frequency under options.contentDetection:

contentDetection: {
  minLength: 400,
  maxLinkDensity: 0.5,
  fragment: {
    // require at least this many sibling parts
    minParts: 2,
    // minimum text length per part
    minChildChars: 150,
    // minimum combined text across parts
    minCombinedChars: 400
  },
  // enable after training weights
  reranker: { enabled: false }
}

Language Detection

Horseman automatically detects the article language and exposes ISO codes via article.language in the result. Downstream steps such as keyword extraction or spelling use these codes to select language-specific resources when available. Dictionaries for English, French, and Spanish are bundled; other languages fall back to English if a matching dictionary or NLP plugin is not found.

Development

Please feel free to fork the repo or open pull requests to the development branch. I've used eslint for linting.

Build the dependencies with:

npm install

Lint the project files with:

npm run lint

Quick single-run (sanity check):

npm run sample:single -- --url "https://example.com/article"

Quick Start (CLI)

Run quick tests and batches from this repo without writing code.

Commands

  • merge:csv: Merge CSVs (utility for dataset building).
    • npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv
  • sample:prepare: Fetch curated URLs from feeds/sitemaps into scripts/data/urls.txt.
    • npm run sample:prepare -- --count 200 --progress-only
  • sample:single: Run a single URL parse and write JSON to scripts/results/single-sample-run-result.json.
    • npm run sample:single -- --url "https://example.com/article"
  • sample:batch: Run the multi-URL sample with progress bar and summaries.
    • npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --progress-only
  • batch:crawl: Crawl URLs and dump content-candidate features to CSV.
    • npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only
  • train:ranker: Train reranker weights from a candidates CSV.
    • npm run train:ranker -- <candidatesCsv>

Common arguments

  • --bar-width: progress bar width for scripts with progress bars.
  • --feed-concurrency / --feed-timeout: tuning for curated feed collection.

Single URL test

Writes a detailed JSON to scripts/results/single-sample-run-result.json.

npm run sample:single -- --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000
# or run directly
node scripts/single-sample-run.js --url "https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html" --timeout 40000

Parameters

  • --timeout: maximum time (ms) for the parse. If omitted, the test uses its default (40000 ms).
  • --url: the article page to parse.

Batch sampler (curated URLs, progress bar)

  1. Fetch a fresh set of URLs:
npm run sample:prepare -- --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only
# or run directly
node scripts/fetch-curated-urls.js --count 200 --feed-concurrency 8 --feed-timeout 15000 --bar-width 20 --progress-only

Parameters

  • --count: target number of URLs to collect into scripts/data/urls.txt.
  • --feed-concurrency: number of feeds to fetch in parallel (optional).
  • --feed-timeout: per-feed timeout in ms (optional).
  • --bar-width: progress bar width (optional).
  • --progress-only: print only progress updates (optional).
  1. Run a batch against unique hosts with a simple progress-only view. Progress and a final summary print to the console; JSON/CSV reports are saved under scripts/results/.
npm run sample:batch -- --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only
# or run directly
node scripts/batch-sample-run.js --count 100 --concurrency 5 --urls-file scripts/data/urls.txt --timeout 20000 --unique-hosts --bar-width 20 --progress-only

Parameters

  • --count: number of URLs to process.
  • --concurrency: number of concurrent parses.
  • --urls-file: file containing URLs to parse.
  • --timeout: maximum time (ms) allowed for each parse.
  • --unique-hosts: ensure each sampled URL has a unique host (optional).
  • --progress-only: print only progress updates (optional).
  • --bar-width: progress bar width (optional).

Training the Reranker (optional)

You can train a simple logistic-regression reranker to improve candidate selection.

  1. Generate candidate features
  • Single URL (appends candidates):

    • npm run sample:single -- --url <articleUrl>
  • Batch (recommended):

    • npm run batch:crawl -- --urls-file scripts/data/urls.txt --out-file scripts/data/candidates_with_url.csv --start 0 --limit 200 --concurrency 1 --unique-hosts --progress-only

    • Adjust --start and --limit to process in slices (e.g., --start 200 --limit 200, --start 400 --limit 200, ...). Parameters

    • --urls-file: input list of URLs to crawl

    • --out-file: output CSV file for candidate features

    • --start: start offset (row index) in the URLs file

    • --limit: number of URLs to process in this run

    • --concurrency: number of parallel crawlers

    • --unique-hosts: ensure each URL has a unique host (optional)

    • --progress-only: show only progress updates (optional)

  • The project dumps candidate features with URL by default (see scripts/single-sample-run.js):

    • Header: url,xpath,css_selector,text_length,punctuation_count,link_density,paragraph_count,has_semantic_container,boilerplate_penalty,direct_paragraph_count,direct_block_count,paragraph_to_block_ratio,average_paragraph_length,dom_depth,heading_children_count,aria_role_main,aria_role_negative,aria_hidden,image_alt_ratio,image_count,training_label,default_selected
    • Up to topN unique-XPath rows per page (default 5)
  1. Label the dataset
  • Open scripts/data/candidates_with_url.csv in a spreadsheet/editor.
  • For each URL group, set label = 1 for the correct article body candidate (leave others as 0).
  • Column meanings (subset):
    • url: source page
    • xpath: Chrome console snippet to select the container (e.g., $x('...')[0])
    • css_selector: Chrome console snippet to select via CSS (e.g., document.querySelector('...'))
    • text_length: raw character length
    • punctuation_count: count of punctuation (.,!?,;:)
    • link_density: ratio of link text length to total text (0..1)
    • paragraph_count: count of <p> and <br> nodes under the container
    • has_semantic_container: 1 if within article/main/role=main/itemtype*=Article, else 0
    • boilerplate_penalty: number of boilerplate containers detected (nav/aside/comments/social/newsletter/consent), capped
    • direct_paragraph_count, direct_block_count, paragraph_to_block_ratio, average_paragraph_length, dom_depth, heading_children_count: direct-children structure features used by heuristics
    • aria_role_main, aria_role_negative, aria_hidden: accessibility signals
    • image_alt_ratio, image_count: image accessibility metrics
    • training_label: 1 for the true article candidate; 0 otherwise
    • default_selected: 1 if this candidate would be chosen by the default heuristic (no custom weights)
  1. Train weights and export JSON
  • Via npm (use --silent and arg separator):

    • npm run --silent train:ranker -- scripts/data/candidates_with_url.csv > weights.json
  • Or run directly (avoids npm banner output):

    • node scripts/train-reranker.js scripts/data/candidates_with_url.csv weights.json Parameters

    • scripts/data/candidates_with_url.csv: labeled candidates CSV (input)

    • weights.json: output weights file (JSON) Tips

    • -- passes subsequent args to the underlying script

    • > weights.json redirects stdout to a file

  1. Use the weights
  • scripts/single-sample-run.js auto-loads weights.json (if present) and enables the reranker:
    • options.contentDetection.reranker = { enabled: true, weights }

Notes

  • If no reranker is configured, the detector uses heuristic scoring only.
  • You can merge CSVs from multiple runs: npm run merge:csv -- scripts/data/merged.csv scripts/data/candidates_with_url.csv.
  • Tip: placing a weights.json in the project root will make scripts/single-sample-run.js auto-enable the reranker on the next run.

Update API docs with:

npm run docs

Dependencies

Dev Dependencies

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 - see the LICENSE file for details

About

A web page article parser which returns an object containing the article's formatted text and other attributes including sentiment, keyphrases, people, places, organisations, spelling suggestions, in-article links, meta data & lighthouse audit results.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •