A modular extractor framework using Node.js, TypeScript, and Playwright. Supports multiple extractors that can be run via CLI.
- Clone the repository:
git clone https://github.com/Itterum/simple-parser
cd simple-parser
- Install dependencies:
npm install
npx playwright install
npm run dev -- --extractor github-extractor --urls https://github.com/trending
npm run build
node dist/cli.js --extractor github-extractor --urls https://github.com/trending
To add a new extractor:
- Create a new folder in
src/extractors/
with the name of your extractor (e.g.,github
). - Inside this folder, create an
index.ts
file. - Optionally, create a
types.ts
file to define any TypeScript interfaces or types related to your extractor.
types.ts
(inside src/extractors/github/
):
import { BaseEntity, IBaseEntity } from "../base/types";
interface IRepositoryFields {
title: string;
url: string;
description: string;
language: string;
countAllStars: number;
countStarsToday: number;
countForks: number;
}
export interface IRepositoryEntity extends IBaseEntity<IRepositoryFields> {
fields: IRepositoryFields;
}
export class RepositoryEntity extends BaseEntity<IRepositoryFields> implements IRepositoryEntity {
fields: IRepositoryFields;
constructor(fields: IRepositoryFields) {
super(fields);
this.fields = fields;
}
}
index.ts
(inside src/extractors/github/
):
import { BaseExtractor } from "../base";
import { ElementHandle } from "playwright";
import { RepositoryEntity } from "./types";
export class GithubExtractor extends BaseExtractor<RepositoryEntity> {
domain = "github.com";
waitSelector = ".Box-row";
async parseEntity(element: ElementHandle): Promise<RepositoryEntity> {
// Logic to extract data from the element
}
}
node dist/cli.js --extractor <extractor-name> --urls <url1> <url2>
Example:
node dist/cli.js --extractor github-extractor --urls https://github.com/trending