Skip to content

Itterum/simple-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Parser

A modular extractor framework using Node.js, TypeScript, and Playwright. Supports multiple extractors that can be run via CLI.


Installation

  1. Clone the repository:
git clone https://github.com/Itterum/simple-parser
cd simple-parser
  1. Install dependencies:
npm install
npx playwright install

Running the Project

Development (TypeScript directly)

npm run dev -- --extractor github-extractor --urls https://github.com/trending

Production (compiled JavaScript)

npm run build
node dist/cli.js --extractor github-extractor --urls https://github.com/trending

Creating a New Extractor

To add a new extractor:

  1. Create a new folder in src/extractors/ with the name of your extractor (e.g., github).
  2. Inside this folder, create an index.ts file.
  3. Optionally, create a types.ts file to define any TypeScript interfaces or types related to your extractor.

Example Extractor

types.ts (inside src/extractors/github/):

import { BaseEntity, IBaseEntity } from "../base/types";

interface IRepositoryFields {
  title: string;
  url: string;
  description: string;
  language: string;
  countAllStars: number;
  countStarsToday: number;
  countForks: number;
}

export interface IRepositoryEntity extends IBaseEntity<IRepositoryFields> {
  fields: IRepositoryFields;
}

export class RepositoryEntity extends BaseEntity<IRepositoryFields> implements IRepositoryEntity {
  fields: IRepositoryFields;

  constructor(fields: IRepositoryFields) {
    super(fields);
    this.fields = fields;
  }
}

index.ts (inside src/extractors/github/):

import { BaseExtractor } from "../base";
import { ElementHandle } from "playwright";
import { RepositoryEntity } from "./types";

export class GithubExtractor extends BaseExtractor<RepositoryEntity> {
  domain = "github.com";
  waitSelector = ".Box-row";

  async parseEntity(element: ElementHandle): Promise<RepositoryEntity> {
    // Logic to extract data from the element
  }
}

CLI Usage

node dist/cli.js --extractor <extractor-name> --urls <url1> <url2>

Example:

node dist/cli.js --extractor github-extractor --urls https://github.com/trending

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published