Skip to content

đź’ˇAn end-to-end solution for aggregating, summarizing, and displaying news articles using an AI-powered backend, an automated CRON crawler, and a responsive Next.js frontend. It integrates technologies like Express.js, MongoDB, Puppeteer, and GenAI/LLMs to deliver up-to-date, curated content to government staff and other users.

License

Notifications You must be signed in to change notification settings

danglinhanh/AI-Gov-Content-Curator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Powered Article Content Curator

The AI-Powered Article Content Curator is a comprehensive, AI-powered system designed to aggregate, summarize, and present curated government-related articles. The project is organized into three main components:

  • Backend: Provides a robust RESTful API to store and serve curated articles.
  • Crawler: Automatically crawls and extracts article URLs and metadata from government homepages and public API sources.
  • Frontend: Offers an intuitive Next.js-based user interface for government staff (and potentially the public) to browse and view article details.

Each component is maintained in its own directory:

Node.js Express.js Next.js MongoDB Mongoose Axios Cheerio Puppeteer Google Generative AI Vercel Cron JWT GitHub Actions React TypeScript CSS Tailwind CSS Docker

Note: This is a work in progress. Please review the information, test out the applications, and provide feedback or contributions. More features are also coming soon!


Table of Contents


Overview

The AI-Powered Article Content Curator system is designed to provide government staff with up-to-date, summarized content from trusted government sources and reputable news outlets. By leveraging AI (Google Generative AI / Gemini) for summarization and using modern web technologies, this solution ensures that users receive concise, accurate, and timely information.

  • Data Ingestion:
    The system aggregates article URLs from multiple sources (government homepages and public APIs like NewsAPI) using a decoupled crawler service.
  • Content Processing:
    The backend processes the fetched articles by generating concise summaries via Google Generative AI. This step includes handling rate limits and transient errors with robust retry mechanisms.
  • Data Storage & API Serving:
    Articles—comprising URLs, titles, full content, AI-generated summaries, source details, and fetch timestamps—are stored in MongoDB (managed through Mongoose). An Express.js API, integrated within a Next.js project, exposes REST endpoints for fetching article lists and individual article details.
  • Frontend Experience:
    A responsive Next.js/React interface allows users to easily browse paginated article lists, filter by source, and view detailed article pages, with dark/light mode support. The frontend also includes user authentication, enabling users to mark articles as favorites for quick access.
  • Scheduled Updates:
    Both the backend and crawler employ scheduled serverless functions (via Vercel cron) to periodically update the content. This ensures that the system (articles) remains fresh and up-to-date.
  • Architecture: Monorepo structure with separate directories for the backend, crawler, and frontend. Each component is designed to be scalable, maintainable, and deployable on Vercel.
  • User Authentication:
    Users can create an account, log in, and receive a JWT token for secure access to the system.
  • Favorite Articles:
    Authenticated users can mark articles as favorites for quick access and reference.
  • Dark Mode:
    The frontend offers a dark mode option for improved readability and user experience.

Architecture

Below is a high-level diagram outlining the system architecture:

      +----------------+       +--------------------------+
      |                |       |                          |
      |   Data Sources |       |   Public API Sources     |
      |                |       |   (e.g., NewsAPI)        |
      +--------+-------+       +-------------+------------+
               |                              |
               |                              |
               v                              v
      +----------------+       +--------------------------+
      |                |       |                          |
      | Custom Crawlers|       | API Fetcher Service      |
      | (Homepage      |       |                          |
      |  Crawling)     |       +-------------+------------+
      |                |                     |
      +--------+-------+                     |
               |                             |
               +------------+----------------+
                            |
                            v
                  +--------------------+
                  |                    |
                  |   Data Processing  |
                  | (Summarization via |
                  |  Gemini AI /       |
                  | GoogleGenerativeAI)|
                  |                    |
                  +---------+----------+
                            |
                            v
                  +--------------------+
                  |                    |
                  |   MongoDB Storage  |
                  | (via Mongoose)     |
                  |                    |
                  +---------+----------+
                            |
                            v
                  +--------------------+
                  |                    |
                  |   Express.js API   |
                  | (REST Endpoints)   |
                  |                    |
                  +---------+----------+
                            |
                            v
                  +--------------------+
                  |                    |
                  |   Next.js Frontend |
                  |  (Consumer of API) |
                  |                    |
                  +--------------------+

User Interface

1. Home Page

Home Page

2. Home Page (Dark Mode)

Home Page (Dark Mode)

3. Home Page (Guest User)

Home Page (Guest User)

4. Article Details Page

Article Detail Page

5. Article Details Page (Guest User)

Article Detail Page (Guest User)

6. Favorite Articles Page

Favorite Articles Page

7. Favorite Articles Page (Unauthenticated User)

Favorite Articles Page (Unauthenticated User)

8. User Authentication

User Authentication

9. User Registration

User Registration

10. Reset Password

Reset Password

11. 404 Not Found Page

404 Not Found Page

12. Backend Swagger API Documentation

Backend Swagger API Documentation


Backend

The Backend is responsible for storing articles and serving them via RESTful endpoints. It integrates AI summarization, MongoDB for storage, and runs within a Next.js environment using Express.js for API routes.

Features

  • Data Ingestion:
    Receives article URLs and data from the crawler and external API sources.
  • Content Summarization:
    Uses Google Generative AI (Gemini) to generate concise summaries.
  • Storage:
    Persists articles in MongoDB using Mongoose with fields for URL, title, full content, summary, source information, and fetch timestamp.
  • API Endpoints:
    • GET /api/articles – Retrieves a paginated list of articles (supports filtering via query parameters such as page, limit, and source).
    • GET /api/articles/:id – Retrieves detailed information for a specific article.
  • Scheduled Updates:
    A serverless function (triggered twice daily at 6:00 AM and 6:00 PM UTC) fetches and processes new articles, so that the system remains up-to-date!

Prerequisites & Installation (Backend)

Note: Instead of installing the node modules separately, you can run npm install in the root directory to install dependencies for all components.

  1. Prerequisites:

    • Node.js (v18 or later)
    • MongoDB (local or cloud)
    • Vercel CLI (for deployment)
  2. Clone the Repository:

    git clone https://github.com/hoangsonww/AI-Gov-Content-Curator.git
    cd AI-Gov-Content-Curator/backend
  3. Install Dependencies:

    npm install

Configuration (Backend)

Create a .env file in the backend directory with the following:

MONGODB_URI=your_production_mongodb_connection_string
GOOGLE_AI_API_KEY=your_google_ai_api_key
AI_INSTRUCTIONS=Your system instructions for Gemini AI
NEWS_API_KEY=your_newsapi_key
PORT=3000
CRAWL_URLS=https://www.whitehouse.gov/briefing-room/,https://www.congress.gov/,https://www.state.gov/press-releases/,https://www.bbc.com/news,https://www.nytimes.com/

Running Locally (Backend)

Start the development server:

npm run dev

Access endpoints:

  • GET http://localhost:3000/api/articles
  • GET http://localhost:3000/api/articles/:id

Deployment on Vercel (Backend)

  1. Configure Environment Variables in your Vercel project settings.

  2. Create or update the vercel.json in the root of the backend directory:

    {
      "version": 2,
      "builds": [
        {
          "src": "package.json",
          "use": "@vercel/next"
        }
      ],
      "crons": [
        {
          "path": "/api/scheduled/fetchAndSummarize",
          "schedule": "0 6,18 * * *"
        }
      ]
    }
  3. Deploy with:

    vercel --prod

Crawler

The Crawler automatically retrieves article links and metadata from government homepages and public API sources. It uses Axios and Cheerio for static HTML parsing and falls back to Puppeteer when necessary.

Features

  • Article Extraction:
    Crawls specified URLs to extract article links and metadata.

  • Error Handling & Resilience:
    Implements a retry mechanism and fallback to Puppeteer for dynamic content fetching when encountering issues (e.g., HTTP 403, ECONNRESET).

  • Scheduling:
    Deployed as a serverless function on Vercel, scheduled via cron (runs daily at 6:00 AM UTC).

  • Next.js UI:
    Provides a basic landing page with information about the crawler and links to the backend and frontend.

Prerequisites & Installation (Crawler)

  1. Prerequisites:
  • Node.js (v18 or later)
  • NPM (or Yarn)
  • Vercel CLI (for deployment)
  1. Clone the Repository:

    git clone https://github.com/hoangsonww/AI-Gov-Content-Curator.git
    cd AI-Gov-Content-Curator/crawler
  2. Install Dependencies:

    npm install

Configuration (Crawler)

Create a .env file in the crawler directory with the following variables:

MONGODB_URI=your_mongodb_connection_string
GOOGLE_AI_API_KEY=your_google_ai_api_key
AI_INSTRUCTIONS=Your system instructions for Gemini AI
NEWS_API_KEY=your_newsapi_key
PORT=3000
CRAWL_URLS=https://www.whitehouse.gov/briefing-room/,https://www.congress.gov/,https://www.state.gov/press-releases/,https://www.bbc.com/news,https://www.nytimes.com/

Running Locally (Crawler)

Start the Next.js development server to test both the UI and crawler function:

npm run dev

Alternatively, run the crawler directly:

npx ts-node schedule/fetchAndSummarize.ts

Deployment on Vercel (Crawler)

  1. Set Environment Variables in the Vercel dashboard.

  2. Create or update the vercel.json in the crawler directory:

    {
      "version": 2,
      "builds": [
        {
          "src": "package.json",
          "use": "@vercel/next"
        }
      ],
      "crons": [
        {
          "path": "/api/scheduled/fetchAndSummarize",
          "schedule": "0 6 * * *"
        }
      ]
    }
  3. Deploy with:

    vercel --prod

Frontend

The Frontend is built with Next.js and React, providing a modern, mobile-responsive UI for browsing and viewing curated articles.

Features

  • Article Listing:
    Fetches and displays a paginated list of articles from the backend API. Supports filtering by source.

  • Article Detail View:
    Dedicated pages display full article content, AI-generated summaries, source information, and fetched timestamps.

  • Responsive Design:
    The UI is optimized for both desktop and mobile devices.

  • Additional UI Components:
    Includes components like HeroSlider, LatestArticles, ThemeToggle, and more for an enhanced user experience.

Prerequisites & Installation (Frontend)

  1. Prerequisites:
  • Node.js (v18 or later)
  • NPM or Yarn
  1. Clone the Repository:

    git clone https://github.com/hoangsonww/AI-Gov-Content-Curator.git
    cd AI-Gov-Content-Curator/frontend
  2. Install Dependencies:

    npm install

    or

    yarn

Configuration (Frontend)

(Optional) Create a .env.local file in the frontend directory to configure the API URL:

NEXT_PUBLIC_API_URL=https://your-backend.example.com

Running Locally (Frontend)

Start the development server:

npm run dev

Access the application at http://localhost:3000.

Deployment on Vercel (Frontend)

  1. Configure Environment Variables in the Vercel dashboard (e.g., NEXT_PUBLIC_API_URL).

  2. Vercel automatically detects the Next.js project; if needed, customize with a vercel.json.

  3. Deploy with:

    vercel --prod

Alternatively, you can deploy directly from the Vercel dashboard.


Logging, Error Handling & Future Enhancements

  • Logging:

    • Development: Uses basic console logging.
    • Production: Consider integrating advanced logging libraries (e.g., Winston, Sentry) for improved error monitoring.
  • Error Handling:

    • The backend implements retry mechanisms for AI summarization.
    • The crawler gracefully handles network errors and switches between Axios/Cheerio and Puppeteer as needed.
  • Future Enhancements:

    • Expand the Next.js UI into a richer dashboard featuring analytics, logs, and user authentication.
    • Refine scheduling options for more granular updates.
    • Integrate additional public API sources and extend filtering capabilities.

Contributing

  1. Fork the repository and clone it locally.

  2. Create a Feature Branch:

    git checkout -b feature/your-feature-name
  3. Commit Your Changes:

    git commit -m "Description of your feature"
  4. Push the Branch and Open a Pull Request.

Contributions are welcome! Please ensure that your code adheres to the project’s linting and formatting guidelines.


License

This project is licensed under the MIT License. See the LICENSE file for more details.


Contact

If you have any questions or suggestions, feel free to reach out to me:

A new development team might be formed to continue the project, so please check back for updates!


Conclusion

The AI-Powered Article Content Curator project brings together a powerful backend, an intelligent crawler, and a modern frontend to deliver up-to-date, summarized government-related articles. Leveraging advanced technologies like Google Generative AI, Next.js, Express.js, and MongoDB, the system is both scalable and robust. Whether you’re a government staff member or a curious public user, this solution provides a streamlined, user-friendly experience to quickly access relevant, summarized content.


Cheers to a more informed world! 🚀

🔝 Back to Top

About

đź’ˇAn end-to-end solution for aggregating, summarizing, and displaying news articles using an AI-powered backend, an automated CRON crawler, and a responsive Next.js frontend. It integrates technologies like Express.js, MongoDB, Puppeteer, and GenAI/LLMs to deliver up-to-date, curated content to government staff and other users.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 84.2%
  • CSS 14.6%
  • Dockerfile 1.1%
  • JavaScript 0.1%