Skip to content

A Python script to automatically scan local project folders and READMEs, intelligently generating and applying relevant topics to your GitHub repositories. It saves hours of manual work. I used this very tool to consistently tag all of my projects in a single run, greatly improving my portfolio's organization.

License

Notifications You must be signed in to change notification settings

imehranasgari/GitHub_Repository_Auto_Tagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Repository Auto-Tagger

Problem Statement and Goal of Project

For any active developer or data scientist, maintaining a large portfolio of projects on GitHub can become disorganized. Manually adding and updating descriptive topics for each repository is a tedious and inconsistent process. This lack of proper tagging reduces the discoverability of projects, making it difficult for recruiters, collaborators, or even my future self to quickly grasp the scope and technologies of a given project.

The goal of this project was to create a robust automation script that streamlines the organization of my GitHub profile. The tool is designed to scan a local directory of Git repositories, intelligently generate a set of relevant topics based on the project's folder structure and README content, and apply these tags directly to the corresponding GitHub repositories via the API.

Solution Approach

I developed a Python script encapsulated in a GitHubTopicUpdater class that automates the entire tagging process from end-to-end. The workflow is as follows:

  1. Secure User Authentication: The script begins by securely and interactively prompting for the user's GitHub username and a Personal Access Token (PAT) using Python's getpass library to ensure credentials are not exposed on the command line. It also requests the root path of the local projects directory.

  2. Local Repository Scanning: The tool recursively walks through the specified directory, identifying valid Git projects by the presence of a .git subfolder.

  3. Intelligent Topic Generation: For each repository found, the script compiles a list of potential topics from multiple sources:

    • Folder Structure: It parses the names of parent directories and the repository folder itself, cleaning them to be valid topic formats (e.g., _ and spaces are converted to -). This leverages a hierarchical folder structure (/machine-learning/time-series-forecasting/) to derive contextual tags.
    • README Content Analysis: The script reads the README.md file within each project and searches for keywords from a comprehensive, predefined list of over 100 relevant technologies, libraries, and concepts (TECH_KEYWORDS) in the AI and Machine Learning space.
    • Filtering and Cleaning: All generated topics are standardized to lowercase. A BLACKLIST_TOPICS set is used to filter out generic, non-descriptive words (e.g., "project", "data", "model"), ensuring the final tags are meaningful and specific.
  4. User Review and Confirmation: Before any changes are made, the script presents a clear, formatted summary of all repositories and the topics it has generated for each one. It requires an explicit 'yes' confirmation from the user to proceed, preventing accidental updates.

  5. GitHub API Integration: Upon confirmation, the script iterates through each repository and uses the requests library to send a PUT request to the GitHub API, updating the repository's topics. A one-second delay is implemented between API calls to respect rate limits.

Technologies & Libraries

  • os: Utilized for file system navigation, such as walking through project directories and constructing file paths.
  • requests: Essential for making HTTP API calls to the GitHub REST API to update repository topics.
  • re: Leveraged for regular expression operations to clean folder and repository names and to reliably search for keywords within README files.
  • getpass: Implemented for securely prompting the user for their GitHub Personal Access Token without displaying it in the terminal.
  • time: Used to add a time.sleep(1) delay between API requests to prevent rate-limiting issues.

Installation & Execution Guide

  1. Prerequisites:

    • Python 3 installed on your system.
    • A GitHub Personal Access Token (PAT) with full repo scope. You can create one here.
    • A local directory where you have cloned the GitHub repositories you wish to tag.
  2. Installation:

    • Save the script as run_all.py.
    • Install the necessary Python library:
      pip install requests
  3. Execution:

    • Open your terminal or command prompt.
    • Navigate to the directory where you saved run_all.py.
    • Run the script using the following command:
      python run_all.py
    • Follow the on-screen prompts to enter your GitHub username, PAT, and the path to your projects directory.
    • Carefully review the proposed topics for each repository.
    • Type yes and press Enter to confirm and begin the update process on GitHub.

Key Results / Performance

The script performs its function reliably, successfully automating a manual and error-prone task. The primary result is a significant improvement in the organization and discoverability of projects on my GitHub profile. It effectively bridges the gap between local project organization and the public-facing presentation on GitHub, ensuring that repository topics are consistent, comprehensive, and up-to-date. The tool provides clear, real-time feedback on the success or failure of each API update, allowing for easy monitoring.

Sample Output

Here is an example of the script's output during an execution cycle, showing the user prompts, the review list, and the final update status messages.

--- GitHub Auto-Tagger Configuration ---
Enter your GitHub username: imehranasgari
Enter your GitHub Personal Access Token (PAT):
Enter the full path to your projects directory: /Users/mehran/Documents/GitHub_Projects
----------------------------------------
STEP 1: Scanning local folders and reading READMEs...

SUCCESS: Found 2 repositories. Here is the generated list:
----------------------------------------------------------------------
    "API-for-Deep-Learning-Model": ['api', 'deep-learning', 'deployment', 'fastapi', 'python', 'tensorflow'],
    "Customer-Churn-Prediction": ['classification', 'deep-learning', 'machine-learning', 'pandas', 'python', 'scikit-learn', 'xgboost'],
----------------------------------------------------------------------

>>> Review the list. Type 'yes' to update GitHub, or anything else to cancel: yes

STEP 2: Starting to update topics on GitHub...
Attempting to update 'API-for-Deep-Learning-Model'...
    ✅ SUCCESS: Topics updated for 'API-for-Deep-Learning-Model'.
Attempting to update 'Customer-Churn-Prediction'...
    ✅ SUCCESS: Topics updated for 'Customer-Churn-Prediction'.

All done!

Additional Learnings / Reflections

Building this tool provided valuable hands-on experience in several key areas of software development:

  • API Integration: It was a practical exercise in consuming a major third-party REST API (GitHub), including handling authentication (with PATs), structuring requests correctly, and interpreting responses.
  • Secure Coding Practices: The deliberate choice to use getpass instead of a standard input() for the PAT reflects an understanding of the importance of handling sensitive data securely.
  • User-Centric Design: The script was designed with the user in mind, incorporating a final review and confirmation step. This is a critical feature for any automation tool that performs write operations, as it provides a safeguard against unintended changes.
  • Code Modularity and Reusability: By creating a well-defined class and internal methods (_scan_local_repos, _update_github_repo), the logic is clean, organized, and easy to maintain or extend in the future. The use of customizable lists for keywords and blacklisted topics also makes the tool highly adaptable.

👤 Author

Mehran Asgari


📄 License

This project is licensed under the Apache 2.0 License – see the LICENSE file for details.

About

A Python script to automatically scan local project folders and READMEs, intelligently generating and applying relevant topics to your GitHub repositories. It saves hours of manual work. I used this very tool to consistently tag all of my projects in a single run, greatly improving my portfolio's organization.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages