Skip to content

Support Newer Hugging Face Model for Text Vectorization #4

@anglerfishlyy

Description

@anglerfishlyy

Problem: iQual’s text vectorization uses sentence-transformers (e.g., older models like all-MiniLM-L6-v1). Newer models like all-MiniLM-L12-v2 offer better accuracy with similar efficiency.
Proposed Solution: Update src/iqual/text_features.py to support all-MiniLM-L12-v2 as an option in add_text_features. This would:

  • Add a parameter to select the model (default to current).
  • Update notebook examples (Basic Modelling) to demo the new model.
  • Include performance benchmarks (e.g., accuracy on politeness dataset).

Steps:

  1. Add model option in text_features.py.
  2. Test on sample data (politeness dataset).
  3. Update notebooks/Basic_Modelling.ipynb with example.
  4. Add tests for vectorization output.
    Impact: Improves iQual’s NLP accuracy, aligning with World Bank’s AI-for-data goals.
    Willing to Implement: I can submit a PR with code and updated notebook.

@addypy @g4brielvs, seeking your thoughts on adding all-MiniLM-L12-v2 to iQual’s text vectorization to boost NLP accuracy for SDG analysis. Happy to refine benchmarks or model choices per your guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions