Skip to content

Skill-Extraction v0.2.2 (ESCO + KSA)

Latest
Compare
Choose a tag to compare
@phanindra-max phanindra-max released this 12 Jun 17:57
· 1 commit to main since this release
106bf7d

Release v0.2.2 · Skill-Extraction Refactor (“ESCO + KSA”)

⚡PR⚡#101

✨ Highlights

  1. Taxonomy-aware skill extraction
    • Integrates a FAISS index of the ESCO skills taxonomy.
    Skill_Extractor.get_top_esco_skills() now returns {Skill, index, score} enabling deterministic Skill Tag values (ESCO.<index>).

  2. KSA enrichment with vLLM
    • New helper get_ksa_details() generates Knowledge Required and Task Abilities lists for each skill.
    • Automatically invoked when a GPU/vLLM backend is available.

  3. Unified output schema
    The extractor returns a tidy DataFrame with seven columns:
    Research ID, Description, Raw Skill, Knowledge Required, Task Abilities, Skill Tag, Correlation Coefficient.


🔧 Detailed Changes

Area Description
utils.py get_top_esco_skills() enhanced to include ESCO index and similarity score.
llm_methods.py Added get_ksa_details() plus supporting imports.
skill_extractor.py • Ensured self.index is always defined.
build_faiss_index_esco() / load_faiss_index_esco() now instance methods storing the index under laiser/input.
• New taxonomy-first pipeline inserted at the top of extractor(); legacy alignment kept for fallback.

⚠️ Deprecated / To Be Removed

  • align_skills() and align_KSAs() will be dropped in v0.3 once consumers migrate to the new output format.

🚧 Known Issues / Roadmap

  1. JSON parsing in get_ksa_details() needs additional resilience checks.
  2. LLM calls are still executed per skill; batching will come in v0.3.
  3. Duplicate import json lines remain in llm_methods.py.
  4. Consider CPU-only fallback for KSA generation.
  5. Persistence of the ESCO vector index should move to a cloud vector DB.
  6. vLLM isn't supported on MPS/MacOS as of now.

⬆️ Upgrade Notes

pip install -U laiser==0.2.2 

No changes to input parameters are required, but downstream code should read the new seven-column schema.


Next up

0.2 → 0.3

  • adding batching and dropping deprecated APIs; increment patch for bug fixes.