Skip to content

cnut1648/skills_scores

Repository files navigation

  • Based on Onet skills (or Competencies) & jobs (or Occupations) from latest (25.1) database, parse as a biparite graph where $V = { occupations } \oplus { skills }$

  • Pull job postings from USAJobs, which is more thorough and more structured than Linkedin (but we still keep the scraper in scraper dir and provide transformer to schema.org scheme) and Virginia Dataset, thus enable better skill extraction; it is also up to date so we decide not to use Kaggle 19,000 job postings that were posted through the Armenian human resource portal CareerCenter.

  • Transform jobposting to schema.org JobPosting schema to enable skill extraction

  • apply data cleansing in utils.py

  • three most efficient skill extractors:

    • Grammer based NP Chunking

      Note this one doesn't align extracted skills to Onet skills

    • Exact match

      using Trie to optimize regular expression

    • Fuzzy search

      Fork of symspell 1 million times faster through Symmetric Delete spelling correction algorithm

  • generate new scores from the extracted skills and ontology skill:

    Assume for job $j$ we have $S$ which are the skills required by $j$ from ontology (Onet in this case)

    From the extracted skills we only have the counts of skills. Similar to the traditional approach in NLP pertaining to co-occurance matrix we use MLE to estimate $P(\text{skill }s \in S \mid \text{job } j)$ i.e. let $q(s \mid j) = \dfrac{count(s, corpus_{j})}{\sum_{s \in S}count(s, corpus_j))}$, where $count(s, corpus_j)$ is the number of occurrences of skill $s$ in the $corpus_j$, which in this case are the USAJobs job postings for job $j$.

    However the distribution of counts of the extracted skills are highly biased and skewed, where most of them are 0 (they never show up in the corpus, but do notice that many of the Onet skills are not likely to show up in the job postings for example writing); while if some skills have positive counts, it is always a large number like greater or equal to 10.

    Thus to cope with count being 0 and assign estimate with positive probability we apply smoothing:

    1. add-$\alpha$ [Chen and Goodman, 1999, Goodman, 2001, Lidstone, 1920]

      smoothed column in the result csv

    2. maximum division

      $q(s \mid j){max_divide} = \dfrac{count(s, corpus{j})}{\max _{s \in S} count(s, corpus_j))}$

      max_divide column in the result csv

    3. smoothed maximum division

      add-$\alpha$ on $q(s\mid j)_{max_divide}$

      max_divide_smoothed column in the result csv

    4. smoothed minimum division

      min_divide_smoothed column in the result csv

    5. log smoothed

      apply $log_{10}(\cdot)$ on add-$\alpha$.

      log_smoothedcolumn in the result csv

    Others we tried but not working well:

    1. back-off or Jelinek mercer interpolated backoff [Chen and Goodman, 1999, Jelinek and Mercer, 1980]: works for n-gram not in here

    2. Laplace smoothing: works well in trigram but not here

    3. PMI [Dagan et al., 1994, Turney, 2001, Turney and Pantel, 2010]: define for bi-gram not suitable for this task.

    4. downweight skills s for job j

      $w_{sj} = count(s, corpus_j) \cdot \ln (\dfrac{|S|}{|corpus_j|})$

      not so effective

    Once we have the $q(s \mid j)$ for each $s \in S$, we can update scores for s $u(corpus_j;ontology)$ as a convex combination of $u_0$ and $w_s \cdot u_0$, i.e. the weight learned by job postings

    $$u_{new} = \lambda \cdot u_0 + (1-\lambda) \cdot w_s\cdot u_0$$

    in here we pick $\lambda = 0.9$. So we assign more weights on Onet scores and use job postings to represent contemporary shift on required jobs

    Onet only have scores for these 5 categories, plus Onet's hot/not hot tech skills

    We use $q(s \mid j)$ to estimate $w_s$, where we need to map probability 0-1 to scale 1-5

NER

We create NER model as:

  1. Embedding: FastText [Mikolov et al., 2018] trained on web context scrapped by crawlers + forward & backword Contextual String trained on 1 billion words news [Akbik, Blythe, and Vollgraf, 2018]

    We believe combined these embeddings each word would represent accurately based on academic and non-academic context

  2. 2 BiLSTM layers

  3. CRF

  4. dropout, epoch annealing and learning rate schedulers…

Although NER is able to extract useful and more concrete skills (eg. onet doesn't have Russian as a skill but foreign language but NER is able to extract Russian as a skill ) but if we use skills extracted from NER then we need an another ontology to update the scores

set up

First git clone this repository, it is suggested that you create a virtual env to use this function so that it won't conflict with your other packages. Note we use python 3.6.

Run pip install -r requirements.txt and then you (should be) are done with downloading packages.

Then take a look at config.py to configure the project settings.

skills scores

To run application

python updateScores.py 11-3012.00 "Administrative Services Managers" "Business Manager" --alpha 0.7

where onet id is 11-3012.00 and the corresponding related jobs are Administrative Services Managers and Business Manager

to see all available commands please see python updateScores.py --help

Note the first time to run this command onet data would be downloaded to your local machine.

This script would generate two files whenever it finishs the updating process. 1. it generates the pickle files that contains the job postings; 2. it generates the csv files that contains the scores, updated scores, probabilities etc..

Or, if you prefer using list of job names, use

python main.py "Business Administrator" "Administrative Officer" "Administrative Services Managers" "Mathematicians" "Statisticians" --alpha 0.7 

main.py would use Onet API to find the corresponding id, and output the confidence levels. Please note that it would only choose the most confident id.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published