Skip to content

generate embeddings from multiple rdf/ttl #91

@ilseva

Description

@ilseva

Hi,
thanks for sharing your works.

We would like to use jRDF2Vec to generate embeddings to have a base of knowledge for a semantic service engine.
Our starting point is a custom ontology where some of the object properties refer to public vocabularies (in rdf format) like Frequency Vocabulary and some others to our custom vocabularies.

The approch we follow is:

  • store ontology in TTL file
  • download all public vocabularies in RDF format and store them in files
  • create our custom vocabularies and store them in TTL (is more simple to write) files
  • create "individuals" based on our ontology in TTL and store them in files
  • generate walks for each of the files above with
    java -jar jrdf2vec-1.2-SNAPSHOT.jar -graph <ttl_file|rdf_file> -onlyWalks -walkDirectory <custom_folder>
  • move the walk_file_0.txt.gz in a specific folder to avoid overwriting
  • merge walks files with
    java -jar jrdf2vec-1.2-SNAPSHOT.jar -mergeWalks -walkDirectory <specific_folder> -o <merged_walks>
  • generate embeddings with
    java -jar jrdf2vec-1.2-SNAPSHOT.jar -onlyTraining -light entities.txt -minCount 5 -walkDirectory <specific_folder>

Is this process the correct one?
If not could you point out how to change it?

Furthermore, using the example of Jupyter Notebooks in your baseline we try to found most similar "concepts" in our model but we found the following unclear issue: if we build the query using keys that belongs to public vocabularies and to our individuals, the results we obtain refer only similar concepts of public vocabularies (expected concepts similar to our individuals seem not to be considered).

from gensim.models import KeyedVectors

kv_file = "../merged_walks/model.kv"
vectors = KeyedVectors.load(kv_file, mmap='r')

def closest(word_vectors: KeyedVectors, concepts: [str], negatives: [str]=None) -> None:
    print(f"Closest concept to: {concepts}")
    for other_concept, confidence in word_vectors.most_similar(positive=concepts, negative=negatives, topn=50):
        print(f"{other_concept} ({confidence})")

closest(word_vectors=vectors, concepts=[
      "https://our-custom-namespace/subjects/CustomSubject"
    , "http://publications.europa.eu/resource/authority/country/AUT"
    , "http://inspire.ec.europa.eu/theme/lc"])
Closest concept to: ['http://inspire.ec.europa.eu/theme/lc', 'http://publications.europa.eu/resource/authority/country/AUT', 'https://our-custom-namespace/subjects/CustomSubject']
http://publications.europa.eu/resource/authority/country/ESP (0.9480347037315369)
http://publications.europa.eu/resource/authority/country/CYP (0.9446530342102051)
http://publications.europa.eu/resource/authority/country/REU (0.9442076086997986)
http://publications.europa.eu/resource/authority/country/SWE (0.9437082409858704)
http://publications.europa.eu/resource/authority/country/ROU (0.9435302019119263)
http://publications.europa.eu/resource/authority/country/FIN (0.9430405497550964)
http://publications.europa.eu/resource/authority/country/BLR (0.9423658847808838)
http://publications.europa.eu/resource/authority/country/PNG (0.9423239231109619)
http://publications.europa.eu/resource/authority/country/BEL (0.9419782757759094)
http://publications.europa.eu/resource/authority/country/EST (0.9419098496437073)
http://publications.europa.eu/resource/authority/country/HRV (0.9417317509651184)
http://publications.europa.eu/resource/authority/country/MLT (0.9416428804397583)

Thanks for you support.
Sevastian

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions