-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi,
thanks for sharing your works.
We would like to use jRDF2Vec to generate embeddings to have a base of knowledge for a semantic service engine.
Our starting point is a custom ontology where some of the object properties refer to public vocabularies (in rdf format) like Frequency Vocabulary and some others to our custom vocabularies.
The approch we follow is:
- store ontology in TTL file
- download all public vocabularies in RDF format and store them in files
- create our custom vocabularies and store them in TTL (is more simple to write) files
- create "individuals" based on our ontology in TTL and store them in files
- generate walks for each of the files above with
java -jar jrdf2vec-1.2-SNAPSHOT.jar -graph <ttl_file|rdf_file> -onlyWalks -walkDirectory <custom_folder>
- move the
walk_file_0.txt.gz
in a specific folder to avoid overwriting - merge walks files with
java -jar jrdf2vec-1.2-SNAPSHOT.jar -mergeWalks -walkDirectory <specific_folder> -o <merged_walks>
- generate embeddings with
java -jar jrdf2vec-1.2-SNAPSHOT.jar -onlyTraining -light entities.txt -minCount 5 -walkDirectory <specific_folder>
Is this process the correct one?
If not could you point out how to change it?
Furthermore, using the example of Jupyter Notebooks in your baseline we try to found most similar "concepts" in our model but we found the following unclear issue: if we build the query using keys that belongs to public vocabularies and to our individuals, the results we obtain refer only similar concepts of public vocabularies (expected concepts similar to our individuals seem not to be considered).
from gensim.models import KeyedVectors
kv_file = "../merged_walks/model.kv"
vectors = KeyedVectors.load(kv_file, mmap='r')
def closest(word_vectors: KeyedVectors, concepts: [str], negatives: [str]=None) -> None:
print(f"Closest concept to: {concepts}")
for other_concept, confidence in word_vectors.most_similar(positive=concepts, negative=negatives, topn=50):
print(f"{other_concept} ({confidence})")
closest(word_vectors=vectors, concepts=[
"https://our-custom-namespace/subjects/CustomSubject"
, "http://publications.europa.eu/resource/authority/country/AUT"
, "http://inspire.ec.europa.eu/theme/lc"])
Closest concept to: ['http://inspire.ec.europa.eu/theme/lc', 'http://publications.europa.eu/resource/authority/country/AUT', 'https://our-custom-namespace/subjects/CustomSubject']
http://publications.europa.eu/resource/authority/country/ESP (0.9480347037315369)
http://publications.europa.eu/resource/authority/country/CYP (0.9446530342102051)
http://publications.europa.eu/resource/authority/country/REU (0.9442076086997986)
http://publications.europa.eu/resource/authority/country/SWE (0.9437082409858704)
http://publications.europa.eu/resource/authority/country/ROU (0.9435302019119263)
http://publications.europa.eu/resource/authority/country/FIN (0.9430405497550964)
http://publications.europa.eu/resource/authority/country/BLR (0.9423658847808838)
http://publications.europa.eu/resource/authority/country/PNG (0.9423239231109619)
http://publications.europa.eu/resource/authority/country/BEL (0.9419782757759094)
http://publications.europa.eu/resource/authority/country/EST (0.9419098496437073)
http://publications.europa.eu/resource/authority/country/HRV (0.9417317509651184)
http://publications.europa.eu/resource/authority/country/MLT (0.9416428804397583)
Thanks for you support.
Sevastian