generate embeddings from multiple rdf/ttl

Hi,
thanks for sharing your works.

We would like to use jRDF2Vec to generate embeddings to have a base of knowledge for a semantic service engine.
Our starting point is a custom ontology where some of the object properties refer to public vocabularies (in rdf format) like [Frequency Vocabulary](https://www.dublincore.org/specifications/dublin-core/collection-description/frequency) and some others to our custom vocabularies.

The approch we follow is:
- store ontology in TTL file
- download all public vocabularies in RDF format and store them in files
- create our custom vocabularies and store them in TTL (is more simple to write) files
- create "individuals" based on our ontology in TTL and store them in files
- generate walks for each of the files above with 
`java -jar jrdf2vec-1.2-SNAPSHOT.jar -graph <ttl_file|rdf_file> -onlyWalks -walkDirectory <custom_folder>`
- move the `walk_file_0.txt.gz` in a specific folder to avoid overwriting
- merge walks files with
`java -jar jrdf2vec-1.2-SNAPSHOT.jar -mergeWalks -walkDirectory <specific_folder> -o <merged_walks>`
- generate embeddings with
`java -jar jrdf2vec-1.2-SNAPSHOT.jar -onlyTraining  -light entities.txt -minCount 5 -walkDirectory <specific_folder>`

Is this process the correct one?
If not could you point out how to change it?

Furthermore, using the example of Jupyter Notebooks in your baseline we try to found most similar "concepts" in our model but we  found the following unclear issue: if we build the query using keys that belongs to public vocabularies and to our individuals, the results we obtain refer only similar concepts of public vocabularies (expected concepts similar to our individuals seem not to be considered).
```python
from gensim.models import KeyedVectors

kv_file = "../merged_walks/model.kv"
vectors = KeyedVectors.load(kv_file, mmap='r')

def closest(word_vectors: KeyedVectors, concepts: [str], negatives: [str]=None) -> None:
    print(f"Closest concept to: {concepts}")
    for other_concept, confidence in word_vectors.most_similar(positive=concepts, negative=negatives, topn=50):
        print(f"{other_concept} ({confidence})")

closest(word_vectors=vectors, concepts=[
      "https://our-custom-namespace/subjects/CustomSubject"
    , "http://publications.europa.eu/resource/authority/country/AUT"
    , "http://inspire.ec.europa.eu/theme/lc"])
```
```
Closest concept to: ['http://inspire.ec.europa.eu/theme/lc', 'http://publications.europa.eu/resource/authority/country/AUT', 'https://our-custom-namespace/subjects/CustomSubject']
http://publications.europa.eu/resource/authority/country/ESP (0.9480347037315369)
http://publications.europa.eu/resource/authority/country/CYP (0.9446530342102051)
http://publications.europa.eu/resource/authority/country/REU (0.9442076086997986)
http://publications.europa.eu/resource/authority/country/SWE (0.9437082409858704)
http://publications.europa.eu/resource/authority/country/ROU (0.9435302019119263)
http://publications.europa.eu/resource/authority/country/FIN (0.9430405497550964)
http://publications.europa.eu/resource/authority/country/BLR (0.9423658847808838)
http://publications.europa.eu/resource/authority/country/PNG (0.9423239231109619)
http://publications.europa.eu/resource/authority/country/BEL (0.9419782757759094)
http://publications.europa.eu/resource/authority/country/EST (0.9419098496437073)
http://publications.europa.eu/resource/authority/country/HRV (0.9417317509651184)
http://publications.europa.eu/resource/authority/country/MLT (0.9416428804397583)
```

Thanks for you support.
Sevastian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

generate embeddings from multiple rdf/ttl #91

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

generate embeddings from multiple rdf/ttl #91

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions