Notes for GSOC students

Jump to bottom

Justin Clark-Casey edited this page May 17, 2018 · 1 revision

Notes for GSOC students

Please read this wiki and the various links to get a feel for this project.
Please set up the crawler and the frontend. The crawler only fetches a very small number of pages (though beta.synbiomine.org is temporarily broken, so you may have to comment this out until I get it fixed shortly).
This is a project that will eventually crawl Bioschemas markup embedded in InterMine (only with dev instances in the GSOC timeframe), but it is a separate set of projects from InterMine itself. It's also intended that Buzzbang will crawl many other life sciences websites embedding Bioschemas information.
There are very few life sciences websites currently embedding Bioschemas data, and this is subject to change. For the GSOC timeframe, the likely crawl target will be the EBI's biosamples website. This is much larger than any crawl target to date, so will need careful thinking about scalability. It's not necessary that a GSOC target can complete a crawl, but that it should be capable of doing so in a reasonable timeframe given reasonable server and network resources. That said, the results of an ongoing crawl must be searchable.
We can assume that pages on the EBI site can be found via its sitemap.xml. Therefore, in the GSOC timeframe there is no strict need for a more sophisticated link-following crawl, or one that needs to render the page in a headless web-browser before extracting the JSON-LD. However, it would be for a proposed design to anticipate that possibility later on.
Bioschemas and this project are actively evolving, so expect change! Priorities may shift and some features become more important and others much less so. Other InterMine GSOC projects may be more stable if you prefer a more predictable environment. However, Buzzbang allows a contribution to a developing area with relatively little existing code where there are currently few other competing projects.
I like design and code that are as simple as possible but not too simple. So, focus on the present requirement (crawling, indexing and searching the large amount of JSON-LD in the EBI biosamples database), don't propose a complex design that tries to cover every possible eventuality (some of these may never occur, others may make the code more difficult to understand). At the same time, try and be careful not to create a design that makes future extension difficult.
Please feel free to ask me (justincc AT intermine.org) any questions.