— For conducting rapid scoping literature reviews
From: Team #semanticClimate – https://semanticclimate.github.io/p/en/
First release: 2025-07-22
Repo and info | Discussion and support
Course outline: | Software citation information: CITATION.cff
Course units are:
- Corpus creation: Open Access repository retrieval and analysis
- Vibe coding: Coding with an AI assistant — Corpus creation
- PDF summarisation
- Image classification
- Image extraction
- Named entity recognition (NER): Entities and concepts
- LLM and RAG: For PDF and HTML
You are welcome to use the experimental framework on your own AI assisted literature review. Please get in contact on the discussion board if you have comments or questions.
Additions: A template GitHub repository for running your own AI LLM RAG for a literature review project is provided. It contains software and instructions. The template has been setup for the IPCC AR6 Corpus from the Climate Knowledge Graph but it can be configured to hold any corpus or literature review mini-corpus.
URL: https://github.com/semanticClimate/llmrag
The course was run as part of Force11 Scholarly Communications Institute (FSCI) in collaboration with UCLA Library, see course: https://force11.org/fsci/post/fsci-2025-courses-abstracts/#e01 — E01 AI-Assisted Literature Review on Open Access Repositories: Including Image and Object Detection
Thank you to the FSCI organisers and technical support team for the smooth running of the sessions.
Courses dates: Tuesday 2025-7-22, Wednesday 2025-7-23, and Thursday 2025-7-24. Each class will have a Zoom session at the same time on each of the three days, as specified in their course abstracts. The courses are listed in Pacific time (UTC-7)
The course covers how to use a self-hosted open-source AI LLM RAG (Retrieval-Augmented Generation) assisted literature review system with supporting user learning material. The system is for: using open access literature repositories; is based on open science (open scholarship) principles; is globally equitable, inclusive, and multilingual, and; is independent of commercial providers.
Participants will be able to self-host their own open-source AI LLM RAG system with no dependency on commercial providers, and to be able to adapt the system to different retrieval and knowledge query use cases. See: semanticClimate llmrag repo.
The Assisted Literature Review (ALR) course covers instruction for a semi-automated literature search with a focus on AI LLM RAG use on a dedicated corpus. In the course the example corpus will be the IPCC’s Sixth Assessment Report. The framework can be used on any topic or corpus, for example from the Open Access literature from Europe PMC, which is a corpus of 7 million open access articles.
The AI and machine learning open-source software used is the #semanticClimate text and data mining tooling. The course is an introduction to AI Algorithms for data mining including LLM RAG frameworks. A template ‘good practice’ framework will be provided for participants later use.This course introduces literature search, text mining, image classification as well as object detection. The algorithms and the data used are all open-source and issues of trustability for open science are a priority.
All instruction is carried out using CoLab Jupyter Notebooks so no complicated installations are required.
The learning points covered allow for familiarity with AI tooling for literature search and as a package that can be reused by students and researchers. The learning package already exists as a fully documented workflow, with existing CoLab Notebooks — all deposited in Zenodo with DOIs. The intention is to give participants experience and methodologies to evaluate and integrate AI LLM RAG into their workflows. AI is evolving so fast that focusing on one set of fixed components in a tech stack is not possible, instead the focus is on concept and evaluation.
The focus for the class is a scoping literature review. The results of the AI Assisted Literature Review workflow taught in the class are a literature review report, including: a textual summary, summaries of papers as a data table, the complete full-text articles downloaded, a reproducible and replicable CoLab Notebook with all the software and code used in the review. The resulting content package can be used in papers, reporting, dashboards, CI pipelines, and for further data analysis.
Course chair: Simon Worthington, Climate Knowledge Graph project lead (TIB — Leibniz Information Centre for Science and Technology and University Library) and #semanticClimate member.
Gitanjali Yadav, National Institute of Plant Genome Research (NIPGR) (Co-course chair); Peter Murray-Rust, Cambridge University (Co-course chair); Renu Kumari, National Institute of Plant Genome Research (NIPGR) (Co-instructor).
Additional Contributors: Shabnam Barbhuiya, Jamia Millia Islamia University (Co-instructor); Ambreen Hamadani, Sher-e-Kashmir University of Agricultural Sciences and Technology of Kashmir (SKUAST-K) (Co-instructor); Moobashara Jawed, Jamia Millia Islamia University (Co-instructor); Anna Rahr, Hannover University of the Applied Science and Arts, and TIB (Co-instructor); Avika Joshi, Delhi Technological University; Deepika Mandakala, Vignan's Institute of Information Technology (A), Visakhapatnam; Anudev Suresh, Jamia Millia Islamia University; Haarthi Vallabhaneni, DVR & Dr. HS MIC College of Technology; Saurav Mishra, National Institute of Technology; Anushka Kushwaha, NIIT University; Malavika Balachandran, University of Toronto; Harshita Mahawar, Amity University, Noida; Shaik Zainab, Anurag University, Hyderabad.
Audience: Researchers, librarians, publishers
Level: (Beginner, but suitable for all levels)
Requirements: Run Google ColLab in a browser. See: https://colab.research.google.com/ A Google account to run CoLab Jupyter Notebooks (if this is not possible users can run Notebooks in their own environments — but please check with course organisers for support). Have a GitLab account (other Git versions can be used — GitLab or Codeberg, etc. Contact course organisers in advance if this is required.) It is also required to obtain some free to use API keys.
At the end of the course, participants will be able to:
- Conduct a scoping literature review using AI LLM RAG tooling
- Obtain familiarity with using LLM RAG with PDFs and with HTML
- Be able to carry out text and data mining and build a corpus from open access sources such as EPMC
- Use Colab Jupyter Notebooks, execute python commands, and use GitHub.
- Use the provided ‘good practice’ framework for managing LLM projects
- Corpus creation: OA repository retreival and analysis
- Vibe coding: Coding with an AI assistant — Corpus creation
- PDF summarisation
- Image classification
- Image extraction
- Named entity recognition (NER): Entities and concepts
- LLM and RAG: For PDF and HTML
Renu Kumari
- Notebook –
- Output
- Presentation
Peter Murray Rust
Shabnam Barbhuiya
- Notebook –
- Presentation
Ambreen Hamadani
Avika Joshi
- FigSense
– Code:
Moobashara Jawed
- Notebook –
- Presentation
Shabnam Barbhuiya and Anna Rahr
- Notebook for LLM RAG from PDF/XML –
- Repo including Notebook for LLM RAG from HTML – not shown in the course
- Day 1: Corpus Creation; Vibe Coding
- Day 2: PDF Summarization; Image classification; FigSense demo — Image extraction
- Day 3: Named Entity Recognition; RAG LLM with PDF-XML
#semantiClimate runs an ongoing internship coordinated by NIPGR. Students from the programme carry out literature review on chapters of the IPCC Sixth Assessment Report (AR6). Below are video presentations from the students reporting on their respective literature reviews.
- AR6/WG2/Chapter04 – Water
- By Ms. Haarthi Vallabhaneni
- AR6/WG2/Chapter08 – Poverty, Livelihood, Sustainable development
- By Malavika Balachandran
- AR6/WG1/Chapter08 – Water Cycle Changes
- By Ms. Anushka Kushwaha
- AR6/WG1/Chapter04 – Future Global Climate
- By Ms. Deepika Mandakala
- AR6/WG2/Chapter-06 – Cities, settlements and key infrastructure
- By Harshita Mahawar
An example of #semanticClimate tooling being used on the question of Climate Justice, 2024.
As well as supporting step-by-step guide.
Worthington, Simon, Renu Kumari, Peter Murray-Rust, Gitanjali Yadav, Shweata N Hegde, and Bhadra Parijat. “Creating the Climate Justice Dictionary — A Step-by-step Guide”. semanticClimate, August 12, 2025. https://doi.org/10.5281/zenodo.16813353.
Bookmark these Git repository:
- https://github.com/semanticClimate/assited-literature-review and
- https://github.com/semanticClimate/llmrag
#semanticClimate tools and resources: https://semanticclimate.github.io/p/en/posts/resources/
A software and learning framework for a self-hosted open-source AI assisted literature review, based on — open access, global equity, and open science
2nd July 2025
LLM RAG GitHub repository (version for working with IPCC reports): https://github.com/semanticClimate/llmrag
The mission is to build a self-hosted open-source AI LLM RAG (Retrieval-Augmented Generation) assisted literature review system with supporting user learning material. The system is for: using open access literature repositories; is based on open science (open scholarship) principles; is globally equitable, inclusive, and multilingual, and; is independent of commercial providers.
- To create an AI LLM RAG self-hosted platform and infrastructure framework for literature retrieval and knowledge query.
- Provide supporting learning resources for AI LLM RAG conceptual models and methods using ‘learning-and-understanding-by-doing’ — its parts and how it impacts the information landscape.
- The system and its supporting learning material is designed so that newcomers can enable competences, help understand the principles involved, and how it impacts the information landscape.
- An approach that implements Global South knowledge participation parity (KPP)*, is global in scope and multilingual.
- To use only open access literature and FAIR data source for AI LLM use.
- How to manage and use your own open-source AI LLM platform and infrastructure that is: accessible to all, is designed for digital sovereignty, and is fully open science based from the start, as uses — open source software, and creates FAIR data outputs.
- To create data sets, such as vector databases, of your own that are fully open source, open licenced, open science in all aspects — including FAIR data.
- Create a Git based template and methodology for working with AI LLMs. The template allows different technologies to be interchanged or for use in different application use cases — chatbots, for research, literature reviews, text search, derivative publishing, etc.
- The technical architecture can run on the command line, in Jupyter Notebooks and Google Colab, and as a Streamlit software UI. It acts as a complete pipeline system that goes from raw-text to knowledge and query.
- To enable the understanding of what are the components needed for an AI LLM system.
- The system is designed for working with and creating text corpora.
- Aware of climate change impact issues.
*Analogous to the economics framework of purchasing power parity (PPP) which is used for setting regional pricing of products. For PPP indices see: Eurostat data. PPPs convert different currencies into a common unit which equalizes their purchasing power and eliminates differences in price levels between economies. (Eurostat) Please get in contact directly for prices.
Open
- Open source software licencing
- Open science compliant in all areas and values
- FAIR data principles for data use and data production, e.g., in data production, vector databases
- AI LLM is open source (where possible)
- AI LLM restrictions on republishing content or using content in training? (where possible and if needed)
- Full documentation for reproducibility (where possible)
- Research is open science and open notebook science based from the start
- AI regulation compliant
- Digital sovereignty by design
- Use open access research literature repositories:
- Redalyc
- OpenAlex
- Europe PubMed Central (Europe PMC)
- bioRxiv
- Ukrainian OA repository
- University of Pretoria
- etc.
- AI and LLM without geo blocking
- Multilingual supporting AI LLM
- Open science knowledge equity values followed — UN Open Science Recommendations
- Support trust in users
- Avoid vendor lockin
- Aim to have higher quality, best of class, literature and data retrieval results
- Enable transparency and reproducibility as much as possible
- Institutional knowledge retention
- Support the knowledge commons and biblio-diversity
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: LICENSE