Scrape bible from multiple resources
- Scrape bible from:
- Currently supports:
- Verses (with poetry).
- Footnotes.
- Headings.
- References.
- Psalm metadata (like author, title, etc.).
- Progress logging.
- Save to Postgres & SQLite database.
To run this project, you will need to add the following environment variables to
your .env
file:
-
App configs:
DB_URL
: Postgres database connection URL. Example:-
Postgres:
postgres://postgres:postgres@localhost:5432/bible
-
Sqlite:
file:../../dumps/ktcgkpv_org.sqlite3?connection_limit=1&socket_timeout=10
LOG_LEVEL
: Log level. -
E.g:
# .env
DB_URL="postgres://postgres:postgres@localhost:65439/bible"
LOG_LEVEL=info
You can also check out the file .env.example
to see all required environment
variables.
This project uses pnpm as package manager:
npm install --global pnpm
Playwright:
Run the following command to download new browser binaries:
npx playwright install
Clone the project:
git clone https://github.com/v-bible/bible-scraper.git
Go to the project directory:
cd bible-scraper
Install dependencies:
pnpm install
Setup Postgres database using Docker compose:
docker-compose up -d
Migrate the database:
-
Sqlite:
pnpm prisma:migrate:sqlite
-
Postgres:
pnpm prisma:migrate:pg
Generate Prisma client:
-
Sqlite:
pnpm prisma:generate --schema ./prisma/sqlite/schema.prisma
-
Postgres:
pnpm prisma:generate --schema ./prisma/pg/schema.prisma
Note
To prevent the error net::ERR_NETWORK_CHANGED
, you can temporarily disable
the ipv6 on your network adapter:
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
- Scrape bible (from biblegateway.com):
npx tsx ./src/biblegateway.com/main.ts
- Scrape bible (from bible.com):
npx tsx ./src/bible.com/main.ts
Note
For the bible.com
script, it doesn't use the local version code, which
may vary for different languages. For example, in Vietnamese language, version
"VCB"
has local code is "KTHD"
.
- Scrape bible (from ktcgkpv.org):
npx tsx ./src/ktcgkpv.org/main.ts
Inject FTS content for SQLite database:
npx tsx ./src/scripts/inject-fts.ts
- Source DB: Defined from
DB_URL
environment variable for Prisma. - Target DB: Defined in the script.
- Scrape Liturgical resources for Ordinary Times (Weekdays & Sundays) from catholic-resources.org:
The Lectionary for Mass - Second USA Edition (Sunday Volume, 1998; Weekday Volumes, 2002)
npx tsx ./src/catholic-resources/main.ts
Note
The script get-ordinary-time.ts
will log out mismatch gospel reading for
Weekday OT between Year I & II. You can see it in
dumps/catholic-resources/note-ot.txt
.
Note
You can update SOURCE_DB
and TARGET_DB
in the script to change the source
& destination database.
Scrape data is stored on Huggingface dataset.
Comparing the scraped data from different sources:
Features | biblegateway.com | bible.com | ktcgkpv.org |
---|---|---|---|
Verse | ✔️ | ✔️ | ✔️ |
Poetry | ✔️ | ✔️ | ✔️ |
Footnote | ✔️ | ✔️ | ✔️ |
Cross Reference | ✔️ | ✔️ | ✔️ |
Psalm Metadata | ✔️ | ✔️ | ✔️ |
Words of Jesus (red letter) | ✔️ | ✔️ | ❌ |
Proper Names (name translation) | ❌ | ❌ | ✔️ |
Version Code | Source | Name | Denomination |
---|---|---|---|
KT2011 | ktcgpv.org | KPA : ấn bản KT 2011 | Catholic |
BD2011 | bible.com | Kinh Thánh Tiếng Việt, Bản Dịch 2011 | Protestant |
BD2011 | biblegateway.com | Bản Dịch 2011 (BD2011) | Protestant |
Thánh Kinh Do Thái | Thánh Kinh Hy Lạp (Bảy Mươi) | Cựu Ước Công Giáo | Cựu Ước Tin Lành |
---|---|---|---|
I. Luật (Torah) 1. Sáng Thế 2. Xuất Hành 3. Lêvi 4. Dân Số 5. Đệ Nhị Luật |
I. Ngũ Thư 1. Sáng Thế 2. Xuất Hành 3. Lêvi 4. Dân Số 5. Đệ Nhị Luật |
I. Ngũ Thư 1. Sáng Thế 2. Xuất Hành 3. Lêvi 4. Dân Số 5. Đệ Nhị Luật |
I. Ngũ Thư 1. Sáng Thế 2. Xuất Hành 3. Lêvi 4. Dân Số 5. Đệ Nhị Luật |
II. Ngôn sứ - Ngôn sứ tiền 6. Giôsuê 7. Thẩm phán 8. 1 & 2 Samuel 9. 1 & 2 Vua - Ngôn sứ hậu 10. Isaia 11. Giêrêmia 12. Êzêkiel 13. Mười hai ngôn sứ |
II. Lịch sử 6. Giôsuê 7. Thẩm phán 8. Ruth 9. 1 & 2 Samuel 10. 1 & 2 Vua 11. 1 & 2 Sử biên niên 12. Ezra – Nêhêmia 13. Ester 14. Giuđitha 15. Tôbit 16. 1 & 2 Maccabê |
II. Lịch sử 6. Giôsuê 7. Thẩm phán 8. Ruth 9. Samuel 1 10. Samuel 2 11. Vua 1 12. Vua 2 13. Sử biên niên 1 14. Sử biên niên 2 15. Ezra 16. Nêhêmia 17. Tobia* 18. Giuđitha* 19. Ester 20. Maccabê 1* 21. Maccabê 2* |
II. Lịch sử 6. Giôsuê 7. Thẩm phán 8. Ruth 9. Samuel 1 10. Samuel 2 11. Vua 1 12. Vua 2 13. Sử biên niên 1 14. Sử biên niên 2 15. Ezra 16. Nêhêmia 17. Ester |
III. Các sách khác 14. Thánh vịnh 15. Giob 16. Châm ngôn 17. Ruth 18. Diễm ca 19. Giảng viên 20. Ai ca 21. Ester 22. Đaniel 23. Ezra – Nêhêmia 24. 1 & 2 Sử biên niên |
III. Giáo huấn – Khôn ngoan 17. Thánh vịnh 18. Châm ngôn 19. Giảng viên 20. Diễm ca 21. Giob 22. Khôn ngoan 23. Huấn ca |
III. Giáo huấn – Khôn ngoan 22. Giob 23. Thánh vịnh 24. Châm ngôn 25. Giảng viên 26. Diễm ca 27. Khôn ngoan* 28. Huấn ca* |
III. Giáo huấn – Khôn ngoan 18. Giob 19. Thánh vịnh 20. Châm ngôn 21. Giảng viên 22. Diễm ca |
IV. Ngôn sứ 24. Ôsê 25. Amos 26. Mica 27. Giôel 28. Abđia 29. Giôna 30. Nahum 31. Habacuc 32. Sôphônia 33. Aggai 34. Zacaria 35. Malaki 36. Isaia 37. Giêrêmia 38. Baruc 39. Ai ca 40. Thư của Giêrêmia 41. Êzêkiel 42. Đaniel |
IV. Ngôn sứ 29. Isaia 30. Giêrêmia 31. Ai ca 32. Baruc* 33. Êzêkiel 34. Đaniel 35. Ôsê 36. Giôel 37. Amos 38. Abđia 39. Giôna 40. Mica 41. Nahum 42. Habacuc 43. Sôphônia 44. Aggai 45. Zacaria 46. Malaki |
IV. Ngôn sứ 23. Isaia 24. Giêrêmia 25. Ai ca 26. Êzêkiel 27. Đaniel 28. Ôsê 29. Giôel 30. Amos 31. Abđia 32. Giôna 33. Mica 34. Nahum 35. Habacuc 36. Sôphônia 37. Aggai 38. Zacaria 39. Malaki |
Note
Source: Stephen L. Harris, Understanding the Bible, 1997.
Note
Books marked with *
is not included in the Old Testament of the Protestant.
- Version: KT2011 - (ktcgkpv.org)
Book | Book Code | Missing Verses | Notes |
---|---|---|---|
Tô-bi-a | tb | chapter 9: 4 | Corrected: 3-4 |
Tô-bi-a | tb | chapter 14: 9 | Corrected: 8-9 |
Châm ngôn | cn | chapter 14: 32 | Intended |
Huấn ca | hc | chapter 1: 5, 7, 21 | Intended |
Huấn ca | hc | chapter 3: 19, 25 | Intended |
Huấn ca | hc | chapter 10: 21 | Intended |
Huấn ca | hc | chapter 11: 15, 16 | Intended |
Huấn ca | hc | chapter 13: 14 | Intended |
Huấn ca | hc | chapter 16: 15, 16 | Intended |
Huấn ca | hc | chapter 17: 5, 9, 16, 18, 21 | Intended |
Huấn ca | hc | chapter 18: 3 | Intended |
Huấn ca | hc | chapter 19: 18, 19, 21 | Intended |
Huấn ca | hc | chapter 22: 7, 8 | Intended |
Huấn ca | hc | chapter 24: 18, 24 | Intended |
Huấn ca | hc | chapter 25: 12 | Intended |
Huấn ca | hc | chapter 26: 19, 20, 21, 22, 23, 24, 25, 26, 27 | Intended |
Gio-an | ga | chapter 7: 38 | Corrected: 37-38 |
Note
For missing verses like tb 9: 3-4
, verse is stored as: number is 3
and label
is 3-4
or ga 7: 37-38
, verse is stored as: number is 37
and label is
37-38
.
- Version: BD2011 - (biblegateway.com)
Book | Book Code | Missing Verses | Notes |
---|---|---|---|
Mác | mark | chapter 9: 45, 47 | Corrected: 45-46, 47-48 |
- Version: BD2011 - (bible.com)
Book | Book Code | Missing Verses | Notes |
---|---|---|---|
Mác | mrk | chapter 9: 45, 47 | Corrected: 45-46, 47-48 |
Contributions are always welcome!
Please read the contribution guidelines.
Please read the Code of Conduct.
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
See the LICENSE.md file for full details.
Duong Vinh - @duckymomo20012 - tienvinh.duong4@gmail.com
Project Link: https://github.com/v-bible/bible-scraper.
Here are useful resources and libraries that we have used in our projects:
- bible.com: bible.com website.
- biblegateway.com: biblegateway.com website.
- ktcgkpv.org: Nhóm Phiên Dịch Các Giờ Kinh Phụng Vụ website.
- The Lectionary for Mass (1998/2002 USA Edition): compiled by Felix Just, S.J., Ph.D.