Skip to content

v-bible/bible-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bible Scraper

Scrape bible from multiple resources

contributors last update forks stars open issues license


📔 Table of Contents

🌟 About the Project

🎯 Features

  • Scrape bible from:
  • Currently supports:
    • Verses (with poetry).
    • Footnotes.
    • Headings.
    • References.
    • Psalm metadata (like author, title, etc.).
  • Progress logging.
  • Save to Postgres & SQLite database.

🔑 Environment Variables

To run this project, you will need to add the following environment variables to your .env file:

  • App configs:

    DB_URL: Postgres database connection URL. Example:

    • Postgres: postgres://postgres:postgres@localhost:5432/bible

    • Sqlite: file:../../dumps/ktcgkpv_org.sqlite3?connection_limit=1&socket_timeout=10

    LOG_LEVEL: Log level.

E.g:

# .env
DB_URL="postgres://postgres:postgres@localhost:65439/bible"
LOG_LEVEL=info

You can also check out the file .env.example to see all required environment variables.

🧰 Getting Started

‼️ Prerequisites

This project uses pnpm as package manager:

npm install --global pnpm

Playwright:

Run the following command to download new browser binaries:

npx playwright install

🏃 Run Locally

Clone the project:

git clone https://github.com/v-bible/bible-scraper.git

Go to the project directory:

cd bible-scraper

Install dependencies:

pnpm install

Setup Postgres database using Docker compose:

docker-compose up -d

Migrate the database:

  • Sqlite:

    pnpm prisma:migrate:sqlite
  • Postgres:

    pnpm prisma:migrate:pg

Generate Prisma client:

  • Sqlite:

    pnpm prisma:generate --schema ./prisma/sqlite/schema.prisma
  • Postgres:

    pnpm prisma:generate --schema ./prisma/pg/schema.prisma

👀 Usage

Scripts

Scrape Bible

Note

To prevent the error net::ERR_NETWORK_CHANGED, you can temporarily disable the ipv6 on your network adapter:

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
npx tsx ./src/biblegateway.com/main.ts
npx tsx ./src/bible.com/main.ts

Note

For the bible.com script, it doesn't use the local version code, which may vary for different languages. For example, in Vietnamese language, version "VCB" has local code is "KTHD".

npx tsx ./src/ktcgkpv.org/main.ts

Inject FTS Content

Inject FTS content for SQLite database:

npx tsx ./src/scripts/inject-fts.ts
  • Source DB: Defined from DB_URL environment variable for Prisma.
  • Target DB: Defined in the script.

Others

The Lectionary for Mass - Second USA Edition (Sunday Volume, 1998; Weekday Volumes, 2002)

npx tsx ./src/catholic-resources/main.ts

Note

The script get-ordinary-time.ts will log out mismatch gospel reading for Weekday OT between Year I & II. You can see it in dumps/catholic-resources/note-ot.txt.

Note

You can update SOURCE_DB and TARGET_DB in the script to change the source & destination database.

Storage

Scrape data is stored on Huggingface dataset.

Implemented Features

Comparing the scraped data from different sources:

Features biblegateway.com bible.com ktcgkpv.org
Verse ✔️ ✔️ ✔️
Poetry ✔️ ✔️ ✔️
Footnote ✔️ ✔️ ✔️
Cross Reference ✔️ ✔️ ✔️
Psalm Metadata ✔️ ✔️ ✔️
Words of Jesus (red letter) ✔️ ✔️
Proper Names (name translation) ✔️

Notes

Bible Version Denominations

Version Code Source Name Denomination
KT2011 ktcgpv.org KPA : ấn bản KT 2011 Catholic
BD2011 bible.com Kinh Thánh Tiếng Việt, Bản Dịch 2011 Protestant
BD2011 biblegateway.com Bản Dịch 2011 (BD2011) Protestant

Bible Old Testament Books Comparison

Thánh Kinh Do Thái Thánh Kinh Hy Lạp (Bảy Mươi) Cựu Ước Công Giáo Cựu Ước Tin Lành
I. Luật (Torah)
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật
I. Ngũ Thư
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật
I. Ngũ Thư
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật
I. Ngũ Thư
1. Sáng Thế
2. Xuất Hành
3. Lêvi
4. Dân Số
5. Đệ Nhị Luật
II. Ngôn sứ
- Ngôn sứ tiền
6. Giôsuê
7. Thẩm phán
8. 1 & 2 Samuel
9. 1 & 2 Vua
- Ngôn sứ hậu
10. Isaia
11. Giêrêmia
12. Êzêkiel
13. Mười hai ngôn sứ





II. Lịch sử
6. Giôsuê
7. Thẩm phán
8. Ruth
9. 1 & 2 Samuel
10. 1 & 2 Vua
11. 1 & 2 Sử biên niên
12. Ezra – Nêhêmia
13. Ester
14. Giuđitha
15. Tôbit
16. 1 & 2 Maccabê




II. Lịch sử
6. Giôsuê
7. Thẩm phán
8. Ruth
9. Samuel 1
10. Samuel 2
11. Vua 1
12. Vua 2
13. Sử biên niên 1
14. Sử biên niên 2
15. Ezra
16. Nêhêmia
17. Tobia*
18. Giuđitha*
19. Ester
20. Maccabê 1*
21. Maccabê 2*
II. Lịch sử
6. Giôsuê
7. Thẩm phán
8. Ruth
9. Samuel 1
10. Samuel 2
11. Vua 1
12. Vua 2
13. Sử biên niên 1
14. Sử biên niên 2
15. Ezra
16. Nêhêmia
17. Ester



III. Các sách khác
14. Thánh vịnh
15. Giob
16. Châm ngôn
17. Ruth
18. Diễm ca
19. Giảng viên
20. Ai ca
21. Ester
22. Đaniel
23. Ezra – Nêhêmia
24. 1 & 2 Sử biên niên
III. Giáo huấn – Khôn ngoan
17. Thánh vịnh
18. Châm ngôn
19. Giảng viên
20. Diễm ca
21. Giob
22. Khôn ngoan
23. Huấn ca



III. Giáo huấn – Khôn ngoan
22. Giob
23. Thánh vịnh
24. Châm ngôn
25. Giảng viên
26. Diễm ca
27. Khôn ngoan*
28. Huấn ca*




III. Giáo huấn – Khôn ngoan
18. Giob
19. Thánh vịnh
20. Châm ngôn
21. Giảng viên
22. Diễm ca
























IV. Ngôn sứ
24. Ôsê
25. Amos
26. Mica
27. Giôel
28. Abđia
29. Giôna
30. Nahum
31. Habacuc
32. Sôphônia
33. Aggai
34. Zacaria
35. Malaki
36. Isaia
37. Giêrêmia
38. Baruc
39. Ai ca
40. Thư của Giêrêmia
41. Êzêkiel
42. Đaniel
IV. Ngôn sứ
29. Isaia
30. Giêrêmia
31. Ai ca
32. Baruc*
33. Êzêkiel
34. Đaniel
35. Ôsê
36. Giôel
37. Amos
38. Abđia
39. Giôna
40. Mica
41. Nahum
42. Habacuc
43. Sôphônia
44. Aggai
45. Zacaria
46. Malaki
IV. Ngôn sứ
23. Isaia
24. Giêrêmia
25. Ai ca
26. Êzêkiel
27. Đaniel
28. Ôsê
29. Giôel
30. Amos
31. Abđia
32. Giôna
33. Mica
34. Nahum
35. Habacuc
36. Sôphônia
37. Aggai
38. Zacaria
39. Malaki

Note

Source: Stephen L. Harris, Understanding the Bible, 1997.

Note

Books marked with * is not included in the Old Testament of the Protestant.

Missing Verses

  • Version: KT2011 - (ktcgkpv.org)
Book Book Code Missing Verses Notes
Tô-bi-a tb chapter 9: 4 Corrected: 3-4
Tô-bi-a tb chapter 14: 9 Corrected: 8-9
Châm ngôn cn chapter 14: 32 Intended
Huấn ca hc chapter 1: 5, 7, 21 Intended
Huấn ca hc chapter 3: 19, 25 Intended
Huấn ca hc chapter 10: 21 Intended
Huấn ca hc chapter 11: 15, 16 Intended
Huấn ca hc chapter 13: 14 Intended
Huấn ca hc chapter 16: 15, 16 Intended
Huấn ca hc chapter 17: 5, 9, 16, 18, 21 Intended
Huấn ca hc chapter 18: 3 Intended
Huấn ca hc chapter 19: 18, 19, 21 Intended
Huấn ca hc chapter 22: 7, 8 Intended
Huấn ca hc chapter 24: 18, 24 Intended
Huấn ca hc chapter 25: 12 Intended
Huấn ca hc chapter 26: 19, 20, 21, 22, 23, 24, 25, 26, 27 Intended
Gio-an ga chapter 7: 38 Corrected: 37-38

Note

For missing verses like tb 9: 3-4, verse is stored as: number is 3 and label is 3-4 or ga 7: 37-38, verse is stored as: number is 37 and label is 37-38.

  • Version: BD2011 - (biblegateway.com)
Book Book Code Missing Verses Notes
Mác mark chapter 9: 45, 47 Corrected: 45-46, 47-48
  • Version: BD2011 - (bible.com)
Book Book Code Missing Verses Notes
Mác mrk chapter 9: 45, 47 Corrected: 45-46, 47-48

👋 Contributing

Contributions are always welcome!

Please read the contribution guidelines.

📜 Code of Conduct

Please read the Code of Conduct.

⚠️ License

This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.

License: CC BY-NC-SA 4.0.

See the LICENSE.md file for full details.

🤝 Contact

Duong Vinh - @duckymomo20012 - tienvinh.duong4@gmail.com

Project Link: https://github.com/v-bible/bible-scraper.

💎 Acknowledgements

Here are useful resources and libraries that we have used in our projects: