COMP 4321 (24S) Group 1 Group Project, by Gupta Harsh Vardhan, Kong Tsz Yui and Zhang Zhe.

What does this project do

This project is a search engine implementation for the course "Search Engine and Enterprize Data"(COMP4321) offered by Hong Kong Univerity of Science and Technology. The project consists of three main parts: crawler, retrivel system, and the search engine website.

crawler: Focuses on parsing data textual data from the website and saving it on
retrieval: Applying the concepts of TFxIDF, Google's PageRank and Weighted Search to the user query and the parsed data, retrieve the top 50 results.
website: User friendly and easy to use way to access the search engine

Project User manual

Demo Video: YouTube

The project is based on Python, version 3.12.0. The project uses the following packages:

urllib.parse for handling URLs
zlib for providing crc32 function which is used to generate page_id
bs4 for parsing HTML files that are being crawled
requests to get the content of URLs
sqlite3 as the dbms (database management system)
email.utils and datetime to convert RFC 2822 time into timestamp and from timestamp to normal date
re for regex, to remove the symbols and numbers in the documents
asyncio, for increasing the running speed by asynchronous
pathlib to get the absolute path of the files so that the database file can be accessed correctly
nltk.stem as stemmer
itertools, for converting list of 1-tuple to a list
collections, for counting the occurrence of the texts
numpy, for calculating the pagerank
math, for the calculation of idf.
time, for measuring the overall running time of the program
flask, for the frontend
timeit, to count the running time for the query.

Users are suggested to run the program in macOS or Linux, with bash or zsh as the shell since the coding and tests are performed on a Unix-based platform. However, they can also run the program on Windows.

To run the program, users should do the following.

Users need to ensure they have installed the correct Python version. Navigate to the directory where the code is downloaded to.
Create a virtual environment using the following command.
```
python -m venv .venv
```
Activate the virtual environment using the following command.

(Unix-based OS)
```
source .venv/bin/activate
```
(Windows)
```
.\.venv\Scripts\activate
```
Install the required package
```
pip install -r requirements.txt
```
Run the program. The program shall do everything, including initializing database.
```
python main.py
```
Now you may see a line mentioning something like
```
* Running on http://127.0.0.1:5000
```
Simply go to the 127.0.0.1:5000, and you can see the search engine there.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
src		src
static		static
templates		templates
.gitignore		.gitignore
COMP4321_G1_Report.pdf		COMP4321_G1_Report.pdf
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

COMP 4321 (24S) Group 1 Group Project, by Gupta Harsh Vardhan, Kong Tsz Yui and Zhang Zhe.

What does this project do

Project User manual

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

hvgupta/COMP4321_searchEngine_Group1

Folders and files

Latest commit

History

Repository files navigation

COMP 4321 (24S) Group 1 Group Project, by Gupta Harsh Vardhan, Kong Tsz Yui and Zhang Zhe.

What does this project do

Project User manual

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages