This tool allows you to search for specific text phrases across one or more websites. It is useful for various purposes, such as:
- Content Verification: Ensuring that specific phrases or keywords are present on your website.
- SEO Audits: Checking if important keywords are used across your site.
- Compliance Checks: Verifying that required legal or compliance text is present on all pages.
- Competitive Analysis: Searching for specific content on competitor websites.
The tool uses web crawling to gather data from the specified websites and then searches the crawled data for the given text phrases. The results are saved to a CSV file for easy analysis.
- Python 3.x
- Required Python packages (install via
pip):requestsbeautifulsoup4tqdmadvertoolsrequests-cacheargparsepandas
- Clone the repository or download the script.
- Install the required Python packages:
pip install requests beautifulsoup4 tqdm advertools requests-cache argparse pandas
To crawl the specified websites, use the --crawl argument:
python search.py --crawlThis will crawl the websites listed in the SITES variable and save the crawled data to JSON lines files.
To search for the specified text phrases in the crawled data, simply run the script without any arguments:
python search.pyThis will search for the phrases listed in the TEXT_TO_SEARCH variable in the crawled data and save the results to a CSV file named search_results.csv.
-
Crawl the websites:
python search.py --crawl
-
Search for the text phrases:
python search.py
-
Check the
search_results.csvfile for the results.
You can configure the list of websites to crawl and the text phrases to search for by modifying the SITES and TEXT_TO_SEARCH variables in the search.py script:
SITES = ["https://www.example.com", "https://www.example2.com"]
TEXT_TO_SEARCH = ["example phrase", "another example phrase"]This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
For any questions or inquiries, please contact [duncan@innermaps.org].