Aug 25, 2021 - Fixed issue with the crontab script. It was generating incorrect crawl data for the wrong user account.
Aug 23, 2021 - Added DOMDocument functionality to grab internal links from the url, store them in an array, then iterate through them. I also removed excess/duplicate code. The code now generates a basic HTML sitemap and stores it in the database. It also displays the sitemap html (available for download) on the front end.
Aug 21, 2021 - Added basic crawler functionality which leverages curl. Also integrated ProductsController with products/index View (volt file) so that I can properly render crawl results on view.
Aug 19, 2021 - Added form elements to submit a url to crawl on the admin side. I also create the base structure for the crawler and tested initial writes to the crawler/products table in the database.
Aug 17, 2021 - Built base project structure and authentication system using phalcon framework. Issued first commit to private repository.
Aug 13-16, 2021 - Reading through user story, understanding the problem and defining appropriate technology and code structure.
We want you to build a PHP app or WordPress plugin that provides the desired outcome.
As an administrator, I want to see how my website web pages are linked together to my home page so that I can manually search for ways to improve my SEO rankings.
- Add a back-end admin page (for WP: settings page) where the admin can log in and then manually trigger a crawl and view the results.
- When the admin triggers a crawl:
- Set a task to run immediately.
- Then set it to run the crawl every hour ⏰🤖.
- When the admin requests to view the results, pull the results from storage and display them on the admin page.
- If an error happens, display an error notice to inform of what happened and guide for what to do.
- On the front-end, allow a visitor to view the sitemap.html page.
- The crawl task:
- Deletes the results from the last crawl (i.e. in temporary storage), if they exist.
- Deletes the sitemap.html file if it exists.
- Starts at the website’s root URL (i.e. home page)
- Extracts all of the internal hyperlinks, i.e. results.
- Stores results temporarily in the database.
- Displays the results on the admin page.
- Saves the homepage content as a static .html file, in the directory of your choice.
- Creates a sitemap.html file that shows the results as a sitemap list structure.
Let’s keep this simple by:
- Only crawl the home webpage, i.e. instead of recursively crawling through all of the internal hyperlinks.
- Only delete the temporary stored results based on time. Normally, we would also delete them when a change in the content happens. But let’s keep it really simple and only delete based on time.
- For storage, you can use a database (MariaDB or MySQL) or the filesystem.
- Content on the home page is dynamic.
- It lives in a GitHub repo and retains its history.
- It’s built using our assessment template.
- It’s built with modern OOP with PSR (autoloading, dependency injection, etc.)
- It uses procedural where it makes sense.
- It’s complete and works.
- It delivers the right expected outcome per what the admin requested (per the user story).
- It does not generate errors, warnings, or notices.
- It runs on the following versions of PHP: 7.0 and up.
- If built with WordPress, it runs on version 5.2 and up.
- It does not create new global variables.
- Use a MariaDB or MySQL database.