Instagram Location Scraper Documentation

Version: 1.0.0

Author: tim021008-la

1. Overview

This Node.js script is a powerful and resilient web scraper designed to recursively collect location data from Instagram's "Explore Locations" pages. It starts from a high-level country page, automatically discovers all the cities listed by navigating through all available pages, and then proceeds to visit each city page to gather a complete list of all specific locations (e.g., parks, restaurants, landmarks).

The entire process is automated using Puppeteer, which controls a headless Chrome browser. The script is built to be robust, incorporating delays, a multi-attempt retry mechanism, and exponential backoff to handle network errors and rate-limiting gracefully. The final output is a well-structured JSON file, perfect for analysis or use in other applications.

2. Features

Recursive Scraping: Traverses from a country-level page down to individual location pages.
Systematic Pagination: Reliably navigates through all pages using the ?page=X parameter, instead of relying on fragile infinite scrolling.
Robust Error Handling:
- Retry Mechanism: Automatically retries failed page loads up to 3 times.
- Exponential Backoff: Intelligently waits for progressively longer durations between failed retries to effectively handle rate-limiting.
Polite Operation: Incorporates randomized delays between requests to mimic human behavior and reduce server load.
Structured JSON Output: Saves data in a clean, human-readable JSON format, organized by city.
Configurable: Easily change the target URL, output file, and scraping limits for testing.

3. Requirements

Before running the script, you must have the following installed on your system:

Node.js: (Version 14.x or newer recommended)
npm: (Node.js Package Manager, typically installed with Node.js)

4. Configuration

You can customize the scraper's behavior by editing the configuration variables at the top of the scraper.js file.

START_URL: The initial URL for the scraper. This should be the "Explore Locations" page for a specific country.
OUTPUT_FILE: The filename for the resulting JSON data.
CITY_LIMIT: A number that limits how many cities will be scraped. This is very useful for testing to ensure the scraper is working without waiting for a full run. To scrape all available cities, set this value to null.

5. How It Works

The script is orchestrated by the main() function and relies on a generic, reusable scrapeAllPages() function for its core logic:

Launch Browser: Puppeteer launches a headless instance of Chromium.
Scrape City URLs: The scrapeCityUrls() function is called. It uses scrapeAllPages() to systematically navigate through ?page=1, ?page=2, etc., of the main country URL until no new cities are found.
Retry & Backoff Logic: If any page navigation fails (e.g., due to a timeout), the scrapeAllPages function will retry up to two more times. The delay between these retries increases exponentially (e.g., ~2s, then ~4s) to allow the server to recover.
Iterate and Scrape Locations: The main function loops through the complete list of city URLs gathered in the first phase.
Scrape Location Data: For each city, the scrapeLocationsForCity() function is called. It re-uses the same robust scrapeAllPages() function to navigate through all of the city's pages and collect the names and URLs of all specific locations.
Polite Delays: The script automatically waits for a few seconds between most page requests and before starting on a new city to avoid overwhelming the server.
Save to File: Once all cities (up to the CITY_LIMIT) have been processed, the final data object is converted into a formatted JSON string and saved to the OUTPUT_FILE.
Close Browser: The finally block ensures the browser instance is always closed to free up system resources, even if an error occurred.

6. Output Format

The final output is a single JSON file (instagram_locations.json by default). The data is structured as an object where each top-level key is a city's name, and the value is an array of location objects found within that city.

Disclaimer

Web scraping can be against the terms of service of some websites. Always scrape responsibly and ethically.
Websites like Instagram frequently update their structure. If this script stops working, it is likely due to a change in the HTML layout or class names on their site, which would require updating the selectors in the page.evaluate() functions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
auto_scraper.js		auto_scraper.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Instagram Location Scraper Documentation

1. Overview

2. Features

3. Requirements

4. Configuration

5. How It Works

6. Output Format

Disclaimer

About

Uh oh!

Releases

Packages

Languages

tim021008-la/InstagramLocationIdScraper

Folders and files

Latest commit

History

Repository files navigation

Instagram Location Scraper Documentation

1. Overview

2. Features

3. Requirements

4. Configuration

5. How It Works

6. Output Format

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages