diff --git a/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md b/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md new file mode 100644 index 000000000..e0e699c5e --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md @@ -0,0 +1,177 @@ +--- +title: Inspecting web pages with browser DevTools +sidebar_label: "DevTools: Inspecting" +description: Lesson about using the browser tools for developers to inspect and manipulate the structure of a website. +slug: /scraping-basics-javascript2/devtools-inspecting +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** + +--- + +A browser is the most complete tool for navigating websites. Scrapers are like automated browsers—and sometimes, they actually are automated browsers. The key difference? There's no user to decide where to go or eyes to see what's displayed. Everything has to be pre-programmed. + +All modern browsers provide developer tools, or _DevTools_, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper: + +1. Inspect the target website in DevTools to understand its structure and determine how to extract the required data. +1. Translate those findings into code. +1. If the scraper fails due to overlooked edge cases or, over time, due to website changes, go back to step 1. + +Now let's spend some time figuring out what the detective work in step 1 is about. + +## Opening DevTools + +Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers, as Safari has its [Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector) and Firefox also has [DevTools](https://firefox-source-docs.mozilla.org/devtools-user/). + +Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**. + +![Wikipedia with Chrome DevTools open](./images/devtools-wikipedia.png) + +Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page: + +![Elements tab in Chrome DevTools](./images/devtools-elements-tab.png) + +:::warning Screen adaptations + +DevTools may appear differently depending on your screen size. For instance, on smaller screens, the CSS panel might move below the HTML elements panel instead of appearing in the right pane. + +::: + +Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) elements as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`. + +```html +
+

First Level Heading

+

Paragraph with emphasized text.

+
+``` + +HTML, a markup language, describes how everything on a page is organized, how elements relate to each other, and what they mean. It doesn't define how elements should look—that's where [CSS](https://developer.mozilla.org/en-US/docs/Learn/CSS) comes in. CSS is like the velvet covering the frame. Using styles, we can select elements and assign rules that tell the browser how they should appear. For instance, we can style all elements with `heading` in their `class` attribute to make the text blue and uppercase. + +```css +.heading { + color: blue; + text-transform: uppercase; +} +``` + +While HTML and CSS describe what the browser should display, [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/JavaScript) is a general-purpose programming language that adds interaction to the page. + +In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press **ESC** to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly. + +![Console in Chrome DevTools](./images/devtools-console.png) + +## Selecting an element + +In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square. + +![Chrome DevTools element selection tool](./images/devtools-element-selection.png) + +We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle. + +![Chrome DevTools element hover](./images/devtools-hover.png) + +The highlighted section should look something like this: + +```html + + The Free Encyclopedia + +``` + +If we were experienced creators of scrapers, our eyes would immediately spot what's needed to make a program that fetches Wikipedia's subtitle. The program would need to download the page's source code, find a `strong` element with `localized-slogan` in its `class` attribute, and extract its text. + +:::tip HTML and whitespace + +In HTML, whitespace isn't significant, i.e., it only makes the code readable. The following code snippets are equivalent: + +```html + + The Free Encyclopedia + +``` + +```html + The Free +Encyclopedia + +``` + +::: + +## Interacting with an element + +We won't be creating Python scrapers just yet. Let's first get familiar with what we can do in the JavaScript console and how we can further interact with HTML elements on the page. + +In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready. + +![Global variable in Chrome DevTools Console](./images/devtools-console-variable.png) + +The Console allows us to run JavaScript in the context of the loaded page, similar to Python's [interactive REPL](https://realpython.com/interacting-with-python/). We can use it to play around with elements. + +For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. We'll type the following and hit **Enter**: + +```js +temp1.textContent; +``` + +The result should be `'The Free Encyclopedia'`. Now let's try this: + +```js +temp1.outerHTML; +``` + +This should return the element's HTML tag as a string. Finally, we'll run the next line to change the text of the element: + +```js +temp1.textContent = 'Hello World!'; +``` + +When we change elements in the Console, those changes reflect immediately on the page! + +![Changing textContent in Chrome DevTools Console](./images/devtools-console-textcontent.png) + +But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence. + +We're not here for playing around with elements, though—we want to create a scraper for an e-commerce website to watch prices. In the next lesson, we'll examine the website and use CSS selectors to locate HTML elements containing the data we need. + +--- + + + +### Find FIFA logo + +Open the [FIFA website](https://www.fifa.com/) and use the DevTools to figure out the URL of FIFA's logo image file. Hint: You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img) element with a `src` attribute. + +
+ Solution + + 1. Go to [fifa.com](https://www.fifa.com/). + 1. Activate the element selection tool. + 1. Click on the logo. + 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. + 1. In the console, type `temp1.src` and hit **Enter**. + + ![DevTools exercise result](./images/devtools-exercise-fifa.png) + +
+ +### Make your own news + +Open a news website, such as [CNN](https://cnn.com). Use the Console to change the headings of some articles. + +
+ Solution + + 1. Go to [cnn.com](https://cnn.com). + 1. Activate the element selection tool. + 1. Click on a heading. + 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. + 1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**. + + ![DevTools exercise result](./images/devtools-exercise-cnn.png) + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md new file mode 100644 index 000000000..1b65814a3 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md @@ -0,0 +1,209 @@ +--- +title: Locating HTML elements on a web page with browser DevTools +sidebar_label: "DevTools: Locating HTML elements" +description: Lesson about using the browser tools for developers to manually find products on an e-commerce website. +slug: /scraping-basics-javascript2/devtools-locating-elements +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** + +--- + +Inspecting Wikipedia and tweaking its subtitle is fun, but let's shift gears and focus on building an app to track prices on an e-commerce site. As part of the groundwork, let's check out the site we'll be working with. + +## Meeting the Warehouse store + +Instead of artificial scraping playgrounds or sandboxes, we'll scrape a real e-commerce site. Shopify, a major e-commerce platform, has a demo store at [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/). It strikes a good balance between being realistic and stable enough for a tutorial. Our scraper will track prices for all products listed on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). + +:::info Balancing authenticity and stability + +Live sites like Amazon are complex, loaded with promotions, frequently changing, and equipped with anti-scraping measures. While those challenges are manageable, they're advanced topics. For this beginner course, we're sticking to a lightweight, stable environment. + +That said, we designed all the additional exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience. + +::: + +## Finding a product card + +As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). + +![Warehouse store with DevTools open](./images/devtools-warehouse.png) + +The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it. + +![Selecting an element with DevTools](./images/devtools-product-title.png) + +Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more. + +In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**. + +![Selecting an element with hover](./images/devtools-hover-product.png) + +At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do. + +Scrapers typically rely on [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) to locate elements on a page, and these selectors often target elements based on their `class` attributes. The product card we highlighted has markup like this: + +```html +
+ ... +
+``` + +The `class` attribute can hold multiple values separated by whitespace. This particular element has four classes. Let's move to the **Console** and experiment with CSS selectors to locate this element. + +## Programmatically locating a product card + +Let's jump into the **Console** and write some JavaScript. Don't worry—we don't need to know the language, and yes, this is a helpful step on our journey to creating a scraper in Python. + +In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. We'll try typing this into the **Console**: + +```js +document.querySelector('.product-item'); +``` + +It will return the HTML element for the first product card in the listing: + +![Using querySelector() in DevTools Console](./images/devtools-queryselector.webp) + +CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine. + +The [type selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors) matches elements by tag name. For example, `h1` would match the highlighted element: + +```html +
+ +

Title

+

Paragraph.

+
+``` + +The [class selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors) matches elements based on their class attribute. For instance, `.heading` (note the dot) would match the following: + +```html +
+

Title

+ +

Subtitle

+

Paragraph

+

+ + Heading +

+
+``` + +You can combine selectors to narrow results. For example, `p.lead` matches `p` elements with the `lead` class, but not `p` elements without the class or elements with the class but a different tag name: + +```html +
+ +

Lead paragraph.

+

Paragraph

+

Paragraph

+
+``` + +How did we know `.product-item` selects a product card? By inspecting the markup of the product card element. After checking its classes, we chose the one that best fit our purpose. Testing in the **Console** confirmed it—selecting by the most descriptive class worked. + +## Choosing good selectors + +Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning. + +The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules. + +This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after. + +![Overview of all the product cards in DevTools](./images/devtools-product-list.png) + +## Locating all product cards + +In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list. + +![Highlighting a querySelector() result](./images/devtools-hover-queryselector.png) + +But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**: + +```js +document.querySelectorAll('.product-item'); +``` + +The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/Web/API/NodeList), a collection of nodes. Browsers understand an HTML document as a tree of nodes. Most nodes are HTML elements, but there are also text nodes for plain text, and others. + +We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer! + +![Highlighting a querySelectorAll() result](./images/devtools-hover-queryselectorall.png) + +To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like in Python lists (or JavaScript arrays): + +```js +products = document.querySelectorAll('.product-item'); +subwoofer = products[2]; +``` + +Even though we're just playing with JavaScript in the browser's **Console**, we're inching closer to figuring out what our Python program will need to do. In the next lesson, we'll dive into accessing child elements and extracting product details. + +--- + + + +### Locate headings on Wikipedia's Main Page + +On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones). + +![Wikipedia's Main Page headings](./images/devtools-exercise-wikipedia.png) + +
+ Solution + + 1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page). + 1. Activate the element selection tool in your DevTools. + 1. Click on several headings to examine the markup. + 1. Notice that all headings are `h2` elements with the `mp-h2` class. + 1. In the **Console**, execute `document.querySelectorAll('h2')`. + 1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is. + +
+ +### Locate products on Shein + +Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products. + +![Products in Shein's Jewelry & Accessories category](./images/devtools-exercise-shein.png) + +
+ Solution + + 1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions. + 1. Activate the element selection tool in your DevTools. + 1. Click on the first product to inspect its markup. Repeat with a few others. + 1. Observe that all products are `section` elements with multiple classes, including `product-card`. + 1. Since `section` is a generic wrapper, focus on the `product-card` class. + 1. In the **Console**, execute `document.querySelectorAll('.product-card')`. + 1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary. + +
+ +### Locate articles on Guardian + +Go to Guardian's [page about F1](https://www.theguardian.com/sport/formulaone). Use the **Console** to find all HTML elements representing the articles. + +Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator). + +![Articles on Guardian's page about F1](./images/devtools-exercise-guardian1.png) + +
+ Solution + + 1. Open the [page about F1](https://www.theguardian.com/sport/formulaone). + 1. Activate the element selection tool in your DevTools. + 1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards. + 1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable. + 1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links. + 1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`. + 1. In the **Console**, execute `document.querySelectorAll('main li')`. + 1. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works! + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md new file mode 100644 index 000000000..730089bb2 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md @@ -0,0 +1,136 @@ +--- +title: Extracting data from a web page with browser DevTools +sidebar_label: "DevTools: Extracting data" +description: Lesson about using the browser tools for developers to manually extract product data from an e-commerce website. +slug: /scraping-basics-javascript2/devtools-extracting-data +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** + +--- + +In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data. Now how do we extract the data? + +## Finding product details + +Previously, we've figured out how to save the subwoofer product card to a variable in the **Console**: + +```js +products = document.querySelectorAll('.product-item'); +subwoofer = products[2]; +``` + +The product details are within the element as text, so maybe if we extract the text, we could work out the individual values? + +```js +subwoofer.textContent; +``` + +That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces. + +![Printing text content of the parent element](./images/devtools-extracting-text.png) + +We'll need to first locate relevant child elements and extract the data from each of them individually. + +## Extracting title + +We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element. + +![Finding child elements](./images/devtools-product-details.png) + +JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element: + +```js +title = subwoofer.querySelector('.product-item__title'); +title.textContent; +``` + +Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title: + +![Extracting product title](./images/devtools-extracting-title.png) + +## Extracting price + +To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class. + +![Finding child elements](./images/devtools-product-details.png) + +We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result: + +```js +price = subwoofer.querySelector('.price'); +price.textContent; +``` + +It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**: + +![Extracting product price](./images/devtools-extracting-price.png) + +But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get the values as numbers. + +In the next lesson, we'll start with our Python project. First we'll be figuring out how to download the Sales page without browser and make it accessible in a Python program. + +--- + + + +### Extract the price of IKEA's most expensive artificial plant + +At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number. + +
+ Solution + + 1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/). + 1. Sort the products by price, from high to low, so the most expensive plant appears first in the listing. + 1. Activate the element selection tool in your DevTools. + 1. Click on the price of the first and most expensive plant. + 1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value. + 1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price. + 1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`. + 1. Convert the price text into a number by executing `parseInt(price.textContent)`. + 1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek). + +
+ +### Extract the name of the top wiki on Fandom Movies + +On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use JavaScript's [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name. + +![Fandom's Movies page](./images/devtools-exercise-fandom.png) + +
+ Solution + + 1. Open the [Movies page](https://www.fandom.com/topics/movies). + 1. Activate the element selection tool in your DevTools. + 1. Click on the list item for the top Fandom wiki in the category. + 1. Notice that it has a class `topic_explore-wikis__link`. + 1. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done. + 1. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`. + 1. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`. + +
+ +### Extract details about the first post on Guardian's F1 news + +On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. + +![F1 news page](./images/devtools-exercise-guardian2.png) + +
+ Solution + + 1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone). + 1. Activate the element selection tool in your DevTools. + 1. Click on the first post. + 1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead. + 1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post. + 1. Extract the post's title by executing `post.querySelector('h3').textContent`. + 1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`. + 1. Extract the photo URL by executing `post.querySelector('img').src`. + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md new file mode 100644 index 000000000..ec361214f --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md @@ -0,0 +1,219 @@ +--- +title: Downloading HTML with Python +sidebar_label: Downloading HTML +description: Lesson about building a Python application for watching prices. Using the HTTPX library to download HTML code of a product listing page. +slug: /scraping-basics-javascript2/downloading-html +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.** + +--- + +Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing. + +## Starting a Python project + +Before we start coding, we need to set up a Python project. Let's create new directory with a virtual environment. Inside the directory and with the environment activated, we'll install the HTTPX library: + +```text +$ pip install httpx +... +Successfully installed ... httpx-0.0.0 +``` + +:::tip Installing packages + +Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide. + +::: + +Now let's test that all works. Inside the project directory we'll create a new file called `main.py` with the following code: + +```py +import httpx + +print("OK") +``` + +Running it as a Python program will verify that our setup is okay and we've installed HTTPX: + +```text +$ python main.py +OK +``` + +:::info Troubleshooting + +If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course. + +::: + +## Downloading product listing + +Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `OK`. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this: + +```py +import httpx + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +print(response.text) +``` + +If we run the program now, it should print the downloaded HTML: + +```text +$ python main.py + + + + + + + + Sales + ... + + +``` + +Running `httpx.get(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper. + +:::tip Client and server, request and response + +HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology: + +- HTTP is an exchange between two participants. +- The _client_ sends a _request_ to the _server_, which replies with a _response_. +- In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server. + +::: + +## Handling errors + +Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit _status code_ that indicates whether it is an error or a success. + +:::tip All status codes + +If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource. + +::: + +A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error. + +First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available: + +```text +https://warehouse-theme-metal.myshopify.com/does/not/exist +``` + +We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful: + +```py +import httpx + +url = "https://warehouse-theme-metal.myshopify.com/does/not/exist" +response = httpx.get(url) +response.raise_for_status() +print(response.text) +``` + +If you run the code above, the program should crash: + +```text +$ python main.py +Traceback (most recent call last): + File "/Users/.../main.py", line 5, in + response.raise_for_status() + File "/Users/.../.venv/lib/python3/site-packages/httpx/_models.py", line 761, in raise_for_status + raise HTTPStatusError(message, request=request, response=self) +httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist' +For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404 +``` + +Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML. + +--- + + + +### Scrape AliExpress + +Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results: + +```text +https://www.aliexpress.com/w/wholesale-darth-vader.html +``` + +
+ Solution + + ```py + import httpx + + url = "https://www.aliexpress.com/w/wholesale-darth-vader.html" + response = httpx.get(url) + response.raise_for_status() + print(response.text) + ``` + +
+ +### Save downloaded HTML as a file + +Download HTML, then save it on your disk as a `products.html` file. You can use the URL we've been already playing with: + +```text +https://warehouse-theme-metal.myshopify.com/collections/sales +``` + +
+ Solution + + Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs: + + ```text + python main.py > products.html + ``` + + If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html): + + ```py + import httpx + from pathlib import Path + + url = "https://warehouse-theme-metal.myshopify.com/collections/sales" + response = httpx.get(url) + response.raise_for_status() + Path("products.html").write_text(response.text) + ``` + +
+ +### Download an image as a file + +Download a product image, then save it on your disk as a file. While HTML is _textual_ content, images are _binary_. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/) for guidance. You can use this URL pointing to an image of a TV: + +```text +https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg +``` + +
+ Solution + + Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html): + + ```py + from pathlib import Path + import httpx + + url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg" + response = httpx.get(url) + response.raise_for_status() + Path("tv.jpg").write_bytes(response.content) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md new file mode 100644 index 000000000..81aaf6778 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md @@ -0,0 +1,167 @@ +--- +title: Parsing HTML with Python +sidebar_label: Parsing HTML +description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to parse HTML code of a product listing page. +slug: /scraping-basics-javascript2/parsing-html +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.** + +--- + +From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`. + +![Products have the ‘product-item’ class](./images/product-item.png) + +As a first step, let's try counting how many products are on the listing page. + +## Processing HTML + +After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products? + +While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an _HTML parser_. It takes a text with HTML markup and turns it into a tree of Python objects. + +:::info Why regex can't parse HTML + +While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty. + +::: + +We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. + +```text +$ pip install beautifulsoup4 +... +Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0 +``` + +Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `

` element, which represents the main heading of the page. + +![Element of the main heading](./images/h1.png) + +We'll update our code to the following: + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") +print(soup.select("h1")) +``` + +Then let's run the program: + +```text +$ python main.py +[

Sales

] +``` + +Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following: + +```py +headings = soup.select("h1") +first_heading = headings[0] +print(first_heading.text) +``` + +If we run our scraper again, it prints the text of the first `h1` element: + +```text +$ python main.py +Sales +``` + +:::note Dynamic websites + +The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](../api_scraping/index.md) and [Puppeteer & Playwright](../puppeteer_playwright/index.md) courses. + +::: + +## Using CSS selectors + +Beautiful Soup's `.select()` method runs a _CSS selector_ against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools. + +Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards: + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") +products = soup.select(".product-item") +print(len(products)) +``` + +In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list. + +```text +$ python main.py +24 +``` + +That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products. + +--- + + + +### Scrape F1 teams + +Print a total count of F1 teams listed on this page: + +```text +https://www.formula1.com/en/teams +``` + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + + url = "https://www.formula1.com/en/teams" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + print(len(soup.select(".group"))) + ``` + +
+ +### Scrape F1 drivers + +Use the same URL as in the previous exercise, but this time print a total count of F1 drivers. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + + url = "https://www.formula1.com/en/teams" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + print(len(soup.select(".f1-team-driver-name"))) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md new file mode 100644 index 000000000..ef85a2612 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md @@ -0,0 +1,324 @@ +--- +title: Locating HTML elements with Python +sidebar_label: Locating HTML elements +description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page. +slug: /scraping-basics-javascript2/locating-elements +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.** + +--- + +In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two. What happens if we print `.text` for each product card? + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +for product in soup.select(".product-item"): + print(product.text) +``` + +Well, it definitely prints _something_… + +```text +$ python main.py +Save $25.00 + + +JBL +JBL Flip 4 Waterproof Portable Bluetooth Speaker + + + +Black + ++7 + + +Blue + ++6 + + +Grey +... +``` + +To get details about each product in a structured way, we'll need a different approach. + +## Locating child elements + +As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card. + +![Product card's child elements](./images/child-elements.png) + +We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors: + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +for product in soup.select(".product-item"): + titles = product.select(".product-item__title") + first_title = titles[0].text + + prices = product.select(".price") + first_price = prices[0].text + + print(first_title, first_price) +``` + +Let's run the program now: + +```text +$ python main.py +JBL Flip 4 Waterproof Portable Bluetooth Speaker +Sale price$74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV +Sale priceFrom $1,398.00 +... +``` + +There's still some room for improvement, but it's already much better! + +## Locating a single element + +Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code! + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text + price = product.select_one(".price").text + print(title, price) +``` + +This program does the same as the one we already had, but its code is more concise. + +:::note Fragile code + +We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above will trigger warnings about this. + +Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it. + +::: + +## Precisely locating price + +In the output we can see that the price isn't located precisely: + +```text +JBL Flip 4 Waterproof Portable Bluetooth Speaker +Sale price$74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV +Sale priceFrom $1,398.00 +... +``` + +For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this: + +```html + + Sale price + $74.95 + +``` + +When translated to a tree of Python objects, the element with class `price` will contain several _nodes_: + +- Textual node with white space, +- a `span` HTML element, +- a textual node representing the actual amount and possibly also white space. + +We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this: + +```py +["\n", Sale price, "$74.95"] +``` + +It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program: + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text + price = product.select_one(".price").contents[-1] + print(title, price) +``` + +If we run the scraper now, it should print prices as only amounts: + +```text +$ python main.py +JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00 +... +``` + +## Formatting output + +The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function: + +```py +print(title, price, sep=" | ") +``` + +The output is much nicer this way: + +```text +$ python main.py +JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00 +... +``` + +Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data. + +--- + + + +### Scrape Wikipedia + +Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL: + +```text +https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa +``` + +Your program should print the following: + +```text +Algeria +Angola +Benin +Botswana +... +``` + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + + url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for table in soup.select(".wikitable"): + for row in table.select("tr"): + cells = row.select("td") + if cells: + third_column = cells[2] + title_link = third_column.select_one("a") + print(title_link.text) + ``` + + Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells. + +
+ +### Use CSS selectors to their max + +Simplify the code from previous exercise. Use a single for loop and a single CSS selector. You may want to check out the following pages: + +- [Descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) +- [`:nth-child()` pseudo-class](https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child) + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + + url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for name_cell in soup.select(".wikitable tr td:nth-child(3)"): + print(name_cell.select_one("a").text) + ``` + +
+ +### Scrape F1 news + +Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print titles of all the listed articles. This is the URL: + +```text +https://www.theguardian.com/sport/formulaone +``` + +Your program should print something like the following: + +```text +Wolff confident Mercedes are heading to front of grid after Canada improvement +Frustrated Lando Norris blames McLaren team for missed chance +Max Verstappen wins Canadian Grand Prix: F1 – as it happened +... +``` + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + + url = "https://www.theguardian.com/sport/formulaone" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for title in soup.select("#maincontent ul li h3"): + print(title.text) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md new file mode 100644 index 000000000..81a375dc5 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md @@ -0,0 +1,346 @@ +--- +title: Extracting data from HTML with Python +sidebar_label: Extracting data from HTML +description: Lesson about building a Python application for watching prices. Using string manipulation to extract and clean data scraped from the product listing page. +slug: /scraping-basics-javascript2/extracting-data +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.** + +--- + +Locating the right HTML elements is the first step of a successful data extraction, so it's no surprise that we're already close to having the data in the correct form. The last bit that still requires our attention is the price: + +```text +$ python main.py +JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00 +... +``` + +Let's summarize what stands in our way if we want to have it in our Python program as a number: + +- A dollar sign precedes the number, +- the number contains decimal commas for better human readability, and +- some prices start with `From`, which reveals there is a certain complexity in how the shop deals with prices. + +## Representing price + +The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually a range. + +It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing. + +Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix? + +```py +price_text = product.select_one(".price").contents[-1] +price = price_text.removeprefix("From ") +``` + +In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty: + +```py +price_text = product.select_one(".price").contents[-1] +if price_text.startswith("From "): + min_price = price_text.removeprefix("From ") + price = None +else: + min_price = price_text + price = min_price +``` + +:::tip Built-in string methods + +If you're not proficient in Python's string methods, [.startswith()](https://docs.python.org/3/library/stdtypes.html#str.startswith) checks the beginning of a given string, and [.removeprefix()](https://docs.python.org/3/library/stdtypes.html#str.removeprefix) removes something from the beginning of a given string. + +::: + +The whole program would look like this: + +```py +import httpx +from bs4 import BeautifulSoup + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text + + price_text = product.select_one(".price").contents[-1] + if price_text.startswith("From "): + min_price = price_text.removeprefix("From ") + price = None + else: + min_price = price_text + price = min_price + + print(title, min_price, price, sep=" | ") +``` + +## Removing white space + +Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) of the HTML tags. + +We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add Python's built-in [.strip()](https://docs.python.org/3/library/stdtypes.html#str.strip): + +```py +title = product.select_one(".product-item__title").text.strip() + +price_text = product.select_one(".price").contents[-1].strip() +``` + +:::info Handling strings in Beautiful Soup + +Beautiful Soup offers several attributes when it comes to working with strings: + +- `.string`, which often is like `.text`, +- `.strings`, which [returns a list of all nested textual nodes](https://beautiful-soup-4.readthedocs.io/en/latest/#strings-and-stripped-strings), +- `.stripped_strings`, which does the same but with whitespace removed. + +These might be useful in some complex scenarios, but in our case, they won't make scraping the title or price any shorter or more elegant. + +::: + +## Removing dollar sign and commas + +We got rid of the `From` and possible whitespace, but we still can't save the price as a number in our Python program: + +```py +>>> price = "$1,998.00" +>>> float(price) +Traceback (most recent call last): + File "", line 1, in +ValueError: could not convert string to float: '$1,998.00' +``` + +:::tip Interactive Python + +The demonstration above is inside the Python's [interactive REPL](https://realpython.com/interacting-with-python/). It's a useful playground where you can try how code behaves before you use it in your program. + +::: + +We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but in this case [`.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) is also sufficient: + +```py +price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") +) +``` + +## Representing money in programs + +Now we should be able to add `float()`, so that we have the prices not as a text, but as numbers: + +```py +if price_text.startswith("From "): + min_price = float(price_text.removeprefix("From ")) + price = None +else: + min_price = float(price_text) + price = min_price +``` + +Great! Only if we didn't overlook an important pitfall called [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). In short, computers save `float()` numbers in a way which isn't always reliable: + +```py +>>> 0.1 + 0.2 +0.30000000000000004 +``` + +These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid `float()` when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text.strip() + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + print(title, min_price, price, sep=" | ") +``` + +If we run the code above, we have nice, clean data about all the products! + +```text +$ python main.py +JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None +... +``` + +Well, not to spoil the excitement, but in its current form, the data isn't very useful. In the next lesson we'll save the product details to a file which data analysts can use or other programs can read. + +--- + + + +### Scrape units on stock + +Change our scraper so that it extracts how many units of each product are on stock. Your program should print the following. Note the unit amounts at the end of each line: + +```text +JBL Flip 4 Waterproof Portable Bluetooth Speaker 672 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV 77 +Sony SACS9 10" Active Subwoofer 7 +Sony PS-HX500 Hi-Res USB Turntable 15 +Klipsch R-120SW Powerful Detailed Home Speaker - Unit 0 +Denon AH-C720 In-Ear Headphones 236 +... +``` + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + + url = "https://warehouse-theme-metal.myshopify.com/collections/sales" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text.strip() + + units_text = ( + product + .select_one(".product-item__inventory") + .text + .removeprefix("In stock,") + .removeprefix("Only") + .removesuffix(" left") + .removesuffix("units") + .strip() + ) + if "Sold out" in units_text: + units = 0 + else: + units = int(units_text) + + print(title, units) + ``` + +
+ +### Use regular expressions + +Simplify the code from previous exercise. Use [regular expressions](https://docs.python.org/3/library/re.html) to parse the number of units. You can match digits using a range like `[0-9]` or by a special sequence `\d`. To match more characters of the same type you can use `+`. + +
+ Solution + + ```py + import re + import httpx + from bs4 import BeautifulSoup + + url = "https://warehouse-theme-metal.myshopify.com/collections/sales" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text.strip() + + units_text = product.select_one(".product-item__inventory").text + if re_match := re.search(r"\d+", units_text): + units = int(re_match.group()) + else: + units = 0 + + print(title, units) + ``` + +
+ +### Scrape publish dates of F1 news + +Download Guardian's page with the latest F1 news and use Beautiful Soup to parse it. Print titles and publish dates of all the listed articles. This is the URL: + +```text +https://www.theguardian.com/sport/formulaone +``` + +Your program should print something like the following. Note the dates at the end of each line: + +```text +Wolff confident Mercedes are heading to front of grid after Canada improvement 2024-06-10 +Frustrated Lando Norris blames McLaren team for missed chance 2024-06-09 +Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09 +... +``` + +Hints: + +- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601. +- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes). +- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat). +- To get just the date part, you can call `.date()` on any `datetime` object. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from datetime import datetime + + url = "https://www.theguardian.com/sport/formulaone" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for article in soup.select("#maincontent ul li"): + title = article.select_one("h3").text.strip() + + time_iso = article.select_one("time")["datetime"].strip() + published_at = datetime.fromisoformat(time_iso) + published_on = published_at.date() + + print(title, published_on) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md new file mode 100644 index 000000000..b2c027a8c --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md @@ -0,0 +1,246 @@ +--- +title: Saving data with Python +sidebar_label: Saving data +description: Lesson about building a Python application for watching prices. Using standard library to save data scraped from product listing pages in popular formats such as CSV or JSON. +slug: /scraping-basics-javascript2/saving-data +unlisted: true +--- + +**In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON. We'll use Python's standard library to export the files.** + +--- + +We managed to scrape data about products and print it, with each product separated by a new line and each field separated by the `|` character. This already produces structured text that can be parsed, i.e., read programmatically. + +```text +$ python main.py +JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95 +Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None +... +``` + +However, the format of this text is rather _ad hoc_ and does not adhere to any specific standard that others could follow. It's unclear what to do if a product title already contains the `|` character or how to represent multi-line product descriptions. No ready-made library can handle all the parsing. + +We should use widely popular formats that have well-defined solutions for all the corner cases and that other programs can read without much effort. Two such formats are CSV (_Comma-separated values_) and JSON (_JavaScript Object Notation_). + +## Collecting data + +Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +# highlight-next-line +data = [] +for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text.strip() + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + # highlight-next-line + data.append({"title": title, "min_price": min_price, "price": price}) + +# highlight-next-line +print(data) +``` + +Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once. + +```text +$ python main.py +[{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...] +``` + +:::tip Pretty print + +If you find the complex data structures printed by `print()` difficult to read, try using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) from the `pprint` module instead. + +::: + +## Saving data as CSV + +The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets. + +In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage: + +```py +>>> import csv +>>> with open("data.csv", "w") as file: +... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"]) +... writer.writeheader() +... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"}) +... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"}) +... +``` + +We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents: + +```csv title=data.csv +name,age,hobbies +Alice,24,"kickbox, Python" +Bob,42,"reading, TypeScript" +``` + +In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this. + +When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have. + +![CSV example preview](images/csv-example.png) + +Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +# highlight-next-line +import csv +``` + +Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following: + +```py +with open("products.csv", "w") as file: + writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"]) + writer.writeheader() + for row in data: + writer.writerow(row) +``` + +If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products. + +![CSV preview](images/csv.png) + +## Saving data as JSON + +The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries. + +In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +# highlight-next-line +import json +``` + +Next, let’s append one more export to end of the source code of our scraper: + +```py +with open("products.json", "w") as file: + json.dump(data, file) +``` + +That’s it! If we run the program now, it should also create a `products.json` file in the current working directory: + +```text +$ python main.py +Traceback (most recent call last): + ... + raise TypeError(f'Object of type {o.__class__.__name__} ' +TypeError: Object of type Decimal is not JSON serializable +``` + +Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly: + +```py +def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + +with open("products.json", "w") as file: + json.dump(data, file, default=serialize) +``` + +Now the program should work as expected, producing a JSON file with the following content: + + +```json title=products.json +[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...] +``` + +If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash: + +```json +{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"} +``` + +:::tip Pretty JSON + +While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output. + +Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`. + +::: + +We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages. + +--- + +## Exercises + +In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them. + +### Process your CSV + +Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500. + +
+ Solution + + Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account: + + 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data. + 2. Select the header row. Go to **Data > Create filter**. + 3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data. + + ![CSV in Google Sheets](images/csv-sheets.png) + +
+ +### Process your JSON + +Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp). + +
+ Solution + + ```py + import json + from pprint import pp + from decimal import Decimal + + with open("products.json", "r") as file: + products = json.load(file) + + for product in products: + if Decimal(product["min_price"]) > 500: + pp(product) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md new file mode 100644 index 000000000..9d2a41333 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md @@ -0,0 +1,430 @@ +--- +title: Getting links from HTML with Python +sidebar_label: Getting links from HTML +description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate links to individual product pages. +slug: /scraping-basics-javascript2/getting-links +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson, we'll locate and extract links to individual product pages. We'll use BeautifulSoup to find the relevant bits of HTML.** + +--- + +The previous lesson concludes our effort to create a scraper. Our program now downloads HTML, locates and extracts data from the markup, and saves the data in a structured and reusable way. + +For some use cases, this is already enough! In other cases, though, scraping just one page is hardly useful. The data is spread across the website, over several pages. + +## Crawling websites + +We'll use a technique called crawling, i.e. following links to scrape multiple pages. The algorithm goes like this: + +1. Visit the start URL. +1. Extract new URLs (and data), and save them. +1. Visit one of the newly found URLs and save data and/or more URLs from it. +1. Repeat steps 2 and 3 until you have everything you need. + +This will help us figure out the actual prices of products, as right now, for some, we're only getting the min price. Implementing the algorithm will require quite a few changes to our code, though. + +## Restructuring code + +Over the course of the previous lessons, the code of our program grew to almost 50 lines containing downloading, parsing, and exporting: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json + +url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +response = httpx.get(url) +response.raise_for_status() + +html_code = response.text +soup = BeautifulSoup(html_code, "html.parser") + +data = [] +for product in soup.select(".product-item"): + title = product.select_one(".product-item__title").text.strip() + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + data.append({"title": title, "min_price": min_price, "price": price}) + +with open("products.csv", "w") as file: + writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"]) + writer.writeheader() + for row in data: + writer.writerow(row) + +def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + +with open("products.json", "w") as file: + json.dump(data, file, default=serialize) +``` + +Let's introduce several functions to make the whole thing easier to digest. First, we can turn the beginning of our program into this `download()` function, which takes a URL and returns a `BeautifulSoup` instance: + +```py +def download(url): + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + return BeautifulSoup(html_code, "html.parser") +``` + +Next, we can put parsing into a `parse_product()` function, which takes the product item element and returns the dictionary with data: + +```py +def parse_product(product): + title = product.select_one(".product-item__title").text.strip() + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + return {"title": title, "min_price": min_price, "price": price} +``` + +Now the CSV export. We'll make a small change here. Having to specify the field names is not ideal. What if we add more field names in the parsing function? We'd always have to remember to go and edit the export function as well. If we could figure out the field names in place, we'd remove this dependency. One way would be to infer the field names from the dictionary keys of the first row: + +```py +def export_csv(file, data): + # highlight-next-line + fieldnames = list(data[0].keys()) + writer = csv.DictWriter(file, fieldnames=fieldnames) + writer.writeheader() + for row in data: + writer.writerow(row) +``` + +:::note Fragile code + +The code above assumes the `data` variable contains at least one item, and that all the items have the same keys. This isn't robust and could break, but in our program, this isn't a problem, and omitting these corner cases allows us to keep the code examples more succinct. + +::: + +The last function we'll add will take care of the JSON export. For better readability of the JSON export, let's make a small change here too and set the indentation level to two spaces: + +```py +def export_json(file, data): + def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + + # highlight-next-line + json.dump(data, file, default=serialize, indent=2) +``` + +Now let's put it all together: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json + +def download(url): + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + return BeautifulSoup(html_code, "html.parser") + +def parse_product(product): + title = product.select_one(".product-item__title").text.strip() + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + return {"title": title, "min_price": min_price, "price": price} + +def export_csv(file, data): + fieldnames = list(data[0].keys()) + writer = csv.DictWriter(file, fieldnames=fieldnames) + writer.writeheader() + for row in data: + writer.writerow(row) + +def export_json(file, data): + def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + + json.dump(data, file, default=serialize, indent=2) + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product) + data.append(item) + +with open("products.csv", "w") as file: + export_csv(file, data) + +with open("products.json", "w") as file: + export_json(file, data) +``` + +The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code. + +:::tip Refactoring + +We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior. + +![Refactoring](images/refactoring.gif) + +::: + +## Extracting links + +With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item: + +![Product card's child elements](./images/child-elements.png) + +Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this: + +```html +Text of the link +``` + +In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys: + +```py +def parse_product(product): + title_element = product.select_one(".product-item__title") + title = title_element.text.strip() + url = title_element["href"] + + ... + + return {"title": title, "min_price": min_price, "price": price, "url": url} +``` + +In the previous code example, we've also added the URL to the dictionary returned by the function. If we run the scraper now, it should produce exports where each product contains a link to its product page: + + +```json title=products.json +[ + { + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", + "min_price": "74.95", + "price": "74.95", + "url": "/products/jbl-flip-4-waterproof-portable-bluetooth-speaker" + }, + { + "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", + "min_price": "1398.00", + "price": null, + "url": "/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv" + }, + ... +] +``` + +Hmm, but that isn't what we wanted! Where is the beginning of each URL? It turns out the HTML contains so-called _relative links_. + +## Turning relative links into absolute + +Browsers reading the HTML know the base address and automatically resolve such links, but we'll have to do this manually. The function [`urljoin`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin) from Python's standard library will help us. Let's add it to our imports first: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json +# highlight-next-line +from urllib.parse import urljoin +``` + +Next, we'll change the `parse_product()` function so that it also takes the base URL as an argument and then joins it with the relative URL to the product page: + +```py +# highlight-next-line +def parse_product(product, base_url): + title_element = product.select_one(".product-item__title") + title = title_element.text.strip() + # highlight-next-line + url = urljoin(base_url, title_element["href"]) + + ... + + return {"title": title, "min_price": min_price, "price": price, "url": url} +``` + +Now we'll pass the base URL to the function in the main body of our program: + +```py +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + # highlight-next-line + item = parse_product(product, listing_url) + data.append(item) +``` + +When we run the scraper now, we should see full URLs in our exports: + + +```json title=products.json +[ + { + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", + "min_price": "74.95", + "price": "74.95", + "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker" + }, + { + "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", + "min_price": "1398.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv" + }, + ... +] +``` + +Ta-da! We've managed to get links leading to the product pages. In the next lesson, we'll crawl these URLs so that we can gather more details about the products in our dataset. + +--- + + + +### Scrape links to countries in Africa + +Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print links to Wikipedia pages of all the states and territories mentioned in all tables. Start with this URL: + +```text +https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa +``` + +Your program should print the following: + +```text +https://en.wikipedia.org/wiki/Algeria +https://en.wikipedia.org/wiki/Angola +https://en.wikipedia.org/wiki/Benin +https://en.wikipedia.org/wiki/Botswana +... +``` + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" + response = httpx.get(listing_url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for name_cell in soup.select(".wikitable tr td:nth-child(3)"): + link = name_cell.select_one("a") + url = urljoin(listing_url, link["href"]) + print(url) + ``` + +
+ +### Scrape links to F1 news + +Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print links to all the listed articles. Start with this URL: + +```text +https://www.theguardian.com/sport/formulaone +``` + +Your program should print something like the following: + +```text +https://www.theguardian.com/world/2024/sep/13/africa-f1-formula-one-fans-lewis-hamilton-grand-prix +https://www.theguardian.com/sport/2024/sep/12/mclaren-lando-norris-oscar-piastri-team-orders-f1-title-race-max-verstappen +https://www.theguardian.com/sport/article/2024/sep/10/f1-designer-adrian-newey-signs-aston-martin-deal-after-quitting-red-bull +https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-undriveable-monster-how-bad-really-is-it-and-why +... +``` + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + url = "https://www.theguardian.com/sport/formulaone" + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + soup = BeautifulSoup(html_code, "html.parser") + + for item in soup.select("#maincontent ul li"): + link = item.select_one("a") + url = urljoin(url, link["href"]) + print(url) + ``` + + Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this: + + ```text + https://www.theguardian.com/sport/article/2024/sep/02/example + https://www.theguardian.com/sport/article/2024/sep/02/example#comments + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md new file mode 100644 index 000000000..f46b0ec63 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md @@ -0,0 +1,305 @@ +--- +title: Crawling websites with Python +sidebar_label: Crawling websites +description: Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages. +slug: /scraping-basics-javascript2/crawling +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.** + +--- + +In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products. + +Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json +from urllib.parse import urljoin + +def download(url): + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + return BeautifulSoup(html_code, "html.parser") + +def parse_product(product, base_url): + title_element = product.select_one(".product-item__title") + title = title_element.text.strip() + url = urljoin(base_url, title_element["href"]) + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + return {"title": title, "min_price": min_price, "price": price, "url": url} + +def export_csv(file, data): + fieldnames = list(data[0].keys()) + writer = csv.DictWriter(file, fieldnames=fieldnames) + writer.writeheader() + for row in data: + writer.writerow(row) + +def export_json(file, data): + def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + + json.dump(data, file, default=serialize, indent=2) + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + data.append(item) + +with open("products.csv", "w") as file: + export_csv(file, data) + +with open("products.json", "w") as file: + export_json(file, data) +``` + +## Extracting vendor name + +Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. + +![Product detail page](./images/pdp.png) + +Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: + +```html +
+

+ Sony XBR-950G BRAVIA 4K HDR Ultra HD TV +

+
+ ... +
+
+ + + + Sony + + + + SKU: + SON-985594-XBR-65 + +
+ +
+ + 3 reviews +
+
+ ... +
+``` + +It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string: + +```py +vendor = product_soup.select_one(".product-meta__vendor").text.strip() +``` + +But where do we put this line in our program? + +## Crawling product detail pages + +In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary: + +```py +... + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + # highlight-next-line + product_soup = download(item["url"]) + # highlight-next-line + item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip() + data.append(item) + +... +``` + +If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name: + + +```json title=products.json +[ + { + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", + "min_price": "74.95", + "price": "74.95", + "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", + "vendor": "JBL" + }, + { + "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", + "min_price": "1398.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", + "vendor": "Sony" + }, + ... +] +``` + +## Extracting price + +Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices! + +Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… + +![Morpheus revealing the existence of product variants](images/variants.png) + +In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. + +--- + + + +### Scrape calling codes of African countries + +This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the _calling code_ from the info table. Print the URL and the calling code for each country. Start with this URL: + +```text +https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa +``` + +Your program should print the following: + +```text +https://en.wikipedia.org/wiki/Algeria +213 +https://en.wikipedia.org/wiki/Angola +244 +https://en.wikipedia.org/wiki/Benin +229 +https://en.wikipedia.org/wiki/Botswana +267 +https://en.wikipedia.org/wiki/Burkina_Faso +226 +https://en.wikipedia.org/wiki/Burundi None +https://en.wikipedia.org/wiki/Cameroon +237 +... +``` + +Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + def download(url): + response = httpx.get(url) + response.raise_for_status() + return BeautifulSoup(response.text, "html.parser") + + def parse_calling_code(soup): + for label in soup.select("th.infobox-label"): + if label.text.strip() == "Calling code": + data = label.parent.select_one("td.infobox-data") + return data.text.strip() + return None + + listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" + listing_soup = download(listing_url) + for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"): + link = name_cell.select_one("a") + country_url = urljoin(listing_url, link["href"]) + country_soup = download(country_url) + calling_code = parse_calling_code(country_soup) + print(country_url, calling_code) + ``` + +
+ +### Scrape authors of F1 news articles + +This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL: + +```text +https://www.theguardian.com/sport/formulaone +``` + +Your program should print something like this: + +```text +Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics +Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire? +Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1 +Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull +PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying +... +``` + +Hints: + +- You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values. +- Sometimes a person authors the article, but other times it's contributed by a news agency. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + def download(url): + response = httpx.get(url) + response.raise_for_status() + return BeautifulSoup(response.text, "html.parser") + + def parse_author(article_soup): + link = article_soup.select_one('aside a[rel="author"]') + if link: + return link.text.strip() + address = article_soup.select_one('aside address') + if address: + return address.text.strip() + return None + + listing_url = "https://www.theguardian.com/sport/formulaone" + listing_soup = download(listing_url) + for item in listing_soup.select("#maincontent ul li"): + link = item.select_one("a") + article_url = urljoin(listing_url, link["href"]) + article_soup = download(article_url) + title = article_soup.select_one("h1").text.strip() + author = parse_author(article_soup) + print(f"{author}: {title}") + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md new file mode 100644 index 000000000..0c68ea5b7 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md @@ -0,0 +1,420 @@ +--- +title: Scraping product variants with Python +sidebar_label: Scraping product variants +description: Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to extract product variants and exporting them as separate items. +slug: /scraping-basics-javascript2/scraping-variants +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** + +--- + +We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL. + +## Locating variants + +First, let's extract information about the variants. If we go to [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv) and open the DevTools, we can see that the buttons for switching between variants look like this: + +```html +
+
+ + +
+
+ + +
+
+``` + +Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information. + +![Switching variants](images/variants-js.gif) + +If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible. + +After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data! + +```html +
+ +
+ +
+
+``` + +These elements aren't visible to regular visitors. They're there just in case JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight. + +## Extracting variants + +Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need: + +```py +... + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + product_soup = download(item["url"]) + vendor = product_soup.select_one(".product-meta__vendor").text.strip() + + if variants := product_soup.select(".product-form__option.no-js option"): + for variant in variants: + data.append(item | {"variant_name": variant.text.strip()}) + else: + item["variant_name"] = None + data.append(item) + +... +``` + +The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper. + +Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty. + +:::tip Modern Python syntax + +Since Python 3.8, you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called an _assignment expression_ or _walrus operator_. You can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/). + +Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https://docs.python.org/3/library/stdtypes.html#dict) aren't clear enough, check out the [proposal document](https://peps.python.org/pep-0584/) for more details. + +::: + +If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page. + + +```json title=products.json +[ + ... + { + "variant_name": null, + "title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit", + "min_price": "324.00", + "price": "324.00", + "url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1", + "vendor": "Klipsch" + }, + ... +] +``` + +Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items. + + +```json title=products.json +[ + ... + { + "variant_name": "Red - $178.00", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + { + "variant_name": "Black - $178.00", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": null, + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + ... +] +``` + +Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like _$74.95_, instead of _from $74.95_. + + +```json title=products.json +[ + ... + { + "variant_name": "Red - $74.95", + "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", + "min_price": "74.95", + "price": "74.95", + "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", + "vendor": "JBL" + }, + ... +] +``` + +## Parsing price + +The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` key. Let's introduce a new function to handle that: + +```py +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} +``` + +First, we split the text into two parts, then we parse the price as a decimal number. This part is similar to what we already do for parsing product listing prices. The function returns a dictionary we can merge with `item`. + +## Saving price + +Now, if we use our new function, we should finally get a program that can scrape exact prices for all products, even if they have variants. The whole code should look like this now: + +```py +import httpx +from bs4 import BeautifulSoup +from decimal import Decimal +import csv +import json +from urllib.parse import urljoin + +def download(url): + response = httpx.get(url) + response.raise_for_status() + + html_code = response.text + return BeautifulSoup(html_code, "html.parser") + +def parse_product(product, base_url): + title_element = product.select_one(".product-item__title") + title = title_element.text.strip() + url = urljoin(base_url, title_element["href"]) + + price_text = ( + product + .select_one(".price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + if price_text.startswith("From "): + min_price = Decimal(price_text.removeprefix("From ")) + price = None + else: + min_price = Decimal(price_text) + price = min_price + + return {"title": title, "min_price": min_price, "price": price, "url": url} + +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} + +def export_csv(file, data): + fieldnames = list(data[0].keys()) + writer = csv.DictWriter(file, fieldnames=fieldnames) + writer.writeheader() + for row in data: + writer.writerow(row) + +def export_json(file, data): + def serialize(obj): + if isinstance(obj, Decimal): + return str(obj) + raise TypeError("Object not JSON serializable") + + json.dump(data, file, default=serialize, indent=2) + +listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" +listing_soup = download(listing_url) + +data = [] +for product in listing_soup.select(".product-item"): + item = parse_product(product, listing_url) + product_soup = download(item["url"]) + vendor = product_soup.select_one(".product-meta__vendor").text.strip() + + if variants := product_soup.select(".product-form__option.no-js option"): + for variant in variants: + # highlight-next-line + data.append(item | parse_variant(variant)) + else: + item["variant_name"] = None + data.append(item) + +with open("products.csv", "w") as file: + export_csv(file, data) + +with open("products.json", "w") as file: + export_json(file, data) +``` + +Let's run the scraper and see if all the items in the data contain prices: + + +```json title=products.json +[ + ... + { + "variant_name": "Red", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": "178.00", + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + { + "variant_name": "Black", + "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", + "min_price": "128.00", + "price": "178.00", + "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", + "vendor": "Sony" + }, + ... +] +``` + +Success! We managed to build a Python application for watching prices! + +Is this the end? Maybe! In the next lesson, we'll use a scraping framework to build the same application, but with less code, faster requests, and better visibility into what's happening while we wait for the program to finish. + +--- + + + +### Build a scraper for watching Python jobs + +You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria: + +- Tagged as "Database" +- Posted within the last 60 days + +For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data: + +- Job title +- Company +- URL to the job posting +- Date of posting + +Your output should look something like this: + +```py +{'title': 'Senior Full Stack Developer', + 'company': 'Baserow', + 'url': 'https://www.python.org/jobs/7705/', + 'posted_on': datetime.date(2024, 9, 16)} +{'title': 'Senior Python Engineer', + 'company': 'Active Prime', + 'url': 'https://www.python.org/jobs/7699/', + 'posted_on': datetime.date(2024, 9, 5)} +... +``` + +You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and `timedelta()`. + +
+ Solution + + After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually. + + ```py + from pprint import pp + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + from datetime import datetime, date, timedelta + + today = date.today() + jobs_url = "https://www.python.org/jobs/type/database/" + response = httpx.get(jobs_url) + response.raise_for_status() + soup = BeautifulSoup(response.text, "html.parser") + + for job in soup.select(".list-recent-jobs li"): + link = job.select_one(".listing-company-name a") + + time = job.select_one(".listing-posted time") + posted_at = datetime.fromisoformat(time["datetime"]) + posted_on = posted_at.date() + posted_ago = today - posted_on + + if posted_ago <= timedelta(days=60): + title = link.text.strip() + company = list(job.select_one(".listing-company-name").stripped_strings)[-1] + url = urljoin(jobs_url, link["href"]) + pp({"title": title, "company": company, "url": url, "posted_on": posted_on}) + ``` + +
+ +### Find the shortest CNN article which made it to the Sports homepage + +Scrape the [CNN Sports](https://edition.cnn.com/sport) homepage. For each linked article, calculate its length in characters: + +- Locate the element that holds the main content of the article. +- Use [`get_text()`](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#get-text) to extract all the content as plain text. +- Use `len()` to calculate the character count. + +Skip pages without text (like those that only have a video). Sort the results and print the URL of the shortest article that made it to the homepage. + +At the time of writing, the shortest article on the CNN Sports homepage is [about a donation to the Augusta National Golf Club](https://edition.cnn.com/2024/10/03/sport/masters-donation-hurricane-helene-relief-spt-intl/), which is just 1,642 characters long. + +
+ Solution + + ```py + import httpx + from bs4 import BeautifulSoup + from urllib.parse import urljoin + + def download(url): + response = httpx.get(url) + response.raise_for_status() + return BeautifulSoup(response.text, "html.parser") + + listing_url = "https://edition.cnn.com/sport" + listing_soup = download(listing_url) + + data = [] + for card in listing_soup.select(".layout__main .card"): + link = card.select_one(".container__link") + article_url = urljoin(listing_url, link["href"]) + article_soup = download(article_url) + if content := article_soup.select_one(".article__content"): + length = len(content.get_text()) + data.append((length, article_url)) + + data.sort() + shortest_item = data[0] + item_url = shortest_item[1] + print(item_url) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md new file mode 100644 index 000000000..3cf1f02c7 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md @@ -0,0 +1,601 @@ +--- +title: Using a scraping framework with Python +sidebar_label: Using a framework +description: Lesson about building a Python application for watching prices. Using the Crawlee framework to simplify creating a scraper. +slug: /scraping-basics-javascript2/framework +unlisted: true +--- + +import Exercises from './_exercises.mdx'; + +**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.** + +--- + +Before rewriting our code, let's point out several caveats in our current solution: + +- _Hard to maintain:_ All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages. +- _Slow:_ The program runs sequentially, which is generously considerate toward the target website, but extremely inefficient. +- _No logging:_ The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs. +- _Boilerplate code:_ We implement downloading and parsing HTML, or exporting data to CSV, although we're not the first people to meet and solve these problems. +- _Prone to anti-scraping:_ If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working. +- _Browser means rewrite:_ We got lucky extracting variants. If the website didn't include a fallback, we might have had no choice but to spin up a browser instance and automate clicking on buttons. Such a change in the underlying technology would require a complete rewrite of our program. +- _No error handling:_ The scraper stops if it encounters issues. It should allow for skipping problematic products with warnings or retrying downloads when the website returns temporary errors. + +In this lesson, we'll tackle all the above issues while keeping the code concise thanks to a scraping framework. + +:::info Why Crawlee and not Scrapy + +From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development. + +We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints. + +::: + +## Installing Crawlee + +When starting with the Crawlee framework, we first need to decide which approach to downloading and parsing we prefer. We want the one based on Beautiful Soup, so let's install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies, so expect the installation to take a while. + +```text +$ pip install crawlee[beautifulsoup] +... +Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ... +``` + +## Running Crawlee + +Now let's use the framework to create a new version of our scraper. First, let's rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, we'll create a new, empty `main.py`. The initial content will look like this: + +```py +import asyncio +from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + +async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_listing(context: BeautifulSoupCrawlingContext): + if title := context.soup.title: + print(title.text.strip()) + + await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) + +if __name__ == '__main__': + asyncio.run(main()) +``` + +In the code, we do the following: + +1. We import the necessary modules and define an asynchronous `main()` function. +2. Inside `main()`, we first create a crawler object, which manages the scraping process. In this case, it's a crawler based on Beautiful Soup. +3. Next, we define a nested asynchronous function called `handle_listing()`. It receives a `context` parameter, and Python type hints show it's of type `BeautifulSoupCrawlingContext`. Type hints help editors suggest what we can do with the object. +4. We use a Python decorator (the line starting with `@`) to register `handle_listing()` as the _default handler_ for processing HTTP responses. +5. Inside the handler, we extract the page title from the `soup` object and print its text without whitespace. +6. At the end of the function, we run the crawler on a product listing URL and await its completion. +7. The last two lines ensure that if the file is executed directly, Python will properly run the `main()` function using its asynchronous event loop. + +Don't worry if some of this is new. We don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html), decorators, or type hints work. Let's stick to the practical side and observe what the program does when executed: + +```text +$ python main.py +[BeautifulSoupCrawler] INFO Current request statistics: +┌───────────────────────────────┬──────────┐ +│ requests_finished │ 0 │ +│ requests_failed │ 0 │ +│ retry_histogram │ [0] │ +│ request_avg_failed_duration │ None │ +│ request_avg_finished_duration │ None │ +│ requests_finished_per_minute │ 0 │ +│ requests_failed_per_minute │ 0 │ +│ request_total_duration │ 0.0 │ +│ requests_total │ 0 │ +│ crawler_runtime │ 0.010014 │ +└───────────────────────────────┴──────────┘ +[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 +Sales +[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish +[BeautifulSoupCrawler] INFO Final request statistics: +┌───────────────────────────────┬──────────┐ +│ requests_finished │ 1 │ +│ requests_failed │ 0 │ +│ retry_histogram │ [1] │ +│ request_avg_failed_duration │ None │ +│ request_avg_finished_duration │ 0.308998 │ +│ requests_finished_per_minute │ 185 │ +│ requests_failed_per_minute │ 0 │ +│ request_total_duration │ 0.308998 │ +│ requests_total │ 1 │ +│ crawler_runtime │ 0.323721 │ +└───────────────────────────────┴──────────┘ +``` + +If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with Beautiful Soup, extracts the title, and prints it. + +:::tip Advanced Python features + +You don't need to be an expert in asynchronous programming, decorators, or type hints to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/), [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/), and [Python Type Checking](https://realpython.com/python-type-checking/). + +::: + +## Crawling product detail pages + +The code now features advanced Python concepts, so it's less accessible to beginners, and the size of the program is about the same as if we worked without a framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive. As we rewrite the rest of the program, the benefits of using Crawlee will become more apparent. + +For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages: + +```py +import asyncio +from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + +async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_listing(context: BeautifulSoupCrawlingContext): + # highlight-next-line + await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title") + + # highlight-next-line + @crawler.router.handler("DETAIL") + # highlight-next-line + async def handle_detail(context: BeautifulSoupCrawlingContext): + # highlight-next-line + print(context.request.url) + + await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) + +if __name__ == '__main__': + asyncio.run(main()) +``` + +First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector that allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links and add them to Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`. + +Below that, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is print the request URL. + +If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way: + +```text +$ python main.py +[BeautifulSoupCrawler] INFO Current request statistics: +┌───────────────────────────────┬──────────┐ +... +└───────────────────────────────┴──────────┘ +[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 +https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv +https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker +https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer +https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable +... +[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish +[BeautifulSoupCrawler] INFO Final request statistics: +┌───────────────────────────────┬──────────┐ +│ requests_finished │ 25 │ +│ requests_failed │ 0 │ +│ retry_histogram │ [25] │ +│ request_avg_failed_duration │ None │ +│ request_avg_finished_duration │ 0.349434 │ +│ requests_finished_per_minute │ 318 │ +│ requests_failed_per_minute │ 0 │ +│ request_total_duration │ 8.735843 │ +│ requests_total │ 25 │ +│ crawler_runtime │ 4.713262 │ +└───────────────────────────────┴──────────┘ +``` + +In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially. + +## Extracting data + +The Beautiful Soup crawler provides handlers with the `context.soup` attribute, which contains the parsed HTML of the handled page. This is the same `soup` object we used in our previous program. Let's locate and extract the same data as before: + +```py +async def main(): + ... + + @crawler.router.handler("DETAIL") + async def handle_detail(context: BeautifulSoupCrawlingContext): + item = { + "url": context.request.url, + "title": context.soup.select_one(".product-meta__title").text.strip(), + "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), + } + print(item) +``` + +:::note Fragile code + +The code above assumes the `.select_one()` call doesn't return `None`. If your editor checks types, it might even warn that `text` is not a known attribute of `None`. This isn't robust and could break, but in our program, that's fine. We expect the elements to be there, and if they're not, we'd rather the scraper break quickly—it's a sign something's wrong and needs fixing. + +::: + +Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper. + +The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page: + +```py +async def main(): + ... + + @crawler.router.handler("DETAIL") + async def handle_detail(context: BeautifulSoupCrawlingContext): + price_text = ( + context.soup + # highlight-next-line + .select_one(".product-form__info-content .price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + item = { + "url": context.request.url, + "title": context.soup.select_one(".product-meta__title").text.strip(), + "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), + "price": Decimal(price_text), + } + print(item) +``` + +Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we had in `main.py`. The full program will look like this: + +```py +import asyncio +from decimal import Decimal +from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + +async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_listing(context: BeautifulSoupCrawlingContext): + await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") + + @crawler.router.handler("DETAIL") + async def handle_detail(context: BeautifulSoupCrawlingContext): + price_text = ( + context.soup + .select_one(".product-form__info-content .price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + item = { + "url": context.request.url, + "title": context.soup.select_one(".product-meta__title").text.strip(), + "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), + "price": Decimal(price_text), + "variant_name": None, + } + if variants := context.soup.select(".product-form__option.no-js option"): + for variant in variants: + print(item | parse_variant(variant)) + else: + print(item) + + await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) + +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} + +if __name__ == '__main__': + asyncio.run(main()) +``` + +If we run this scraper, we should get the same data for the 24 products as before. Crawlee has saved us a lot of effort by managing downloading, parsing, and parallelization. The code is also cleaner, with two separate and labeled handlers. + +Crawlee doesn't do much to help with locating and extracting the data—that part of the code remains almost the same, framework or not. This is because the detective work of finding and extracting the right data is the core value of custom scrapers. With Crawlee, we can focus on just that while letting the framework take care of everything else. + +## Saving data + +When we're at _letting the framework take care of everything else_, let's take a look at what it can do about saving data. As of now the product detail page handler prints each item as soon as the item is ready. Instead, we can push the item to Crawlee's default dataset: + +```py +async def main(): + ... + + @crawler.router.handler("DETAIL") + async def handle_detail(context: BeautifulSoupCrawlingContext): + price_text = ( + ... + ) + item = { + ... + } + if variants := context.soup.select(".product-form__option.no-js option"): + for variant in variants: + # highlight-next-line + await context.push_data(item | parse_variant(variant)) + else: + # highlight-next-line + await context.push_data(item) +``` + +That's it! If we run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item. + +![Single dataset item](images/dataset-item.png) + +We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping: + +```py +async def main(): + ... + + await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) + # highlight-next-line + await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) + # highlight-next-line + await crawler.export_data_csv(path='dataset.csv') +``` + +After running the scraper again, there should be two new files in your directory, `dataset.json` and `dataset.csv`, containing all the data. If we peek into the JSON file, it should have indentation. + +## Logging + +Crawlee gives us stats about HTTP requests and concurrency, but we don't get much visibility into the pages we're crawling or the items we're saving. Let's add some custom logging: + +```py +import asyncio +from decimal import Decimal +from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + +async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_listing(context: BeautifulSoupCrawlingContext): + # highlight-next-line + context.log.info("Looking for product detail pages") + await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") + + @crawler.router.handler("DETAIL") + async def handle_detail(context: BeautifulSoupCrawlingContext): + # highlight-next-line + context.log.info(f"Product detail page: {context.request.url}") + price_text = ( + context.soup + .select_one(".product-form__info-content .price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + item = { + "url": context.request.url, + "title": context.soup.select_one(".product-meta__title").text.strip(), + "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), + "price": Decimal(price_text), + "variant_name": None, + } + if variants := context.soup.select(".product-form__option.no-js option"): + for variant in variants: + # highlight-next-line + context.log.info("Saving a product variant") + await context.push_data(item | parse_variant(variant)) + else: + # highlight-next-line + context.log.info("Saving a product") + await context.push_data(item) + + await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) + + # highlight-next-line + crawler.log.info("Exporting data") + await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) + await crawler.export_data_csv(path='dataset.csv') + +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} + +if __name__ == '__main__': + asyncio.run(main()) +``` + +Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are [standard Python loggers](https://docs.python.org/3/library/logging.html). + +If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but at each phase, the code remained clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about. + +In the next lesson, we'll use a scraping platform to set up our application to run automatically every day. + +--- + + + +### Build a Crawlee scraper of F1 Academy drivers + +Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Academy) drivers listed on the official [Drivers](https://www.f1academy.com/Racing-Series/Drivers) page. Each item you push to Crawlee's default dataset should include the following data: + +- URL of the driver's f1academy.com page +- Name +- Team +- Nationality +- Date of birth (as a `date()` object) +- Instagram URL + +If you export the dataset as JSON, it should look something like this: + + +```json +[ + { + "url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus", + "name": "Emely De Heus", + "team": "MP Motorsport", + "nationality": "Dutch", + "dob": "2003-02-10", + "instagram_url": "https://www.instagram.com/emely.de.heus/", + }, + { + "url": "https://www.f1academy.com/Racing-Series/Drivers/28/Hamda-Al-Qubaisi", + "name": "Hamda Al Qubaisi", + "team": "MP Motorsport", + "nationality": "Emirati", + "dob": "2002-08-08", + "instagram_url": "https://www.instagram.com/hamdaalqubaisi_official/", + }, + ... +] +``` + +Hints: + +- Use Python's `datetime.strptime(text, "%d/%m/%Y").date()` to parse dates in the `DD/MM/YYYY` format. Check out the [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) for more details. +- To locate the Instagram URL, use the attribute selector `a[href*='instagram']`. Learn more about attribute selectors in the [MDN docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors). + +
+ Solution + + ```py + import asyncio + from datetime import datetime + + from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + + async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_listing(context: BeautifulSoupCrawlingContext): + await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER") + + @crawler.router.handler("DRIVER") + async def handle_driver(context: BeautifulSoupCrawlingContext): + info = {} + for row in context.soup.select(".common-driver-info li"): + name = row.select_one("span").text.strip() + value = row.select_one("h4").text.strip() + info[name] = value + + detail = {} + for row in context.soup.select(".driver-detail--cta-group a"): + name = row.select_one("p").text.strip() + value = row.select_one("h2").text.strip() + detail[name] = value + + await context.push_data({ + "url": context.request.url, + "name": context.soup.select_one("h1").text.strip(), + "team": detail["Team"], + "nationality": info["Nationality"], + "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(), + "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"), + }) + + await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"]) + await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) + + if __name__ == '__main__': + asyncio.run(main()) + ``` + +
+ +### Use Crawlee to find the ratings of the most popular Netflix films + +The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listing the most popular Netflix films worldwide. Scrape the movie names from this page, then search for each movie on [IMDb](https://www.imdb.com/). Assume the first search result is correct and retrieve the film's rating. Each item you push to Crawlee's default dataset should include the following data: + +- URL of the film's IMDb page +- Title +- Rating + +If you export the dataset as JSON, it should look something like this: + + +```json +[ + { + "url": "https://www.imdb.com/title/tt32368345/?ref_=fn_tt_tt_1", + "title": "The Merry Gentlemen", + "rating": "5.0/10" + }, + { + "url": "https://www.imdb.com/title/tt32359447/?ref_=fn_tt_tt_1", + "title": "Hot Frosty", + "rating": "5.4/10" + }, + ... +] +``` + +To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this: + +```py +... +from urllib.parse import quote_plus + +async def main(): + ... + + @crawler.router.default_handler + async def handle_netflix_table(context: BeautifulSoupCrawlingContext): + requests = [] + for name_cell in context.soup.select(...): + name = name_cell.text.strip() + imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft" + requests.append(Request.from_url(imdb_search_url, label="...")) + await context.add_requests(requests) + + ... +... +``` + +When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue. + +
+ Solution + + ```py + import asyncio + from urllib.parse import quote_plus + + from crawlee import Request + from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext + + async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_netflix_table(context: BeautifulSoupCrawlingContext): + requests = [] + for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"): + name = name_cell.text.strip() + imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft" + requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH")) + await context.add_requests(requests) + + @crawler.router.handler("IMDB_SEARCH") + async def handle_imdb_search(context: BeautifulSoupCrawlingContext): + await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1) + + @crawler.router.handler("IMDB") + async def handle_imdb(context: BeautifulSoupCrawlingContext): + rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']" + rating_text = context.soup.select_one(rating_selector).text.strip() + await context.push_data({ + "url": context.request.url, + "title": context.soup.select_one("h1").text.strip(), + "rating": rating_text, + }) + + await crawler.run(["https://www.netflix.com/tudum/top10"]) + await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) + + if __name__ == '__main__': + asyncio.run(main()) + ``` + +
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md new file mode 100644 index 000000000..e1bb36f3f --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md @@ -0,0 +1,435 @@ +--- +title: Using a scraping platform with Python +sidebar_label: Using a platform +description: Lesson about building a Python application for watching prices. Using the Apify platform to deploy a scraper. +slug: /scraping-basics-javascript2/platform +unlisted: true +--- + +**In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily. We'll also use the platform's API to retrieve and work with the results.** + +--- + +Before starting with a scraping platform, let's highlight a few caveats in our current setup: + +- _User-operated:_ We have to run the scraper ourselves. If we're tracking price trends, we'd need to remember to run it daily. And if we want alerts for big discounts, manually running the program isn't much better than just checking the site in a browser every day. +- _No monitoring:_ If we have a spare server or a Raspberry Pi lying around, we could use [cron](https://en.wikipedia.org/wiki/Cron) to schedule it. But even then, we'd have little insight into whether it ran successfully, what errors or warnings occurred, how long it took, or what resources it used. +- _Manual data management:_ Tracking prices over time means figuring out how to organize the exported data ourselves. Processing the data could also be tricky since different analysis tools often require different formats. +- _Anti-scraping risks:_ If the target website detects our scraper, they can rate-limit or block us. Sure, we could run it from a coffee shop's Wi-Fi, but eventually, they'd block that too—risking seriously annoying our barista. + +In this lesson, we'll use a platform to address all of these issues. Generic cloud platforms like [GitHub Actions](https://github.com/features/actions) can work for simple scenarios. But platforms dedicated to scraping, like [Apify](https://apify.com/), offer extra features such as monitoring scrapers, managing retrieved data, and overcoming anti-scraping measures. + +:::info Why Apify + +Scraping platforms come in many varieties, offering a wide range of tools and approaches. As the course authors, we're obviously biased toward Apify—we think it's both powerful and complete. + +That said, the main goal of this lesson is to show how deploying to _any platform_ can make life easier. Plus, everything we cover here fits within [Apify's free tier](https://apify.com/pricing). + +::: + +## Registering + +First, let's [create a new Apify account](https://console.apify.com/sign-up). We'll go through a few checks to confirm we're human and our email is valid—annoying but necessary to prevent abuse of the platform. + +Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But let's hold off on exploring the Apify Store for now. + +## Getting access from the command line + +To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download). + +After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version: + +```text +$ apify --version +apify-cli/0.0.0 system-arch00 node-v0.0.0 +``` + +Now let's connect the CLI with the cloud platform using our account from previous step: + +```text +$ apify login +... +Success: You are logged in to Apify as user1234! +``` + +## Starting a real-world project + +Until now, we've kept our scrapers simple, each with just a single Python module like `main.py`, and we've added dependencies only by installing them with `pip` inside a virtual environment. + +If we sent our code to a friend, they wouldn't know what to install to avoid import errors. The same goes for deploying to a cloud platform. + +To share our project, we need to package it. The best way is following the official [Python Packaging User Guide](https://packaging.python.org/), but for this course, we'll take a shortcut with the Apify CLI. + +In our terminal, let's change to a directory where we usually start new projects. Then, we'll run the following command: + +```text +apify create warehouse-watchdog --template=python-crawlee-beautifulsoup +``` + +It will create a new subdirectory called `warehouse-watchdog` for the new project, containing all the necessary files: + +```text +Info: Python version 0.0.0 detected. +Info: Creating a virtual environment in ... +... +Success: Actor 'warehouse-watchdog' was created. To run it, run "cd warehouse-watchdog" and "apify run". +Info: To run your code in the cloud, run "apify push" and deploy your code to Apify Console. +Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory. +``` + +## Adjusting the template + +Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, including `main.py`. This is a sample Beautiful Soup scraper provided by the template. + +The file contains a single asynchronous function, `main()`. At the beginning, it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then passes that input to a small crawler built on top of the Crawlee framework. + +Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://apify.com/actors)—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code. + +![The expected file structure](./images/actor-file-structure.webp) + +We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson: + +```py title=warehouse-watchdog/src/crawler.py +import asyncio +from decimal import Decimal +from crawlee.crawlers import BeautifulSoupCrawler + +async def main(): + crawler = BeautifulSoupCrawler() + + @crawler.router.default_handler + async def handle_listing(context): + context.log.info("Looking for product detail pages") + await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") + + @crawler.router.handler("DETAIL") + async def handle_detail(context): + context.log.info(f"Product detail page: {context.request.url}") + price_text = ( + context.soup + .select_one(".product-form__info-content .price") + .contents[-1] + .strip() + .replace("$", "") + .replace(",", "") + ) + item = { + "url": context.request.url, + "title": context.soup.select_one(".product-meta__title").text.strip(), + "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), + "price": Decimal(price_text), + "variant_name": None, + } + if variants := context.soup.select(".product-form__option.no-js option"): + for variant in variants: + context.log.info("Saving a product variant") + await context.push_data(item | parse_variant(variant)) + else: + context.log.info("Saving a product") + await context.push_data(item) + + await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) + + crawler.log.info("Exporting data") + await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) + await crawler.export_data_csv(path='dataset.csv') + +def parse_variant(variant): + text = variant.text.strip() + name, price_text = text.split(" - ") + price = Decimal( + price_text + .replace("$", "") + .replace(",", "") + ) + return {"variant_name": name, "price": price} + +if __name__ == '__main__': + asyncio.run(main()) +``` + +Now, let's replace the contents of `warehouse-watchdog/src/main.py` with this: + +```py title=warehouse-watchdog/src/main.py +from apify import Actor +from .crawler import main as crawl + +async def main(): + async with Actor: + await crawl() +``` + +We import our scraper as a function and await the result inside the Actor block. Unlike the sample scraper, the one we made in the previous lesson doesn't expect any input data, so we can omit the code that handles that part. + +Next, we'll change to the `warehouse-watchdog` directory in our terminal and verify that everything works locally before deploying the project to the cloud: + +```text +$ apify run +Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src +[apify] INFO Initializing Actor... +[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) +[BeautifulSoupCrawler] INFO Current request statistics: +┌───────────────────────────────┬──────────┐ +│ requests_finished │ 0 │ +│ requests_failed │ 0 │ +│ retry_histogram │ [0] │ +│ request_avg_failed_duration │ None │ +│ request_avg_finished_duration │ None │ +│ requests_finished_per_minute │ 0 │ +│ requests_failed_per_minute │ 0 │ +│ request_total_duration │ 0.0 │ +│ requests_total │ 0 │ +│ crawler_runtime │ 0.016736 │ +└───────────────────────────────┴──────────┘ +[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 +[BeautifulSoupCrawler] INFO Looking for product detail pages +[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker +[BeautifulSoupCrawler] INFO Saving a product variant +[BeautifulSoupCrawler] INFO Saving a product variant +... +``` + +## Updating the Actor configuration + +The Actor configuration from the template tells the platform to expect input, so we need to update that before running our scraper in the cloud. + +Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'll edit the `input_schema.json` file, which looks like this by default: + +```json title=warehouse-watchdog/src/.actor/input_schema.json +{ + "title": "Python Crawlee BeautifulSoup Scraper", + "type": "object", + "schemaVersion": 1, + "properties": { + "start_urls": { + "title": "Start URLs", + "type": "array", + "description": "URLs to start with", + "prefill": [ + { "url": "https://apify.com" } + ], + "editor": "requestListSources" + } + }, + "required": ["start_urls"] +} +``` + +:::tip Hidden dot files + +On some systems, `.actor` might be hidden in the directory listing because it starts with a dot. Use your editor's built-in file explorer to locate it. + +::: + +We'll remove the expected properties and the list of required ones. After our changes, the file should look like this: + +```json title=warehouse-watchdog/src/.actor/input_schema.json +{ + "title": "Python Crawlee BeautifulSoup Scraper", + "type": "object", + "schemaVersion": 1, + "properties": {} +} +``` + +:::danger Trailing commas in JSON + +Make sure there's no trailing comma after `{}`, or the file won't be valid JSON. + +::: + +## Deploying the scraper + +Now we can proceed to deployment: + +```text +$ apify push +Info: Created Actor with name warehouse-watchdog on Apify. +Info: Deploying Actor 'warehouse-watchdog' to Apify. +Run: Updated version 0.0 for Actor warehouse-watchdog. +Run: Building Actor warehouse-watchdog +... +Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.0.1 +? Do you want to open the Actor detail in your browser? (Y/n) +``` + +After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud. + +![Actor's detail page, screen Source, tab Input](./images/actor-input.webp) + +When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more. + +![Actor's detail page, screen Source, tab Output](./images/actor-output.webp) + +:::info Accessing data + +We don't need to click buttons to download the data. It's possible to retrieve it also using Apify's API, the `apify datasets` CLI command, or the Python SDK. Learn more in the [Dataset docs](https://docs.apify.com/platform/storage/dataset). + +::: + +## Running the scraper periodically + +Now that our scraper is deployed, let's automate its execution. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Let's click **Create new**, review the periodicity (default: daily), and specify the Actor to run. Then we'll click **Enable**—that's it! + +From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts. + +![Schedule detail page](./images/actor-schedule.webp) + +## Adding support for proxies + +If monitoring shows that our scraper frequently fails to reach the Warehouse Shop website, it's likely being blocked. To avoid this, we can [configure proxies](https://docs.apify.com/platform/proxy) so our requests come from different locations, reducing the chances of detection and blocking. + +Proxy configuration is a type of Actor input, so let's start by reintroducing the necessary code. We'll update `warehouse-watchdog/src/main.py` like this: + +```py title=warehouse-watchdog/src/main.py +from apify import Actor +from .crawler import main as crawl + +async def main(): + async with Actor: + input_data = await Actor.get_input() + + if actor_proxy_input := input_data.get("proxyConfig"): + proxy_config = await Actor.create_proxy_configuration(actor_proxy_input=actor_proxy_input) + else: + proxy_config = None + + await crawl(proxy_config) +``` + +Next, we'll add `proxy_config` as an optional parameter in `warehouse-watchdog/src/crawler.py`. Thanks to the built-in integration between Apify and Crawlee, we only need to pass it to `BeautifulSoupCrawler()`, and the class will handle the rest: + +```py title=warehouse-watchdog/src/crawler.py +import asyncio +from decimal import Decimal +from crawlee.crawlers import BeautifulSoupCrawler + +# highlight-next-line +async def main(proxy_config = None): + # highlight-next-line + crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config) + # highlight-next-line + crawler.log.info(f"Using proxy: {'yes' if proxy_config else 'no'}") + + @crawler.router.default_handler + async def handle_listing(context): + context.log.info("Looking for product detail pages") + await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") + + ... +``` + +Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` to include the `proxyConfig` input parameter: + +```json title=warehouse-watchdog/src/.actor/input_schema.json +{ + "title": "Python Crawlee BeautifulSoup Scraper", + "type": "object", + "schemaVersion": 1, + "properties": { + "proxyConfig": { + "title": "Proxy config", + "description": "Proxy configuration", + "type": "object", + "editor": "proxy", + "prefill": { + "useApifyProxy": true, + "apifyProxyGroups": [] + }, + "default": { + "useApifyProxy": true, + "apifyProxyGroups": [] + } + } + } +} +``` + +To verify everything works, we'll run the scraper locally. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run: + +```text +$ apify run --purge +Info: All default local stores were purged. +Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src +[apify] INFO Initializing Actor... +[apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) +[BeautifulSoupCrawler] INFO Using proxy: no +[BeautifulSoupCrawler] INFO Current request statistics: +┌───────────────────────────────┬──────────┐ +│ requests_finished │ 0 │ +│ requests_failed │ 0 │ +│ retry_histogram │ [0] │ +│ request_avg_failed_duration │ None │ +│ request_avg_finished_duration │ None │ +│ requests_finished_per_minute │ 0 │ +│ requests_failed_per_minute │ 0 │ +│ request_total_duration │ 0.0 │ +│ requests_total │ 0 │ +│ crawler_runtime │ 0.014976 │ +└───────────────────────────────┴──────────┘ +[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 +[BeautifulSoupCrawler] INFO Looking for product detail pages +[BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker +[BeautifulSoupCrawler] INFO Saving a product variant +[BeautifulSoupCrawler] INFO Saving a product variant +... +``` + +In the logs, we should see `Using proxy: no`, because local runs don't include proxy settings. All requests will be made from our own location, just as before. Now, let's update the cloud version of our scraper with `apify push`: + +```text +$ apify push +Info: Deploying Actor 'warehouse-watchdog' to Apify. +Run: Updated version 0.0 for Actor warehouse-watchdog. +Run: Building Actor warehouse-watchdog +(timestamp) ACTOR: Found input schema referenced from .actor/actor.json +... +? Do you want to open the Actor detail in your browser? (Y/n) +``` + +Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. + +![Actor's detail page, screen Source, tab Input with proxies](./images/actor-input-proxies.webp) + +We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform: + +```text +(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository. +(timestamp) ACTOR: Creating Docker container. +(timestamp) ACTOR: Starting Docker container. +(timestamp) [apify] INFO Initializing Actor... +(timestamp) [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) +(timestamp) [BeautifulSoupCrawler] INFO Using proxy: yes +(timestamp) [BeautifulSoupCrawler] INFO Current request statistics: +(timestamp) ┌───────────────────────────────┬──────────┐ +(timestamp) │ requests_finished │ 0 │ +(timestamp) │ requests_failed │ 0 │ +(timestamp) │ retry_histogram │ [0] │ +(timestamp) │ request_avg_failed_duration │ None │ +(timestamp) │ request_avg_finished_duration │ None │ +(timestamp) │ requests_finished_per_minute │ 0 │ +(timestamp) │ requests_failed_per_minute │ 0 │ +(timestamp) │ request_total_duration │ 0.0 │ +(timestamp) │ requests_total │ 0 │ +(timestamp) │ crawler_runtime │ 0.036449 │ +(timestamp) └───────────────────────────────┴──────────┘ +(timestamp) [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 +(timestamp) [crawlee.storages._request_queue] INFO The queue still contains requests locked by another client +(timestamp) [BeautifulSoupCrawler] INFO Looking for product detail pages +(timestamp) [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker +(timestamp) [BeautifulSoupCrawler] INFO Saving a product variant +... +``` + +## Congratulations! + +We've reached the end of the course—congratulations! Together, we've built a program that: + +- Crawls a shop and extracts product and pricing data. +- Exports the results in several formats. +- Uses a concise code, thanks to a scraping framework. +- Runs on a cloud platform with monitoring and alerts. +- Executes periodically without manual intervention, collecting data over time. +- Uses proxies to avoid being blocked. + +We hope this serves as a solid foundation for your next scraping project. Perhaps you'll even [start publishing scrapers](https://docs.apify.com/platform/actors/publishing) for others to use—for a fee? diff --git a/sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx b/sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx new file mode 100644 index 000000000..ba254f402 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx @@ -0,0 +1,10 @@ + +## Exercises + +These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! + +:::caution Real world + +You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). + +::: diff --git a/sources/academy/webscraping/scraping_basics_javascript2/images b/sources/academy/webscraping/scraping_basics_javascript2/images new file mode 120000 index 000000000..535a050e4 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/images @@ -0,0 +1 @@ +../scraping_basics_python/images \ No newline at end of file diff --git a/sources/academy/webscraping/scraping_basics_javascript2/index.md b/sources/academy/webscraping/scraping_basics_javascript2/index.md new file mode 100644 index 000000000..03c7dde99 --- /dev/null +++ b/sources/academy/webscraping/scraping_basics_javascript2/index.md @@ -0,0 +1,69 @@ +--- +title: Web scraping basics for JavaScript devs +description: Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics. +sidebar_position: 1.5 +category: web scraping & automation +slug: /scraping-basics-javascript2 +unlisted: true +--- + +import DocCardList from '@theme/DocCardList'; + +**Learn how to use Python to extract information from websites in this practical course, starting from the absolute basics.** + +--- + +In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc. + +![E-commerce listing on the left, JSON with data on the right](./images/scraping.webp) + +## What we'll do + +- Inspect pages using browser DevTools. +- Download web pages using the HTTPX library. +- Extract data from web pages using the Beautiful Soup library. +- Save extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open. +- Follow links programmatically (crawling). +- Save time and effort with frameworks, such as Crawlee, and scraping platforms, such as Apify. + +## Who this course is for + +Anyone with basic knowledge of developing programs in Python who wants to start with web scraping can take this course. The course does not expect you to have any prior knowledge of web technologies or scraping. + +## Requirements + +- A macOS, Linux, or Windows machine with a web browser and Python installed. +- Familiarity with Python basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions. +- Comfort with importing from the Python standard library, using virtual environments, and installing dependencies with `pip`. +- Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows). + +## You may want to know + +Let's explore the key reasons to take this course. What is web scraping good for, and what career opportunities does it enable for you? + +### Why learn scraping + +The internet is full of useful data, but most of it isn't offered in a structured way that's easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them. + +Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more. + +### Why build your own scrapers + +Scrapers are programs specifically designed to mine data from the internet. Point-and-click or no-code scraping solutions do exist, but they only take you so far. While simple to use, they lack the flexibility and optimization needed to handle advanced cases. Only custom-built scrapers can tackle more difficult challenges. And unlike ready-made solutions, they can be fine-tuned to perform tasks more efficiently, at a lower cost, or with greater precision. + +### Why become a scraper dev + +As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API! Here are some things you can do if you understand scraping: + +- Improve your productivity by building personal tools, such as your own real estate or rare sneakers watchdog. +- Companies can hire you to build custom scrapers mining data important for their business. +- Become an invaluable asset to data journalism, data science, or nonprofit teams working to make the world a better place. +- You can publish your scrapers on platforms like the [Apify Store](https://apify.com/store) and earn money by renting them out to others. + +### Why learn with Apify + +We are [Apify](https://apify.com), a web scraping and automation platform. We do our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how a scraping platform can simplify your life, but that lesson is optional and designed to fit within our [free tier](https://apify.com/pricing). + +## Course content + +