Skip to content

fix: re-order JSON and CSV in Python lessons #1658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ Over the course of the previous lessons, the code of our program grew to almost
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
import json
import csv

url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
response = httpx.get(url)
Expand Down Expand Up @@ -153,8 +153,8 @@ Now let's put it all together:
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
import json
import csv

def download(url):
response = httpx.get(url)
Expand Down Expand Up @@ -279,8 +279,8 @@ Browsers reading the HTML know the base address and automatically resolve such l
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
import json
import csv
# highlight-next-line
from urllib.parse import urljoin
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ Thanks to the refactoring, we have functions ready for each of the tasks, so we
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
import json
import csv
from urllib.parse import urljoin

def download(url):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -193,8 +193,8 @@ Now, if we use our new function, we should finally get a program that can scrape
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
import json
import csv
from urllib.parse import urljoin

def download(url):
Expand Down
154 changes: 81 additions & 73 deletions sources/academy/webscraping/scraping_basics_python/08_saving_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,83 +78,28 @@ If you find the complex data structures printed by `print()` difficult to read,

:::

## Saving data as CSV

The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.

In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:

```py
>>> import csv
>>> with open("data.csv", "w") as file:
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
... writer.writeheader()
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
...
```

We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:

```csv title=data.csv
name,age,hobbies
Alice,24,"kickbox, Python"
Bob,42,"reading, TypeScript"
```

In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this.

When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have.

![CSV example preview](images/csv-example.png)

Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:

```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
# highlight-next-line
import csv
```

Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following:

```py
with open("products.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
writer.writeheader()
for row in data:
writer.writerow(row)
```

If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products.

![CSV preview](images/csv.png)

## Saving data as JSON

The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries.

In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports:
In Python, we can read and write JSON using the [`json`](https://docs.python.org/3/library/json.html) standard library module. We'll begin with imports:

```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
# highlight-next-line
import json
```

Next, let’s append one more export to end of the source code of our scraper:
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `print(data)` with the following:

```py
with open("products.json", "w") as file:
json.dump(data, file)
```

Thats it! If we run the program now, it should also create a `products.json` file in the current working directory:
That's it! If we run the program now, it should also create a `products.json` file in the current working directory:

```text
$ python main.py
Expand All @@ -176,7 +121,7 @@ with open("products.json", "w") as file:
json.dump(data, file, default=serialize)
```

Now the program should work as expected, producing a JSON file with the following content:
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:

<!-- eslint-skip -->
```json title=products.json
Expand All @@ -197,30 +142,76 @@ Also, if your data contains non-English characters, set `ensure_ascii=False`. By

:::

We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
## Saving data as CSV

---
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.

## Exercises
In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:

In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.
```py
>>> import csv
>>> with open("data.csv", "w") as file:
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
... writer.writeheader()
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
...
```

### Process your CSV
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:

Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500.
```csv title=data.csv
name,age,hobbies
Alice,24,"kickbox, Python"
Bob,42,"reading, TypeScript"
```

<details>
<summary>Solution</summary>
In the CSV format, if a value contains commas, we should enclose it in quotes. When we open the file in a text editor of our choice, we can see that the writer automatically handled this.

Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.

1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
2. Select the header row. Go to **Data > Create filter**.
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
![CSV example preview](images/csv-example.png)

![CSV in Google Sheets](images/csv-sheets.png)
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:

</details>
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import json
# highlight-next-line
import csv
```

Next, let's add one more data export to end of the source code of our scraper:

```py
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")

with open("products.json", "w") as file:
json.dump(data, file, default=serialize)

with open("products.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"])
writer.writeheader()
for row in data:
writer.writerow(row)
```

The program should now also produce a CSV file with the following content:

![CSV preview](images/csv.png)

We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.

---

## Exercises

In this lesson, we created export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them.

### Process your JSON

Expand All @@ -243,3 +234,20 @@ Write a new Python program that reads `products.json`, finds all products with a
```

</details>

### Process your CSV

Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500.

<details>
<summary>Solution</summary>

Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:

1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
2. Select the header row. Go to **Data > Create filter**.
3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.

![CSV in Google Sheets](images/csv-sheets.png)

</details>
Loading
Loading