Skip to content

Commit 1842da8

Browse files
authored
fix: Replace deprecated dataset in tutorials and docs (#2462)
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
1 parent 67d40c1 commit 1842da8

File tree

4 files changed

+215
-775
lines changed

4 files changed

+215
-775
lines changed

docs/source/scale.rst

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -37,18 +37,16 @@ In distributed mode, the same ``awswrangler`` APIs can now handle much larger da
3737

3838
.. code-block:: python
3939
40-
# Read Parquet data (1.2 Gb Parquet compressed)
41-
df = wr.s3.read_parquet(
42-
path=f"s3://amazon-reviews-pds/parquet/product_category=Toys/",
43-
)
40+
# Read 1.6 Gb Parquet data
41+
df = wr.s3.read_parquet(path="s3://ursa-labs-taxi-data/2017/")
4442
45-
# Drop the customer_id column
46-
df.drop("customer_id", axis=1, inplace=True)
43+
# Drop vendor_id column
44+
df.drop("vendor_id", axis=1, inplace=True)
4745
48-
# Filter reviews with 5-star rating
49-
df5 = df[df["star_rating"] == 5]
46+
# Filter trips over 1 mile
47+
df1 = df[df["trip_distance"] > 1]
5048
51-
In the example above, Amazon product data is read from Amazon S3 into a distributed `Modin data frame <https://modin.readthedocs.io/en/stable/getting_started/why_modin/pandas.html>`_.
49+
In the example above, New York City Taxi data is read from Amazon S3 into a distributed `Modin data frame <https://modin.readthedocs.io/en/stable/getting_started/why_modin/pandas.html>`_.
5250
Modin is a drop-in replacement for Pandas. It exposes the same APIs but enables you to use all of the cores on your machine, or all of the workers in an entire cluster, leading to improved performance and scale.
5351
To use it, make sure to replace your pandas import statement with modin:
5452

tests/glue_scripts/wrangler_blog_simple.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# Drop vendor_id column
1414
df.drop("vendor_id", axis=1, inplace=True)
1515

16-
# Filter trips with 1 passenger
16+
# Filter trips over 1 mile
1717
df1 = df[df["trip_distance"] > 1]
1818

1919
# Write partitioned trips to S3 in Parquet format

tutorials/029 - S3 Select.ipynb

Lines changed: 52 additions & 167 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,33 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"metadata": {},
5+
"metadata": {
6+
"pycharm": {
7+
"name": "#%% md\n"
8+
}
9+
},
610
"source": [
711
"[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)"
812
]
913
},
1014
{
1115
"cell_type": "markdown",
12-
"metadata": {},
16+
"metadata": {
17+
"pycharm": {
18+
"name": "#%% md\n"
19+
}
20+
},
1321
"source": [
1422
"# 29 - S3 Select"
1523
]
1624
},
1725
{
1826
"cell_type": "markdown",
19-
"metadata": {},
27+
"metadata": {
28+
"pycharm": {
29+
"name": "#%% md\n"
30+
}
31+
},
2032
"source": [
2133
"AWS SDK for pandas supports [Amazon S3 Select](https://aws.amazon.com/blogs/aws/s3-glacier-select/), enabling applications to use SQL statements in order to query and filter the contents of a single S3 object. It works on objects stored in CSV, JSON or Apache Parquet, including compressed and large files of several TBs.\n",
2234
"\n",
@@ -32,172 +44,28 @@
3244
},
3345
{
3446
"cell_type": "markdown",
35-
"metadata": {},
47+
"metadata": {
48+
"pycharm": {
49+
"name": "#%% md\n"
50+
}
51+
},
3652
"source": [
3753
"## Read multiple Parquet files from an S3 prefix"
3854
]
3955
},
4056
{
4157
"cell_type": "code",
4258
"execution_count": 1,
43-
"metadata": {},
59+
"metadata": {
60+
"pycharm": {
61+
"name": "#%%\n"
62+
}
63+
},
4464
"outputs": [
4565
{
4666
"data": {
47-
"text/html": [
48-
"<div>\n",
49-
"<style scoped>\n",
50-
" .dataframe tbody tr th:only-of-type {\n",
51-
" vertical-align: middle;\n",
52-
" }\n",
53-
"\n",
54-
" .dataframe tbody tr th {\n",
55-
" vertical-align: top;\n",
56-
" }\n",
57-
"\n",
58-
" .dataframe thead th {\n",
59-
" text-align: right;\n",
60-
" }\n",
61-
"</style>\n",
62-
"<table border=\"1\" class=\"dataframe\">\n",
63-
" <thead>\n",
64-
" <tr style=\"text-align: right;\">\n",
65-
" <th></th>\n",
66-
" <th>marketplace</th>\n",
67-
" <th>customer_id</th>\n",
68-
" <th>review_id</th>\n",
69-
" <th>product_id</th>\n",
70-
" <th>product_parent</th>\n",
71-
" <th>star_rating</th>\n",
72-
" <th>helpful_votes</th>\n",
73-
" <th>total_votes</th>\n",
74-
" <th>vine</th>\n",
75-
" <th>verified_purchase</th>\n",
76-
" <th>review_headline</th>\n",
77-
" <th>review_body</th>\n",
78-
" <th>review_date</th>\n",
79-
" <th>year</th>\n",
80-
" </tr>\n",
81-
" </thead>\n",
82-
" <tbody>\n",
83-
" <tr>\n",
84-
" <th>0</th>\n",
85-
" <td>US</td>\n",
86-
" <td>52670295</td>\n",
87-
" <td>RGPOFKORD8RTU</td>\n",
88-
" <td>B0002CZPPG</td>\n",
89-
" <td>867256265</td>\n",
90-
" <td>5</td>\n",
91-
" <td>105</td>\n",
92-
" <td>107</td>\n",
93-
" <td>N</td>\n",
94-
" <td>N</td>\n",
95-
" <td>Excellent Gift Idea</td>\n",
96-
" <td>I wonder if the other reviewer actually read t...</td>\n",
97-
" <td>2005-02-08</td>\n",
98-
" <td>2005</td>\n",
99-
" </tr>\n",
100-
" <tr>\n",
101-
" <th>1</th>\n",
102-
" <td>US</td>\n",
103-
" <td>29964102</td>\n",
104-
" <td>R2U8X8V5KPB4J3</td>\n",
105-
" <td>B00H5BMF00</td>\n",
106-
" <td>373287760</td>\n",
107-
" <td>5</td>\n",
108-
" <td>0</td>\n",
109-
" <td>0</td>\n",
110-
" <td>N</td>\n",
111-
" <td>Y</td>\n",
112-
" <td>Five Stars</td>\n",
113-
" <td>convenience is the name of the game.</td>\n",
114-
" <td>2015-05-03</td>\n",
115-
" <td>2015</td>\n",
116-
" </tr>\n",
117-
" <tr>\n",
118-
" <th>2</th>\n",
119-
" <td>US</td>\n",
120-
" <td>25173351</td>\n",
121-
" <td>R15XV3LXUMLTXL</td>\n",
122-
" <td>B00PG40CO4</td>\n",
123-
" <td>137115061</td>\n",
124-
" <td>5</td>\n",
125-
" <td>0</td>\n",
126-
" <td>0</td>\n",
127-
" <td>N</td>\n",
128-
" <td>Y</td>\n",
129-
" <td>Birthday Gift</td>\n",
130-
" <td>This gift card was handled with accuracy in de...</td>\n",
131-
" <td>2015-05-03</td>\n",
132-
" <td>2015</td>\n",
133-
" </tr>\n",
134-
" <tr>\n",
135-
" <th>3</th>\n",
136-
" <td>US</td>\n",
137-
" <td>12516181</td>\n",
138-
" <td>R3G6G7H8TX4H0T</td>\n",
139-
" <td>B0002CZPPG</td>\n",
140-
" <td>867256265</td>\n",
141-
" <td>5</td>\n",
142-
" <td>6</td>\n",
143-
" <td>6</td>\n",
144-
" <td>N</td>\n",
145-
" <td>N</td>\n",
146-
" <td>Love 'em.</td>\n",
147-
" <td>Gotta love these iTunes Prepaid Card thingys. ...</td>\n",
148-
" <td>2005-10-15</td>\n",
149-
" <td>2005</td>\n",
150-
" </tr>\n",
151-
" <tr>\n",
152-
" <th>4</th>\n",
153-
" <td>US</td>\n",
154-
" <td>38355314</td>\n",
155-
" <td>R2NJ7WNBU16YTQ</td>\n",
156-
" <td>B00B2TFSO6</td>\n",
157-
" <td>89375983</td>\n",
158-
" <td>5</td>\n",
159-
" <td>0</td>\n",
160-
" <td>0</td>\n",
161-
" <td>N</td>\n",
162-
" <td>Y</td>\n",
163-
" <td>Five Stars</td>\n",
164-
" <td>perfect</td>\n",
165-
" <td>2015-05-03</td>\n",
166-
" <td>2015</td>\n",
167-
" </tr>\n",
168-
" </tbody>\n",
169-
"</table>\n",
170-
"</div>"
171-
],
172-
"text/plain": [
173-
" marketplace customer_id review_id product_id product_parent \\\n",
174-
"0 US 52670295 RGPOFKORD8RTU B0002CZPPG 867256265 \n",
175-
"1 US 29964102 R2U8X8V5KPB4J3 B00H5BMF00 373287760 \n",
176-
"2 US 25173351 R15XV3LXUMLTXL B00PG40CO4 137115061 \n",
177-
"3 US 12516181 R3G6G7H8TX4H0T B0002CZPPG 867256265 \n",
178-
"4 US 38355314 R2NJ7WNBU16YTQ B00B2TFSO6 89375983 \n",
179-
"\n",
180-
" star_rating helpful_votes total_votes vine verified_purchase \\\n",
181-
"0 5 105 107 N N \n",
182-
"1 5 0 0 N Y \n",
183-
"2 5 0 0 N Y \n",
184-
"3 5 6 6 N N \n",
185-
"4 5 0 0 N Y \n",
186-
"\n",
187-
" review_headline review_body \\\n",
188-
"0 Excellent Gift Idea I wonder if the other reviewer actually read t... \n",
189-
"1 Five Stars convenience is the name of the game. \n",
190-
"2 Birthday Gift This gift card was handled with accuracy in de... \n",
191-
"3 Love 'em. Gotta love these iTunes Prepaid Card thingys. ... \n",
192-
"4 Five Stars perfect \n",
193-
"\n",
194-
" review_date year \n",
195-
"0 2005-02-08 2005 \n",
196-
"1 2015-05-03 2015 \n",
197-
"2 2015-05-03 2015 \n",
198-
"3 2005-10-15 2005 \n",
199-
"4 2015-05-03 2015 "
200-
]
67+
"text/plain": " vendor_id pickup_at dropoff_at \\\n0 2 2019-01-01T00:48:10.000Z 2019-01-01T01:36:58.000Z \n1 2 2019-01-01T00:38:36.000Z 2019-01-01T01:21:33.000Z \n2 2 2019-01-01T00:10:43.000Z 2019-01-01T01:23:59.000Z \n3 1 2019-01-01T00:13:17.000Z 2019-01-01T01:06:13.000Z \n4 2 2019-01-01T00:29:11.000Z 2019-01-01T01:29:05.000Z \n\n passenger_count trip_distance rate_code_id store_and_fwd_flag \\\n0 1 31.570000 1 N \n1 2 33.189999 5 N \n2 1 33.060001 1 N \n3 1 44.099998 5 N \n4 2 31.100000 1 N \n\n pickup_location_id dropoff_location_id payment_type fare_amount extra \\\n0 138 138 2 82.5 0.5 \n1 107 265 1 121.0 0.0 \n2 243 42 2 92.0 0.5 \n3 132 265 2 150.0 0.0 \n4 169 201 1 85.5 0.5 \n\n mta_tax tip_amount tolls_amount improvement_surcharge total_amount \\\n0 0.5 0.00 0.00 0.3 83.800003 \n1 0.0 0.08 10.50 0.3 131.880005 \n2 0.5 0.00 5.76 0.3 99.059998 \n3 0.0 0.00 0.00 0.3 150.300003 \n4 0.5 0.00 7.92 0.3 94.720001 \n\n congestion_surcharge \n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN ",
68+
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>vendor_id</th>\n <th>pickup_at</th>\n <th>dropoff_at</th>\n <th>passenger_count</th>\n <th>trip_distance</th>\n <th>rate_code_id</th>\n <th>store_and_fwd_flag</th>\n <th>pickup_location_id</th>\n <th>dropoff_location_id</th>\n <th>payment_type</th>\n <th>fare_amount</th>\n <th>extra</th>\n <th>mta_tax</th>\n <th>tip_amount</th>\n <th>tolls_amount</th>\n <th>improvement_surcharge</th>\n <th>total_amount</th>\n <th>congestion_surcharge</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2</td>\n <td>2019-01-01T00:48:10.000Z</td>\n <td>2019-01-01T01:36:58.000Z</td>\n <td>1</td>\n <td>31.570000</td>\n <td>1</td>\n <td>N</td>\n <td>138</td>\n <td>138</td>\n <td>2</td>\n <td>82.5</td>\n <td>0.5</td>\n <td>0.5</td>\n <td>0.00</td>\n <td>0.00</td>\n <td>0.3</td>\n <td>83.800003</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2019-01-01T00:38:36.000Z</td>\n <td>2019-01-01T01:21:33.000Z</td>\n <td>2</td>\n <td>33.189999</td>\n <td>5</td>\n <td>N</td>\n <td>107</td>\n <td>265</td>\n <td>1</td>\n <td>121.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.08</td>\n <td>10.50</td>\n <td>0.3</td>\n <td>131.880005</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>2019-01-01T00:10:43.000Z</td>\n <td>2019-01-01T01:23:59.000Z</td>\n <td>1</td>\n <td>33.060001</td>\n <td>1</td>\n <td>N</td>\n <td>243</td>\n <td>42</td>\n <td>2</td>\n <td>92.0</td>\n <td>0.5</td>\n <td>0.5</td>\n <td>0.00</td>\n <td>5.76</td>\n <td>0.3</td>\n <td>99.059998</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n <td>2019-01-01T00:13:17.000Z</td>\n <td>2019-01-01T01:06:13.000Z</td>\n <td>1</td>\n <td>44.099998</td>\n <td>5</td>\n <td>N</td>\n <td>132</td>\n <td>265</td>\n <td>2</td>\n <td>150.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.00</td>\n <td>0.00</td>\n <td>0.3</td>\n <td>150.300003</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2</td>\n <td>2019-01-01T00:29:11.000Z</td>\n <td>2019-01-01T01:29:05.000Z</td>\n <td>2</td>\n <td>31.100000</td>\n <td>1</td>\n <td>N</td>\n <td>169</td>\n <td>201</td>\n <td>1</td>\n <td>85.5</td>\n <td>0.5</td>\n <td>0.5</td>\n <td>0.00</td>\n <td>7.92</td>\n <td>0.3</td>\n <td>94.720001</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>"
20169
},
20270
"execution_count": 1,
20371
"metadata": {},
@@ -208,25 +76,34 @@
20876
"import awswrangler as wr\n",
20977
"\n",
21078
"df = wr.s3.select_query(\n",
211-
" sql=\"SELECT * FROM s3object s where s.\\\"star_rating\\\" >= 5\",\n",
212-
" path=\"s3://amazon-reviews-pds/parquet/product_category=Gift_Card/\",\n",
79+
" sql=\"SELECT * FROM s3object s where s.\\\"trip_distance\\\" > 30\",\n",
80+
" path=\"s3://ursa-labs-taxi-data/2019/01/\",\n",
21381
" input_serialization=\"Parquet\",\n",
21482
" input_serialization_params={},\n",
21583
")\n",
216-
"df.loc[:, df.columns != \"product_title\"].head()"
84+
"\n",
85+
"df.head()"
21786
]
21887
},
21988
{
22089
"cell_type": "markdown",
221-
"metadata": {},
90+
"metadata": {
91+
"pycharm": {
92+
"name": "#%% md\n"
93+
}
94+
},
22295
"source": [
22396
"## Read full CSV file"
22497
]
22598
},
22699
{
227100
"cell_type": "code",
228101
"execution_count": 5,
229-
"metadata": {},
102+
"metadata": {
103+
"pycharm": {
104+
"name": "#%%\n"
105+
}
106+
},
230107
"outputs": [
231108
{
232109
"data": {
@@ -340,15 +217,23 @@
340217
},
341218
{
342219
"cell_type": "markdown",
343-
"metadata": {},
220+
"metadata": {
221+
"pycharm": {
222+
"name": "#%% md\n"
223+
}
224+
},
344225
"source": [
345226
"## Filter JSON file"
346227
]
347228
},
348229
{
349230
"cell_type": "code",
350231
"execution_count": 3,
351-
"metadata": {},
232+
"metadata": {
233+
"pycharm": {
234+
"name": "#%%\n"
235+
}
236+
},
352237
"outputs": [
353238
{
354239
"data": {
@@ -468,4 +353,4 @@
468353
},
469354
"nbformat": 4,
470355
"nbformat_minor": 2
471-
}
356+
}

0 commit comments

Comments
 (0)