|
153 | 153 | "id": "3f6afdfd",
|
154 | 154 | "metadata": {},
|
155 | 155 | "source": [
|
156 |
| - "### Reduce pandas.DataFrame’s Memory" |
| 156 | + "### Smart Data Type Selection for Memory-Efficient Pandas" |
157 | 157 | ]
|
158 | 158 | },
|
159 | 159 | {
|
160 | 160 | "cell_type": "markdown",
|
161 | 161 | "id": "a71e2b90",
|
162 | 162 | "metadata": {},
|
163 | 163 | "source": [
|
164 |
| - "If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below." |
| 164 | + "To reduce the memory usage of a Pandas DataFrame, you can start by changing the data type of a column. " |
165 | 165 | ]
|
166 | 166 | },
|
167 | 167 | {
|
168 | 168 | "cell_type": "code",
|
169 |
| - "execution_count": 7, |
| 169 | + "execution_count": 42, |
170 | 170 | "id": "bb5a60a6",
|
171 | 171 | "metadata": {
|
172 | 172 | "ExecuteTime": {
|
173 | 173 | "end_time": "2021-11-18T14:28:51.041729Z",
|
174 | 174 | "start_time": "2021-11-18T14:28:51.013221Z"
|
175 | 175 | }
|
176 | 176 | },
|
| 177 | + "outputs": [ |
| 178 | + { |
| 179 | + "name": "stdout", |
| 180 | + "output_type": "stream", |
| 181 | + "text": [ |
| 182 | + "<class 'pandas.core.frame.DataFrame'>\n", |
| 183 | + "RangeIndex: 150 entries, 0 to 149\n", |
| 184 | + "Data columns (total 5 columns):\n", |
| 185 | + " # Column Non-Null Count Dtype \n", |
| 186 | + "--- ------ -------------- ----- \n", |
| 187 | + " 0 sepal length (cm) 150 non-null float64\n", |
| 188 | + " 1 sepal width (cm) 150 non-null float64\n", |
| 189 | + " 2 petal length (cm) 150 non-null float64\n", |
| 190 | + " 3 petal width (cm) 150 non-null float64\n", |
| 191 | + " 4 target 150 non-null int64 \n", |
| 192 | + "dtypes: float64(4), int64(1)\n", |
| 193 | + "memory usage: 6.0 KB\n" |
| 194 | + ] |
| 195 | + } |
| 196 | + ], |
| 197 | + "source": [ |
| 198 | + "from sklearn.datasets import load_iris\n", |
| 199 | + "import pandas as pd \n", |
| 200 | + "\n", |
| 201 | + "X, y = load_iris(as_frame=True, return_X_y=True)\n", |
| 202 | + "df = pd.concat([X, pd.DataFrame(y, columns=[\"target\"])], axis=1)\n", |
| 203 | + "df.info()" |
| 204 | + ] |
| 205 | + }, |
| 206 | + { |
| 207 | + "cell_type": "markdown", |
| 208 | + "id": "e7a44662-d675-46d1-b860-ce65fec1aeab", |
| 209 | + "metadata": {}, |
| 210 | + "source": [ |
| 211 | + "By default, Pandas uses `float64` for floating-point numbers, which can be oversized for columns with smaller value ranges. Here are some alternatives:\n", |
| 212 | + "\n", |
| 213 | + "- **float16**: Suitable for values between -32768 and 32767.\n", |
| 214 | + "- **float32**: Suitable for integers between -2147483648 and 2147483647.\n", |
| 215 | + "- **float64**: The default, suitable for a wide range of values.\n", |
| 216 | + "\n", |
| 217 | + "For example, if you know that the values in the \"sepal length (cm)\" column will never exceed 32767, you can use `float16` to reduce memory usage." |
| 218 | + ] |
| 219 | + }, |
| 220 | + { |
| 221 | + "cell_type": "code", |
| 222 | + "execution_count": 44, |
| 223 | + "id": "dbfae785-f316-4e8d-b428-81e05a8da1dc", |
| 224 | + "metadata": {}, |
177 | 225 | "outputs": [
|
178 | 226 | {
|
179 | 227 | "data": {
|
180 | 228 | "text/plain": [
|
181 |
| - "Index 128\n", |
182 |
| - "sepal length (cm) 1200\n", |
183 |
| - "sepal width (cm) 1200\n", |
184 |
| - "petal length (cm) 1200\n", |
185 |
| - "petal width (cm) 1200\n", |
186 |
| - "target 1200\n", |
187 |
| - "dtype: int64" |
| 229 | + "7.9" |
188 | 230 | ]
|
189 | 231 | },
|
190 |
| - "execution_count": 7, |
| 232 | + "execution_count": 44, |
191 | 233 | "metadata": {},
|
192 | 234 | "output_type": "execute_result"
|
193 | 235 | }
|
194 | 236 | ],
|
195 | 237 | "source": [
|
196 |
| - "from sklearn.datasets import load_iris\n", |
197 |
| - "import pandas as pd \n", |
| 238 | + "df['sepal length (cm)'].max()" |
| 239 | + ] |
| 240 | + }, |
| 241 | + { |
| 242 | + "cell_type": "code", |
| 243 | + "execution_count": 45, |
| 244 | + "id": "a12334bd-2c33-45e8-9979-91f16c45df06", |
| 245 | + "metadata": {}, |
| 246 | + "outputs": [ |
| 247 | + { |
| 248 | + "data": { |
| 249 | + "text/plain": [ |
| 250 | + "1332" |
| 251 | + ] |
| 252 | + }, |
| 253 | + "execution_count": 45, |
| 254 | + "metadata": {}, |
| 255 | + "output_type": "execute_result" |
| 256 | + } |
| 257 | + ], |
| 258 | + "source": [ |
| 259 | + "df['sepal length (cm)'].memory_usage()" |
| 260 | + ] |
| 261 | + }, |
| 262 | + { |
| 263 | + "cell_type": "code", |
| 264 | + "execution_count": 46, |
| 265 | + "id": "1221e9cc-fed2-4f75-8698-9b09b89d4c0e", |
| 266 | + "metadata": {}, |
| 267 | + "outputs": [ |
| 268 | + { |
| 269 | + "data": { |
| 270 | + "text/plain": [ |
| 271 | + "432" |
| 272 | + ] |
| 273 | + }, |
| 274 | + "execution_count": 46, |
| 275 | + "metadata": {}, |
| 276 | + "output_type": "execute_result" |
| 277 | + } |
| 278 | + ], |
| 279 | + "source": [ |
| 280 | + "df[\"sepal length (cm)\"] = df[\"sepal length (cm)\"].astype(\"float16\")\n", |
| 281 | + "df['sepal length (cm)'].memory_usage()" |
| 282 | + ] |
| 283 | + }, |
| 284 | + { |
| 285 | + "cell_type": "markdown", |
| 286 | + "id": "fcdcaaed-a7b4-484e-8dd0-bfe766203967", |
| 287 | + "metadata": {}, |
| 288 | + "source": [ |
| 289 | + "Here, the memory usage of the \"sepal length (cm)\" column decreased from 1332 bytes to 432 bytes, a reduction of approximately 67.6%." |
| 290 | + ] |
| 291 | + }, |
| 292 | + { |
| 293 | + "cell_type": "markdown", |
| 294 | + "id": "c1d1f261-0b1a-4dd9-a3d9-fc6d742f5847", |
| 295 | + "metadata": {}, |
| 296 | + "source": [ |
| 297 | + "If you have a categorical variable with low cardinality, you can change its data type to `category` to reduce memory usage.\n", |
198 | 298 | "\n",
|
199 |
| - "X, y = load_iris(as_frame=True, return_X_y=True)\n", |
200 |
| - "df = pd.concat([X, pd.DataFrame(y, columns=[\"target\"])], axis=1)\n", |
201 |
| - "df.memory_usage()" |
| 299 | + "The \"target\" column has only 3 unique values, making it a good candidate for the category data type to save memory." |
| 300 | + ] |
| 301 | + }, |
| 302 | + { |
| 303 | + "cell_type": "code", |
| 304 | + "execution_count": 48, |
| 305 | + "id": "6b1769e9-61f4-4a1d-a0d2-ffc30567c722", |
| 306 | + "metadata": {}, |
| 307 | + "outputs": [ |
| 308 | + { |
| 309 | + "data": { |
| 310 | + "text/plain": [ |
| 311 | + "3" |
| 312 | + ] |
| 313 | + }, |
| 314 | + "execution_count": 48, |
| 315 | + "metadata": {}, |
| 316 | + "output_type": "execute_result" |
| 317 | + } |
| 318 | + ], |
| 319 | + "source": [ |
| 320 | + "# View category\n", |
| 321 | + "df['target'].nunique()" |
| 322 | + ] |
| 323 | + }, |
| 324 | + { |
| 325 | + "cell_type": "code", |
| 326 | + "execution_count": 30, |
| 327 | + "id": "d236a672-3485-4503-a7d6-849c2fc6dfed", |
| 328 | + "metadata": {}, |
| 329 | + "outputs": [ |
| 330 | + { |
| 331 | + "data": { |
| 332 | + "text/plain": [ |
| 333 | + "1332" |
| 334 | + ] |
| 335 | + }, |
| 336 | + "execution_count": 30, |
| 337 | + "metadata": {}, |
| 338 | + "output_type": "execute_result" |
| 339 | + } |
| 340 | + ], |
| 341 | + "source": [ |
| 342 | + "df['target'].memory_usage()" |
202 | 343 | ]
|
203 | 344 | },
|
204 | 345 | {
|
205 | 346 | "cell_type": "code",
|
206 |
| - "execution_count": 5, |
| 347 | + "execution_count": 38, |
207 | 348 | "id": "a770da2a",
|
208 | 349 | "metadata": {
|
209 | 350 | "ExecuteTime": {
|
|
215 | 356 | {
|
216 | 357 | "data": {
|
217 | 358 | "text/plain": [
|
218 |
| - "Index 128\n", |
219 |
| - "sepal length (cm) 1200\n", |
220 |
| - "sepal width (cm) 1200\n", |
221 |
| - "petal length (cm) 1200\n", |
222 |
| - "petal width (cm) 1200\n", |
223 |
| - "target 282\n", |
224 |
| - "dtype: int64" |
| 359 | + "414" |
225 | 360 | ]
|
226 | 361 | },
|
227 |
| - "execution_count": 5, |
| 362 | + "execution_count": 38, |
228 | 363 | "metadata": {},
|
229 | 364 | "output_type": "execute_result"
|
230 | 365 | }
|
231 | 366 | ],
|
232 | 367 | "source": [
|
233 | 368 | "df[\"target\"] = df[\"target\"].astype(\"category\")\n",
|
234 |
| - "df.memory_usage()" |
| 369 | + "df['target'].memory_usage()" |
235 | 370 | ]
|
236 | 371 | },
|
237 | 372 | {
|
238 | 373 | "cell_type": "markdown",
|
239 | 374 | "id": "2f78876c",
|
240 | 375 | "metadata": {},
|
241 | 376 | "source": [
|
242 |
| - "The memory is now is reduced to almost a fifth of what it was!" |
| 377 | + "Here, the memory usage of the \"target\" column decreased from 1332 bytes to 414 bytes, a reduction of approximately 68.9%." |
| 378 | + ] |
| 379 | + }, |
| 380 | + { |
| 381 | + "cell_type": "markdown", |
| 382 | + "id": "d416217a-75f0-4ba3-be38-65a1386fc288", |
| 383 | + "metadata": {}, |
| 384 | + "source": [ |
| 385 | + "If we apply this reduction to the rest of the columns, the memory usage of the DataFrame decreased from 6.0 KB to 1.6 KB, a reduction of approximately 73.3%." |
| 386 | + ] |
| 387 | + }, |
| 388 | + { |
| 389 | + "cell_type": "code", |
| 390 | + "execution_count": 32, |
| 391 | + "id": "95737307-6680-4dfe-a0aa-bf1629e981d8", |
| 392 | + "metadata": {}, |
| 393 | + "outputs": [ |
| 394 | + { |
| 395 | + "name": "stdout", |
| 396 | + "output_type": "stream", |
| 397 | + "text": [ |
| 398 | + "<class 'pandas.core.frame.DataFrame'>\n", |
| 399 | + "RangeIndex: 150 entries, 0 to 149\n", |
| 400 | + "Data columns (total 5 columns):\n", |
| 401 | + " # Column Non-Null Count Dtype \n", |
| 402 | + "--- ------ -------------- ----- \n", |
| 403 | + " 0 sepal length (cm) 150 non-null float16 \n", |
| 404 | + " 1 sepal width (cm) 150 non-null float16 \n", |
| 405 | + " 2 petal length (cm) 150 non-null float16 \n", |
| 406 | + " 3 petal width (cm) 150 non-null float16 \n", |
| 407 | + " 4 target 150 non-null category\n", |
| 408 | + "dtypes: category(1), float16(4)\n", |
| 409 | + "memory usage: 1.6 KB\n" |
| 410 | + ] |
| 411 | + } |
| 412 | + ], |
| 413 | + "source": [ |
| 414 | + "float_cols = df.select_dtypes(include=['float64']).columns\n", |
| 415 | + "df[float_cols] = df[float_cols].apply(lambda x: x.astype('float16'))\n", |
| 416 | + "df.info()" |
243 | 417 | ]
|
244 | 418 | },
|
245 | 419 | {
|
|
0 commit comments