This repository contains exercises from Codeup's bootcamp.
Numpy is a library for representing and working with large and multi-dimensional arrays. Most other libraries in the data-science ecosystem depend on numpy, making it one of the fundamental data science libraries.
Numpy provides a number of useful tools for scientific programming. Convention is to import it like so:
import numpy as np
Provides capabilities pertaining to:
- Indexing
- An Array type that goes beyond built-in lists.
- Create a numpy array by passing a list to the np.array function.
- Make it multi-dimensional by passing a list of lists to np.array
- An Array type that goes beyond built-in lists.
- Vectorized Operations
- Vectorizing operations means that operations are automatically applied to every element in a vector
- Not only are the arithmatic operators vectorized;the same applies to comparison operators.
- Vectorizing operations means that operations are automatically applied to every element in a vector
- Array Creation (several methods)
- np.random.randn; np.zeros; np.ones; np.full; np.arange; np.linspace
- Array Methods
- .min; .max; .mean; .sum; .std (standard deviation)
The Pandas Series object is similar to a numpy array, with added functionality and features.
A pandas Series object is a one-dimensional, labeled array made up of an autogenerated index that starts at 0 and data of a single data type.
A couple of important things to note about a Series:
-
When attempting to create a pandas Series using multiple datatypes(e.g., int + string), the data will be converted to the same object data type; the int values will lose their int functionality.
-
A pandas Series can be created in several ways; we will look at a few of these ways below. However, it will most often be created by selecting a single column from a pandas Dataframe in which case the Series retains the same index as the Dataframe. We will dive into this in the next two lessons: DataFrames and Advanced DataFrames.
Convention is to import pandas like this:
import pandas as pd
Pandas series are vectorized by default.
- Series Attributes Attributes return useful information about a Series' properties; they don't perform operations or calculations with the Series. Attributes are easily accessible using dot notation like we will see in the examples below. There are several components comprising a Series; easily accessed individually using attributes.
- Examples:
- .index: The index allows us to reference items in the series.
- .values: The values are the data itself
- .dtype: The dtype is the data type of the elements in the Series.
- int, float, bool, object, category
- .name: The name is an optional human-friendly name for the Series.
- .size: The .size attribute returns an int representing the number of rows in the Series.
- NULL values are included.
- .shape: The .shape attribute returns a tuple representing the rows and columns when used on a two-dimensional structure like a DataFrame, but it can also be used on a Series to return its number of rows.
- NULL values are included.
- Series Methods
- .head: The .head(n) method returns the first n rows in the Series
- .tail: The .tail(n) method returns the last n rows in the Series
- .sample: The .sample(n) method returns a random sample of rows in the Series
- .astype: used to convert the data types of the values in the series
- .value_counts: returns a new Series consisting of a labeled index representing the unique values from the original Series and values representing the frequency of each unique value that appears in the original Series.
- It's like performing a SQL GROUP BY with a COUNT.
- nlargest: number of largest values
- nsmallest: number of smallest values
- I can set the keep parameter to first, last, or all to deal with duplicate largest or smallest values; this is quite handy.
- sort_values: sorting in ascending or descending order
- sort_index: sorting in ascending or descending orders
- .describe: returns a Series of descriptive statistics on a pandas Series.
- The information it returns depends on the data type of the elements in the Series.
- Other descriptive statistics methods:
- count: number of non-na observations
- sum: sum of values
- mean: mean of values
- median: arithmetic median of values
- min: minimum value
- max: maximum value
- mode: most occurant value
- abs: Absolute Value
- std: bessel-corrected sample standard deviation
- quantile: sample quanitle (value at %)
Numpy Array Methods
Multiple Pandas Tutorials Recommended from the Official Pandas Docs
Pandas Cheatsheet