Skip to content

nicholas-dougherty/numpy-pandas-visualization-exercises

Repository files navigation

Visualizing Data with Seaborn

Experimenting with Numpy and Pandas


This repository contains exercises from Codeup's bootcamp.


Elaborating on the Subject Matter

Numpy

Numpy is a library for representing and working with large and multi-dimensional arrays. Most other libraries in the data-science ecosystem depend on numpy, making it one of the fundamental data science libraries.
Numpy provides a number of useful tools for scientific programming. Convention is to import it like so:

import numpy as np

Provides capabilities pertaining to:

  • Indexing
    • An Array type that goes beyond built-in lists.
      • Create a numpy array by passing a list to the np.array function.
      • Make it multi-dimensional by passing a list of lists to np.array
  • Vectorized Operations
    • Vectorizing operations means that operations are automatically applied to every element in a vector
      • Not only are the arithmatic operators vectorized;the same applies to comparison operators.
  • Array Creation (several methods)
    • np.random.randn; np.zeros; np.ones; np.full; np.arange; np.linspace
  • Array Methods
    • .min; .max; .mean; .sum; .std (standard deviation)

Pandas

Series:

The Pandas Series object is similar to a numpy array, with added functionality and features.

A pandas Series object is a one-dimensional, labeled array made up of an autogenerated index that starts at 0 and data of a single data type.

A couple of important things to note about a Series:

  • When attempting to create a pandas Series using multiple datatypes(e.g., int + string), the data will be converted to the same object data type; the int values will lose their int functionality.

  • A pandas Series can be created in several ways; we will look at a few of these ways below. However, it will most often be created by selecting a single column from a pandas Dataframe in which case the Series retains the same index as the Dataframe. We will dive into this in the next two lessons: DataFrames and Advanced DataFrames.

Convention is to import pandas like this: import pandas as pd Pandas series are vectorized by default.

  • Series Attributes Attributes return useful information about a Series' properties; they don't perform operations or calculations with the Series. Attributes are easily accessible using dot notation like we will see in the examples below. There are several components comprising a Series; easily accessed individually using attributes.
  • Examples:
    • .index: The index allows us to reference items in the series.
    • .values: The values are the data itself
    • .dtype: The dtype is the data type of the elements in the Series.
      • int, float, bool, object, category
    • .name: The name is an optional human-friendly name for the Series.
    • .size: The .size attribute returns an int representing the number of rows in the Series.
      • NULL values are included.
    • .shape: The .shape attribute returns a tuple representing the rows and columns when used on a two-dimensional structure like a DataFrame, but it can also be used on a Series to return its number of rows.
      • NULL values are included.
  • Series Methods
    • .head: The .head(n) method returns the first n rows in the Series
    • .tail: The .tail(n) method returns the last n rows in the Series
    • .sample: The .sample(n) method returns a random sample of rows in the Series
    • .astype: used to convert the data types of the values in the series
    • .value_counts: returns a new Series consisting of a labeled index representing the unique values from the original Series and values representing the frequency of each unique value that appears in the original Series.
      • It's like performing a SQL GROUP BY with a COUNT.
    • nlargest: number of largest values
    • nsmallest: number of smallest values
      • I can set the keep parameter to first, last, or all to deal with duplicate largest or smallest values; this is quite handy.
    • sort_values: sorting in ascending or descending order
    • sort_index: sorting in ascending or descending orders
    • .describe: returns a Series of descriptive statistics on a pandas Series.
      • The information it returns depends on the data type of the elements in the Series.
  • Other descriptive statistics methods:
    • count: number of non-na observations
    • sum: sum of values
    • mean: mean of values
    • median: arithmetic median of values
    • min: minimum value
    • max: maximum value
    • mode: most occurant value
    • abs: Absolute Value
    • std: bessel-corrected sample standard deviation
    • quantile: sample quanitle (value at %)

Further Reading:

Numpy Array Methods Multiple Pandas Tutorials Recommended from the Official Pandas Docs
Pandas Cheatsheet

About

Data Science Libraries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published