Skip to content

nhansendev/Home

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 

Repository files navigation

Home

A summary of my repositories

Index

  • Visualization
    • LCSC Part Search
  • Data Science
    • 3D Printer Filament Customer Review Topic Modeling
    • Pistachio Image Classification
    • Applying Painting Style using GAN
    • Extracting Insights from Nike Product Reviews
    • Twitter Message Content Classification
    • Cancerous Cell Detection
    • Australian Weather Clustering
    • News Article Topic Classification
    • Tweet Network Analysis
    • Patient Stroke Prediction
    • New York Shooting Incidents
    • COVID Trends
  • Other
    • PyDrawNet
    • KiCAD autoBOM
    • RectanglePack
    • PDF Combine
    • PyAudioPlayer
    • TimeMarkers
    • AutoConfig
    • DictionaryPrint

Visualization

LCSC Part Search - Link | Demo

image A database search tool for finding LCSC parts, written in Python with Streamlit

Features

  • Fast, responsive interface for browsing part database (tested to ~7,000,000 parts)
  • Filtering by category, part number, description, etc
  • Provides cost estimates by quantity, including price breaks (where known)
  • Download filtered table results as xlsx files

Data Science

3D Printer Filament Customer Review Topic Modeling - Link

image

Tools Used:

  • python
    • numpy, pandas, matplotlib, sentence_transformers, scikit-learn, nltk, bertopic, scipy, hdbscan, torch, ipywidgets
  • NLP, TF-IDF, Sentence Transformers, Supervised Learning, Clustering Algorithms

Abstract: In this project topic modeling is used to extract actionable insights from product reviews for 3D printer filament. Using this information the factors important to customers when purchasing 3D printer filament can be estimated, as well as more specific feedback on a case-by-case basis, such as per supplier, or filament type. Reviews were retrieved from the AMAZON REVIEWS 2023 dataset after careful filtering was performed to identify relevant products, which required the use of supervised classification algorithms. Topic modeling was performed using the BERTopic model to extract common discussion topics, from which actionable insights could be drawn. Topic comparisons were performed using a variety of metrics, including the frequencies at which topics were paired within reviews, and topic tones. These comparisons revealed several useful insights into customer preferences and common complaints, which could be expanded upon further in future analysis.

Pistachio Image Classification - Link

image image

Tools Used:

Abstract: For this project I chose to tackle an image classification problem presented by a Kaggle dataset. The images in the dataset are of two different varieties of pistachios: "Siirt" and "Kirmizi", with the goal being to create a neural-network based model capable of reliably differentiating between them.

The general steps of the project included exploring and pre-processing the data, preparing the model(s), training the models, and evaluating their performance.

The final model achieved a validation F1-Score of 0.98, indicating that it had effectively learned to classify the pistachios.

Applying Painting Style using GAN - Link

image image

Tools Used:

  • python
    • numpy, torch, torchvision, matplotlib, pillow
  • CNN, Deep Learning, Image Augmentation, Image Classification

Abstract: The dataset for this project is provided via the Kaggle "GAN Getting Started"/"I'm Something of a Painter Myself" competition. The goal of the project is to use a Generative Adversarial Network (GAN) to adapt real photos to the style of Claude Monet, a famous French painter, using examples of his artwork.

To reach this goal the dataset (images) will be explored and pre-processed, then a GAN-based model will be trained and used to generate adapted images. The images will then be submitted for a final score in the competition.

GANs are notoriously difficult to train, and while the model performed well in the numeric Kaggle evaluation, the subjective performance of the model was poor. Despite many manual iterations of hyperparameter tuning, model architecture exploration, and other tweaks, no models capable of "believable" style tranfer arose.

Extracting Insights from Nike Product Reviews - Link

image

Tools Used:

  • python
    • numpy, matplotlib, nltk, scipy, lda
  • NLP, LDA, Clustering

Abstract: The goal of this project was to extract useful product and marketing insights from product reviews using topic modeling. In this case, the provided dataset is composed of real Amazon reviews, which have been filtered to focus on clothing, and which will be further filtered to focus on Nike branded products specifically.

From the collected topic descriptions some concepts can be inferred:

  • Nike products are often purchased as gifts for family members, especially sons and for Christmas presents
    • Marketing could lean-into these concepts when designing advertising
  • Customers appreciate being able to find the products they want online, which are often not available locally
    • Further analysis might identify products that are often desired at local retailers, but not available.
    • Adjusting stocking practices may drive more sales.
  • Some lines of footwear commonly run small compared to expectations, such as when compared to other brands
    • Adjusting the sizing guides to match expectations may improve customer experience
  • Customers are sensitive to the returns process and dislike added shipping costs
    • Working with local retailers to provide a free dropoff option for returns may improve the customer experience (if not already an option)
  • Some lines of footwear were found to start squeaking excessively, or quickly started falling apart, which annoyed customers
    • This could indicate an error in the manufacturing process, or a necessary design change for these products

While the current topics are informative, it's clear that we could benefit for more specific information:

  • Analyzing reviews over time could clarify whether these are ongoing problems, since they may have already been fixed.
  • Category 8 "watch_band_wrist_battery" has a significantly below average rating, suggesting that there are aspects of the watches that are in need of improvement. Breaking-out this category into multiple watch sub-categories (by filtering reviews to only watches and repeating topic generation, for example) could be useful for determining exactly what the problems are.

Twitter Message Content Classification - Link

image

Tools Used:

  • python
    • numpy, torch, scikit-learn, matplotlib, nltk, spacy, gensim
  • LSTM, NLP, Text Classification

Abstract: The dataset to be analyzed was provided via Kaggle and consists of 10000 Twitter messages hand-classified on whether they are about disasters or not.

The goal of the project was be to clean, explore, and encode the data, then train Recurrent Neural Network (RNN) models to perform Natural Language Processing (NLP) to predict the disaster/not disaster labels.

Cancerous Cell Detection - Link

image

Tools Used:

  • python
    • numpy, torch, scikit-learn, torchvision, matplotlib
  • CNN, Deep Learning, Image Augmentation, Image Classification

Abstract: For this project the target dataset is a set of images of histopathologic scans (tissue magnified via microscope) of lymph node sections, which may or may not contain metastatic tissue (cancer). The dataset is hosted via Kaggle, with the goal of identifying which images include metatstic tissue.

More specifically, per the data description on Kaggle: "A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image."

Convolutional Neural Network (CNN) based models will be trained towards this goal since they have been proven effective for image-based analysis tasks.

Four models were trained for this goal, with each scoring between 76% - 81% accuracy during testing, with the results likely suffering from over-fitting. Basic ensembling was attempted, but did not improve upon the best individual score.

Australian Weather Clustering - Link

image

Tools Used:

  • python
    • numpy, pandas, matplotlib, scikit-learn
  • PCA, K-Means, T-SNE

Abstract: For this project I chose a dataset describing weather in Australia, retrieved from Kaggle. The dataset covers about 10 years of daily weather observations from 49 weather stations across Australia. Each observation includes 23 features, such as date, location, temperature, humidity, etc.

Unsupervised clustering analysis will be performed to gain a better understanding of trends in the data.

The KMeans algorithm was used to cluster the weather stations by their weather patterns, resulting in a North-South divide with strong correlations to the maximum daily temperature.

Between the dimensionality reduction algorithms Principle Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (TSNE) it was found that the more complex, non-linear embedding performed by TSNE better captured the structure of the data while translating to lower dimensions. PCA lost most of the available information during the transformation, but still produced plots that could be used to compare relationships between features. Since TSNE is a much slower algorithm, this results in a tradeoff between processing time and embedded feature fidelity.

News Article Topic Classification - Link

image image

Tools Used:

  • python
    • numpy, pandas, matplotlib, scikit-learn
  • Regression, Cross-Validation, NMF, PCA

Abstract: The goal of this project was to train unsupervised and supervised models to predict news article topics.

The BBC News Classification dataset used in the analysis was retrieved from Kaggle, and contains 2225 articles in the categories of business, entertainment, politics, sports, and technology.

Logistic Regression models were fit with a testing accuract of ~99%, while NMF models reached ~96%. Observed tradeoffs of the two approaches:

  • Logistic Regression requires labeled data, but is fast to train
  • NFM does not require labeled data, but is slower
  • Both can achieve similar levels of accuracy for this task

Tweet Network Analysis - Link

image

Tools Used:

  • python
    • numpy, pandas, matplotlib, nltk
  • NLP, Network Analysis, Semantic Graphs

Abstract: The goal of this project was to use network analysis on tweets to compare how consumers discuss the brands Nike, Adidas, and Lululemon.

To accomplish this two network graphs were created; one for "mentions" between tweets, and one for the semantics of word use in tweets.

The graphs allowed some useful conclusions to be drawn about the data, seeming to reflect popular topics at the time the data was gathered. However, it was noted that individual users were capable of skewing the results through a high volume of tweets, which may require more careful filtering.

Patient Stroke Prediction - Link

image image

Tools Used:

  • python
    • numpy, pandas, matplotlib, scikit-learn
  • Regression, Cross-Validation, KNN, Random Forest, PCA, Descision Tree, SVM

Abstract: For this project I chose to analyze a stroke dataset provided by kaggle. The dataset contains 5110 observations (patients) with 12 attributes, including a binary classification for whether they did or didn't have a stroke. The original source of the data is described as "confidential", and any attributes that might be used to personally identify a patient have been omitted (e.g. name and location). The goal of this project was to develop models capable of accurately predicting which patients are at risk of strokes using the available data. This required cleaning and preprocessing of the data followed by model selection, evaluation, and optimization.

Three methods of data preparation were considered; using PCA to perform dimensionality reduction, using standardized data, and using one-hot encoded data. Models trained on the one-hot encoded data had the highest F1-Scores when predicting on testing data, though there were other models with better recall. Since the consequences of false-negatives could be harmful in a medical setting, it was determined that models with higher false-positive rates were preferable (associated with higher recall). The highest test recall achieved by a model was 0.92, though its precision was 0.18.

Overall, three models were selected for further consideration depending on the requirements of their specific appliction (tradeoffs of precision and recall):

  • RandomForest trained on one-hot encoded data
  • SVC trained on one-hot encoded data
  • An ensemble of the top three models (RandomForest, AdaBoost, and LogisticRegression, each trained on one-hot encoded data)

New York Shooting Incidents - Link

image

Tools Used:

  • R

Abstract: The goal of this project was to identify trends in shooting incident data (retrieved from the city of New York website), which required data to be imported, cleaned, and analyzed.

COVID Trends - Link

image

Tools Used:

  • R

Abstract: The goal of this project was to identify trends in COVID-19 data (retrieved from John Hopkins University), which required data to be imported, cleaned, and analyzed.

Other

PyDrawNet - Link

A python utility for plotting neural network (and other) diagrams image

KiCAD autoBOM - Link

Python scripts for automating BOM operations in KiCAD image

RectanglePack - Link

This Python project expands on the capabilities of the rectangle-packer package primarily by adding efficient rotation checking (missing from the base package), the ability to maximize area usage of stock, and multi-sheet packing.

image

PDF Combine - Link

A python utility for automatically combining and trimming PDF files image

PyAudioPlayer - Link

An audio player GUI with yt_dlp integration, made in python using PySide6. image

TimeMarkers - Link

Some simple Python time tracking utilities

AutoConfig - Link

A python utility for reading/writing custom YAML configuration files

DictionaryPrint - Link

A python utility for printing out dictionary contents in an easily readable format.

About

Provides a map of my current projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published