Skip to content

Natural Language Processing AI-model driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies.

License

Notifications You must be signed in to change notification settings

ArtZaragozaGitHub/NLP--P6_Sentiment_Analysis_and_Summarization_of_Stock_News

Repository files navigation

NLP--Sentiment_Analysis_and_Summarization_of_Stock_News

This notebook uses a Natural Language Processing AI-model driven sentiment analysis solution that can process and analyze news articles to gauge market sentiment predicting stock price and volume. It will also summarize lengthy news at a weekly level to enhance the accuracy of their stock price predictions to optimize investment strategies. GloVe, Word2Vec and Transformer models will be compared for accuracy and will be fine-tuned.

I experimented with specific configurations to optimize each model's initial tuning parameters after training the models using different classifiers. Additionally, I used different available TPU's and GPUs to determine processing tradeoffs with cost and speed: I used the following libraries.

  • 1-To manipulate and analyze data: pandas, numpy.
  • 2-To visualize data: matplotlib.pyplot, seaborn.
  • 3-To parse JSON data: json.
  • 4-To build, tune, and evaluate ML models: sklearn.ensemble: GradientBoostingClassifier, RandomForestClassifier, DecisionTreeClassifier sklearn.model_selection: GridSearchCV, sklearn.metrics: confusion_matrix, accuracy_score, f1_score, precision_score, recall_score.
  • 5-To load/create word embeddings: gensim.models, Word2Vec; KeyedVectors, gensim.scripts.glove2word2vec, glove2 word2vec.
  • 6-To work with transformer models: torch, sentence_transformers
  • 7-To summarize with NLP models: Llama Mistral-7B max_tokens, temperature, top_p, top_k.

Models TradeOffs

Best Model Selection: Tuning the model Word2Vec with a Decision Tree Classifier gave us comparable performance metrics as using a non-tuned Sentence Transformer Model:

  • ->Model: Tuned Word2Vec ->Accuracy: 0.48 ->F1-Score: 0.48
  • ->Model: Non-Tuned Sentence Transformer ->Accuracy: 0.52 ->F1-Score: 0.48

However, the TPU/GPU processing is much higher so for cost considerations, Tuned Word2Vec may be more economical.

final_model_selection

With the second part of this project, I used LLama & Mistral-7B. Note: Llama models come in various sizes, including larger ones like Llama 2 70B. Mistral-7B often outperforms larger Llama models in certain tasks despite its smaller size. For the news summarizations, I had to monitor the performance and GPU utilization with the following configurations:

Llama Model Setup

This is an example of the input to be processed by the model:

input

These are examples of the resulting outputs with a summary extraction, keywords, topics, stock value and price after summarizing the news input above when you enter a specic date (interactive mode):

output-1

And this is an example when you just need a summary for the top 3 positive/negative events per week:

output-2

Summary of my learnings:

Model Development and Hardware Optimization:

  • Explored various classifier configurations and tuning parameters
  • Evaluated performance trade-offs across different TPUs and GPUs, considering both cost and processing speed

Data Processing and Analysis:

  • Pandas and NumPy for data manipulation and numerical operations
  • JSON parsing for structured data handling
  • Matplotlib and Seaborn for data visualization

Machine Learning Framework (scikit-learn):

  • Ensemble Methods:
    • Gradient Boosting Classifier
    • Random Forest Classifier
    • Decision Tree Classifier
  • Model Optimization:
    • GridSearchCV for hyperparameter tuning
  • Evaluation Metrics:
    • Confusion Matrix
    • Accuracy Score
    • F1 Score
    • Precision Score
    • Recall Score

Natural Language Processing Tools:

  • Word Embeddings:
    • Gensim's Word2Vec
    • GloVe (with glove2word2vec conversion)
    • KeyedVectors for embedding management
  • Transformer Models:
    • PyTorch
    • Sentence Transformers
  • Large Language Models:
    • Llama
    • Mistral-7B with configurable parameters:
      • Maximum tokens
      • Temperature
      • Top-p sampling
      • Top-k sampling

About

Natural Language Processing AI-model driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published