NLP--Sentiment_Analysis_and_Summarization_of_Stock_News

This notebook uses a Natural Language Processing AI-model driven sentiment analysis solution that can process and analyze news articles to gauge market sentiment predicting stock price and volume. It will also summarize lengthy news at a weekly level to enhance the accuracy of their stock price predictions to optimize investment strategies. GloVe, Word2Vec and Transformer models will be compared for accuracy and will be fine-tuned.

I experimented with specific configurations to optimize each model's initial tuning parameters after training the models using different classifiers. Additionally, I used different available TPU's and GPUs to determine processing tradeoffs with cost and speed: I used the following libraries.

1-To manipulate and analyze data: pandas, numpy.
2-To visualize data: matplotlib.pyplot, seaborn.
3-To parse JSON data: json.
4-To build, tune, and evaluate ML models: sklearn.ensemble: GradientBoostingClassifier, RandomForestClassifier, DecisionTreeClassifier sklearn.model_selection: GridSearchCV, sklearn.metrics: confusion_matrix, accuracy_score, f1_score, precision_score, recall_score.
5-To load/create word embeddings: gensim.models, Word2Vec; KeyedVectors, gensim.scripts.glove2word2vec, glove2 word2vec.
6-To work with transformer models: torch, sentence_transformers
7-To summarize with NLP models: Llama Mistral-7B max_tokens, temperature, top_p, top_k.

Best Model Selection: Tuning the model Word2Vec with a Decision Tree Classifier gave us comparable performance metrics as using a non-tuned Sentence Transformer Model:

->Model: Tuned Word2Vec ->Accuracy: 0.48 ->F1-Score: 0.48
->Model: Non-Tuned Sentence Transformer ->Accuracy: 0.52 ->F1-Score: 0.48

However, the TPU/GPU processing is much higher so for cost considerations, Tuned Word2Vec may be more economical.

With the second part of this project, I used LLama & Mistral-7B. Note: Llama models come in various sizes, including larger ones like Llama 2 70B. Mistral-7B often outperforms larger Llama models in certain tasks despite its smaller size. For the news summarizations, I had to monitor the performance and GPU utilization with the following configurations:

This is an example of the input to be processed by the model:

These are examples of the resulting outputs with a summary extraction, keywords, topics, stock value and price after summarizing the news input above when you enter a specic date (interactive mode):

And this is an example when you just need a summary for the top 3 positive/negative events per week:

Summary of my learnings:

Model Development and Hardware Optimization:

Explored various classifier configurations and tuning parameters
Evaluated performance trade-offs across different TPUs and GPUs, considering both cost and processing speed

Data Processing and Analysis:

Pandas and NumPy for data manipulation and numerical operations
JSON parsing for structured data handling
Matplotlib and Seaborn for data visualization

Machine Learning Framework (scikit-learn):

Ensemble Methods:
- Gradient Boosting Classifier
- Random Forest Classifier
- Decision Tree Classifier
Model Optimization:
- GridSearchCV for hyperparameter tuning
Evaluation Metrics:
- Confusion Matrix
- Accuracy Score
- F1 Score
- Precision Score
- Recall Score

Natural Language Processing Tools:

Word Embeddings:
- Gensim's Word2Vec
- GloVe (with glove2word2vec conversion)
- KeyedVectors for embedding management
Transformer Models:
- PyTorch
- Sentence Transformers
Large Language Models:
- Llama
- Mistral-7B with configurable parameters:
  - Maximum tokens
  - Temperature
  - Top-p sampling
  - Top-k sampling

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Art_Zaragoza_NLP_Stock_Market_News_Sentiment_Analysis_and_Summarization_Project6_Full_Code.ipynb		Art_Zaragoza_NLP_Stock_Market_News_Sentiment_Analysis_and_Summarization_Project6_Full_Code.ipynb
LICENSE		LICENSE
README.md		README.md
mistral_processed_data.csv		mistral_processed_data.csv
stock_news.csv		stock_news.csv
top_3_each_week.csv		top_3_each_week.csv
top_3_each_week_numbered.csv		top_3_each_week_numbered.csv
weekly_grouped_results.csv		weekly_grouped_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP--Sentiment_Analysis_and_Summarization_of_Stock_News

About

Uh oh!

Releases

Packages

Languages

License

ArtZaragozaGitHub/NLP--P6_Sentiment_Analysis_and_Summarization_of_Stock_News

Folders and files

Latest commit

History

Repository files navigation

NLP--Sentiment_Analysis_and_Summarization_of_Stock_News

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages