This is a student project for the Machine Learning course, aimed at developing an application for detecting fake news.
Create a machine learning model capable of analyzing Ukrainian news texts and determining their authenticity.
- Searching for available Ukrainian news datasets in internet
- Downloading datasets (if we find it)
- Analyze the data structure and identify key fields (URL, title, text, source, date). What sources the news is from, etc.
- Extracting relevant information for further work. For example:
- Title of news
- Text of news
- Time when the news was published
- Additional info that will be needed later
- Removing duplicates and unnecessary data
- Analyzing data distribution and identifying news sources. We divide them into the following
- Sources we trust:
- Sources we do not trust:
- Balancing classes (if necessary)
- Selection of a machine learning algorithm by the selection method.
- Training of the selected algorithm
- Determining its performance.
- Optimization of hyperparameters to increase the accuracy.
- (Optional) If the approach fails, consider:
- Using an LLM model.
- Falling back to GPT-based classification.
- Using streamlit for this purpose.
- Source: Kaggle
- Description: Contains Fake and True news about Russo-Ukrainian war
- Source: Hugging Face
- Description: A dataset of news articles downloaded from various Ukrainian websites and Telegram channels
- Divided the Dataset from Hugging Face
- PM: voinskyi
- Data engineer: Yul4onok
- Data scientist: highbrow-228
📰2025-02-16 – The first try to scrap news from TSN.ua
- News is scrapping very slowly so we decided to find other ways of collecting data.
- Came across the problem of labeling data (how to label a huge amount of data (fake or real) by ourselves?).
- The second try to find the dataset and already successful. (We found
ukrainian-news
on HuggingFace andUkrainian News
on Kaggle). - We consider the idea of labeling data this way: Choose sources we trust and mark that news as "real", then choose some suspicious sources and mark them as "fake".
- Trying to download
ukrainian-news
dataset on HuggingFace but came across of lacking resources (the dataset is really large (22 milion rows) and takes a lot of RAM to process). So we decide to divide the dataset into 23 subsets for better processing.
📌 Next Steps... Stay Tuned!