Skip to content

highbrow-228/Project-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📖 Documentation for project: Detection of fake Ukrainian news

alt text

📌 Description:

This is a student project for the Machine Learning course, aimed at developing an application for detecting fake news.

🎯 The main goal:

Create a machine learning model capable of analyzing Ukrainian news texts and determining their authenticity.

✅Approach

1️⃣ Data Collection

  • Searching for available Ukrainian news datasets in internet
  • Downloading datasets (if we find it)
  • Analyze the data structure and identify key fields (URL, title, text, source, date). What sources the news is from, etc.

2️⃣ Data Preprocessing

3️⃣ Model Development

  • Selection of a machine learning algorithm by the selection method.
  • Training of the selected algorithm
  • Determining its performance.
  • Optimization of hyperparameters to increase the accuracy.
  • (Optional) If the approach fails, consider:
    • Using an LLM model.
    • Falling back to GPT-based classification.

4️⃣ Creating a WebUI for easier interaction

📊 Datasets Used

1️⃣ Ukrainian Fake and True News

  • Source: Kaggle
  • Description: Contains Fake and True news about Russo-Ukrainian war

2️⃣ Ukrainian News

  • Source: Hugging Face
  • Description: A dataset of news articles downloaded from various Ukrainian websites and Telegram channels
  • Divided the Dataset from Hugging Face
    • Source:

      • Filemail. A modified version of the Hugging Face dataset, divided into 22 parts, each containing 1 million rows.
      • Filemail. A modified version of the Hugging Face dataset, divided into 46 parts, each containing 500 thousand rows.
    • Using this code for that purpose.

🚀 Development Journey

⭐ 2025-02-03 – Team Formation.

👥 Team Members:

  1. PM: voinskyi
  2. Data engineer: Yul4onok
  3. Data scientist: highbrow-228

💡2025-02-06 – Development of the Idea to Tackle Disinformation.

🔍2025-02-14 – Searched for datasets with Ukrainian news, but failed.

📰2025-02-16 – The first try to scrap news from TSN.ua

  • News is scrapping very slowly so we decided to find other ways of collecting data.

🤔2025-02-17

  • Came across the problem of labeling data (how to label a huge amount of data (fake or real) by ourselves?).
  • The second try to find the dataset and already successful. (We found ukrainian-news on HuggingFace and Ukrainian News on Kaggle).
  • We consider the idea of labeling data this way: Choose sources we trust and mark that news as "real", then choose some suspicious sources and mark them as "fake".

🗂️2025-02-18

  • Trying to download ukrainian-news dataset on HuggingFace but came across of lacking resources (the dataset is really large (22 milion rows) and takes a lot of RAM to process). So we decide to divide the dataset into 23 subsets for better processing.

📌 Next Steps... Stay Tuned!

About

A project to identify fake news in the Ukrainian press

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •