Skip to content

πŸ₯ˆ This repository contains the 2nd place solution for the Hackathon Schneider Pollution competition (Data Science)

Notifications You must be signed in to change notification settings

albertferre/hackathon-schneider-pollution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Air Quality Insights πŸ†

πŸ₯ˆ Second Place in the hackaton-schneider-pollution

This repository contains the solution that achieved second place in the air quality challenge organized by Nuwe and Schneider Electric. Below are the results obtained:

πŸ“Š My Score

My Score

πŸ“Œ Final Ranking

Final Ranking

πŸ† Challenge Details


Air Quality Insights

Category ➑️ Data Science

Subcategory ➑️ Recommender systems

Difficulty ➑️ Medium


🌐 Background

The challenge of energy pollution and climate action arises from the global dependency on fossil fuels, which are the primary contributors to greenhouse gas emissions. The combustion of coal, oil, and natural gas for energy production releases carbon dioxide and other harmful pollutants into the atmosphere, accelerating climate change and damaging ecosystems. As energy demand surges due to population growth and industrial development, the environmental impact intensifies. This challenge necessitates innovative solutions to transition towards cleaner energy sources, enhance energy efficiency, and implement sustainable practices. Tackling energy pollution is vital not only for mitigating climate change but also for fostering healthier communities and ensuring long-term environmental sustainability.

πŸ—‚οΈ Dataset

Three distinct datasets will be provided:

  • Measurement data:

    • Variables:
      • Measurement date
      • Station code
      • Latitude
      • Longitude
      • SO2
      • NO2
      • O3
      • CO
      • PM10
      • PM2.5
  • Data available at: Download Measurement data

  • Instrument data

    • Variables:
      • Measurement date
      • Station code
      • Item code
      • Average value
      • Instrument status : Status of the measuring instrument when the sample was taken.
      {
          0: "Normal",
          1: "Need for calibration",
          2: "Abnormal",
          4: "Power cut off",
          8: "Under repair",
          9: "Abnormal data",
      }
    • Data available at: Download Instrument data
  • Pollutant data

  • Variables:

    • Item code
    • Item name
    • Unit of measurement
    • Good
    • Normal
    • Bad
    • Very bad

πŸ“Š Data Processing

Perform comprehensive data processing, including filtering, normalization, and handling missing values.

Afterwards, develop two machine learning models:

  • Forecast Model: Predict hourly air quality for specified periods, assuming no measurement errors.

  • Instrument Status Model: Detect and classify failures in measurement instruments.

πŸ“‚ Repository Structure

The repository structure is provided and must be adhered to strictly:


β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── pollutant_data.csv
β”‚   └── processed/
β”‚
│── predictions/
β”‚   β”œβ”€β”€ questions.json
β”‚   β”œβ”€β”€ predictions_task_2.json
β”‚   └── predictions_task_3.json
β”‚
│── models/
β”‚   β”œβ”€β”€ model_task_2
β”‚   └── model_task_3
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ questions.py
β”‚   β”‚   └── ...
β”‚   β”‚
β”‚   └── models/
β”‚       └── (prepare your data and create the model)
β”‚
└── README.md

The /predictions folder should contain the tasks outputs for the questions in Task 1 and the predictions for both Task 2 and Task 3.

🎯 Tasks

This challenge will include three tasks: an initial exploratory data analysis task with questions, followed by two model creation tasks.

Task 1: Answer the following questions about the given datasets:

IMPORTANT Answer the following questions considering only measurements with the value tagged as "Normal" (code 0):

  • Q1: Average daily SO2 concentration across all districts over the entire period. Give the station average. Provide the answer with 5 decimals.
  • Q2: Analyse how pollution levels vary by season. Return the average levels of CO per season at the station 209. (Take the whole month of December as part of winter, March as spring, and so on.) Provide the answer with 5 decimals.
  • Q3: Which hour presents the highest variability (Standard Deviation) for the pollutant O3? Treat all stations as equal.
  • Q4: Which is the station code with more measurements labeled as "Abnormal data"?
  • Q5: Which station code has more "not normal" measurements (!= 0)?
  • Q6: Return the count of Good, Normal, Bad and Very bad records for all the station codes of PM2.5 pollutant.

Example question output format:

{"target":
  {
    "Q1": 0.11111,
    "Q2": {
        "1": 0.11111,
        "2": 0.11111,
        "3": 0.11111,
        "4": 0.11111
    },
    "Q3": 111,
    "Q4": 111,
    "Q5": 111,
    "Q6": {
        "Normal": 111,
        "Good": 111,
        "Bad": 111,
        "Very bad": 111
    }
  }
}

Task 2: Develop the forecasting model :

  • Predict hourly pollutant concentrations for the following stations and periods, assuming error-free measurements:
Station code: 206 | pollutant: SO2   | Period: 2023-07-01 00:00:00 - 2023-07-31 23:00:00
Station code: 211 | pollutant: NO2   | Period: 2023-08-01 00:00:00 - 2023-08-31 23:00:00
Station code: 217 | pollutant: O3    | Period: 2023-09-01 00:00:00 - 2023-09-30 23:00:00
Station code: 219 | pollutant: CO    | Period: 2023-10-01 00:00:00 - 2023-10-31 23:00:00
Station code: 225 | pollutant: PM10  | Period: 2023-11-01 00:00:00 - 2023-11-30 23:00:00
Station code: 228 | pollutant: PM2.5 | Period: 2023-12-01 00:00:00 - 2023-12-31 23:00:00

Expected output format:

{
  "target":
  {
    "206":
      {
        "2023-07-01 00:00:00": 0.32,
        "2023-07-01 01:00:00": 0.5,
        "2023-07-01 02:00:00": 0.8,
        "2023-07-01 03:00:00": 0.11,
        "2023-07-01 04:00:00": 0.7,
        ...
      },
    "211":
      {
        ...
      }
  }
}

Task 3: Detect anomalies in data measurements

Detect instrument anomalies for the following stations and periods:

Station code: 205 | pollutant: SO2   | Period: 2023-11-01 00:00:00 - 2023-11-30 23:00:00
Station code: 209 | pollutant: NO2   | Period: 2023-09-01 00:00:00 - 2023-09-30 23:00:00
Station code: 223 | pollutant: O3    | Period: 2023-07-01 00:00:00 - 2023-07-31 23:00:00
Station code: 224 | pollutant: CO    | Period: 2023-10-01 00:00:00 - 2023-10-31 23:00:00
Station code: 226 | pollutant: PM10  | Period: 2023-08-01 00:00:00 - 2023-08-31 23:00:00
Station code: 227 | pollutant: PM2.5 | Period: 2023-12-01 00:00:00 - 2023-12-31 23:00:00

Example output:

{
  "target":
  {
    "205":
    {
      "2023-11-01 00:00:00": 5,
      "2023-11-01 01:00:00": 3,
      "2023-11-01 02:00:00": 6,
      "2023-11-01 03:00:00": 1,
      "2023-11-01 05:00:00": 3,
    ...
    },
    "209":
    {
      ...
    }
  }
}

πŸ’« Guides

  • Study and explore the datasets thoroughly.
  • Handle missing or erroneous values.
  • Normalize and scale the data.
  • Implement feature engineering to improve model accuracy.

πŸ“€ Submission

Submit a questions.json file for the queries in task 1 and a predictions_task_2.json and predictions_task_3.json files containing the predictions made by your models. Ensure the files are formatted correctly and placed within the /predictions folder.

πŸ“Š Evaluation

  • Task 1: The questions of this tasks will be evaluated via JSON file, comparing your answer in questions.json against the expected value.
  • Task 2: The model will be evaluated using R2. The score will be the mean for all the station predictions.
  • Task 3: The model will be evaluated using F1 score, using the macro average.

The grading system will be the following:

  • Task 1: 300 / 1400 points
  • Task 2: 550 / 1400 points
  • Task 3: 550 / 1400 points

⚠️ Please note: All submissions might undergo a manual code review process to ensure that the work has been conducted honestly and adheres to the highest standards of academic integrity. Any form of dishonesty or misconduct will be addressed seriously, and may lead to disqualification from the challenge.

❓ FAQs

Q1: How do I submit my solution?

A1: Submit your solution via Git. Once your code and predictions are ready, commit your changes to the main branch and push your repository. Your submission will be graded automatically within a few minutes. Make sure to write meaningful commit messages.

About

πŸ₯ˆ This repository contains the 2nd place solution for the Hackathon Schneider Pollution competition (Data Science)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published