Before you begin, ensure you have the following installed:
- Python (Latest stable version recommended)
- pip (Python package manager)
-
Clone the Repository
If you haven't already, clone the BeyondTheMarks repository:git clone https://github.com/ShailKPatel/BeyondTheMarks cd BeyondTheMarks
-
Install Dependencies
Run the following command to install all required dependencies fromrequirements.txt
:pip install -r requirements.txt
-
Run the Application
Use the command below to start the application with Streamlit:streamlit run main.py
The project is organized as follows:
BeyondTheMarks
│ LICENSE
│ main.py
│ README.md
│ requirements.txt
│
├───.streamlit
│ config.toml
│
├───analysis
│ subject_analysis.py
│ teacher_analysis.py
│
├───bias_analysis
│ bias_detection.py
│
├───core_functionality
│ data_validator.py
│
├───images
│ logo.png
│
├───reviews
│ recent_reviews.txt
│ word_count.txt
│
├───samplefiles
│ test1.csv
│ test2.csv
│ test3.csv
│ test4.csv
│
└───views
Data_Dissector.py
Home.py
Reviews.py
Tech_Wizardry.py
The_Brains_Behind.py
The main.py
script sets up a multi-page Streamlit application with a structured navigation system. It provides an interface for users to navigate between different sections of the app, each serving a distinct purpose.
- Multi-Page Navigation: Allows seamless switching between different pages.
- Categorized Sections:
- Home: Introduction and overview of the project.
- Data Dissector: Provides core analysis functionalities.
- The Brains Behind: Displays credits for contributors.
- Tech Wizardry: Showcases the technologies used in the project.
- Reviews: Displays user feedback and reviews.
The script defines five pages, each corresponding to a separate Python file located in the views
directory:
home = st.Page("views/Home.py", icon='🏠')
data_dissector = st.Page("views/Data_Dissector.py", icon='🔬')
the_brains_behind = st.Page("views/The_Brains_Behind.py", icon='🧠')
tech_wizardry = st.Page("views/Tech_Wizardry.py", icon='🛠️')
reviews = st.Page("views/Reviews.py", icon='📨')
Each page file should contain its own Streamlit logic and UI components.
The script uses st.navigation()
to create a structured menu:
pg = st.navigation([
home,
data_dissector,
the_brains_behind,
tech_wizardry,
reviews
])
This ensures a well-organized navigation bar, allowing users to switch between sections effortlessly.
- Each page file (
Home.py
,Data_Dissector.py
, etc.) must be properly configured with Streamlit components. - Ensure that all necessary dependencies are installed to avoid runtime errors.
This main.py
script serves as the entry point for the BeyondTheMarks Streamlit application, providing a structured and user-friendly interface for data analysis, reviews, and project insights.
The views
directory contains individual Python scripts responsible for rendering different sections of the Beyond The Marks Streamlit application. Each script defines the layout and functionality of a specific page within the app.
The Home.py
script serves as the landing page of the application, introducing users to Beyond The Marks and providing navigation to key sections.
- Displays the project logo.
- Provides a brief description of the project and its functionalities.
- Lists key features such as file validation, teacher analysis, and bias detection.
- Offers quick navigation links to other views (
Data Dissector
,The Brains Behind
,Tech Wizardry
, andReviews
). - Shows footer information including licensing details.
- Project Overview: Explains the purpose and scope of the project.
- Key Features: Highlights core functionalities.
- Navigation Links: Directs users to different sections of the application.
- Footer: Displays license and developer information.
The Tech_Wizardry.py
script provides an insight into the technologies used to build the application.
- Displays the project logo.
- Lists the core technology stack, including Python, Streamlit, and key data science libraries.
- Explains mathematical techniques like ANOVA and One-Hot Encoding used in bias detection.
- Provides a lighthearted overview of why the technology stack was chosen.
- Core Tech Stack: Describes programming languages, frameworks, and libraries used.
- Mathematical Wizardry: Explains the statistical techniques powering analysis.
- Why This Works: A fun and engaging explanation of the tool’s effectiveness.
- Fun Fact: Adds a humorous touch to the documentation.
The The_Brains_Behind.py
script gives credit to the contributors behind the project.
- Displays the project logo.
- Highlights Shail K Patel as the lead developer.
- Provides links to LinkedIn and GitHub profiles.
- Includes a fun fact about the project's development journey.
- Contributor Information: Acknowledges key developers.
- Social Links: Provides ways to connect with the contributors.
- Fun Fact: Adds personality to the documentation.
The Review.py
script is a Streamlit-based feedback module for the BeyondTheMarks project. It allows users to submit and view reviews, analyzes word frequencies, and maintains a record of recent feedback.
- Review Submission & Display: Users can submit feedback, which is displayed dynamically.
- Queue System: Stores up to six recent reviews using a queue.
- Word Frequency Analysis: Tracks the most frequently used words in reviews.
- Persistent Storage:
- Reviews are stored in
reviews/recent_reviews.txt
. - Word counts are stored in
reviews/word_count.txt
.
- Reviews are stored in
- Graphical Representation: Displays the top 10 most common words using a Plotly bar chart.
The script defines a QueueWithReverse
class for managing a fixed-size queue (max 6 entries):
class QueueWithReverse:
THRESHOLD = 6
...
- enqueue(entry): Adds an entry; removes the oldest if full.
- dequeue(): Removes and returns the oldest entry.
- retrieve(): Returns all stored entries.
Reviews are saved and loaded via:
def extract_reviews():
def preserve_reviews(repository):
extract_reviews()
: Reads stored reviews fromrecent_reviews.txt
.preserve_reviews(repository)
: Saves queue data back to the file.
A custom hash map (CustomHashMap
) tracks word frequencies:
class CustomHashMap:
def add(word):
def get_items():
def load_word_count():
def save_word_count():
- add(word): Increments the count of a word.
- get_items(): Returns the top 10 most frequent words.
- load_word_count(): Reads data from
word_count.txt
. - save_word_count(): Writes word frequencies to
word_count.txt
.
- Shows up to 6 recent reviews in a grid layout:
reviews = review_queue.retrieve()
columns = st.columns(min(3, total_entries - i))
- Displays a Plotly bar chart of the most used words:
fig = go.Figure(data=[go.Bar(x=words, y=frequencies, marker_color='indianred')])
st.plotly_chart(fig)
- Users enter text in a text area and submit:
user_review = st.text_area("Got Complaint.. Er... Suggestion? Drop them here", "")
- The review is added to the queue and stored persistently.
- Balloon animation appears on submission.
- The review queue is limited to six entries.
- Words are case-insensitive in frequency tracking.
- Data persistence ensures reviews and word counts remain after a restart.
The data_validator.py
module provides functionality for validating and processing CSV and Excel files containing student data. It ensures that the data is correctly structured, contains necessary columns, and adheres to specific rules such as numeric constraints and unique identifiers.
- Supports
.csv
,.xlsx
file formats. - Validates file integrity and structure.
- Dynamically detects subjects based on column naming conventions.
- Ensures
Roll No
uniqueness and numeric constraints for marks and attendance. - Raises custom exceptions for various validation failures.
Raised when the file format is unsupported.
Raised when the file cannot be read, possibly due to corruption.
Raised when the file lacks necessary columns or has an incorrect structure.
Raised when an unexpected column is found in the dataset.
Description: Validates the structure and content of a given file and converts it into a Pandas DataFrame.
Parameters:
file
: The uploaded file object.
Returns:
tuple
:(Pandas DataFrame, NumPy array of detected subjects)
Validation Steps:
- Check file extension.
- Load data into a Pandas DataFrame.
- Verify required columns (
Roll No
, subject-wiseMarks
andAttendance
). - Ensure subject-based column relationships (e.g.,
Math Marks
must haveMath Attendance
). - Detect unknown columns and raise an error if found.
- Ensure at least one subject exists.
- Convert detected subjects into a NumPy array.
- Call
validate_data(df)
for further validation.
Description: Validates the contents of the Pandas DataFrame by ensuring uniqueness, numeric constraints, and value ranges.
Parameters:
df
: A Pandas DataFrame containing the dataset.
Returns:
Pandas DataFrame
: The validated and cleaned dataset.
Validation Steps:
- Ensure
Roll No
values are unique. - Identify all
Marks
andAttendance
columns dynamically. - Ensure these columns are numeric.
- Check if values are within the valid range (0-100).
- Round values to two decimal places.
This Python script analyzes the effectiveness of teachers by examining the variance in student performance metrics such as attendance and marks. It applies statistical methods like one-way ANOVA and calculates weighted scores based on mean and interquartile range (IQR) values.
- ANOVA Test: Determines if there is a significant difference in marks or attendance across different teachers.
- Mean and IQR Calculation: Computes the average and interquartile range for each teacher.
- Weighted Score Computation: Uses a 60-40 weighted formula to rank teachers based on their effectiveness.
- Box Plot Visualization: Generates distribution plots for attendance and marks across teachers.
Purpose:
- Performs a one-way ANOVA test to check if there is a significant difference in marks or attendance across different teachers.
Parameters:
df
(DataFrame): A DataFrame with two columns:- One column for teachers (e.g., "Math Teacher").
- One numeric column ("Marks" or "Attendance").
Returns:
True
if the ANOVA test finds a significant difference (p-value < 0.1
), otherwiseFalse
.
Purpose:
- Computes the mean of the numeric column (Marks or Attendance) for each teacher.
Returns:
- A dictionary
{teacher_name: mean_value}
.
Purpose:
- Computes the interquartile range (IQR) of the numeric column (Marks or Attendance) for each teacher.
Returns:
- A dictionary
{teacher_name: iqr_value}
.
Purpose:
- Determines teacher effectiveness based on ANOVA significance tests for marks and attendance.
- Calculates mean, IQR, and weighted scores for teachers if significant differences exist.
Returns:
- A dictionary:
{ "Marks": {teacher_name: weighted_score}, "Attendance": {teacher_name: weighted_score} }
Purpose:
- Computes a weighted score for each teacher using a 60-40 formula:
- 60% weight for Mean
- 40% weight for IQR
Returns:
- A dictionary
{teacher_name: weighted_score}
.
Purpose:
- Generates a box plot to visualize the distribution of attendance or marks for each teacher.
Returns:
- A Plotly figure object.
Purpose:
- Generates box plots for attendance and marks distributions across teachers.
Returns:
- Two Plotly figure objects: one for attendance and one for marks.
import pandas as pd
data = {
"Teacher": ["A", "A", "B", "B", "C", "C", "A", "B", "C"],
"Marks": [80, 85, 78, 82, 88, 90, 83, 79, 87],
"Attendance": [90, 95, 85, 80, 88, 92, 93, 81, 89]
}
df = pd.DataFrame(data)
results = analyze_teacher_effectiveness(df)
print(results)
- Teachers with fewer than 3 data points are excluded from the analysis.
- ANOVA test significance is set at
p-value < 0.1
to detect meaningful variations.
This script helps in evaluating the impact of teachers based on student attendance and marks using statistical analysis and visualization tools.
This function analyzes student performance across multiple subjects based on marks and attendance. It generates visualizations to explore correlations, distribution of marks, and the relationship between attendance and marks.
df (pd.DataFrame)
: A dataset containing student performance details.subject_names (list)
: A list of subjects to analyze (e.g.,["Math", "Science", "English"]
).
A tuple containing:
- Correlation Matrix Heatmap (Plotly figure) - Displays correlations between subject marks if multiple subjects are present.
- Box Plot of Subject Marks (Plotly figure) - Shows the distribution of marks for each subject if multiple subjects are present.
- Scatter Plots (List of Plotly figures) - Each plot illustrates the relationship between attendance and marks for a subject, including a regression line.
-
Correlation Matrix
- Extracts subject marks columns and computes a correlation matrix.
- Generates a heatmap to visualize correlations (if multiple subjects are present).
-
Box Plot of Marks Distribution
- Converts marks data to a long format for visualization.
- Creates a box plot showing the spread of marks across subjects.
-
Scatter Plots for Attendance vs. Marks
- Iterates through each subject.
- Fits a simple linear regression model using attendance as the independent variable and marks as the dependent variable.
- Displays the regression equation on each scatter plot.
import pandas as pd
data = {
"Roll No": [1, 2, 3, 4, 5],
"Name": ["Amit", "Neha", "Rohan", "Sara", "Vikram"],
"Math Marks": [85, 78, 92, 65, 80],
"Math Attendance": [90, 85, 95, 60, 88],
"Science Marks": [75, 88, 79, 72, 85],
"Science Attendance": [80, 92, 78, 85, 90],
"English Marks": [82, 79, 88, 77, 83],
"English Attendance": [85, 80, 90, 70, 82],
}
df = pd.DataFrame(data)
subjects = ["Math", "Science", "English"]
fig1, fig2, scatter_list = analyze_subject_performance(df, subjects)
if fig1: fig1.show()
if fig2: fig2.show()
for fig in scatter_list:
fig.show()
- If only one subject is provided, correlation matrix and box plot are not generated.
- Handles different subject names dynamically.
- Accounts for missing values by converting data types before regression.
- Regression analysis is used to determine the effect of attendance on marks.
- Scatter plots include trend lines for better interpretation of relationships.
The detect_bias
function analyzes potential bias in student marks based on gender or religion using multiple linear regression and SHAP analysis. It helps identify whether a categorical factor influences student grades disproportionately.
The regression equation used in this analysis follows:
[ \text{Marks} = \beta_0 + \beta_1 \times \text{Attendance} + \beta_2 \times \text{Teacher_Avg} + \beta_3 \times \text{Categorical_Factor} + \varepsilon ]
Where:
- ( \beta_0 ): Intercept (baseline marks)
- ( \beta_1 ): Impact of attendance on marks
- ( \beta_2 ): Impact of teacher's average student performance
- ( \beta_3 ): Influence of categorical factors (Gender or Religion)
- ( \varepsilon ): Random error term
- Detect Required Columns → Identify
Attendance
,Marks
,Teacher
, and a categorical factor (Gender or Religion). - Data Filtering → If the teacher column exists, exclude teachers with ≤5 students. If absent, consider all students.
- Replace Teacher Name with Their Average Student Marks (or Overall Avg if no teacher column).
- One-Hot Encode the Categorical Factor without dropping the first category to retain visibility.
- Perform Regression Analysis using OLS (Ordinary Least Squares) to determine bias impact.
- Use SHAP (SHapley Additive exPlanations) to analyze feature importance.
- Generate a Plotly Bar Chart with Bias Interpretation.
- 0 - 0.05 → Negligible Bias 🟢
- 0.05 - 0.15 → Mild Bias 🟡 (Possible but weak influence)
- 0.15 - 0.30 → Moderate Bias 🟠 (Further investigation needed)
- > 0.30 → Severe Bias 🔴 (Strong evidence of discrimination)
df
(pd.DataFrame
): DataFrame with the following structure:"[Subject] Attendance"
: Numeric (e.g.,"Math Attendance"
)"[Subject] Marks"
: Numeric (e.g.,"Math Marks"
)"[Subject] Teacher"
(Optional): Categorical (e.g.,"Math Teacher"
)"Gender"
or"Religion"
: Categorical (≤4 unique values)
- A Plotly bar chart showing SHAP values for bias detection (not displayed directly in function).
- Detect attendance, marks, and teacher columns dynamically.
- Ensure at least one categorical column (Gender or Religion) exists with ≤4 unique values.
- If
Teacher
exists, replace it with their average student marks. - If no
Teacher
column, use the overall class average.
- Apply One-Hot Encoding without dropping any category to retain full visibility.
- Define independent variables (X): Attendance, Teacher Avg, Encoded Categorical Values.
- Define dependent variable (y): Marks.
- Perform Ordinary Least Squares (OLS) Regression.
- Compute SHAP values to determine each feature’s contribution.
- Categorize SHAP values as positive (blue) or negative (red) for visual interpretation.
- Display a stacked bar chart showing bias influence based on SHAP values.
- Highlight severity thresholds for interpretation.
import pandas as pd
data = {
"Math Teacher": ["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
"Gender": ["Male", "Female", "Trans", "Male", "Female", "Trans", "Male", "Female", "Trans", "Male", "Female", "Trans"],
"Math Attendance": [90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90],
"Math Marks": [85, 88, 6, 78, 85, 5, 80, 90, 5, 88, 86, 5],
}
df = pd.DataFrame(data)
bias_graph = detect_bias(df)
bias_graph.show()
- The function assumes that gender or religion is the primary categorical factor influencing bias detection.
- Teachers with ≤5 students are excluded to ensure statistical validity.
- The function does not infer causation but highlights statistical correlations.
BeyondTheMarks is an analytical tool designed to process and evaluate student performance data from academic marksheets. The application identifies trends, biases, and effectiveness in teaching methodologies, providing deep insights into the educational environment. With features like professor performance analysis, bias detection, and subject performance comparison, it transforms raw data into actionable insights.
- Supports
.csv
and.xlsx
formats. - Performs rigorous validation to ensure data integrity.
- Enforces structural laws:
- Unique Roll Numbers.
- Proper subject-wise attendance and marks format.
- Optional but structured teacher column.
- Numerical values within permissible limits.
- Identifies subject-wise teacher effectiveness.
- Uses attendance and marks to compute a teacher score.
- Visualizes teacher performance through interactive graphs.
- Generates a structured performance matrix for better comparison.
- Detects gender-based discrepancies in marks and attendance.
- Uses visualization techniques to highlight potential bias.
- Requires a
Gender
column for analysis.
- Identifies patterns of religious bias in academic performance.
- Relies on a
Religion
column for meaningful insights. - Visual representations make bias detection intuitive.
- Compares subjects based on overall performance.
- Displays statistical insights using:
- Correlation matrix
- Box plots
- Scatter plots
- Helps in understanding subject difficulty levels.
- Click the "Upload" button and select a valid
.csv
or.xlsx
file. - The system automatically validates the data and provides feedback.
- If successful, the data appears for further analysis.
- Professor Performance Analysis: Evaluates teacher effectiveness.
- Gender Bias Detection: Detects biases based on gender.
- Religious Bias Detection: Identifies biases related to religion.
- Subject Performance Comparison: Analyzes subject difficulty and trends.
- Graphs and matrices provide deep insights.
- Any detected biases or anomalies are highlighted.
- The final data visualization helps in decision-making.
- If invalid data is uploaded, clear error messages guide correction.
- Users must ensure compliance with the "Grand Data Upload Rulebook."