SDG 3 · Good Health & Well‑being

Predicting Cervical Cancer Risk with Machine Learning

An end‑to‑end ML pipeline that learns from patient risk factors to support early detection—covering data cleaning, modeling, evaluation, ethics, and impact.

Python Pandas scikit‑learn Matplotlib Classification

Project Overview

Early detection of cervical cancer dramatically improves health outcomes. This project builds a supervised machine learning pipeline to predict the probability of a positive biopsy result from easily collected risk factors. The work aligns with SDG 3: Good Health & Well‑being by exploring scalable, data‑driven screening support.

  • Task: Binary classification (predict biopsy: 0/1) from risk‑factor features.
  • Models: Logistic Regression, Random Forest, Support Vector Machine.
  • Key steps: Missing‑value imputation, leakage‑free feature selection, stratified split, pipelines, ROC‑AUC and confusion matrices.

Why this dataset?

I selected the cervical cancer risk factors dataset because (1) it addresses a high‑impact public health problem aligned to SDG 3, and (2) it contains a rich mix of demographic, lifestyle, and clinical history variables that are feasible to collect in low‑resource settings. Predicting a positive biopsy can help prioritize patients for screening where resources are limited.

To maintain scientific validity, I exclude direct screening outcomes such as Hinselmann, Schiller, and Citology from the features, keeping only Biopsy as the target—this prevents information leakage.

End‑to‑End Workflow

1) Data Preparation

  • Replace "?" placeholders with NaN; coerce columns to numeric.
  • Drop rows with missing target; impute features with median.
  • Exclude leakage columns: Hinselmann, Schiller, Citology.

2) Modeling

  • Stratified 80/20 train‑test split.
  • Pipelines: imputer + (scaler) + classifier.
  • Class imbalance: class_weight='balanced'.

3) Evaluation

  • Confusion matrix; precision/recall/F1 per class.
  • ROC curve + AUC using predicted probabilities.
  • 5‑fold stratified CV AUC for robustness.

4) Interpretation

  • Random Forest feature importances (top predictors).
  • Discuss false‑negative risk and threshold tuning.

Results (Summary)

Across models, Random Forest provided a strong balance of recall and precision and highlighted influential predictors via feature importance. Logistic Regression offered interpretability with competitive performance, and SVM performed well after scaling. Final model choice can depend on the chosen operating point (e.g., maximizing recall to minimize missed positives).

Primary metricROC‑AUC
FocusRecall (positive)
CV5‑fold stratified

Ethical Reflection

  • Bias & Representativeness: Results may not generalize beyond the dataset’s population. Seek diverse cohorts and external validation.
  • Fairness: Optimize for low false‑negative rates; monitor subgroup performance.
  • Privacy: Use de‑identified data; follow data governance policies.
  • Sustainability & Impact: Triage tools can prioritize limited screening resources in low‑resource settings.

Explore the Code & Notebook

Open the repository to view the Jupyter notebook, dataset, and instructions.

Tip: host this page on GitHub Pages to submit as a shareable link.