Predicting Cervical Cancer Risk with Machine Learning (SDG 3)

Project Overview

Early detection of cervical cancer dramatically improves health outcomes. This project builds a supervised machine learning pipeline to predict the probability of a positive biopsy result from easily collected risk factors. The work aligns with SDG 3: Good Health & Well‑being by exploring scalable, data‑driven screening support.

Task: Binary classification (predict biopsy: 0/1) from risk‑factor features.
Models: Logistic Regression, Random Forest, Support Vector Machine.
Key steps: Missing‑value imputation, leakage‑free feature selection, stratified split, pipelines, ROC‑AUC and confusion matrices.

Why this dataset?

I selected the cervical cancer risk factors dataset because (1) it addresses a high‑impact public health problem aligned to SDG 3, and (2) it contains a rich mix of demographic, lifestyle, and clinical history variables that are feasible to collect in low‑resource settings. Predicting a positive biopsy can help prioritize patients for screening where resources are limited.

To maintain scientific validity, I exclude direct screening outcomes such as Hinselmann, Schiller, and Citology from the features, keeping only Biopsy as the target—this prevents information leakage.

End‑to‑End Workflow

1) Data Preparation

Replace "?" placeholders with NaN; coerce columns to numeric.
Drop rows with missing target; impute features with median.
Exclude leakage columns: Hinselmann, Schiller, Citology.

2) Modeling

Stratified 80/20 train‑test split.
Pipelines: imputer + (scaler) + classifier.
Class imbalance: class_weight='balanced'.

3) Evaluation

Confusion matrix; precision/recall/F1 per class.
ROC curve + AUC using predicted probabilities.
5‑fold stratified CV AUC for robustness.

4) Interpretation

Random Forest feature importances (top predictors).
Discuss false‑negative risk and threshold tuning.

Results (Summary)

Across models, Random Forest provided a strong balance of recall and precision and highlighted influential predictors via feature importance. Logistic Regression offered interpretability with competitive performance, and SVM performed well after scaling. Final model choice can depend on the chosen operating point (e.g., maximizing recall to minimize missed positives).

Primary metricROC‑AUC

FocusRecall (positive)

CV5‑fold stratified

Ethical Reflection

Bias & Representativeness: Results may not generalize beyond the dataset’s population. Seek diverse cohorts and external validation.
Fairness: Optimize for low false‑negative rates; monitor subgroup performance.
Privacy: Use de‑identified data; follow data governance policies.
Sustainability & Impact: Triage tools can prioritize limited screening resources in low‑resource settings.