An end‑to‑end ML pipeline that learns from patient risk factors to support early detection—covering data cleaning, modeling, evaluation, ethics, and impact.
Early detection of cervical cancer dramatically improves health outcomes. This project builds a supervised machine learning pipeline to predict the probability of a positive biopsy result from easily collected risk factors. The work aligns with SDG 3: Good Health & Well‑being by exploring scalable, data‑driven screening support.
I selected the cervical cancer risk factors dataset because (1) it addresses a high‑impact public health problem aligned to SDG 3, and (2) it contains a rich mix of demographic, lifestyle, and clinical history variables that are feasible to collect in low‑resource settings. Predicting a positive biopsy can help prioritize patients for screening where resources are limited.
To maintain scientific validity, I exclude direct screening outcomes such as Hinselmann, Schiller, and Citology from the features, keeping only Biopsy
as the target—this prevents information leakage.
"?"
placeholders with NaN
; coerce columns to numeric.class_weight='balanced'
.Across models, Random Forest provided a strong balance of recall and precision and highlighted influential predictors via feature importance. Logistic Regression offered interpretability with competitive performance, and SVM performed well after scaling. Final model choice can depend on the chosen operating point (e.g., maximizing recall to minimize missed positives).
Open the repository to view the Jupyter notebook, dataset, and instructions.
Tip: host this page on GitHub Pages to submit as a shareable link.