Predicting Cancer using GradientBoost Project

This is a project I made to develop my portafolio. Any thoughts or feed back is welcome.

!pip install scikit-optimize

# Load and Preview Dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from skopt import BayesSearchCV

# Load dataset
file_path = '/kaggle/input/cancer-prediction-dataset/The_Cancer_data_1500_V2.csv'
df = pd.read_csv(file_path)

# Display dataset overview

## Feature Categorization

We categorize the features into binary, continuous, and categorical types. This categorization helps in applying appropriate visualization techniques and preprocessing steps specific to each type of feature.

# Feature categorization
binary_features = ['Gender', 'Smoking', 'CancerHistory', 'Diagnosis']
continuous_features = ['Age', 'BMI', 'PhysicalActivity', 'AlcoholIntake']
categorical_features = ['GeneticRisk']

# Visualizations and EDA

This section includes various visualizations to understand the distribution of continuous features and the count of binary and categorical features. These visualizations help in identifying patterns, outliers, and potential relationships between features.

### Histograms and Box Plots for Continuous Features

Histograms provide insights into the distribution of continuous features, while box plots help in identifying outliers and comparing distributions. We use both plots to get a comprehensive understanding of the data.

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for i, feature in enumerate(continuous_features):
    # Histogram
    ax_hist = axes[i // 2, (i % 2) * 2]
    ax_hist.hist(df[feature].dropna(), bins=30, color='skyblue', edgecolor='black')
    ax_hist.set_title(f'Histogram of {feature}')
    # Boxplot
    ax_box = axes[i // 2, (i % 2) * 2 + 1]
    ax_box.boxplot(df[feature].dropna(), vert=True, patch_artist=True, boxprops=dict(facecolor='skyblue'))
    ax_box.set_title(f'Boxplot of {feature}')


### Bar Charts for Binary Features

Bar charts are used to display the count of each category in binary features. This helps in understanding the distribution of these features across the dataset.

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
for i, feature in enumerate(binary_features):
    sns.countplot(data=df, x=feature, ax=axes[i // 2, i % 2], palette='Set2')
    axes[i // 2, i % 2].set_title(f'Bar Chart of {feature}')

### Bar Chart for Categorical Feature

This bar chart shows the distribution of the 'GeneticRisk' feature, providing insights into the prevalence of different risk categories in the dataset.

sns.countplot(data=df, x='GeneticRisk', palette='Set2')

### Correlation Heatmap with Diagnosis

A heatmap is used to visualize the correlation between features and the target variable 'Diagnosis'. This helps in identifying which features have a strong relationship with the target variable.

plt.figure(figsize=(8, 10))
sns.heatmap(df.corr()[['Diagnosis']].sort_values(by='Diagnosis', ascending=False), annot=True, cmap='Reds', vmin=-1, vmax=1)
plt.title('Correlation with Diagnosis', color='blue', fontsize=12)

### Box Plots for Continuous Features by Diagnosis

Box plots are used to compare the distribution of continuous features between the diagnosed and non-diagnosed groups. This helps in identifying any significant differences in these features between the two groups.

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
for i, feature in enumerate(continuous_features):
    sns.boxplot(data=df, x='Diagnosis', y=feature, ax=axes[i // 2, i % 2], palette='Set2')

# Summary Statistics:

#### **Age**: Ranges from 20 to 80 years, with a mean of 50.32 years.
#### **Gender**: Almost equally distributed (0: Male, 1: Female).
#### **BMI**: Ranges from 15.00 to 39.96, with a mean of 27.51.
#### **Smoking**: 26.93% are smokers.
#### **GeneticRisk**: Categorical values ranging from 0 to 2.
#### **PhysicalActivity**: Ranges from 0.002 to 9.99 hours per week, with a mean of 4.90 hours.
#### **AlcoholIntake**: Ranges from 0.001 to 4.99 units per week, with a mean of 2.42 units.
#### **CancerHistory**: 14.4% have a family history of cancer.
#### **Diagnosis**: 37.13% have been diagnosed with cancer.

### Binary Features (Bar Charts)
- **Gender**:
  - The distribution of gender is nearly equal, with slightly more males than females.
  - Males: 755 (50.33%)
  - Females: 745 (49.67%)
- **Smoking**:
  - A majority of the individuals in the dataset are non-smokers.
  - Non-Smokers: 1075 (71.67%)
  - Smokers: 425 (28.33%)
- **CancerHistory**:
  - Most individuals do not have a family history of cancer.
  - No Family History of Cancer: 1275 (85.00%)
  - Family History of Cancer: 225 (15.00%)
- **Diagnosis**:
  - More individuals have not been diagnosed with cancer compared to those who have.
  - Not Diagnosed with Cancer: 945 (63.00%)
  - Diagnosed with Cancer: 555 (37.00%)

### Continuous Features (Histograms with KDE)
- **Age**:
  - The distribution of age is relatively uniform across different age groups, with a slight increase in frequency around ages 65 and 80. There is a noticeable dip between the ages of 60 and 65.
  - Mean: ~50 years
- **BMI**:
  - The BMI distribution shows a uniform spread with no distinct peaks, suggesting varied body mass indices across the population.
  - Mean: ~27.5
- **Physical Activity**:
  - The distribution of physical activity hours per week is fairly even, indicating a wide range of physical activity levels among individuals.
  - Mean: ~5 hours per week
- **Alcohol Intake**:
  - The alcohol intake distribution is also relatively uniform, with a slight increase in individuals consuming around 2 units per week.
  - Mean: ~2.5 units per week

### Categorical Feature (Pie Chart)
- **Genetic Risk**:
  - The majority of the population falls under 'Low' genetic risk, with 'Medium' and 'High' risk categories being less frequent.
  - Low Risk: 895 (59.7%)
  - Medium Risk: 447 (29.8%)
  - High Risk: 158 (10.5%)

### Conclusions:
- The dataset seems to have a balanced gender distribution.
- There is a higher prevalence of non-smokers (71.67%) and individuals without a family history of cancer (85.00%).
- Cancer diagnoses are less common (37.00%) compared to non-diagnoses (63.00%).
- The continuous features show a fairly uniform distribution, indicating diverse characteristics within the population.
- Most individuals have a low genetic risk for cancer (59.7%), with fewer individuals in the medium (29.8%) and high (10.5%) risk categories.

The data is very evenly balances throughout the set, with exceptions in some of the binary and categorical features. The continous features don't contain any outliers, meaning the data is pretty cleand and doesn't need much preprocessing.

The model will be built trying to minimize false negatives as much as possible, making recall a priority metric.

# Data Preprocessing

In this step, we define the target and features, and standardize the continuous features. Standardization is important for models that rely on distance calculations, as it ensures that each feature contributes equally to the distance.

# Define the target and features
X = df.drop('Diagnosis', axis=1)
y = df['Diagnosis']

# Standardize the continuous features
scaler = StandardScaler()
X[continuous_features] = scaler.fit_transform(X[continuous_features])

# Picking Model

We define and evaluate several models using Stratified k-Fold Cross-Validation. This technique helps in assessing the model's performance more reliably by reducing the impact of data variability.

# Define models for evaluation
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'k-NN': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()

# Evaluate each model using Stratified k-Fold Cross-Validation
results = {}
k = 5
skf = StratifiedKFold(n_splits=k)

for model_name, model in models.items():
    scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
    results[model_name] = {
        'Accuracy': scores.mean()

# Convert results to a DataFrame
results_df = pd.DataFrame(results).T

# Hyperparameter Tuning for Gradient Boosting Classifier

We use Bayesian optimization to tune the hyperparameters of the Gradient Boosting Classifier. This method is more efficient than grid search and can lead to better model performance by exploring the parameter space more effectively.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

param_space = {
    'n_estimators': (170, 180),
    'learning_rate': (0.085, 0.095),
    'max_depth': [2],
    'min_samples_split': [5],
    'min_samples_leaf': [5]

bayes_search = BayesSearchCV(
), y_train)
best_params = bayes_search.best_params_
best_model = GradientBoostingClassifier(**best_params), y_train)

# Histogram of False Negatives for Different Thresholds

We plot a histogram of false negatives for different thresholds to identify the optimal threshold that minimizes false negatives. This is important for our model's evaluation as we aim to reduce the number of missed positive diagnoses.

# Evaluate on test set with a predefined threshold
y_prob = best_model.predict_proba(X_test)[:, 1]
thresholds = np.arange(0.3, 0.61, 0.01)
metrics = {'Threshold': [], 'False Negatives': []}

for threshold in thresholds:
    y_pred = (y_prob >= threshold).astype(int)
    fn = confusion_matrix(y_test, y_pred).ravel()[2]  # Get FN count
    metrics['False Negatives'].append(fn)

metrics_df = pd.DataFrame(metrics)

plt.figure(figsize=(10, 6))
sns.histplot(metrics_df, x='Threshold', weights='False Negatives', bins=len(thresholds), kde=True)
plt.ylabel('Count of False Negatives')
plt.title('Histogram of False Negatives for Different Thresholds (0.3 to 0.6)')
plt.axvline(0.34, color='r', linestyle='--', label='Optimal Threshold: 0.34')

# Final Model Evaluation with Predefined Threshold

I manually input the threshold that I considered had the best results.

optimal_threshold = 0.34
y_pred_final = (y_prob >= optimal_threshold).astype(int)

# Final evaluation
final_test_accuracy = accuracy_score(y_test, y_pred_final)
final_test_precision = precision_score(y_test, y_pred_final)
final_test_recall = recall_score(y_test, y_pred_final)
final_test_f1 = f1_score(y_test, y_pred_final)
conf_matrix = confusion_matrix(y_test, y_pred_final)

print(f"Final Test Accuracy: {final_test_accuracy}")
print(f"Final Test Precision: {final_test_precision}")
print(f"Final Test Recall: {final_test_recall}")
print(f"Final Test F1 Score: {final_test_f1}")
print(f"Confusion Matrix:\n{conf_matrix}")

# Interpretation of Metrics:

1. **Accuracy**:
 - **0.94** indicates that 94% of the predictions made by the model are correct. This means that out of 100 predictions, 94 are expected to be accurate.

2. **Precision**:
 - **0.897** indicates that when the model predicts a positive diagnosis (cancer diagnosis), it is correct 89.7% of the time. Precision is crucial when the cost of false positives is high.

3. **Recall**:
 - **0.946** indicates that the model correctly identifies 94.6% of actual positive cases (cancer diagnoses). High recall is important in this context because it means fewer false negatives, which is critical in medical diagnoses to ensure that most of the true cases are identified.

4. **F1 Score**:
 - **0.921** is the harmonic mean of precision and recall. It provides a balance between precision and recall and is useful when you need to take both metrics into account.

5. **Confusion Matrix**:
 - **True Positives (TP)**: 105
   - These are the cases where the model correctly predicted cancer diagnosis (i.e., the model predicted positive, and it was actually positive).
 - **True Negatives (TN)**: 177
   - These are the cases where the model correctly predicted no cancer diagnosis (i.e., the model predicted negative, and it was actually negative).
 - **False Positives (FP)**: 12
   - These are the cases where the model incorrectly predicted cancer diagnosis (i.e., the model predicted positive, but it was actually negative). False positives could lead to unnecessary stress and additional tests for patients.
 - **False Negatives (FN)**: 6
   - These are the cases where the model incorrectly predicted no cancer diagnosis (i.e., the model predicted negative, but it was actually positive). False negatives are critical as they mean missed diagnoses, which could delay treatment for the patient.

### Practical Implications:

- **High Accuracy and Recall**:
- The model’s high accuracy and recall suggest it is very effective in identifying individuals with cancer. This means that in a practical scenario, the model will correctly identify most patients who actually have cancer, which is vital for early detection and treatment.

- **Moderate Precision**:
- While precision is also quite high, there are still some false positives. This indicates that some patients might be incorrectly diagnosed with cancer, leading to unnecessary follow-up tests and anxiety. However, in medical screening, it is often more acceptable to have a higher number of false positives than false negatives.

- **Balanced F1 Score**:
- The F1 score indicates a good balance between precision and recall. This balance ensures that the model is not only good at identifying true positives but also does not make too many false positive predictions.

### Conclusion:

The model is highly effective for cancer diagnosis with a high accuracy and recall rate, meaning it successfully identifies the majority of true cancer cases. The slightly lower precision indicates that there will be some false positives, but this is often acceptable in medical contexts where the priority is to minimize missed diagnoses (false negatives). The high F1 score confirms that the model maintains a good balance between precision and recall, making it reliable for practical use in medical diagnostics.