Predicting Credit Card Fraud - Logistic Regression


I just finalised my code for the “Predicting Credit Card Fraud” on the Data Science: Machine Learning Specialist skillpath, but my model score and predictions don’t add up (…seems to good). I have tried to identify where my error is, but I can’t figure it out. I’ve enclosed my code and would love some help (…and feedback!).

Link to exercise:

import seaborn
import pandas as pd
import numpy as np
import codecademylib3
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the data
transactions = pd.read_csv('transactions.csv')

# Summary statistics on amount column

# Create isPayment field
def cond_isPayment(x):
  if x == 'PAYMENT':
    return 1
  elif x == 'DEBIT':
    return 1
    return 0

func = np.vectorize(cond_isPayment)
isPayment = func(transactions['type'])
transactions['isPayment'] = isPayment


# Create isMovement field
def cond_isMovement(x):
  if x == 'CASH_OUT':
    return 1
  elif x == 'TRANSFER':
    return 1
    return 0

func = np.vectorize(cond_isMovement)
isMovement = func(transactions['type'])
transactions['isMovement'] = isMovement


# Create accountDiff field
transactions['accountDiff'] = transactions['oldbalanceOrg'] - transactions['oldbalanceDest']


# Create features and label variables
X = transactions[['amount', 'isPayment', 'isMovement', 'accountDiff']]

y = transactions['isFraud']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Normalize the features variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the model to the training data
lr = LogisticRegression(), y_train)

# Score the model on the training data
print(lr.score(X_train_scaled, y_train))

# Score the model on the test data
print(lr.score(X_test_scaled, y_test))

# Print the model coefficients

# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])

# Combine new transactions into a single array
sample_transactions = np.array([transaction1, transaction2, transaction3])

# Normalize the new transactions
sample_transactions_scaled = scaler.fit_transform(sample_transactions)

# Predict fraud on the new transactions
predicted_fraud = lr.predict(sample_transactions_scaled)

# Show probabilities on the new transactions
predicted_prob_fraud = lr.predict_proba(sample_transactions_scaled)

Did you try printing the confusion matrix? I have the same problem and I think the score is high because there are just very few fraudulent transactions overall. If you have 9990 non-fraudulent transactions and 10 fraudulent transactions, even if you classify 100% of fraudulent transactions as non-fraudulent, you can still end up with a pretty high score.

Confusion matrix helps because you can see the actual number of false negatives and false positives.