Predicting Credit Card Fraud - Logistic Regression

Hi!

I just finalised my code for the “Predicting Credit Card Fraud” on the Data Science: Machine Learning Specialist skillpath, but my model score and predictions don’t add up (…seems to good). I have tried to identify where my error is, but I can’t figure it out. I’ve enclosed my code and would love some help (…and feedback!).

Link to exercise: https://www.codecademy.com/paths/data-science/tracks/dsml-supervised-learning-i/modules/mle-logistic-regression-66e97567-68e4-431c-895f-c3c64abc5547/projects/logreg-credit-card-fraud

import seaborn
import pandas as pd
import numpy as np
import codecademylib3
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the data
transactions = pd.read_csv('transactions.csv')
print(transactions.head(10))
print(transactions.info)

# Summary statistics on amount column
print(transactions['amount'].describe())

# Create isPayment field
def cond_isPayment(x):
  if x == 'PAYMENT':
    return 1
  elif x == 'DEBIT':
    return 1
  else:
    return 0

func = np.vectorize(cond_isPayment)
isPayment = func(transactions['type'])
transactions['isPayment'] = isPayment

print(transactions['isPayment'].head(10))

# Create isMovement field
def cond_isMovement(x):
  if x == 'CASH_OUT':
    return 1
  elif x == 'TRANSFER':
    return 1
  else:
    return 0

func = np.vectorize(cond_isMovement)
isMovement = func(transactions['type'])
transactions['isMovement'] = isMovement

print(transactions['isMovement'].head(5))

# Create accountDiff field
transactions['accountDiff'] = transactions['oldbalanceOrg'] - transactions['oldbalanceDest']

print(transactions['accountDiff'].head(5))

# Create features and label variables
X = transactions[['amount', 'isPayment', 'isMovement', 'accountDiff']]
print(X.head(5))

y = transactions['isFraud']
print(y.head(5))

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Normalize the features variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the model to the training data
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)

# Score the model on the training data
print(lr.score(X_train_scaled, y_train))

# Score the model on the test data
print(lr.score(X_test_scaled, y_test))

# Print the model coefficients
print(lr.coef_)

# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])

# Combine new transactions into a single array
sample_transactions = np.array([transaction1, transaction2, transaction3])
print(sample_transactions)

# Normalize the new transactions
sample_transactions_scaled = scaler.fit_transform(sample_transactions)

# Predict fraud on the new transactions
predicted_fraud = lr.predict(sample_transactions_scaled)
print(predicted_fraud)

# Show probabilities on the new transactions
predicted_prob_fraud = lr.predict_proba(sample_transactions_scaled)
print(predicted_prob_fraud)

Did you try printing the confusion matrix? I have the same problem and I think the score is high because there are just very few fraudulent transactions overall. If you have 9990 non-fraudulent transactions and 10 fraudulent transactions, even if you classify 100% of fraudulent transactions as non-fraudulent, you can still end up with a pretty high score.

Confusion matrix helps because you can see the actual number of false negatives and false positives.

Can anyone tell me why the fit_transform then predictions dont work with each transaction singularly?

transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])

transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])

transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])

sample_transactions = np.array([transaction1, transaction2, transaction3])

fit_sample_transactions = scale.fit_transform(sample_transactions)

fit_sample_trans1 = scale.fit_transform(np.array([transaction1]))

fit_sample_trans2 = scale.fit_transform(np.array([transaction2]))

fit_sample_trans3 = scale.fit_transform(np.array([transaction3]))

print(model.predict_proba(fit_sample_transactions))

print(model.predict_proba(fit_sample_trans1))

print(model.predict_proba(fit_sample_trans2))

print(model.predict_proba(fit_sample_trans3))

For example, I see that it doesn’t work, you get the same outcome for all 3, but i am not sure why? And think it is probably something worth knowing. Also is there a way to check all 3 transactions separately properly?

What was your score? For the mod dataset i got like 95% and I am positive I got it right so if its that you’re good. The full dataset is where I am surprised. Also remember for the score a True positive and True false both count as True, so if your only considering True positive and so doing like total/num_true_positive only that may be why you are confused. Realize this is super late but yeah xD

Can anyone help me to understand why we use sometimes .transform() and sometimes .fit_transform() ?

I am a bit lost with the difference between the two and when they are applicable ?

Thanks a lot!