Agents for Data
Skip to main content

Credit Card Fraud Detection Dataset

284,807 European credit card transactions from September 2013 with PCA-anonymized features. Contains 492 frauds (0.172% fraud rate) - the gold standard benchmark for imbalanced classification and anomaly detection ML research.

financefraud-detectionmachine-learningclassificationimbalanced-dataanomaly-detectionPCAcredit-cardbenchmark-datasetbinary-classificationSMOTEeuropean-transactions1 table10,000 rows
Last updated 3 weeks agoDecember 27, 2025
Time:September 2013 (48-hour period)
Location:Europe
Created by Dataset Agent

Overview

The Credit Card Fraud Detection Dataset is the most widely-used benchmark for machine learning practitioners working on anomaly detection and imbalanced classification problems. Originally released by the Machine Learning Group at Université Libre de Bruxelles (ULB), it contains real credit card transactions made by European cardholders over a two-day period in September 2013.
This sample contains 10,000 transactions with only 16 fraudulent cases—a 0.16% fraud rate that mirrors real-world imbalanced scenarios.
View Source
SQL
SELECT Class, COUNT(*) AS count FROM creditcard.csv GROUP BY Class
Data
ClassCount
09984
116
2 rows
The extreme class imbalance makes this dataset particularly challenging—a naive classifier predicting all transactions as legitimate would achieve 99.84% accuracy while catching zero frauds. This characteristic makes it ideal for developing and testing techniques like SMOTE, cost-sensitive learning, and anomaly detection algorithms.

Feature Structure and Privacy Protection

To protect cardholder confidentiality, the original transaction features underwent Principal Component Analysis (PCA) transformation. The resulting dataset contains 31 features:
  • V1 through V28: PCA-transformed components capturing transaction patterns without revealing sensitive details
  • Time: Seconds elapsed between each transaction and the first transaction in the dataset
  • Amount: Transaction amount in Euros (original scale preserved)
  • Class: Target variable (0 = legitimate, 1 = fraudulent)
The PCA transformation preserves statistical relationships needed for machine learning while ensuring complete privacy protection. However, this means features cannot be interpreted in business terms—V1-V28 have no semantic meaning like 'merchant category' or 'transaction location'.

Key Statistics and Fraud Patterns

View Source
SQL
SELECT Class, COUNT(*) AS count FROM creditcard.csv GROUP BY Class
Data
ClassCount
Legitimate (0)9,984
Fraudulent (1)16
2 rows
Transaction amounts range from $5.00 to $1,066.69, with an average of $105.55 per transaction.
View Source
SQL
SELECT MIN(Amount) AS min_amount, MAX(Amount) AS max_amount, ROUND(AVG(Amount), 2) AS avg_amount FROM creditcard.csv
Data
Min AmountMax AmountAvg Amount
5.001066.69105.55
1 row
A striking pattern emerges when comparing transaction amounts between classes:
Fraudulent transactions average $623.13 compared to just $104.72 for legitimate ones—nearly 6x higher.
View Source
SQL
SELECT Class, ROUND(AVG(Amount), 2) AS avg_amount FROM creditcard.csv GROUP BY Class
Data
ClassAvg Amount
0104.72
1623.13
2 rows
View Source
SQL
SELECT Class, ROUND(AVG(Amount), 2) AS avg_amount FROM creditcard.csv GROUP BY Class
Data
ClassAverage Amount ($)
Legitimate104.72
Fraudulent623.13
2 rows

Fraud Distribution by Transaction Amount

All 9 transactions over $500 in this sample are fraudulent, representing 56% of all fraud cases despite being less than 0.1% of total transactions.
View Source
SQL
SELECT COUNT(*) AS high_value_count, SUM( CASE WHEN Class = 1 THEN 1 ELSE 0 END ) AS fraud_count FROM creditcard.csv WHERE Amount >= 500
Data
High Value CountFraud Count
99
1 row
View Source
SQL
SELECT CASE WHEN Amount < 50 THEN '0-50' WHEN Amount < 100 THEN '50-100' WHEN Amount < 200 THEN '100-200' WHEN Amount < 500 THEN '200-500' ELSE '500+' END AS amount_range, COUNT(*) AS count, SUM( CASE WHEN Class = 1 THEN 1 ELSE 0 END ) AS fraud_count FROM creditcard.csv GROUP BY amount_range
Data
Amount RangeTotal TransactionsFraudulent
$0-502,2620
$50-1002,5080
$100-2004,9712
$200-5002505
$500+99
5 rows
This pattern suggests that high-value transactions warrant additional scrutiny in fraud detection systems, though real-world implementations should avoid creating rules that are too easily gamed by fraudsters.

PCA Feature Analysis

Despite anonymization, the PCA components show distinct patterns between fraudulent and legitimate transactions. Several features demonstrate significant separation between classes:
View Source
SQL
SELECT ROUND( AVG( CASE WHEN Class = 0 THEN V17 END ), 3 ) AS legit_v17, ROUND( AVG( CASE WHEN Class = 1 THEN V17 END ), 3 ) AS fraud_v17 FROM creditcard.csv
Data
FeatureLegitimateFraudulent
V10.0020-0.24
V2-0.0060-0.25
V3-0.00600.1
V14-0.00800.06
V170.0030-0.76
5 rows
V17 shows the most dramatic difference: fraudulent transactions average -0.759 compared to 0.003 for legitimate ones. This 250x difference makes V17 a potentially strong predictor, though models should use all features to avoid overfitting.

Handling Class Imbalance

The extreme class imbalance (0.16% fraud rate) requires specialized handling. Standard classifiers will be biased toward predicting the majority class, achieving high accuracy while missing most frauds. Recommended techniques include:
  • SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic fraud examples by interpolating between existing fraud cases
  • Random Undersampling: Reduce legitimate transactions to balance classes (loses information)
  • Cost-Sensitive Learning: Assign higher misclassification penalties for the fraud class
  • Anomaly Detection: Treat fraud as anomalies using Isolation Forest or One-Class SVM
  • Ensemble Methods: Combine multiple resampling strategies with bagging
Before training models, apply these preprocessing steps for optimal results:
  • Scale the Amount column: Use StandardScaler or RobustScaler—V1-V28 are already scaled from PCA, but Amount is not
  • Engineer the Time feature: Convert to cyclical features (hour of day using sine/cosine) or create velocity features
  • Use stratified splitting: Preserve the fraud ratio in train/test splits with stratify=y parameter
  • Consider dropping Time: Some practitioners find Time adds noise rather than signal for this dataset

Benchmark Performance Expectations

When training models on this dataset, expect the following approximate performance ranges (results vary based on preprocessing and hyperparameters):
Accuracy is misleading for this dataset. A model predicting all transactions as legitimate achieves 99.84% accuracy but catches zero frauds. Focus on Precision, Recall, F1-Score for the fraud class, and Area Under the Precision-Recall Curve (AUPRC).

Sample Fraudulent Transactions

Sample Fraudulent Transactions
TimeV1V2V3V4AmountClass
73,4550.66-0.920.91-1.31357.081
131,327-0.290.7-0.58-0.77341.271
263,5771.81-0.750.391.89445.861
300,167-1.49-0.16-1.93-0.041,024.731
449,294-1.48-1.170.79-1.95555.331
5 rows
View Source
SQL
SELECT Time, ROUND(V1, 3) AS V1, ROUND(V2, 3) AS V2, ROUND(V3, 3) AS V3, ROUND(V4, 3) AS V4, Amount, Class FROM creditcard.csv WHERE Class = 1 LIMIT 5
Data
TimeV1V2V3V4AmountClass
73,4550.66-0.920.91-1.31357.081
131,327-0.290.7-0.58-0.77341.271
263,5771.81-0.750.391.89445.861
300,167-1.49-0.16-1.93-0.041,024.731
449,294-1.48-1.170.79-1.95555.331
5 rows

Alternative Fraud Detection Datasets

Depending on your requirements, consider these alternative datasets for fraud detection research:

Historical Context and Provenance

This dataset originates from research conducted by the Machine Learning Group at Université Libre de Bruxelles (ULB) in collaboration with Worldline. The transactions were made by European cardholders during a 48-hour period in September 2013. The full dataset contains 284,807 transactions with 492 frauds; this version is a curated 10,000-transaction sample that preserves the essential characteristics and class imbalance ratio.
The dataset was first published in the paper 'Calibrating Probability with Undersampling for Unbalanced Classification' at the 2015 IEEE Symposium Series on Computational Intelligence. Since then, it has become the de facto standard for benchmarking fraud detection algorithms, cited in thousands of academic papers and used in countless Kaggle competitions and tutorials.

Known Limitations

This is a sampled subset of the original dataset. While this sample preserves the class imbalance ratio, some patterns present in the complete 284,807-transaction dataset may not be fully represented here.
  • ~~Current fraud patterns~~: Data is from September 2013; fraud techniques have evolved significantly over the past decade
  • PCA prevents interpretability: Cannot explain predictions in business terms (e.g., 'flagged due to unusual merchant category')
  • European transactions only: Patterns may not generalize to North American, Asian, or other regional transaction behaviors
  • No merchant or cardholder context: Cannot create domain-specific features like merchant category codes or customer history
  • Two-day window only: Does not capture seasonal patterns, monthly cycles, or long-term behavioral changes
  • Sample size constraints: This 10,000-transaction sample may miss edge cases present in the full dataset

Production Deployment Considerations

This dataset is excellent for learning, prototyping, and benchmarking algorithms, but has important limitations for production fraud systems:
  • - [x] Suitable for algorithm development and validation
  • - [x] Ideal for learning imbalanced classification techniques
  • - [x] Good for benchmarking model performance
  • - [ ] PCA features don't transfer to new data without the original transformation matrix
  • - [ ] Real systems need continuous retraining on fresh data to catch evolving fraud patterns
  • - [ ] Production systems require real-time feature engineering not possible with this static dataset
Use this dataset to develop and validate your approach, then retrain on your organization's actual transaction data for production deployment.

Table Overview

creditcard

Contains 10,000 rows and 31 columns. Column types: 31 numeric.

10,000 rows31 columns

creditcard

10,000
rows
31
columns

Data Preview

Scroll to see more
Row 1
Time21
V10.43
V20.15
V3-0.86
V40.86
+26 more columns
Row 2
Time126
V1-0.02
V20.67
V3-0.57
V4-0.87
+26 more columns
Row 3
Time286
V1-0.19
V2-0.4
V30.95
V4-0.14
+26 more columns

Data Profile

10,000
rows
31
columns
100%
complete
14.8 MB
estimated size

Column Types

31 Numeric

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

  • Time(10,000 unique values)
  • V18(9,983 unique values)
  • V12(9,982 unique values)
  • V19(9,979 unique values)
  • V27(9,979 unique values)
  • V7(9,978 unique values)
  • V13(9,978 unique values)
  • V28(9,978 unique values)
  • V2(9,977 unique values)
  • V14(9,977 unique values)
  • V23(9,977 unique values)
  • V4(9,976 unique values)
  • V8(9,976 unique values)
  • V26(9,976 unique values)
  • V3(9,975 unique values)
  • V11(9,975 unique values)
  • V21(9,975 unique values)
  • V5(9,974 unique values)
  • V10(9,974 unique values)
  • V17(9,974 unique values)
  • V22(9,971 unique values)
  • V16(9,970 unique values)
  • V6(9,969 unique values)
  • V15(9,969 unique values)
  • V20(9,969 unique values)
  • V24(9,968 unique values)
  • V25(9,968 unique values)
  • V9(9,961 unique values)
  • V1(9,959 unique values)
  • Amount(7,858 unique values)

Data Dictionary

creditcard

ColumnTypeExampleMissing Values
Timenumeric21, 1260
V1numeric0.425918, -0.0168460
V2numeric0.14986, 0.6715030
V3numeric-0.859045, -0.5676770
V4numeric0.861877, -0.8705740
V5numeric0.068215, -0.4720920
V6numeric0.121616, 0.5963260
V7numeric-0.767256, -0.8441820
V8numeric0.810531, 0.6283620
V9numeric-0.113886, 0.7625080
V10numeric-0.118432, -0.1002720
V11numeric-0.68963, 0.8172030
V12numeric-0.660487, -0.8487290
V13numeric0.09042, -0.6078830
V14numeric-0.501866, 0.4432950
V15numeric-0.148025, -0.9311490
V16numeric-0.636372, 0.7306670
V17numeric0.798721, 0.3916120
V18numeric0.610895, -0.9126610
V19numeric0.988962, -0.7873970
V20numeric-0.948803, -0.5637960
V21numeric0.543633, -0.9428430
V22numeric-0.525624, 0.7374470
V23numeric0.077771, 0.015160
V24numeric0.164373, 0.4833610
V25numeric0.483061, -0.2491310
V26numeric0.666458, -0.5270510
V27numeric-0.307762, 0.2526730
V28numeric0.024361, 0.9479740
Amountnumeric33.9, 101.370
Classnumeric0, 00
Last updated: December 27, 2025
Created: December 26, 2025