What is the Credit Card Fraud Detection Dataset dataset?

284,807 European credit card transactions from September 2013 with PCA-anonymized features. Contains 492 frauds (0.172% fraud rate) - the gold standard benchmark for imbalanced classification and anomaly detection ML research.

How can I download the Credit Card Fraud Detection Dataset dataset?

You can download the Credit Card Fraud Detection Dataset dataset in CSV or Parquet format directly from this page. Each table has its own download buttons.

What format is the Credit Card Fraud Detection Dataset dataset available in?

This dataset is available in CSV and Parquet formats. CSV is great for spreadsheet applications, while Parquet is optimized for data analysis tools like Pandas and DuckDB.

What is the license for the Credit Card Fraud Detection Dataset dataset?

This dataset is available under the Open Database License (ODbL) license. See the full license at https://opendatacommons.org/licenses/odbl/1-0/

How many tables are in the Credit Card Fraud Detection Dataset dataset?

The Credit Card Fraud Detection Dataset dataset contains 1 table: creditcard.

Credit Card Fraud Detection Dataset

Last updated 5 months ago•December 27, 2025

Source:Machine Learning Group - ULB

License:Open Database License (ODbL)

Version:1.0

Time:September 2013 (48-hour period)

Location:Europe

Created by Dataset Agent

Overview

The Credit Card Fraud Detection Dataset is the most widely-used benchmark for machine learning practitioners working on anomaly detection and imbalanced classification problems. Originally released by the Machine Learning Group at Université Libre de Bruxelles (ULB), it contains real credit card transactions made by European cardholders over a two-day period in September 2013.

This sample contains 10,000 transactions with only 16 fraudulent cases—a 0.16% fraud rate that mirrors real-world imbalanced scenarios.

View Source

SQL

SELECT Class, COUNT(*) AS count FROM creditcard.csv GROUP BY Class

Data

Class	Count
0	9984
1	16
2 rows

The extreme class imbalance makes this dataset particularly challenging—a naive classifier predicting all transactions as legitimate would achieve 99.84% accuracy while catching zero frauds. This characteristic makes it ideal for developing and testing techniques like SMOTE, cost-sensitive learning, and anomaly detection algorithms.

Feature Structure and Privacy Protection

To protect cardholder confidentiality, the original transaction features underwent Principal Component Analysis (PCA) transformation. The resulting dataset contains 31 features:

V1 through V28: PCA-transformed components capturing transaction patterns without revealing sensitive details
Time: Seconds elapsed between each transaction and the first transaction in the dataset
Amount: Transaction amount in Euros (original scale preserved)
Class: Target variable (0 = legitimate, 1 = fraudulent)

The PCA transformation preserves statistical relationships needed for machine learning while ensuring complete privacy protection. However, this means features cannot be interpreted in business terms—V1-V28 have no semantic meaning like 'merchant category' or 'transaction location'.

Key Statistics and Fraud Patterns

View Source

SQL

SELECT Class, COUNT(*) AS count FROM creditcard.csv GROUP BY Class

Data

Class	Count
Legitimate (0)	9,984
Fraudulent (1)	16
2 rows

Transaction amounts range from $5.00 to $1,066.69, with an average of $105.55 per transaction.

View Source

SQL

SELECT MIN(Amount) AS min_amount, MAX(Amount) AS max_amount, ROUND(AVG(Amount), 2) AS avg_amount FROM creditcard.csv

Data

Min Amount	Max Amount	Avg Amount
5.00	1066.69	105.55
1 row

A striking pattern emerges when comparing transaction amounts between classes:

Fraudulent transactions average $623.13 compared to just $104.72 for legitimate ones—nearly 6x higher.

View Source

SQL

SELECT Class, ROUND(AVG(Amount), 2) AS avg_amount FROM creditcard.csv GROUP BY Class

Data

Class	Avg Amount
0	104.72
1	623.13
2 rows

View Source

SQL

SELECT Class, ROUND(AVG(Amount), 2) AS avg_amount FROM creditcard.csv GROUP BY Class

Data

Class	Average Amount ($)
Legitimate	104.72
Fraudulent	623.13
2 rows

Fraud Distribution by Transaction Amount

All 9 transactions over $500 in this sample are fraudulent, representing 56% of all fraud cases despite being less than 0.1% of total transactions.

View Source

SQL

SELECT COUNT(*) AS high_value_count, SUM( CASE WHEN Class = 1 THEN 1 ELSE 0 END ) AS fraud_count FROM creditcard.csv WHERE Amount >= 500

Data

High Value Count	Fraud Count
9	9
1 row

View Source

SQL

SELECT CASE WHEN Amount < 50 THEN '0-50' WHEN Amount < 100 THEN '50-100' WHEN Amount < 200 THEN '100-200' WHEN Amount < 500 THEN '200-500' ELSE '500+' END AS amount_range, COUNT(*) AS count, SUM( CASE WHEN Class = 1 THEN 1 ELSE 0 END ) AS fraud_count FROM creditcard.csv GROUP BY amount_range

Data

Amount Range	Total Transactions	Fraudulent
$0-50	2,262	0
$50-100	2,508	0
$100-200	4,971	2
$200-500	250	5
$500+	9	9
5 rows

This pattern suggests that high-value transactions warrant additional scrutiny in fraud detection systems, though real-world implementations should avoid creating rules that are too easily gamed by fraudsters.

PCA Feature Analysis

Despite anonymization, the PCA components show distinct patterns between fraudulent and legitimate transactions. Several features demonstrate significant separation between classes:

View Source

SQL

SELECT ROUND( AVG( CASE WHEN Class = 0 THEN V17 END ), 3 ) AS legit_v17, ROUND( AVG( CASE WHEN Class = 1 THEN V17 END ), 3 ) AS fraud_v17 FROM creditcard.csv

Data

Feature	Legitimate	Fraudulent
V1	0.0020	-0.24
V2	-0.0060	-0.25
V3	-0.0060	0.1
V14	-0.0080	0.06
V17	0.0030	-0.76
5 rows

V17 shows the most dramatic difference: fraudulent transactions average -0.759 compared to 0.003 for legitimate ones. This 250x difference makes V17 a potentially strong predictor, though models should use all features to avoid overfitting.

Handling Class Imbalance

The extreme class imbalance (0.16% fraud rate) requires specialized handling. Standard classifiers will be biased toward predicting the majority class, achieving high accuracy while missing most frauds. Recommended techniques include:

SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic fraud examples by interpolating between existing fraud cases
Random Undersampling: Reduce legitimate transactions to balance classes (loses information)
Cost-Sensitive Learning: Assign higher misclassification penalties for the fraud class
Anomaly Detection: Treat fraud as anomalies using Isolation Forest or One-Class SVM
Ensemble Methods: Combine multiple resampling strategies with bagging

Recommended Preprocessing Steps

Before training models, apply these preprocessing steps for optimal results:

Scale the Amount column: Use StandardScaler or RobustScaler—V1-V28 are already scaled from PCA, but Amount is not
Engineer the Time feature: Convert to cyclical features (hour of day using sine/cosine) or create velocity features
Use stratified splitting: Preserve the fraud ratio in train/test splits with stratify=y parameter
Consider dropping Time: Some practitioners find Time adds noise rather than signal for this dataset

Benchmark Performance Expectations

When training models on this dataset, expect the following approximate performance ranges (results vary based on preprocessing and hyperparameters):

Accuracy is misleading for this dataset. A model predicting all transactions as legitimate achieves 99.84% accuracy but catches zero frauds. Focus on Precision, Recall, F1-Score for the fraud class, and Area Under the Precision-Recall Curve (AUPRC).

Sample Fraudulent Transactions

Time	V1	V2	V3	V4	Amount	Class
73,455	0.66	-0.92	0.91	-1.31	357.08	1
131,327	-0.29	0.7	-0.58	-0.77	341.27	1
263,577	1.81	-0.75	0.39	1.89	445.86	1
300,167	-1.49	-0.16	-1.93	-0.04	1,024.73	1
449,294	-1.48	-1.17	0.79	-1.95	555.33	1
5 rows

View Source

SQL

SELECT Time, ROUND(V1, 3) AS V1, ROUND(V2, 3) AS V2, ROUND(V3, 3) AS V3, ROUND(V4, 3) AS V4, Amount, Class FROM creditcard.csv WHERE Class = 1 LIMIT 5

Data

Time	V1	V2	V3	V4	Amount	Class
73,455	0.66	-0.92	0.91	-1.31	357.08	1
131,327	-0.29	0.7	-0.58	-0.77	341.27	1
263,577	1.81	-0.75	0.39	1.89	445.86	1
300,167	-1.49	-0.16	-1.93	-0.04	1,024.73	1
449,294	-1.48	-1.17	0.79	-1.95	555.33	1
5 rows

Alternative Fraud Detection Datasets

Depending on your requirements, consider these alternative datasets for fraud detection research:

Historical Context and Provenance

This dataset originates from research conducted by the Machine Learning Group at Université Libre de Bruxelles (ULB) in collaboration with Worldline. The transactions were made by European cardholders during a 48-hour period in September 2013. The full dataset contains 284,807 transactions with 492 frauds; this version is a curated 10,000-transaction sample that preserves the essential characteristics and class imbalance ratio.

The dataset was first published in the paper 'Calibrating Probability with Undersampling for Unbalanced Classification' at the 2015 IEEE Symposium Series on Computational Intelligence. Since then, it has become the de facto standard for benchmarking fraud detection algorithms, cited in thousands of academic papers and used in countless Kaggle competitions and tutorials.

Known Limitations

This is a sampled subset of the original dataset. While this sample preserves the class imbalance ratio, some patterns present in the complete 284,807-transaction dataset may not be fully represented here.

~~Current fraud patterns~~: Data is from September 2013; fraud techniques have evolved significantly over the past decade
PCA prevents interpretability: Cannot explain predictions in business terms (e.g., 'flagged due to unusual merchant category')
European transactions only: Patterns may not generalize to North American, Asian, or other regional transaction behaviors
No merchant or cardholder context: Cannot create domain-specific features like merchant category codes or customer history
Two-day window only: Does not capture seasonal patterns, monthly cycles, or long-term behavioral changes
Sample size constraints: This 10,000-transaction sample may miss edge cases present in the full dataset

Production Deployment Considerations

This dataset is excellent for learning, prototyping, and benchmarking algorithms, but has important limitations for production fraud systems:

- [x] Suitable for algorithm development and validation
- [x] Ideal for learning imbalanced classification techniques
- [x] Good for benchmarking model performance
- [ ] PCA features don't transfer to new data without the original transformation matrix
- [ ] Real systems need continuous retraining on fresh data to catch evolving fraud patterns
- [ ] Production systems require real-time feature engineering not possible with this static dataset

Use this dataset to develop and validate your approach, then retrain on your organization's actual transaction data for production deployment.

Table Overview

creditcard

Contains 10,000 rows and 31 columns. Column types: 31 numeric.

10,000 rows31 columns

creditcard

10,000

rows

columns

Data Preview

Scroll to see more

Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	Amount
21	0.43	0.15	-0.86	0.86	0.07	0.12	-0.77	0.81	-0.11	-0.12	-0.69	-0.66	0.09	-0.5	-0.15	-0.64	0.8	0.61	0.99	-0.95	0.54	-0.53	0.08	0.16	0.48	0.67	-0.31	0.02	33.9
126	-0.02	0.67	-0.57	-0.87	-0.47	0.6	-0.84	0.63	0.76	-0.1	0.82	-0.85	-0.61	0.44	-0.93	0.73	0.39	-0.91	-0.79	-0.56	-0.94	0.74	0.02	0.48	-0.25	-0.53	0.25	0.95	101.37
286	-0.19	-0.4	0.95	-0.14	0.42	0.19	-0.65	0.67	-0.75	0.89	0.52	0.19	0.97	-0.68	0.09	-0.73	0.62	0.69	0.75	0.65	-0.61	0.04	-0.53	0.39	-0.65	-0.24	-0.71	0.28	187.26
378	0.03	-0.41	0.28	0.99	0	-0.62	0.63	0.46	-0.61	0.09	-0.71	-0.27	0.93	0.21	0.55	0.37	-0.12	0.18	0.75	0.27	0.16	0.58	-0.87	-0.65	0.74	0.75	-0.62	-0.86	58.52
532	0.24	-0.95	0.11	-0.65	-0.25	-0.47	-0.63	-0.5	0.19	-0.24	-1	1	0.49	0.37	-0.66	0.14	0.56	0.38	0.87	0.59	0.23	-0.57	-0.94	-0.25	0.58	0.22	0.28	-0.2	85.55

Row 1

Time21

V10.43

V20.15

V3-0.86

V40.86

+26 more columns

Row 2

Time126

V1-0.02

V20.67

V3-0.57

V4-0.87

+26 more columns

Row 3

Time286

V1-0.19

V2-0.4

V30.95

V4-0.14

+26 more columns

Showing 5 of 10,000 rows

Data Profile

10,000

rows

columns

100%

complete

14.8 MB

estimated size

Column Types

31 Numeric

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

Time(10,000 unique values)
V18(9,983 unique values)
V12(9,982 unique values)
V19(9,979 unique values)
V27(9,979 unique values)
V7(9,978 unique values)
V13(9,978 unique values)
V28(9,978 unique values)
V2(9,977 unique values)
V14(9,977 unique values)
V23(9,977 unique values)
V4(9,976 unique values)
V8(9,976 unique values)
V26(9,976 unique values)
V3(9,975 unique values)
V11(9,975 unique values)
V21(9,975 unique values)
V5(9,974 unique values)
V10(9,974 unique values)
V17(9,974 unique values)
V22(9,971 unique values)
V16(9,970 unique values)
V6(9,969 unique values)
V15(9,969 unique values)
V20(9,969 unique values)
V24(9,968 unique values)
V25(9,968 unique values)
V9(9,961 unique values)
V1(9,959 unique values)
Amount(7,858 unique values)

Data Dictionary

creditcard

Column	Type	Description	Example
`Time`	numeric	Numeric value (range: 21 - 532)	21, 126
`V1`	numeric	Numeric value (range: -0.19 - 0.43)	0.425918, -0.016846
`V2`	numeric	Numeric value (range: -0.95 - 0.67)	0.14986, 0.671503
`V3`	numeric	Numeric value (range: -0.86 - 0.95)	-0.859045, -0.567677
`V4`	numeric	Numeric value (range: -0.87 - 0.99)	0.861877, -0.870574
`V5`	numeric	Numeric value (range: -0.47 - 0.42)	0.068215, -0.472092
`V6`	numeric	Numeric value (range: -0.62 - 0.6)	0.121616, 0.596326
`V7`	numeric	Numeric value (range: -0.84 - 0.63)	-0.767256, -0.844182
`V8`	numeric	Numeric value (range: -0.5 - 0.81)	0.810531, 0.628362
`V9`	numeric	Numeric value (range: -0.75 - 0.76)	-0.113886, 0.762508
`V10`	numeric	Numeric value (range: -0.24 - 0.89)	-0.118432, -0.100272
`V11`	numeric	Numeric value (range: -1 - 0.82)	-0.68963, 0.817203
`V12`	numeric	Numeric value (range: -0.85 - 1)	-0.660487, -0.848729
`V13`	numeric	Numeric value (range: -0.61 - 0.97)	0.09042, -0.607883
`V14`	numeric	Numeric value (range: -0.68 - 0.44)	-0.501866, 0.443295
`V15`	numeric	Numeric value (range: -0.93 - 0.55)	-0.148025, -0.931149
`V16`	numeric	Numeric value (range: -0.73 - 0.73)	-0.636372, 0.730667
`V17`	numeric	Numeric value (range: -0.12 - 0.8)	0.798721, 0.391612
`V18`	numeric	Numeric value (range: -0.91 - 0.69)	0.610895, -0.912661
`V19`	numeric	Numeric value (range: -0.79 - 0.99)	0.988962, -0.787397
`V20`	numeric	Numeric value (range: -0.95 - 0.65)	-0.948803, -0.563796
`V21`	numeric	Numeric value (range: -0.94 - 0.54)	0.543633, -0.942843
`V22`	numeric	Numeric value (range: -0.57 - 0.74)	-0.525624, 0.737447
`V23`	numeric	Numeric value (range: -0.94 - 0.08)	0.077771, 0.01516
`V24`	numeric	Numeric value (range: -0.65 - 0.48)	0.164373, 0.483361
`V25`	numeric	Numeric value (range: -0.65 - 0.74)	0.483061, -0.249131
`V26`	numeric	Numeric value (range: -0.53 - 0.75)	0.666458, -0.527051
`V27`	numeric	Numeric value (range: -0.71 - 0.28)	-0.307762, 0.252673
`V28`	numeric	Numeric value (range: -0.86 - 0.95)	0.024361, 0.947974
`Amount`	numeric	Monetary value	33.9, 101.37
`Class`	numeric	Class category	0, 0