Last updated 1 weeks ago•January 2, 2026
Time:1993-1995
Location:University of Wisconsin Hospitals, Madison, USA
Created by Dataset Agent
Overview
The Breast Cancer Wisconsin (Diagnostic) Dataset is the gold-standard benchmark for binary classification in medical machine learning. Created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin Hospitals (1993-1995), this dataset contains 569 tumor samples with 30 real-valued features computed from digitized Fine Needle Aspirate (FNA) images of breast masses.
The dataset contains 569 samples across 32 columns: 1 ID, 1 diagnosis label, and 30 computed nuclear features.
View Source
Class distribution: 212 Malignant (37.3%) and 357 Benign (62.7%) — a realistic imbalance reflecting clinical prevalence.
View Source
View Source
From FNA Image to 30 Features: The Clinical Workflow
Understanding how this data was created helps interpret the features correctly. A Fine Needle Aspirate (FNA) is a minimally invasive procedure where a thin needle extracts cells from a breast mass. These cells are stained and photographed under a microscope. The Xcyt program then traces the boundaries of 10-20 cell nuclei per image and computes geometric measurements.
Ten fundamental properties are measured for each nucleus, then aggregated three ways across all nuclei in the image:
- Mean: Average value across all nuclei — represents typical cell characteristics
- Standard Error (SE): Measures variation between nuclei — high SE suggests heterogeneous cell population
- Worst: Mean of the three largest values — captures the most abnormal cells, often most predictive of malignancy
The 'worst' features are frequently the most discriminative because cancer diagnosis often depends on identifying the most abnormal cells, not average cell behavior.
The 10 Base Features: Clinical Meaning
Each of the 10 base measurements captures a specific aspect of cell nucleus morphology that pathologists use to assess malignancy:
Key Statistical Insights
Malignant tumors show 44% larger mean radius (17.46 vs 12.15 units) compared to benign tumors.
View Source
Concavity is the most discriminative feature — malignant tumors show values 3.49× higher (0.161 vs 0.046) than benign cases.
View Source
View Source
100% of tumors with area > 1000 sq units are malignant (92 out of 92 cases) — area alone provides strong diagnostic signal.
View Source
View Source
Benchmark Model Performance
This dataset is well-studied with established performance baselines. Expect these accuracy ranges with proper cross-validation:
In medical diagnosis, sensitivity (recall for malignant class) matters more than accuracy. A model with 95% accuracy but 85% sensitivity misses 15% of cancers — potentially fatal false negatives. Always evaluate precision-recall tradeoffs.
Sample Data Preview
First 5 Records with Key Features
| # | ID | Diagnosis | Radius Mean | Area Mean | Concavity Mean | Concave Points Mean |
|---|---|---|---|---|---|---|
| 1 | 842302 | M | 17.99 | 1,001 | 0.3 | 0.15 |
| 2 | 842517 | M | 20.57 | 1,326 | 0.09 | 0.07 |
| 3 | 84300903 | M | 19.69 | 1,203 | 0.2 | 0.13 |
| 4 | 84348301 | M | 11.42 | 386.1 | 0.24 | 0.11 |
| 5 | 84358402 | M | 20.29 | 1,297 | 0.2 | 0.1 |
| 5 rows | ||||||
View Source
Data Quality Notes
This dataset is exceptionally clean, making it ideal for teaching and rapid prototyping:
- No missing values — all 569 samples have complete data for all 32 columns
- No duplicate IDs — each sample is unique
- All features are continuous positive real numbers — no categorical encoding needed
- No obvious outliers requiring removal — data has been quality-controlled
- Consistent scale — though StandardScaler is recommended for distance-based algorithms
Zero missing values across all 569 samples and 32 columns — ready for immediate use.
View Source
Historical Context and Research Impact
This dataset originated from research published in "Nuclear feature extraction for breast tumor diagnosis" (Street, Wolberg, Mangasarian, 1993). The Xcyt computer program developed for this research represented an early success in computer-aided diagnosis (CAD), demonstrating that computational analysis of cell images could match expert pathologist accuracy.
Since its release through the UCI Machine Learning Repository, this dataset has been cited in thousands of academic papers. Its combination of real-world medical relevance, clean structure, and manageable size makes it the de facto standard for introducing classification algorithms, demonstrating feature selection techniques, and benchmarking new methods.
Important Limitations
This dataset is for research and educational purposes only. Models trained on this data must not be used for actual clinical diagnosis without proper validation, regulatory approval (FDA/CE), and integration with professional medical judgment.
- Single institution: All samples from University of Wisconsin Hospitals — may not generalize to other populations
- Historical data: Collected 1993-1995 — imaging technology and diagnostic criteria have evolved
- Limited demographics: No patient age, ethnicity, or other clinical variables included
- Small sample size: 569 samples is modest by modern deep learning standards
- Binary outcome only: Does not include cancer staging, grade, or prognosis information
Table Overview
breast_cancer_wisconsin
Data Preview
Scroll to see more| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave_points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave_points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave_points_worst | symmetry_worst | fractal_dimension_worst |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 842,302 | M | 17.99 | 10.38 | 122.8 | 1,001 | 0.12 | 0.28 | 0.3 | 0.15 | 0.24 | 0.08 | 1.1 | 0.91 | 8.59 | 153.4 | 0.01 | 0.05 | 0.05 | 0.02 | 0.03 | 0.01 | 25.38 | 17.33 | 184.6 | 2,019 | 0.16 | 0.67 | 0.71 | 0.27 | 0.46 | 0.12 |
| 842,517 | M | 20.57 | 17.77 | 132.9 | 1,326 | 0.08 | 0.08 | 0.09 | 0.07 | 0.18 | 0.06 | 0.54 | 0.73 | 3.4 | 74.08 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0 | 24.99 | 23.41 | 158.8 | 1,956 | 0.12 | 0.19 | 0.24 | 0.19 | 0.28 | 0.09 |
| 84,300,903 | M | 19.69 | 21.25 | 130 | 1,203 | 0.11 | 0.16 | 0.2 | 0.13 | 0.21 | 0.06 | 0.75 | 0.79 | 4.59 | 94.03 | 0.01 | 0.04 | 0.04 | 0.02 | 0.02 | 0 | 23.57 | 25.53 | 152.5 | 1,709 | 0.14 | 0.42 | 0.45 | 0.24 | 0.36 | 0.09 |
| 84,348,301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14 | 0.28 | 0.24 | 0.11 | 0.26 | 0.1 | 0.5 | 1.16 | 3.45 | 27.23 | 0.01 | 0.07 | 0.06 | 0.02 | 0.06 | 0.01 | 14.91 | 26.5 | 98.87 | 567.7 | 0.21 | 0.87 | 0.69 | 0.26 | 0.66 | 0.17 |
| 84,358,402 | M | 20.29 | 14.34 | 135.1 | 1,297 | 0.1 | 0.13 | 0.2 | 0.1 | 0.18 | 0.06 | 0.76 | 0.78 | 5.44 | 94.44 | 0.01 | 0.02 | 0.06 | 0.02 | 0.02 | 0.01 | 22.54 | 16.67 | 152.2 | 1,575 | 0.14 | 0.21 | 0.4 | 0.16 | 0.24 | 0.08 |
Row 1
id842,302
diagnosisM
radius_mean17.99
texture_mean10.38
perimeter_mean122.8
+27 more columns
Row 2
id842,517
diagnosisM
radius_mean20.57
texture_mean17.77
perimeter_mean132.9
+27 more columns
Row 3
id84,300,903
diagnosisM
radius_mean19.69
texture_mean21.25
perimeter_mean130
+27 more columns
Showing 5 of 569 rows
Data Profile
569
rows
32
columns
100%
complete
889.1 KB
estimated size
Column Types
31 Numeric1 Text
High-Cardinality Columns
Columns with many unique values (suitable for identifiers or categorical features)
- id(569 unique values)
- smoothness_se(547 unique values)
- fractal_dimension_se(545 unique values)
- area_worst(544 unique values)
- concave_points_mean(542 unique values)
- compactness_se(541 unique values)
- radius_se(540 unique values)
- area_mean(539 unique values)
- concavity_worst(539 unique values)
- compactness_mean(537 unique values)
- concavity_mean(537 unique values)
- fractal_dimension_worst(535 unique values)
- perimeter_se(533 unique values)
- concavity_se(533 unique values)
- compactness_worst(529 unique values)
- area_se(528 unique values)
- perimeter_mean(522 unique values)
- texture_se(519 unique values)
- perimeter_worst(514 unique values)
- texture_worst(511 unique values)
- concave_points_se(507 unique values)
- symmetry_worst(500 unique values)
- fractal_dimension_mean(499 unique values)
- symmetry_se(498 unique values)
- concave_points_worst(492 unique values)
- texture_mean(479 unique values)
- smoothness_mean(474 unique values)
- radius_worst(457 unique values)
- radius_mean(456 unique values)
- symmetry_mean(432 unique values)
- smoothness_worst(411 unique values)
Data Dictionary
breast_cancer_wisconsin
| Column | Type | Example | Missing Values |
|---|---|---|---|
id | numeric | 842302, 842517 | 0 |
diagnosis | string | "M", "M" | 0 |
radius_mean | numeric | 17.99, 20.57 | 0 |
texture_mean | numeric | 10.38, 17.77 | 0 |
perimeter_mean | numeric | 122.8, 132.9 | 0 |
area_mean | numeric | 1001, 1326 | 0 |
smoothness_mean | numeric | 0.1184, 0.08474 | 0 |
compactness_mean | numeric | 0.2776, 0.07864 | 0 |
concavity_mean | numeric | 0.3001, 0.0869 | 0 |
concave_points_mean | numeric | 0.1471, 0.07017 | 0 |
symmetry_mean | numeric | 0.2419, 0.1812 | 0 |
fractal_dimension_mean | numeric | 0.07871, 0.05667 | 0 |
radius_se | numeric | 1.095, 0.5435 | 0 |
texture_se | numeric | 0.9053, 0.7339 | 0 |
perimeter_se | numeric | 8.589, 3.398 | 0 |
area_se | numeric | 153.4, 74.08 | 0 |
smoothness_se | numeric | 0.006399, 0.005225 | 0 |
compactness_se | numeric | 0.04904, 0.01308 | 0 |
concavity_se | numeric | 0.05373, 0.0186 | 0 |
concave_points_se | numeric | 0.01587, 0.0134 | 0 |
symmetry_se | numeric | 0.03003, 0.01389 | 0 |
fractal_dimension_se | numeric | 0.006193, 0.003532 | 0 |
radius_worst | numeric | 25.38, 24.99 | 0 |
texture_worst | numeric | 17.33, 23.41 | 0 |
perimeter_worst | numeric | 184.6, 158.8 | 0 |
area_worst | numeric | 2019, 1956 | 0 |
smoothness_worst | numeric | 0.1622, 0.1238 | 0 |
compactness_worst | numeric | 0.6656, 0.1866 | 0 |
concavity_worst | numeric | 0.7119, 0.2416 | 0 |
concave_points_worst | numeric | 0.2654, 0.186 | 0 |
symmetry_worst | numeric | 0.4601, 0.275 | 0 |
fractal_dimension_worst | numeric | 0.1189, 0.08902 | 0 |