Agents for Data
Skip to main content

Breast Cancer Wisconsin (Diagnostic) Dataset

569 breast tumor samples with 30 cell nucleus features from FNA images. Gold-standard benchmark for cancer classification ML models with 37% malignant, 63% benign distribution.

healthcareclassificationmachine-learningcancer-diagnosismedical-imagingbinary-classificationbenchmark-datasetbreast-cancerfine-needle-aspiratecell-morphologyuci-repositorycomputer-aided-diagnosis1 table569 rows
Last updated 1 weeks agoJanuary 2, 2026
Time:1993-1995
Location:University of Wisconsin Hospitals, Madison, USA
Created by Dataset Agent

Overview

The Breast Cancer Wisconsin (Diagnostic) Dataset is the gold-standard benchmark for binary classification in medical machine learning. Created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin Hospitals (1993-1995), this dataset contains 569 tumor samples with 30 real-valued features computed from digitized Fine Needle Aspirate (FNA) images of breast masses.
The dataset contains 569 samples across 32 columns: 1 ID, 1 diagnosis label, and 30 computed nuclear features.
View Source
SQL
SELECT COUNT(*) AS total_samples FROM breast_cancer_wisconsin.csv
Data
Total Samples
569
1 row
Class distribution: 212 Malignant (37.3%) and 357 Benign (62.7%) — a realistic imbalance reflecting clinical prevalence.
View Source
SQL
SELECT diagnosis, COUNT(*) AS count, ROUND(COUNT(*) * 100.0 / 569, 1) AS percentage FROM breast_cancer_wisconsin.csv GROUP BY diagnosis
Data
DiagnosisCountPercentage
M21237.3
B35762.7
2 rows
View Source
SQL
SELECT diagnosis, COUNT(*) AS count FROM breast_cancer_wisconsin.csv GROUP BY diagnosis
Data
DiagnosisCount
Malignant (M)212
Benign (B)357
2 rows

From FNA Image to 30 Features: The Clinical Workflow

Understanding how this data was created helps interpret the features correctly. A Fine Needle Aspirate (FNA) is a minimally invasive procedure where a thin needle extracts cells from a breast mass. These cells are stained and photographed under a microscope. The Xcyt program then traces the boundaries of 10-20 cell nuclei per image and computes geometric measurements.
Ten fundamental properties are measured for each nucleus, then aggregated three ways across all nuclei in the image:
  • Mean: Average value across all nuclei — represents typical cell characteristics
  • Standard Error (SE): Measures variation between nuclei — high SE suggests heterogeneous cell population
  • Worst: Mean of the three largest values — captures the most abnormal cells, often most predictive of malignancy
The 'worst' features are frequently the most discriminative because cancer diagnosis often depends on identifying the most abnormal cells, not average cell behavior.

The 10 Base Features: Clinical Meaning

Each of the 10 base measurements captures a specific aspect of cell nucleus morphology that pathologists use to assess malignancy:

Key Statistical Insights

Malignant tumors show 44% larger mean radius (17.46 vs 12.15 units) compared to benign tumors.
View Source
SQL
SELECT diagnosis, ROUND(AVG(radius_mean), 2) AS avg_radius FROM breast_cancer_wisconsin.csv GROUP BY diagnosis
Data
DiagnosisAvg Radius
M17.46
B12.15
2 rows
Concavity is the most discriminative feature — malignant tumors show values 3.49× higher (0.161 vs 0.046) than benign cases.
View Source
SQL
SELECT diagnosis, ROUND(AVG(concavity_mean), 4) AS avg_concavity FROM breast_cancer_wisconsin.csv GROUP BY diagnosis
Data
DiagnosisAvg Concavity
M0.16
B0.05
2 rows
View Source
SQL
SELECT diagnosis, ROUND(AVG(radius_mean), 2) AS avg_radius, ROUND(AVG(texture_mean), 2) AS avg_texture, ROUND(AVG(perimeter_mean), 2) AS avg_perimeter, ROUND(AVG(area_mean), 2) AS avg_area FROM breast_cancer_wisconsin.csv GROUP BY diagnosis
Data
FeatureMalignantBenign
Radius17.4612.15
Texture21.617.91
Perimeter115.3778.08
Area978.38462.79
4 rows
100% of tumors with area > 1000 sq units are malignant (92 out of 92 cases) — area alone provides strong diagnostic signal.
View Source
SQL
SELECT diagnosis, COUNT(*) AS count FROM breast_cancer_wisconsin.csv WHERE area_mean > 1000 GROUP BY diagnosis
Data
DiagnosisCount
M92
1 row
View Source
SQL
SELECT CASE WHEN area_mean < 400 THEN '< 400' WHEN area_mean < 600 THEN '400-600' WHEN area_mean < 800 THEN '600-800' WHEN area_mean < 1000 THEN '800-1000' ELSE '> 1000' END AS area_range, SUM( CASE WHEN diagnosis = 'M' THEN 1 ELSE 0 END ) AS malignant, SUM( CASE WHEN diagnosis = 'B' THEN 1 ELSE 0 END ) AS benign FROM breast_cancer_wisconsin.csv GROUP BY area_range ORDER BY area_range
Data
Area RangeMalignantBenign
< 4003115
400-60028189
600-8004848
800-1000415
> 1000920
5 rows

Benchmark Model Performance

This dataset is well-studied with established performance baselines. Expect these accuracy ranges with proper cross-validation:
In medical diagnosis, sensitivity (recall for malignant class) matters more than accuracy. A model with 95% accuracy but 85% sensitivity misses 15% of cancers — potentially fatal false negatives. Always evaluate precision-recall tradeoffs.

Sample Data Preview

First 5 Records with Key Features
#IDDiagnosisRadius MeanArea MeanConcavity MeanConcave Points Mean
1842302M17.991,0010.30.15
2842517M20.571,3260.090.07
384300903M19.691,2030.20.13
484348301M11.42386.10.240.11
584358402M20.291,2970.20.1
5 rows
View Source
SQL
SELECT id, diagnosis, ROUND(radius_mean, 2), ROUND(area_mean, 1), ROUND(concavity_mean, 4), ROUND(concave_points_mean, 4) FROM breast_cancer_wisconsin.csv LIMIT 5
Data
IDDiagnosisRadius MeanArea MeanConcavity MeanConcave Points Mean
842302M17.991,0010.30.15
842517M20.571,3260.090.07
84300903M19.691,2030.20.13
84348301M11.42386.10.240.11
84358402M20.291,2970.20.1
5 rows

Data Quality Notes

This dataset is exceptionally clean, making it ideal for teaching and rapid prototyping:
  • No missing values — all 569 samples have complete data for all 32 columns
  • No duplicate IDs — each sample is unique
  • All features are continuous positive real numbers — no categorical encoding needed
  • No obvious outliers requiring removal — data has been quality-controlled
  • Consistent scale — though StandardScaler is recommended for distance-based algorithms
Zero missing values across all 569 samples and 32 columns — ready for immediate use.
View Source
SQL
SELECT COUNT(*) AS complete_rows FROM breast_cancer_wisconsin.csv WHERE id IS NOT NULL AND diagnosis IS NOT NULL AND radius_mean IS NOT NULL
Data
Complete Rows
569
1 row

Historical Context and Research Impact

This dataset originated from research published in "Nuclear feature extraction for breast tumor diagnosis" (Street, Wolberg, Mangasarian, 1993). The Xcyt computer program developed for this research represented an early success in computer-aided diagnosis (CAD), demonstrating that computational analysis of cell images could match expert pathologist accuracy.
Since its release through the UCI Machine Learning Repository, this dataset has been cited in thousands of academic papers. Its combination of real-world medical relevance, clean structure, and manageable size makes it the de facto standard for introducing classification algorithms, demonstrating feature selection techniques, and benchmarking new methods.

Important Limitations

This dataset is for research and educational purposes only. Models trained on this data must not be used for actual clinical diagnosis without proper validation, regulatory approval (FDA/CE), and integration with professional medical judgment.
  • Single institution: All samples from University of Wisconsin Hospitals — may not generalize to other populations
  • Historical data: Collected 1993-1995 — imaging technology and diagnostic criteria have evolved
  • Limited demographics: No patient age, ethnicity, or other clinical variables included
  • Small sample size: 569 samples is modest by modern deep learning standards
  • Binary outcome only: Does not include cancer staging, grade, or prognosis information

Table Overview

breast_cancer_wisconsin

Contains 569 rows and 32 columns. Column types: 31 numeric, 1 text.

569 rows32 columns

breast_cancer_wisconsin

569
rows
32
columns

Data Preview

Scroll to see more
Row 1
id842,302
diagnosisM
radius_mean17.99
texture_mean10.38
perimeter_mean122.8
+27 more columns
Row 2
id842,517
diagnosisM
radius_mean20.57
texture_mean17.77
perimeter_mean132.9
+27 more columns
Row 3
id84,300,903
diagnosisM
radius_mean19.69
texture_mean21.25
perimeter_mean130
+27 more columns

Data Profile

569
rows
32
columns
100%
complete
889.1 KB
estimated size

Column Types

31 Numeric1 Text

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

  • id(569 unique values)
  • smoothness_se(547 unique values)
  • fractal_dimension_se(545 unique values)
  • area_worst(544 unique values)
  • concave_points_mean(542 unique values)
  • compactness_se(541 unique values)
  • radius_se(540 unique values)
  • area_mean(539 unique values)
  • concavity_worst(539 unique values)
  • compactness_mean(537 unique values)
  • concavity_mean(537 unique values)
  • fractal_dimension_worst(535 unique values)
  • perimeter_se(533 unique values)
  • concavity_se(533 unique values)
  • compactness_worst(529 unique values)
  • area_se(528 unique values)
  • perimeter_mean(522 unique values)
  • texture_se(519 unique values)
  • perimeter_worst(514 unique values)
  • texture_worst(511 unique values)
  • concave_points_se(507 unique values)
  • symmetry_worst(500 unique values)
  • fractal_dimension_mean(499 unique values)
  • symmetry_se(498 unique values)
  • concave_points_worst(492 unique values)
  • texture_mean(479 unique values)
  • smoothness_mean(474 unique values)
  • radius_worst(457 unique values)
  • radius_mean(456 unique values)
  • symmetry_mean(432 unique values)
  • smoothness_worst(411 unique values)

Data Dictionary

breast_cancer_wisconsin

ColumnTypeExampleMissing Values
idnumeric842302, 8425170
diagnosisstring"M", "M"0
radius_meannumeric17.99, 20.570
texture_meannumeric10.38, 17.770
perimeter_meannumeric122.8, 132.90
area_meannumeric1001, 13260
smoothness_meannumeric0.1184, 0.084740
compactness_meannumeric0.2776, 0.078640
concavity_meannumeric0.3001, 0.08690
concave_points_meannumeric0.1471, 0.070170
symmetry_meannumeric0.2419, 0.18120
fractal_dimension_meannumeric0.07871, 0.056670
radius_senumeric1.095, 0.54350
texture_senumeric0.9053, 0.73390
perimeter_senumeric8.589, 3.3980
area_senumeric153.4, 74.080
smoothness_senumeric0.006399, 0.0052250
compactness_senumeric0.04904, 0.013080
concavity_senumeric0.05373, 0.01860
concave_points_senumeric0.01587, 0.01340
symmetry_senumeric0.03003, 0.013890
fractal_dimension_senumeric0.006193, 0.0035320
radius_worstnumeric25.38, 24.990
texture_worstnumeric17.33, 23.410
perimeter_worstnumeric184.6, 158.80
area_worstnumeric2019, 19560
smoothness_worstnumeric0.1622, 0.12380
compactness_worstnumeric0.6656, 0.18660
concavity_worstnumeric0.7119, 0.24160
concave_points_worstnumeric0.2654, 0.1860
symmetry_worstnumeric0.4601, 0.2750
fractal_dimension_worstnumeric0.1189, 0.089020
Last updated: January 2, 2026
Created: January 2, 2026