Last updated 2 weeks ago•December 27, 2025
Time:1936
Location:Gaspé Peninsula, Quebec, Canada
Created by Dataset Agent
Overview
The Iris Dataset is the most iconic dataset in machine learning history—often called the "Hello World" of data science. Introduced by British statistician Ronald A. Fisher in his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems," this multivariate dataset has become the universal standard for teaching classification algorithms and benchmarking new methods.
The dataset contains 150 samples of iris flowers, with 50 samples from each of 3 species: Iris setosa, Iris versicolor, and Iris virginica—a perfectly balanced dataset that's rare in real-world applications.
View Source
Understanding the Iris Flower
Before diving into the data, it's essential to understand what we're measuring. Many beginners encounter terms like "sepal" and "petal" without knowing what they mean botanically:
- Sepals are the outer protective parts of the flower that enclose the bud before it blooms. In most flowers, sepals are green and leaf-like, but in iris flowers, they're colorful and often mistaken for petals. The three outer, drooping segments of an iris are actually sepals (called "falls" by gardeners).
- Petals are the inner, typically colorful parts that attract pollinators. In iris flowers, these are the three upright segments (called "standards"). They're usually smaller than the sepals in iris species.
- Length is measured from base to tip along the longest axis
- Width is measured at the widest point perpendicular to the length
Unlike typical flowers where petals are larger than sepals, iris flowers have prominent sepals that are often more visually striking than the petals—making accurate measurement crucial for species identification.
Dataset Statistics by Species
The three species show dramatically different morphological characteristics, which is what makes this dataset excellent for classification tasks:
Summary Statistics: Sepal Length (cm) by Species
| Species | Mean | Std Dev | Min | Max |
|---|---|---|---|---|
| Iris-setosa | 5.01 | 0.35 | 4.3 | 5.8 |
| Iris-versicolor | 5.94 | 0.52 | 4.9 | 7.0 |
| Iris-virginica | 6.59 | 0.64 | 4.9 | 7.9 |
| 3 rows | ||||
View Source
Summary Statistics: Petal Length (cm) by Species
| Species | Mean | Std Dev | Min | Max |
|---|---|---|---|---|
| Iris-setosa | 1.46 | 0.17 | 1.0 | 1.9 |
| Iris-versicolor | 4.26 | 0.47 | 3.0 | 5.1 |
| Iris-virginica | 5.55 | 0.55 | 4.5 | 6.9 |
| 3 rows | ||||
View Source
The most striking difference: Iris setosa petal length averages just 1.46 cm, while Iris virginica averages 5.55 cm—nearly 4x larger. This dramatic gap makes setosa linearly separable from the other species.
View Source
Feature Correlations and Predictive Power
Not all features are equally useful for classification. Understanding which measurements matter most helps you build better models:
Feature Correlation Matrix
| Feature Pair | Correlation | Predictive Value |
|---|---|---|
| Petal Length ↔ Petal Width | 0.963 | ★★★★★ Excellent |
| Sepal Length ↔ Petal Length | 0.872 | ★★★★☆ Very Good |
| Sepal Length ↔ Petal Width | 0.818 | ★★★★☆ Very Good |
| Sepal Length ↔ Sepal Width | -0.109 | ★☆☆☆☆ Poor |
| Sepal Width ↔ Petal Width | -0.357 | ★★☆☆☆ Weak |
| 5 rows | ||
View Source
Petal measurements are far more predictive than sepal measurements for species classification. The 0.963 correlation between petal length and petal width means these features carry nearly identical information—you could use just one and lose minimal accuracy.
View Source
Pro tip: When building models, try using only petal_length and petal_width first. You'll often achieve 95%+ accuracy with just 2 features instead of 4, making your model simpler and more interpretable.
Linear Separability: The Key Insight
One of the most important characteristics of this dataset—and why it's used to teach classification—is its partial linear separability:
- Iris setosa is linearly separable from versicolor and virginica. A simple rule like "petal_length < 2.5 cm" correctly identifies 100% of setosa samples.
- Versicolor and virginica overlap in feature space. No straight line can perfectly separate them, requiring more sophisticated algorithms or accepting some misclassification.
- This makes the dataset ideal for demonstrating why simple linear classifiers work sometimes but not always.
Using petal_length < 2.5 cm as a threshold, all 50 setosa samples are correctly isolated with zero false positives from the other 100 samples.
View Source
Historical Context
The Iris dataset was introduced by Ronald A. Fisher in his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems" published in the Annals of Eugenics. Fisher used this data to demonstrate linear discriminant analysis (LDA), a method he developed for classifying observations into predefined categories.
The actual measurements were collected by Edgar Anderson, an American botanist who gathered the data from iris flowers on the Gaspé Peninsula in Quebec, Canada. Anderson's meticulous measurements of 50 specimens from each of the three species provided Fisher with a perfectly balanced dataset for his statistical analysis. The collaboration between Anderson's fieldwork and Fisher's statistical methods exemplifies early interdisciplinary data science.
Known Data Quality Issues
The UCI Machine Learning Repository version contains two documented transcription errors that have propagated through many copies of this dataset.
The known errors are in samples 35 and 38 (0-indexed), both Iris setosa:
For most educational purposes, these minor errors don't significantly impact results. However, if you're publishing research or need exact reproducibility, consider using the corrected version from Fisher's original paper or noting which version you used.
Why This Dataset is Perfect for Beginners
The Iris dataset has specific properties that make it ideal for learning machine learning fundamentals:
Limitations for Real-World ML
While excellent for learning, the Iris dataset has significant limitations that don't reflect real-world machine learning challenges:
- Too small: 150 samples is trivial—modern datasets have millions of records
- Too clean: No missing values, no outliers, no noise—unrealistic for production data
- Too balanced: Real classification problems often have 99:1 or worse class imbalance
- Too easy: Most algorithms achieve 95%+ accuracy, making it hard to compare methods
- Too simple for deep learning: Neural networks need thousands of samples to show their advantages over simpler methods
If your model achieves less than 90% accuracy on Iris, there's likely a bug in your code. If it achieves 100%, you may be overfitting or evaluating on training data. Target 95-98% accuracy as a sanity check.
Sample Data Preview
First 5 Records from Each Species
| # | Sepal Length | Sepal Width | Petal Length | Petal Width | Species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.9 | 3 | 1.4 | 0.2 | Iris-setosa |
| 3 | 7 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
| 4 | 6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
| 5 | 6.3 | 3.3 | 6 | 2.5 | Iris-virginica |
| 6 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
| 6 rows | |||||
View Source
Expected Model Performance
Use these benchmarks to validate your implementations:
Table Overview
iris
Data Preview
Scroll to see more| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 4.9 | 3 | 1.4 | 0.2 | Iris-setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 5 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Row 1
sepal_length5.1
sepal_width3.5
petal_length1.4
petal_width0.2
speciesIris-setosa
Row 2
sepal_length4.9
sepal_width3
petal_length1.4
petal_width0.2
speciesIris-setosa
Row 3
sepal_length4.7
sepal_width3.2
petal_length1.3
petal_width0.2
speciesIris-setosa
Showing 5 of 150 rows
Data Profile
150
rows
5
columns
100%
complete
36.6 KB
estimated size
Column Types
4 Numeric1 Text
Data Dictionary
iris
| Column | Type | Example | Missing Values |
|---|---|---|---|
sepal_length | numeric | 5.1, 4.9 | 0 |
sepal_width | numeric | 3.5, 3 | 0 |
petal_length | numeric | 1.4, 1.4 | 0 |
petal_width | numeric | 0.2, 0.2 | 0 |
species | string | "Iris-setosa", "Iris-setosa" | 0 |