Last updated 1 weeks ago•January 2, 2026
Time:2011
Location:Portugal (Lisbon, Oporto, and other regions)
Created by Dataset Agent
Overview
The Wholesale Customers Dataset captures real purchasing behavior from a wholesale distributor operating in Portugal, originally compiled by Nuno Abreu for his 2011 doctoral thesis and later published through the UCI Machine Learning Repository. This dataset has become a foundational benchmark for unsupervised machine learning, particularly customer segmentation and clustering algorithm evaluation.
The dataset contains 440 customer records with 8 variables and zero missing values, making it immediately usable without data cleaning.
View Source
What distinguishes this dataset is its dual nature: it supports both classification tasks (predicting customer channel based on spending) and clustering tasks (discovering natural customer segments). The clear separation between Hotel/Restaurant/Café (Horeca) and Retail customers creates interpretable ground truth for validating unsupervised learning results.
Understanding the Variables
The dataset comprises 2 categorical identifiers and 6 continuous spending features, all representing annual monetary units (interpretable as Euros):
Why Channel Matters: Horeca customers (hotels, restaurants, cafés) purchase for food service operations—they need bulk fresh ingredients and frozen items. Retail customers stock shelves—they need grocery items and household supplies like detergents. This fundamental business difference creates naturally separable clusters.
Data Preview
Sample Customer Records (First 5 Rows)
| Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents Paper | Delicassen |
|---|---|---|---|---|---|---|---|
| 2 | 3 | 12,669 | 9,656 | 7,561 | 214 | 2,674 | 1,338 |
| 2 | 3 | 7,057 | 9,810 | 9,568 | 1,762 | 3,293 | 1,776 |
| 2 | 3 | 6,353 | 8,808 | 7,684 | 2,405 | 3,516 | 7,844 |
| 1 | 3 | 13,265 | 1,196 | 4,221 | 6,404 | 507 | 1,788 |
| 2 | 3 | 22,615 | 5,410 | 7,198 | 3,915 | 1,777 | 5,185 |
| 5 rows | |||||||
View Source
Notice how the fourth row (Channel 1 = Horeca) shows higher Fresh (13,265) and Frozen (6,404) spending but much lower Milk (1,196) and Detergents_Paper (507) compared to the Retail customers above it. This pattern is consistent throughout the dataset.
Descriptive Statistics
Summary Statistics for Spending Variables
| Variable | Min | Max | Mean | Std Dev | Median |
|---|---|---|---|---|---|
| Fresh | 3 | 112,151 | 12,000 | 12,647 | 8,504 |
| Milk | 55 | 73,498 | 5,796 | 7,380 | 3,627 |
| Grocery | 3 | 92,780 | 7,951 | 9,503 | 4,756 |
| Frozen | 25 | 60,869 | 3,072 | 4,855 | 1,526 |
| Detergents_Paper | 3 | 40,827 | 2,881 | 4,768 | 816 |
| Delicassen | 3 | 47,943 | 1,525 | 2,820 | 965 |
| 6 rows | |||||
View Source
Right-Skewed Distributions: All spending variables show means significantly higher than medians, indicating right-skewed distributions with high-spending outliers. Consider log-transformation before clustering to reduce the influence of extreme values.
Customer Distribution Analysis
The dataset includes 298 Horeca customers (67.7%) and 142 Retail customers (32.3%), reflecting the distributor's business focus on food service clients.
View Source
View Source
Geographically, 77 customers from Lisbon (17.5%), 47 from Oporto (10.7%), and 316 from other Portuguese regions (71.8%).
View Source
View Source
Channel Comparison: Horeca vs Retail Spending Patterns
The two customer channels exhibit dramatically different purchasing behaviors, which is why this dataset excels for both classification (supervised) and segmentation (unsupervised) tasks.
View Source
Key Business Insight: Horeca customers spend 51% more on Fresh and 127% more on Frozen items—they're stocking kitchens. Retail customers spend 312% more on Grocery and 819% more on Detergents_Paper—they're stocking shelves for consumers. This pattern makes Channel highly predictable from spending data.
Correlation Analysis for Market Basket Insights
Strong correlations between product categories reveal natural purchasing patterns valuable for cross-selling strategies and understanding customer behavior.
The strongest correlation is between Grocery and Detergents_Paper (r=0.925)—customers who buy groceries also buy household supplies. This reflects retail supermarket purchasing patterns.
View Source
Product Category Correlation Matrix
| Category Pair | Correlation (R) | Interpretation |
|---|---|---|
| Grocery ↔ Detergents_Paper | 0.925 | Very Strong - Retail basket pattern |
| Milk ↔ Grocery | 0.728 | Strong - Supermarket staples |
| Milk ↔ Detergents_Paper | 0.662 | Moderate-Strong - Consumer goods |
| Fresh ↔ Frozen | 0.346 | Weak-Moderate - Food service pattern |
| Fresh ↔ Delicassen | 0.205 | Weak - Limited association |
| 5 rows | ||
View Source
The correlation structure reveals two distinct purchasing clusters: (1) a Retail cluster where Milk, Grocery, and Detergents_Paper are highly correlated, and (2) a Horeca cluster where Fresh and Frozen show moderate association. This structure is why K-means clustering naturally discovers interpretable segments.
Total Spending Analysis
Total annual spending across all customers is €14.62 million, with Fresh products accounting for €5.28 million (36.1%) of all purchases.
View Source
View Source
High-Value Customer Identification
The top 10% of customers (44 records) account for €5.89 million (40.3%) of total spending, demonstrating significant customer value concentration.
View Source
Top 5 Highest-Spending Customers
| Rank | Channel | Region | Total Spend (€) | Top Category |
|---|---|---|---|---|
| 1 | Horeca | Other | 176,040 | Fresh (€112,151) |
| 2 | Retail | Other | 163,968 | Grocery (€92,780) |
| 3 | Horeca | Other | 154,003 | Fresh (€76,237) |
| 4 | Retail | Other | 140,567 | Milk (€73,498) |
| 5 | Horeca | Lisbon | 128,940 | Fresh (€56,082) |
| 5 rows | ||||
View Source
Data Preprocessing Considerations
Before applying machine learning algorithms, consider these preprocessing steps based on the data characteristics:
- Standardization is essential: Spending ranges vary dramatically (Fresh: 3-112,151 vs Delicassen: 3-47,943). Use StandardScaler before K-means or hierarchical clustering.
- Log-transformation recommended: All variables are right-skewed with outliers. Log(x+1) transformation normalizes distributions and reduces outlier influence.
- One-hot encode Channel/Region: For algorithms requiring numeric input, convert categorical variables. However, for pure clustering, consider excluding these to discover segments independently.
- Outlier strategy: The dataset has legitimate high-spenders (not errors). Consider Winsorization (capping at 95th percentile) rather than removal for robust clustering.
Pro Tip: When evaluating clustering results, use Channel as ground truth validation. If your algorithm naturally separates Horeca from Retail customers, it's capturing meaningful business segments.
Why This Dataset for Clustering?
The Wholesale Customers Dataset has become a benchmark for clustering algorithms because it offers:
- Manageable size: 440 records allows quick iteration during algorithm development and parameter tuning
- Real-world origin: Actual business data with interpretable features, unlike synthetic datasets
- Natural clusters: Channel variable provides ground truth for validating unsupervised results
- Appropriate complexity: 6 continuous features create non-trivial clustering challenges without overwhelming dimensionality
- Clean data: Zero missing values means immediate usability
- Business interpretability: Discovered segments translate directly to marketing strategies
Limitations and Considerations
Geographic Limitation: This dataset represents a single wholesale distributor in Portugal. Purchasing patterns may not generalize to other countries, markets, or time periods. The data was collected circa 2011.
- Sample size: 440 records may be insufficient for complex deep learning models or detecting rare customer segments
- Temporal snapshot: Annual totals don't capture seasonality, trends, or purchasing frequency
- Limited demographics: No customer size, age, or detailed business type information beyond Channel
- Currency/inflation: Monetary values from 2011 Portugal may need adjustment for contemporary analysis
- Class imbalance: 2:1 ratio of Horeca to Retail may bias some classification algorithms
Comparison to Alternative Datasets
Choose the Wholesale Customers Dataset when you need a clean, well-understood benchmark for comparing clustering algorithms, teaching unsupervised learning concepts, or rapidly prototyping customer segmentation approaches before scaling to larger datasets.
Table Overview
customers
Data Preview
Scroll to see more| Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|---|---|
| 2 | 3 | 12,669 | 9,656 | 7,561 | 214 | 2,674 | 1,338 |
| 2 | 3 | 7,057 | 9,810 | 9,568 | 1,762 | 3,293 | 1,776 |
| 2 | 3 | 6,353 | 8,808 | 7,684 | 2,405 | 3,516 | 7,844 |
| 1 | 3 | 13,265 | 1,196 | 4,221 | 6,404 | 507 | 1,788 |
| 2 | 3 | 22,615 | 5,410 | 7,198 | 3,915 | 1,777 | 5,185 |
Row 1
Channel2
Region3
Fresh12,669
Milk9,656
Grocery7,561
+3 more columns
Row 2
Channel2
Region3
Fresh7,057
Milk9,810
Grocery9,568
+3 more columns
Row 3
Channel2
Region3
Fresh6,353
Milk8,808
Grocery7,684
+3 more columns
Showing 5 of 440 rows
Data Profile
440
rows
8
columns
100%
complete
171.9 KB
estimated size
Column Types
8 Numeric
High-Cardinality Columns
Columns with many unique values (suitable for identifiers or categorical features)
- Fresh(433 unique values)
- Grocery(430 unique values)
- Frozen(426 unique values)
- Milk(421 unique values)
- Detergents_Paper(417 unique values)
- Delicassen(403 unique values)
Data Dictionary
customers
| Column | Type | Example | Missing Values |
|---|---|---|---|
Channel | numeric | 2, 2 | 0 |
Region | numeric | 3, 3 | 0 |
Fresh | numeric | 12669, 7057 | 0 |
Milk | numeric | 9656, 9810 | 0 |
Grocery | numeric | 7561, 9568 | 0 |
Frozen | numeric | 214, 1762 | 0 |
Detergents_Paper | numeric | 2674, 3293 | 0 |
Delicassen | numeric | 1338, 1776 | 0 |