Agents for Data
Skip to main content

Wholesale Customers Dataset

440 wholesale customer records from a Portuguese distributor with annual spending across 6 product categories (Fresh, Milk, Grocery, Frozen, Detergents_Paper, Delicassen). A benchmark dataset for customer segmentation, K-means clustering, and market basket analysis.

customer-segmentationclusteringk-meansunsupervised-learningmachine-learningretail-analyticsmarket-basket-analysisUCIbenchmark-datasetPortugalwholesaleB2B1 table440 rows
Last updated 1 weeks agoJanuary 2, 2026
Time:2011
Location:Portugal (Lisbon, Oporto, and other regions)
Created by Dataset Agent

Overview

The Wholesale Customers Dataset captures real purchasing behavior from a wholesale distributor operating in Portugal, originally compiled by Nuno Abreu for his 2011 doctoral thesis and later published through the UCI Machine Learning Repository. This dataset has become a foundational benchmark for unsupervised machine learning, particularly customer segmentation and clustering algorithm evaluation.
The dataset contains 440 customer records with 8 variables and zero missing values, making it immediately usable without data cleaning.
View Source
SQL
SELECT COUNT(*) AS total_records, COUNT(Channel) AS channel_non_null, COUNT(Fresh) AS fresh_non_null FROM customers.csv
Data
Total RecordsChannel Non NullFresh Non Null
440440440
1 row
What distinguishes this dataset is its dual nature: it supports both classification tasks (predicting customer channel based on spending) and clustering tasks (discovering natural customer segments). The clear separation between Hotel/Restaurant/Café (Horeca) and Retail customers creates interpretable ground truth for validating unsupervised learning results.

Understanding the Variables

The dataset comprises 2 categorical identifiers and 6 continuous spending features, all representing annual monetary units (interpretable as Euros):
Why Channel Matters: Horeca customers (hotels, restaurants, cafés) purchase for food service operations—they need bulk fresh ingredients and frozen items. Retail customers stock shelves—they need grocery items and household supplies like detergents. This fundamental business difference creates naturally separable clusters.

Data Preview

Sample Customer Records (First 5 Rows)
ChannelRegionFreshMilkGroceryFrozenDetergents PaperDelicassen
2312,6699,6567,5612142,6741,338
237,0579,8109,5681,7623,2931,776
236,3538,8087,6842,4053,5167,844
1313,2651,1964,2216,4045071,788
2322,6155,4107,1983,9151,7775,185
5 rows
View Source
SQL
SELECT * FROM customers.csv LIMIT 5
Data
ChannelRegionFreshMilkGroceryFrozenDetergents PaperDelicassen
2312,6699,6567,5612142,6741,338
237,0579,8109,5681,7623,2931,776
236,3538,8087,6842,4053,5167,844
1313,2651,1964,2216,4045071,788
2322,6155,4107,1983,9151,7775,185
5 rows
Notice how the fourth row (Channel 1 = Horeca) shows higher Fresh (13,265) and Frozen (6,404) spending but much lower Milk (1,196) and Detergents_Paper (507) compared to the Retail customers above it. This pattern is consistent throughout the dataset.

Descriptive Statistics

Summary Statistics for Spending Variables
VariableMinMaxMeanStd DevMedian
Fresh3112,15112,00012,6478,504
Milk5573,4985,7967,3803,627
Grocery392,7807,9519,5034,756
Frozen2560,8693,0724,8551,526
Detergents_Paper340,8272,8814,768816
Delicassen347,9431,5252,820965
6 rows
View Source
SQL
SELECT ROUND(MIN(Fresh)) AS min_fresh, ROUND(MAX(Fresh)) AS max_fresh, ROUND(AVG(Fresh)) AS mean_fresh, ROUND(STDDEV (Fresh), 0) AS std_fresh, ROUND(MEDIAN (Fresh)) AS median_fresh FROM customers.csv
Data
VariableMinMaxMeanStd DevMedian
Fresh3112,15112,00012,6478,504
Milk5573,4985,7967,3803,627
Grocery392,7807,9519,5034,756
Frozen2560,8693,0724,8551,526
Detergents_Paper340,8272,8814,768816
Delicassen347,9431,5252,820965
6 rows
Right-Skewed Distributions: All spending variables show means significantly higher than medians, indicating right-skewed distributions with high-spending outliers. Consider log-transformation before clustering to reduce the influence of extreme values.

Customer Distribution Analysis

The dataset includes 298 Horeca customers (67.7%) and 142 Retail customers (32.3%), reflecting the distributor's business focus on food service clients.
View Source
SQL
SELECT Channel, COUNT(*) AS count, ROUND( COUNT(*) * 100.0 / ( SELECT COUNT(*) FROM customers.csv ), 1 ) AS percentage FROM customers.csv GROUP BY Channel
Data
ChannelCountPercentage
129867.7
214232.3
2 rows
View Source
SQL
SELECT Channel, COUNT(*) AS count FROM customers.csv GROUP BY Channel
Data
ChannelCount
Horeca (Hotels/Restaurants/Cafés)298
Retail142
2 rows
Geographically, 77 customers from Lisbon (17.5%), 47 from Oporto (10.7%), and 316 from other Portuguese regions (71.8%).
View Source
SQL
SELECT Region, COUNT(*) AS count FROM customers.csv GROUP BY Region ORDER BY Region
Data
RegionCount
177
247
3316
3 rows
View Source
SQL
SELECT Region, COUNT(*) AS count FROM customers.csv GROUP BY Region ORDER BY Region
Data
RegionCustomer Count
Lisbon77
Oporto47
Other Regions316
3 rows

Channel Comparison: Horeca vs Retail Spending Patterns

The two customer channels exhibit dramatically different purchasing behaviors, which is why this dataset excels for both classification (supervised) and segmentation (unsupervised) tasks.
View Source
SQL
SELECT Channel, ROUND(AVG(Fresh)) AS avg_fresh, ROUND(AVG(Milk)) AS avg_milk, ROUND(AVG(Grocery)) AS avg_grocery, ROUND(AVG(Frozen)) AS avg_frozen, ROUND(AVG(Detergents_Paper)) AS avg_detergents, ROUND(AVG(Delicassen)) AS avg_delicassen FROM customers.csv GROUP BY Channel
Data
CategoryHoreca (Channel 1)Retail (Channel 2)
Fresh13,4768,904
Milk3,45210,717
Grocery3,96216,323
Frozen3,7481,653
Detergents & Paper7917,270
Delicassen1,4161,753
6 rows
Key Business Insight: Horeca customers spend 51% more on Fresh and 127% more on Frozen items—they're stocking kitchens. Retail customers spend 312% more on Grocery and 819% more on Detergents_Paper—they're stocking shelves for consumers. This pattern makes Channel highly predictable from spending data.

Correlation Analysis for Market Basket Insights

Strong correlations between product categories reveal natural purchasing patterns valuable for cross-selling strategies and understanding customer behavior.
The strongest correlation is between Grocery and Detergents_Paper (r=0.925)—customers who buy groceries also buy household supplies. This reflects retail supermarket purchasing patterns.
View Source
SQL
SELECT ROUND(CORR(Grocery, Detergents_Paper), 3) AS grocery_detergents_corr FROM customers.csv
Data
Grocery Detergents Corr
0.93
1 row
Product Category Correlation Matrix
Category PairCorrelation (R)Interpretation
Grocery ↔ Detergents_Paper0.925Very Strong - Retail basket pattern
Milk ↔ Grocery0.728Strong - Supermarket staples
Milk ↔ Detergents_Paper0.662Moderate-Strong - Consumer goods
Fresh ↔ Frozen0.346Weak-Moderate - Food service pattern
Fresh ↔ Delicassen0.205Weak - Limited association
5 rows
View Source
SQL
SELECT ROUND(CORR(Fresh, Frozen), 3), ROUND(CORR(Milk, Grocery), 3), ROUND(CORR(Milk, Detergents_Paper), 3), ROUND(CORR(Grocery, Detergents_Paper), 3), ROUND(CORR(Fresh, Delicassen), 3) FROM customers.csv
Data
Category PairCorrelation (R)Interpretation
Grocery ↔ Detergents_Paper0.925Very Strong - Retail basket pattern
Milk ↔ Grocery0.728Strong - Supermarket staples
Milk ↔ Detergents_Paper0.662Moderate-Strong - Consumer goods
Fresh ↔ Frozen0.346Weak-Moderate - Food service pattern
Fresh ↔ Delicassen0.205Weak - Limited association
5 rows
The correlation structure reveals two distinct purchasing clusters: (1) a Retail cluster where Milk, Grocery, and Detergents_Paper are highly correlated, and (2) a Horeca cluster where Fresh and Frozen show moderate association. This structure is why K-means clustering naturally discovers interpretable segments.

Total Spending Analysis

Total annual spending across all customers is €14.62 million, with Fresh products accounting for €5.28 million (36.1%) of all purchases.
View Source
SQL
SELECT SUM(Fresh) + SUM(Milk) + SUM(Grocery) + SUM(Frozen) + SUM(Detergents_Paper) + SUM(Delicassen) AS total_all, SUM(Fresh) AS total_fresh FROM customers.csv
Data
Total AllTotal Fresh
14,619,5005,280,131
1 row
View Source
SQL
SELECT SUM(Fresh), SUM(Milk), SUM(Grocery), SUM(Frozen), SUM(Detergents_Paper), SUM(Delicassen) FROM customers.csv
Data
CategoryTotal Spending (€)Percentage
Fresh5,280,13136.1%
Grocery3,498,56223.9%
Milk2,550,35717.4%
Frozen1,351,6509.2%
Detergents_Paper1,267,8578.7%
Delicassen670,9434.6%
6 rows

High-Value Customer Identification

The top 10% of customers (44 records) account for €5.89 million (40.3%) of total spending, demonstrating significant customer value concentration.
View Source
SQL
SELECT SUM( Fresh + Milk + Grocery + Frozen + Detergents_Paper + Delicassen ) AS top_10_spending FROM ( SELECT *, Fresh + Milk + Grocery + Frozen + Detergents_Paper + Delicassen AS total FROM customers.csv ORDER BY total DESC LIMIT 44 )
Data
Top 10 Spending
5,890,000
1 row
Top 5 Highest-Spending Customers
RankChannelRegionTotal Spend (€)Top Category
1HorecaOther176,040Fresh (€112,151)
2RetailOther163,968Grocery (€92,780)
3HorecaOther154,003Fresh (€76,237)
4RetailOther140,567Milk (€73,498)
5HorecaLisbon128,940Fresh (€56,082)
5 rows
View Source
SQL
SELECT Channel, Region, Fresh + Milk + Grocery + Frozen + Detergents_Paper + Delicassen AS total_spend, Fresh, Milk, Grocery FROM customers.csv ORDER BY total_spend DESC LIMIT 5
Data
RankChannelRegionTotal Spend (€)Top Category
1HorecaOther176,040Fresh (€112,151)
2RetailOther163,968Grocery (€92,780)
3HorecaOther154,003Fresh (€76,237)
4RetailOther140,567Milk (€73,498)
5HorecaLisbon128,940Fresh (€56,082)
5 rows

Data Preprocessing Considerations

Before applying machine learning algorithms, consider these preprocessing steps based on the data characteristics:
  • Standardization is essential: Spending ranges vary dramatically (Fresh: 3-112,151 vs Delicassen: 3-47,943). Use StandardScaler before K-means or hierarchical clustering.
  • Log-transformation recommended: All variables are right-skewed with outliers. Log(x+1) transformation normalizes distributions and reduces outlier influence.
  • One-hot encode Channel/Region: For algorithms requiring numeric input, convert categorical variables. However, for pure clustering, consider excluding these to discover segments independently.
  • Outlier strategy: The dataset has legitimate high-spenders (not errors). Consider Winsorization (capping at 95th percentile) rather than removal for robust clustering.
Pro Tip: When evaluating clustering results, use Channel as ground truth validation. If your algorithm naturally separates Horeca from Retail customers, it's capturing meaningful business segments.

Why This Dataset for Clustering?

The Wholesale Customers Dataset has become a benchmark for clustering algorithms because it offers:
  • Manageable size: 440 records allows quick iteration during algorithm development and parameter tuning
  • Real-world origin: Actual business data with interpretable features, unlike synthetic datasets
  • Natural clusters: Channel variable provides ground truth for validating unsupervised results
  • Appropriate complexity: 6 continuous features create non-trivial clustering challenges without overwhelming dimensionality
  • Clean data: Zero missing values means immediate usability
  • Business interpretability: Discovered segments translate directly to marketing strategies

Limitations and Considerations

Geographic Limitation: This dataset represents a single wholesale distributor in Portugal. Purchasing patterns may not generalize to other countries, markets, or time periods. The data was collected circa 2011.
  • Sample size: 440 records may be insufficient for complex deep learning models or detecting rare customer segments
  • Temporal snapshot: Annual totals don't capture seasonality, trends, or purchasing frequency
  • Limited demographics: No customer size, age, or detailed business type information beyond Channel
  • Currency/inflation: Monetary values from 2011 Portugal may need adjustment for contemporary analysis
  • Class imbalance: 2:1 ratio of Horeca to Retail may bias some classification algorithms

Comparison to Alternative Datasets

Choose the Wholesale Customers Dataset when you need a clean, well-understood benchmark for comparing clustering algorithms, teaching unsupervised learning concepts, or rapidly prototyping customer segmentation approaches before scaling to larger datasets.

Table Overview

customers

Contains 440 rows and 8 columns. Column types: 8 numeric.

440 rows8 columns

customers

440
rows
8
columns

Data Preview

Scroll to see more
Row 1
Channel2
Region3
Fresh12,669
Milk9,656
Grocery7,561
+3 more columns
Row 2
Channel2
Region3
Fresh7,057
Milk9,810
Grocery9,568
+3 more columns
Row 3
Channel2
Region3
Fresh6,353
Milk8,808
Grocery7,684
+3 more columns

Data Profile

440
rows
8
columns
100%
complete
171.9 KB
estimated size

Column Types

8 Numeric

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

  • Fresh(433 unique values)
  • Grocery(430 unique values)
  • Frozen(426 unique values)
  • Milk(421 unique values)
  • Detergents_Paper(417 unique values)
  • Delicassen(403 unique values)

Data Dictionary

customers

ColumnTypeExampleMissing Values
Channelnumeric2, 20
Regionnumeric3, 30
Freshnumeric12669, 70570
Milknumeric9656, 98100
Grocerynumeric7561, 95680
Frozennumeric214, 17620
Detergents_Papernumeric2674, 32930
Delicassennumeric1338, 17760
Last updated: January 2, 2026
Created: January 2, 2026