Agents for Data
Skip to main content

Adult Income (Census) Dataset

Predict whether income exceeds $50K/yr based on 1994 US Census data. 32,561 records with 14 demographic, education, and employment features. The benchmark dataset for binary classification and algorithmic fairness research.

census-dataincome-predictionbinary-classificationmachine-learning-benchmarkalgorithmic-fairnessbias-detectiondemographicssocioeconomicuci-repositorysupervised-learning1 table32,561 rows
Last updated 2 weeks agoDecember 27, 2025
Time:1994
Location:United States
Created by Dataset Agent

Overview

The Adult Income Dataset (also known as Census Income Dataset) is the definitive benchmark for binary classification in machine learning. Extracted from the 1994 US Census Bureau database by Barry Becker and Ronny Kohavi, the prediction task is to determine whether an individual's annual income exceeds $50,000 based on demographic and employment attributes.
The dataset contains 32,561 instances with 14 attributes covering age, workclass, education, marital status, occupation, relationship, race, sex, capital gains/losses, hours worked, and native country.
View Source
SQL
SELECT COUNT(*) AS total_instances FROM adult_income.csv
Data
Total Instances
32,561
1 row
Originally donated to the UCI Machine Learning Repository in 1996, this dataset has become one of the most cited resources in machine learning literature—particularly for classification benchmarking, algorithmic fairness research, and studies examining income inequality and socioeconomic factors.

Dataset Characteristics

| Characteristic | Value | |----------------|-------| | Subject Area | Social Science | | Task Type | Binary Classification | | Feature Types | Categorical, Integer | | Instances | 32,561 (training) | | Features | 14 | | Missing Values | Yes (encoded as '?') |
The data was extracted using specific conditions from the Census database: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)), ensuring only working-age individuals with positive income and work hours are included.

Class Distribution and Imbalance

The target variable shows significant class imbalance: 24,720 individuals (75.9%) earn ≤$50K while only 7,841 (24.1%) earn >$50K.
View Source
SQL
SELECT income, COUNT(*) AS count, ROUND( 100.0 * COUNT(*) / ( SELECT COUNT(*) FROM adult_income.csv ), 1 ) AS percentage FROM adult_income.csv GROUP BY income
Data
IncomeCountPercentage
<=50K24,72075.9
>50K7,84124.1
2 rows
View Source
SQL
SELECT income, COUNT(*) AS count FROM adult_income.csv GROUP BY income
Data
Income LevelCount
≤$50K (75.9%)24,720
>$50K (24.1%)7,841
2 rows
This ~3:1 class imbalance is important for model training—naive classifiers predicting all instances as ≤$50K would achieve 75.9% accuracy, making this a useful baseline for evaluating model performance.

Historical Context and Economic Interpretation

$50,000 in 1994 ≈ $103,000 in 2024 (adjusted for inflation). The income threshold represents upper-middle-class earnings for that era, roughly equivalent to the top 25% of individual earners.
The data captures the American workforce during a period of economic expansion following the early 1990s recession. Understanding this historical context is crucial when interpreting patterns—gender wage gaps, occupational distributions, and educational attainment rates have shifted significantly in the three decades since this data was collected.

Education and Income Correlation

Education level demonstrates the strongest correlation with income in this dataset. Advanced degrees dramatically increase the probability of earning above the $50K threshold.
View Source
SQL
SELECT education, COUNT(*) AS total, ROUND( 100.0 * SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) / COUNT(*), 1 ) AS high_income_pct FROM adult_income.csv GROUP BY education ORDER BY high_income_pct DESC
Data
EducationHigh Income %
Doctorate74.1
Prof-school73.4
Masters55.7
Bachelors41.5
Assoc-voc26.1
Assoc-acdm24.8
Some-college19
HS-grad16
11th5.6
Preschool0
10 rows
Individuals with a Doctorate have a 74.1% probability of earning >$50K, compared to just 16% for high school graduates—a 4.6x difference highlighting the economic returns to education in 1994.
View Source
SQL
SELECT education, ROUND( 100.0 * SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) / COUNT(*), 1 ) AS high_income_pct FROM adult_income.csv WHERE education IN ('Doctorate', 'HS-grad') GROUP BY education
Data
EducationHigh Income Pct
Doctorate74.1
HS-grad16
2 rows

Understanding the fnlwgt (Final Weight) Field

The fnlwgt column is one of the most misunderstood features in this dataset. It represents the Census Bureau's estimate of how many people in the US population each record represents, calculated using demographic stratification.
The weighting methodology groups people by state, age, race, and sex, then assigns weights so that the sample totals match known population controls. For example, if the sample under-represents young Hispanic females in California, records matching that profile receive higher weights.
For most ML tasks, exclude fnlwgt from your features. It's a sampling artifact, not a predictive attribute. However, it's valuable for weighted statistical analysis when you need population-level estimates rather than sample-level results.

Occupation Analysis

View Source
SQL
SELECT occupation, COUNT(*) AS total, ROUND( 100.0 * SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) / COUNT(*), 1 ) AS high_income_pct FROM adult_income.csv WHERE occupation != '?' GROUP BY occupation ORDER BY high_income_pct DESC LIMIT 10
Data
OccupationTotal WorkersHigh Income %
Exec-managerial4,06648.4
Prof-specialty4,14044.9
Protective-serv64932.5
Tech-support92830.5
Sales3,65026.9
Craft-repair4,09922.7
Transport-moving1,59720
Adm-clerical3,77013.4
Machine-op-inspct2,00213.2
Farming-fishing99410.5
10 rows
Executive-managerial roles lead with a 48.4% high-income rate, followed by Professional-specialty at 44.9%. Service and manual labor occupations show rates below 15%.
View Source
SQL
SELECT occupation, ROUND( 100.0 * SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) / COUNT(*), 1 ) AS high_income_pct FROM adult_income.csv WHERE occupation IN ('Exec-managerial', 'Prof-specialty') GROUP BY occupation ORDER BY high_income_pct DESC
Data
OccupationHigh Income Pct
Exec-managerial48.4
Prof-specialty44.9
2 rows

Age and Income Patterns

Ages range from 17 to 90 years, with a mean of 38.6 years and median of 37 years.
View Source
SQL
SELECT MIN(age) AS min_age, MAX(age) AS max_age, ROUND(AVG(age), 1) AS avg_age FROM adult_income.csv
Data
Min AgeMax AgeAvg Age
179038.6
1 row
View Source
SQL
SELECT CASE WHEN age < 25 THEN '17-24' WHEN age < 35 THEN '25-34' WHEN age < 45 THEN '35-44' WHEN age < 55 THEN '45-54' WHEN age < 65 THEN '55-64' ELSE '65+' END AS age_group, COUNT(*) AS total, SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) AS high_income, ROUND( 100.0 * SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) / COUNT(*), 1 ) AS high_income_pct FROM adult_income.csv GROUP BY age_group ORDER BY age_group
Data
Age GroupTotalHigh IncomeHigh Income %
17-245,570611.1
25-348,4791,42716.8
35-448,1512,70333.2
45-545,8532,34840.1
55-643,1721,02632.3
65+1,33627620.7
6 rows
Peak earning years occur between 45-54, where the high-income rate reaches 40.1%. Young workers (17-24) have only a 1.1% probability of earning >$50K, reflecting entry-level positions and limited experience. The decline after 55 likely reflects early retirement and reduced work hours.

Gender Disparity Analysis

The dataset reveals significant gender disparity: 30.6% of males earn >$50K compared to only 10.9% of females—a 2.8x difference.
View Source
SQL
SELECT sex, COUNT(*) AS total, SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) AS high_income, ROUND( 100.0 * SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) / COUNT(*), 1 ) AS high_income_pct FROM adult_income.csv GROUP BY sex
Data
SexTotalHigh IncomeHigh Income Pct
Male21,7906,66230.6
Female10,7711,17910.9
2 rows
View Source
SQL
SELECT sex, SUM( CASE WHEN income = '<=50K' THEN 1 ELSE 0 END ) AS low_income, SUM( CASE WHEN income = '>50K' THEN 1 ELSE 0 END ) AS high_income FROM adult_income.csv GROUP BY sex
Data
Gender≤$50K>$50K
Female9,5921,179
Male15,1286,662
2 rows
The dataset contains 21,790 males (66.9%) and 10,771 females (33.1%), reflecting 1994 labor force participation rates. This gender imbalance and income disparity make the dataset particularly valuable for algorithmic fairness research and bias auditing.

Working Hours and Capital Gains

High earners work an average of 45.5 hours per week compared to 38.8 hours for those earning ≤$50K—a 17% difference.
View Source
SQL
SELECT income, ROUND(AVG(hours_per_week), 1) AS avg_hours FROM adult_income.csv GROUP BY income
Data
IncomeAvg Hours
>50K45.5
<=50K38.8
2 rows
Capital gains show dramatic differences: high earners average $4,006 in capital gains versus just $149 for lower earners—a 27x multiplier indicating wealth accumulation patterns.
View Source
SQL
SELECT income, ROUND(AVG(capital_gain), 2) AS avg_capital_gain FROM adult_income.csv GROUP BY income
Data
IncomeAvg Capital Gain
>50K4,006.14
<=50K148.75
2 rows

Missing Values Analysis

Missing values are encoded as ' ?' (with a leading space) and appear in three columns:
| Column | Missing Count | Missing % | |--------|---------------|----------| | workclass | 1,836 | 5.6% | | occupation | 1,843 | 5.7% | | native_country | 583 | 1.8% |
Missing Value Strategies: - Remove rows: Drops ~7% of data (2,399 unique rows with any missing value) - Treat '?' as category: Preserves all data, may capture meaningful patterns (e.g., self-employed without formal occupation) - Mode imputation: Replace with most frequent value per column - Predictive imputation: Use other features to predict missing values - Use robust algorithms: XGBoost, LightGBM handle missing values natively

Baseline Model Performance

Typical accuracy scores on this dataset provide benchmarks for evaluating your models:
| Model | Accuracy | AUC-ROC | Notes | |-------|----------|---------|-------| | Naive Baseline (predict majority) | 75.9% | 0.50 | Always predicts ≤$50K | | Logistic Regression | ~84-85% | ~0.88 | Good interpretability | | Decision Tree | ~82-84% | ~0.85 | Prone to overfitting | | Random Forest | ~85-86% | ~0.90 | Strong baseline | | Gradient Boosting (XGBoost) | ~86-87% | ~0.92 | Current best practices | | Neural Network | ~85-86% | ~0.91 | Requires more tuning |
Note: Exact scores vary based on preprocessing choices, train/test splits, and hyperparameter tuning. The ~87% accuracy ceiling reflects inherent noise and the limited predictive power of available features.

Known Limitations and Ethical Considerations

When using this dataset, be aware of these important limitations:
- [ ] Temporal obsolescence: Data is 30 years old; labor markets, gender dynamics, and income distributions have changed significantly - [ ] Geographic limitation: US-only data; patterns don't generalize to other countries - [ ] Historical bias: Labels reflect 1994 societal biases in hiring, promotion, and compensation - [ ] Binary income threshold: Oversimplifies continuous income distribution - [ ] Self-reported data: Subject to reporting errors and social desirability bias - [ ] Protected attributes: Contains race and sex features that may lead to discriminatory models if used carelessly
For Fairness Research: This dataset is widely used to study algorithmic bias precisely because it exhibits disparate outcomes across protected groups. When building models, consider measuring equalized odds, demographic parity, and calibration across subgroups.

Sample Data Preview

Sample Records from the Dataset
#AgeWorkclassEducationOccupationSexHours/WeekIncome
139State-govBachelorsAdm-clericalMale40<=50K
250Self-emp-not-incBachelorsExec-managerialMale13<=50K
338PrivateHS-gradHandlers-cleanersMale40<=50K
453Private11thHandlers-cleanersMale40<=50K
528PrivateBachelorsProf-specialtyFemale40<=50K
637PrivateMastersExec-managerialFemale40<=50K
749Private9thOther-serviceFemale16<=50K
852Self-emp-not-incHS-gradExec-managerialMale45>50K
8 rows
View Source
SQL
SELECT age, workclass, education, occupation, sex, hours_per_week, income FROM adult_income.csv LIMIT 8
Data
AgeWorkclassEducationOccupationSexHours/WeekIncome
39State-govBachelorsAdm-clericalMale40<=50K
50Self-emp-not-incBachelorsExec-managerialMale13<=50K
38PrivateHS-gradHandlers-cleanersMale40<=50K
53Private11thHandlers-cleanersMale40<=50K
28PrivateBachelorsProf-specialtyFemale40<=50K
37PrivateMastersExec-managerialFemale40<=50K
49Private9thOther-serviceFemale16<=50K
52Self-emp-not-incHS-gradExec-managerialMale45>50K
8 rows

Categorical Value Reference

Workclass Categories Explained | Value | Description | |-------|-------------| | Private | For-profit private sector employment | | Self-emp-not-inc | Self-employed, business not incorporated | | Self-emp-inc | Self-employed, incorporated business | | Federal-gov | US federal government employee | | Local-gov | Local government employee | | State-gov | State government employee | | Without-pay | Unpaid family worker | | Never-worked | Never held a job |
Education Levels and Numeric Mapping | Education | education_num | Description | |-----------|---------------|-------------| | Preschool | 1 | No formal education | | 1st-4th | 2 | Elementary (grades 1-4) | | 5th-6th | 3 | Elementary (grades 5-6) | | 7th-8th | 4 | Middle school | | 9th | 5 | High school freshman | | 10th | 6 | High school sophomore | | 11th | 7 | High school junior | | 12th | 8 | High school senior (no diploma) | | HS-grad | 9 | High school graduate | | Some-college | 10 | College, no degree | | Assoc-voc | 11 | Associate's (vocational) | | Assoc-acdm | 12 | Associate's (academic) | | Bachelors | 13 | Bachelor's degree | | Masters | 14 | Master's degree | | Prof-school | 15 | Professional school (MD, JD) | | Doctorate | 16 | Doctoral degree (PhD) |
Occupation Categories Explained | Value | Description | |-------|-------------| | Exec-managerial | Executive and managerial positions | | Prof-specialty | Professional specialty (doctors, lawyers, engineers) | | Tech-support | Technical support roles | | Sales | Sales occupations | | Adm-clerical | Administrative and clerical | | Protective-serv | Police, firefighters, security | | Priv-house-serv | Private household service | | Handlers-cleaners | Material handlers, cleaners | | Machine-op-inspct | Machine operators, inspectors | | Transport-moving | Transportation and material moving | | Craft-repair | Skilled trades and repair | | Farming-fishing | Agriculture and fishing | | Armed-Forces | Military personnel | | Other-service | Other service occupations |

Table Overview

adult_income

Contains 32,561 rows and 15 columns. Column types: 6 numeric, 9 text.

32,561 rows15 columns

adult_income

32,561
rows
15
columns

Data Preview

Scroll to see more
Row 1
age39
workclassState-gov
fnlwgt77,516
educationBachelors
education_num13
+10 more columns
Row 2
age50
workclassSelf-emp-not-inc
fnlwgt83,311
educationBachelors
education_num13
+10 more columns
Row 3
age38
workclassPrivate
fnlwgt215,646
educationHS-grad
education_num9
+10 more columns

Data Profile

32,561
rows
15
columns
100%
complete
23.3 MB
estimated size

Column Types

6 Numeric9 Text

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

  • fnlwgt(21,648 unique values)

Data Dictionary

adult_income

ColumnTypeExampleMissing Values
agenumeric39, 500
workclassstring"State-gov", "Self-emp-not-inc"0
fnlwgtnumeric77516, 833110
educationstring"Bachelors", "Bachelors"0
education_numnumeric13, 130
marital_statusstring"Never-married", "Married-civ-spouse"0
occupationstring"Adm-clerical", "Exec-managerial"0
relationshipstring"Not-in-family", "Husband"0
racestring"White", "White"0
sexstring"Male", "Male"0
capital_gainnumeric2174, 00
capital_lossnumeric0, 00
hours_per_weeknumeric40, 130
native_countrystring"United-States", "United-States"0
incomestring"<=50K", "<=50K"0
Last updated: December 27, 2025
Created: December 26, 2025