Last updated 2 weeks ago•December 27, 2025
Time:1994
Location:United States
Created by Dataset Agent
Overview
The Adult Income Dataset (also known as Census Income Dataset) is the definitive benchmark for binary classification in machine learning. Extracted from the 1994 US Census Bureau database by Barry Becker and Ronny Kohavi, the prediction task is to determine whether an individual's annual income exceeds $50,000 based on demographic and employment attributes.
The dataset contains 32,561 instances with 14 attributes covering age, workclass, education, marital status, occupation, relationship, race, sex, capital gains/losses, hours worked, and native country.
View Source
Originally donated to the UCI Machine Learning Repository in 1996, this dataset has become one of the most cited resources in machine learning literature—particularly for classification benchmarking, algorithmic fairness research, and studies examining income inequality and socioeconomic factors.
Dataset Characteristics
| Characteristic | Value |
|----------------|-------|
| Subject Area | Social Science |
| Task Type | Binary Classification |
| Feature Types | Categorical, Integer |
| Instances | 32,561 (training) |
| Features | 14 |
| Missing Values | Yes (encoded as '?') |
The data was extracted using specific conditions from the Census database: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)), ensuring only working-age individuals with positive income and work hours are included.
Class Distribution and Imbalance
The target variable shows significant class imbalance: 24,720 individuals (75.9%) earn ≤$50K while only 7,841 (24.1%) earn >$50K.
View Source
View Source
This ~3:1 class imbalance is important for model training—naive classifiers predicting all instances as ≤$50K would achieve 75.9% accuracy, making this a useful baseline for evaluating model performance.
Historical Context and Economic Interpretation
$50,000 in 1994 ≈ $103,000 in 2024 (adjusted for inflation). The income threshold represents upper-middle-class earnings for that era, roughly equivalent to the top 25% of individual earners.
The data captures the American workforce during a period of economic expansion following the early 1990s recession. Understanding this historical context is crucial when interpreting patterns—gender wage gaps, occupational distributions, and educational attainment rates have shifted significantly in the three decades since this data was collected.
Education and Income Correlation
Education level demonstrates the strongest correlation with income in this dataset. Advanced degrees dramatically increase the probability of earning above the $50K threshold.
View Source
Individuals with a Doctorate have a 74.1% probability of earning >$50K, compared to just 16% for high school graduates—a 4.6x difference highlighting the economic returns to education in 1994.
View Source
Understanding the fnlwgt (Final Weight) Field
The fnlwgt column is one of the most misunderstood features in this dataset. It represents the Census Bureau's estimate of how many people in the US population each record represents, calculated using demographic stratification.
The weighting methodology groups people by state, age, race, and sex, then assigns weights so that the sample totals match known population controls. For example, if the sample under-represents young Hispanic females in California, records matching that profile receive higher weights.
For most ML tasks, exclude fnlwgt from your features. It's a sampling artifact, not a predictive attribute. However, it's valuable for weighted statistical analysis when you need population-level estimates rather than sample-level results.
Occupation Analysis
View Source
Executive-managerial roles lead with a 48.4% high-income rate, followed by Professional-specialty at 44.9%. Service and manual labor occupations show rates below 15%.
View Source
Age and Income Patterns
Ages range from 17 to 90 years, with a mean of 38.6 years and median of 37 years.
View Source
View Source
Peak earning years occur between 45-54, where the high-income rate reaches 40.1%. Young workers (17-24) have only a 1.1% probability of earning >$50K, reflecting entry-level positions and limited experience. The decline after 55 likely reflects early retirement and reduced work hours.
Gender Disparity Analysis
The dataset reveals significant gender disparity: 30.6% of males earn >$50K compared to only 10.9% of females—a 2.8x difference.
View Source
View Source
The dataset contains 21,790 males (66.9%) and 10,771 females (33.1%), reflecting 1994 labor force participation rates. This gender imbalance and income disparity make the dataset particularly valuable for algorithmic fairness research and bias auditing.
Working Hours and Capital Gains
High earners work an average of 45.5 hours per week compared to 38.8 hours for those earning ≤$50K—a 17% difference.
View Source
Capital gains show dramatic differences: high earners average $4,006 in capital gains versus just $149 for lower earners—a 27x multiplier indicating wealth accumulation patterns.
View Source
Missing Values Analysis
Missing values are encoded as ' ?' (with a leading space) and appear in three columns:
| Column | Missing Count | Missing % |
|--------|---------------|----------|
| workclass | 1,836 | 5.6% |
| occupation | 1,843 | 5.7% |
| native_country | 583 | 1.8% |
Missing Value Strategies:
- Remove rows: Drops ~7% of data (2,399 unique rows with any missing value)
- Treat '?' as category: Preserves all data, may capture meaningful patterns (e.g., self-employed without formal occupation)
- Mode imputation: Replace with most frequent value per column
- Predictive imputation: Use other features to predict missing values
- Use robust algorithms: XGBoost, LightGBM handle missing values natively
Baseline Model Performance
Typical accuracy scores on this dataset provide benchmarks for evaluating your models:
| Model | Accuracy | AUC-ROC | Notes |
|-------|----------|---------|-------|
| Naive Baseline (predict majority) | 75.9% | 0.50 | Always predicts ≤$50K |
| Logistic Regression | ~84-85% | ~0.88 | Good interpretability |
| Decision Tree | ~82-84% | ~0.85 | Prone to overfitting |
| Random Forest | ~85-86% | ~0.90 | Strong baseline |
| Gradient Boosting (XGBoost) | ~86-87% | ~0.92 | Current best practices |
| Neural Network | ~85-86% | ~0.91 | Requires more tuning |
Note: Exact scores vary based on preprocessing choices, train/test splits, and hyperparameter tuning. The ~87% accuracy ceiling reflects inherent noise and the limited predictive power of available features.
Known Limitations and Ethical Considerations
When using this dataset, be aware of these important limitations:
- [ ] Temporal obsolescence: Data is 30 years old; labor markets, gender dynamics, and income distributions have changed significantly
- [ ] Geographic limitation: US-only data; patterns don't generalize to other countries
- [ ] Historical bias: Labels reflect 1994 societal biases in hiring, promotion, and compensation
- [ ] Binary income threshold: Oversimplifies continuous income distribution
- [ ] Self-reported data: Subject to reporting errors and social desirability bias
- [ ] Protected attributes: Contains race and sex features that may lead to discriminatory models if used carelessly
For Fairness Research: This dataset is widely used to study algorithmic bias precisely because it exhibits disparate outcomes across protected groups. When building models, consider measuring equalized odds, demographic parity, and calibration across subgroups.
Sample Data Preview
Sample Records from the Dataset
| # | Age | Workclass | Education | Occupation | Sex | Hours/Week | Income |
|---|---|---|---|---|---|---|---|
| 1 | 39 | State-gov | Bachelors | Adm-clerical | Male | 40 | <=50K |
| 2 | 50 | Self-emp-not-inc | Bachelors | Exec-managerial | Male | 13 | <=50K |
| 3 | 38 | Private | HS-grad | Handlers-cleaners | Male | 40 | <=50K |
| 4 | 53 | Private | 11th | Handlers-cleaners | Male | 40 | <=50K |
| 5 | 28 | Private | Bachelors | Prof-specialty | Female | 40 | <=50K |
| 6 | 37 | Private | Masters | Exec-managerial | Female | 40 | <=50K |
| 7 | 49 | Private | 9th | Other-service | Female | 16 | <=50K |
| 8 | 52 | Self-emp-not-inc | HS-grad | Exec-managerial | Male | 45 | >50K |
| 8 rows | |||||||
View Source
Categorical Value Reference
Workclass Categories Explained
| Value | Description | |-------|-------------| | Private | For-profit private sector employment | | Self-emp-not-inc | Self-employed, business not incorporated | | Self-emp-inc | Self-employed, incorporated business | | Federal-gov | US federal government employee | | Local-gov | Local government employee | | State-gov | State government employee | | Without-pay | Unpaid family worker | | Never-worked | Never held a job |Education Levels and Numeric Mapping
| Education | education_num | Description | |-----------|---------------|-------------| | Preschool | 1 | No formal education | | 1st-4th | 2 | Elementary (grades 1-4) | | 5th-6th | 3 | Elementary (grades 5-6) | | 7th-8th | 4 | Middle school | | 9th | 5 | High school freshman | | 10th | 6 | High school sophomore | | 11th | 7 | High school junior | | 12th | 8 | High school senior (no diploma) | | HS-grad | 9 | High school graduate | | Some-college | 10 | College, no degree | | Assoc-voc | 11 | Associate's (vocational) | | Assoc-acdm | 12 | Associate's (academic) | | Bachelors | 13 | Bachelor's degree | | Masters | 14 | Master's degree | | Prof-school | 15 | Professional school (MD, JD) | | Doctorate | 16 | Doctoral degree (PhD) |Occupation Categories Explained
| Value | Description | |-------|-------------| | Exec-managerial | Executive and managerial positions | | Prof-specialty | Professional specialty (doctors, lawyers, engineers) | | Tech-support | Technical support roles | | Sales | Sales occupations | | Adm-clerical | Administrative and clerical | | Protective-serv | Police, firefighters, security | | Priv-house-serv | Private household service | | Handlers-cleaners | Material handlers, cleaners | | Machine-op-inspct | Machine operators, inspectors | | Transport-moving | Transportation and material moving | | Craft-repair | Skilled trades and repair | | Farming-fishing | Agriculture and fishing | | Armed-Forces | Military personnel | | Other-service | Other service occupations |Table Overview
adult_income
Data Preview
Scroll to see more| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77,516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2,174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83,311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215,646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234,721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338,409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Row 1
age39
workclassState-gov
fnlwgt77,516
educationBachelors
education_num13
+10 more columns
Row 2
age50
workclassSelf-emp-not-inc
fnlwgt83,311
educationBachelors
education_num13
+10 more columns
Row 3
age38
workclassPrivate
fnlwgt215,646
educationHS-grad
education_num9
+10 more columns
Showing 5 of 32,561 rows
Data Profile
32,561
rows
15
columns
100%
complete
23.3 MB
estimated size
Column Types
6 Numeric9 Text
High-Cardinality Columns
Columns with many unique values (suitable for identifiers or categorical features)
- fnlwgt(21,648 unique values)
Data Dictionary
adult_income
| Column | Type | Example | Missing Values |
|---|---|---|---|
age | numeric | 39, 50 | 0 |
workclass | string | "State-gov", "Self-emp-not-inc" | 0 |
fnlwgt | numeric | 77516, 83311 | 0 |
education | string | "Bachelors", "Bachelors" | 0 |
education_num | numeric | 13, 13 | 0 |
marital_status | string | "Never-married", "Married-civ-spouse" | 0 |
occupation | string | "Adm-clerical", "Exec-managerial" | 0 |
relationship | string | "Not-in-family", "Husband" | 0 |
race | string | "White", "White" | 0 |
sex | string | "Male", "Male" | 0 |
capital_gain | numeric | 2174, 0 | 0 |
capital_loss | numeric | 0, 0 | 0 |
hours_per_week | numeric | 40, 13 | 0 |
native_country | string | "United-States", "United-States" | 0 |
income | string | "<=50K", "<=50K" | 0 |