Last updated 2 months ago•December 27, 2025
Time:Movie reviews collected through 2011
Location:Global (English language reviews from IMDB users worldwide)
Created by Dataset Agent
Overview
The IMDB Movie Reviews Dataset is the gold-standard benchmark for binary sentiment classification in natural language processing. Originally published by Andrew Maas and colleagues at Stanford University in 2011, this dataset has been cited in thousands of research papers and remains the go-to resource for evaluating sentiment analysis models—from classical Naive Bayes classifiers to modern transformer architectures like BERT and GPT.
The dataset contains 50,000 movie reviews split evenly into 25,000 training and 25,000 test samples, with perfect class balance (12,500 positive and 12,500 negative in each split).
View Source
What makes this dataset particularly valuable is its curation methodology: only "highly polar" reviews were included. Positive reviews have ratings of 7 or higher (out of 10), while negative reviews have ratings of 4 or lower. Reviews with neutral ratings (5-6) were deliberately excluded, ensuring clear sentiment polarity and reducing label noise that plagues many sentiment datasets.
Dataset Statistics
Understanding the statistical properties of this dataset is crucial for effective model development and preprocessing decisions.
Reviews average 231 words in length, with significant variation from very brief opinions to detailed multi-paragraph analyses reaching up to 2,470 words.
View Source
The vocabulary contains approximately 89,527 unique words, providing rich linguistic diversity for training robust word embeddings and language models.
View Source
Benchmark Performance
The IMDB dataset has well-documented baseline performances, making it easy to validate your implementation and compare against published results:
- Naive Bayes + Bag-of-Words: ~85% accuracy (classical baseline)
- Logistic Regression + TF-IDF: ~88-89% accuracy
- LSTM/BiLSTM: ~89-91% accuracy
- DistilBERT: ~93.2% accuracy
- BERT-base: ~95.8% accuracy
- RoBERTa-large: ~96.1% accuracy
- DeBERTa-v3: ~97%+ accuracy (current state-of-the-art)
State-of-the-art models now exceed 97% accuracy, approaching human-level performance. This makes IMDB excellent for validating implementations and learning, but less useful for pushing the boundaries of sentiment analysis research. For more challenging benchmarks, consider SST-5 (fine-grained sentiment) or domain-specific datasets.
Preprocessing Requirements
The raw reviews contain HTML artifacts that require cleaning before model training. Most notably, line breaks appear as literal "
" tags throughout the text.
" tags throughout the text.
Approximately 98% of reviews contain HTML break tags that should be removed or converted to spaces during preprocessing.
View Source
Recommended preprocessing steps vary by model type:
- For traditional ML models: Remove HTML tags, lowercase text, handle contractions, remove stopwords, apply stemming/lemmatization
- For transformer models (BERT, RoBERTa): Remove HTML tags only—preserve original casing and punctuation as these models benefit from complete text features
- For all models: Replace "<br />" tags with spaces or newlines using regex:
re.sub(r'<br\s*/?>', ' ', text)
Historical Context and Academic Significance
The dataset was introduced in the landmark paper "Learning Word Vectors for Sentiment Analysis" at ACL 2011. This work demonstrated that learning word vectors specifically for sentiment tasks could outperform generic word representations—a finding that presaged the transfer learning revolution in NLP.
The original release also included 50,000 additional unlabeled reviews for unsupervised and semi-supervised learning experiments. This "unsupervised" split is available through HuggingFace's datasets library and remains valuable for pre-training domain-specific language models.
Dataset Comparison
When choosing a sentiment analysis dataset, consider these alternatives and their trade-offs:
- SST-2 (Stanford Sentiment Treebank): 67k movie sentences, binary labels, shorter texts—better for sentence-level analysis
- SST-5: Same source as SST-2 but with 5-class fine-grained sentiment—more challenging than binary classification
- Amazon Reviews: 3.6M+ product reviews with 1-5 star ratings—larger scale, multi-domain, but noisier labels
- Yelp Reviews: 6.9M business reviews with 1-5 stars—strong for transfer learning to service industry domains
- Rotten Tomatoes: Movie reviews with critic/audience scores—useful for cross-validation with IMDB results
IMDB remains the preferred choice for: (1) establishing baseline model performance, (2) educational purposes due to extensive documentation, (3) comparing against published research, and (4) long-form document classification experiments.
Key Characteristics
- Perfectly Balanced Classes: 50/50 split eliminates need for class weighting or resampling
- Highly Polar Labels: Only extreme ratings included (≤4 or ≥7), ensuring clear sentiment signals
- Long-form Text: Reviews average 231 words, providing rich context for deep learning models
- Duplicate Prevention: Maximum 30 reviews per movie prevents single-film bias
- Real User Content: Authentic language with natural variations, slang, and diverse writing styles
- Standardized Splits: Fixed train/test split enables reproducible benchmarking
Limitations and Considerations
Models trained on IMDB may not generalize well to other domains. The dataset reflects movie review language patterns from 2011 and earlier, which may differ from contemporary social media text or other review types.
- Domain-Specific Vocabulary: Movie terminology ("plot," "acting," "cinematography") may not transfer to product or restaurant reviews
- Binary Labels Only: No support for neutral sentiment or fine-grained rating prediction
- English Only: Not suitable for multilingual sentiment analysis
- Temporal Bias: Reviews collected through 2011 may contain outdated cultural references and language patterns
- Near-Saturation: With SOTA exceeding 97%, marginal improvements are difficult to measure
- Potential Bias: User demographics on IMDB may not represent general population sentiment patterns
Access Methods
The dataset is available through multiple platforms to suit different workflows:
- Original Source: Stanford AI Lab at ai.stanford.edu/~amaas/data/sentiment/ (tar.gz archive)
- HuggingFace Datasets:
load_dataset('imdb')with train/test/unsupervised splits - TensorFlow Datasets:
tfds.load('imdb_reviews')with built-in preprocessing options - Kaggle: CSV format with direct download and 1,900+ community notebooks
- This Page: Processed CSV files ready for immediate use
Table Overview
Tables
train
Data Preview
Scroll to see more| text | label |
|---|---|
| I rented I AM CURIOUS-YELLOW from my ... | 0 |
| "I Am Curious: Yellow" is a risible a... | 0 |
| If only to avoid making this type of ... | 0 |
| This film was probably inspired by Go... | 0 |
| Oh, brother...after hearing about thi... | 0 |
Row 1
textI rented I AM CURIOUS-YELLO...
label0
Row 2
text"I Am Curious: Yellow" is a...
label0
Row 3
textIf only to avoid making thi...
label0
Showing 5 of 25,000 rows
Data Profile
25,000
rows
2
columns
100%
complete
2.4 MB
estimated size
Column Types
1 Numeric1 Text
High-Cardinality Columns
Columns with many unique values (suitable for identifiers or categorical features)
- text(24,904 unique values)
Data Dictionary
train
| Column | Type | Example | Missing Values |
|---|---|---|---|
text | string | "I rented I AM CURIOU...", ""I Am Curious: Yello..." | 0 |
label | numeric | 0, 0 | 0 |