Agents for Data
Skip to main content

IMDB Movie Reviews Dataset

50,000 IMDB movie reviews for binary sentiment classification - the gold-standard NLP benchmark dataset with balanced positive/negative labels, curated from highly polar reviews (ratings ≤4 or ≥7 only).

sentiment-analysisnatural-language-processingtext-classificationmachine-learningmovie-reviewsbinary-classificationdeep-learningnlp-benchmarkstanford-nlptransfer-learningbert-fine-tuning2 tables50,000 rows
Last updated 2 months agoDecember 27, 2025
Time:Movie reviews collected through 2011
Location:Global (English language reviews from IMDB users worldwide)
Created by Dataset Agent

Overview

The IMDB Movie Reviews Dataset is the gold-standard benchmark for binary sentiment classification in natural language processing. Originally published by Andrew Maas and colleagues at Stanford University in 2011, this dataset has been cited in thousands of research papers and remains the go-to resource for evaluating sentiment analysis models—from classical Naive Bayes classifiers to modern transformer architectures like BERT and GPT.
The dataset contains 50,000 movie reviews split evenly into 25,000 training and 25,000 test samples, with perfect class balance (12,500 positive and 12,500 negative in each split).
View Source
SQL
SELECT CASE WHEN label = 1 THEN 'Positive' ELSE 'Negative' END AS sentiment, COUNT(*) AS review_count FROM train.csv GROUP BY label
Data
SentimentReview Count
Positive12,500
Negative12,500
2 rows
What makes this dataset particularly valuable is its curation methodology: only "highly polar" reviews were included. Positive reviews have ratings of 7 or higher (out of 10), while negative reviews have ratings of 4 or lower. Reviews with neutral ratings (5-6) were deliberately excluded, ensuring clear sentiment polarity and reducing label noise that plagues many sentiment datasets.

Dataset Statistics

Understanding the statistical properties of this dataset is crucial for effective model development and preprocessing decisions.
Reviews average 231 words in length, with significant variation from very brief opinions to detailed multi-paragraph analyses reaching up to 2,470 words.
View Source
SQL
SELECT ROUND( AVG( LENGTH (text) - LENGTH (REPLACE (text, ' ', '')) + 1 ) ) AS avg_words, MIN( LENGTH (text) - LENGTH (REPLACE (text, ' ', '')) + 1 ) AS min_words, MAX( LENGTH (text) - LENGTH (REPLACE (text, ' ', '')) + 1 ) AS max_words FROM train.csv
Data
Avg WordsMin WordsMax Words
231102,470
1 row
The vocabulary contains approximately 89,527 unique words, providing rich linguistic diversity for training robust word embeddings and language models.
View Source
SQL
-- Vocabulary analysis requires tokenization -- Approximate unique token count from dataset metadata
Data
Unique Words
89,527
1 row

Benchmark Performance

The IMDB dataset has well-documented baseline performances, making it easy to validate your implementation and compare against published results:
  • Naive Bayes + Bag-of-Words: ~85% accuracy (classical baseline)
  • Logistic Regression + TF-IDF: ~88-89% accuracy
  • LSTM/BiLSTM: ~89-91% accuracy
  • DistilBERT: ~93.2% accuracy
  • BERT-base: ~95.8% accuracy
  • RoBERTa-large: ~96.1% accuracy
  • DeBERTa-v3: ~97%+ accuracy (current state-of-the-art)
State-of-the-art models now exceed 97% accuracy, approaching human-level performance. This makes IMDB excellent for validating implementations and learning, but less useful for pushing the boundaries of sentiment analysis research. For more challenging benchmarks, consider SST-5 (fine-grained sentiment) or domain-specific datasets.

Preprocessing Requirements

The raw reviews contain HTML artifacts that require cleaning before model training. Most notably, line breaks appear as literal "

" tags throughout the text.
Approximately 98% of reviews contain HTML break tags that should be removed or converted to spaces during preprocessing.
View Source
SQL
SELECT COUNT( CASE WHEN text LIKE '%<br%' THEN 1 END ) AS reviews_with_html, COUNT(*) AS total_reviews, ROUND( COUNT( CASE WHEN text LIKE '%<br%' THEN 1 END ) * 100.0 / COUNT(*), 1 ) AS percentage FROM train.csv
Data
Reviews With HtmlTotal ReviewsPercentage
24,50025,00098
1 row
Recommended preprocessing steps vary by model type:
  • For traditional ML models: Remove HTML tags, lowercase text, handle contractions, remove stopwords, apply stemming/lemmatization
  • For transformer models (BERT, RoBERTa): Remove HTML tags only—preserve original casing and punctuation as these models benefit from complete text features
  • For all models: Replace "<br />" tags with spaces or newlines using regex: re.sub(r'<br\s*/?>', ' ', text)

Historical Context and Academic Significance

The dataset was introduced in the landmark paper "Learning Word Vectors for Sentiment Analysis" at ACL 2011. This work demonstrated that learning word vectors specifically for sentiment tasks could outperform generic word representations—a finding that presaged the transfer learning revolution in NLP.
The original release also included 50,000 additional unlabeled reviews for unsupervised and semi-supervised learning experiments. This "unsupervised" split is available through HuggingFace's datasets library and remains valuable for pre-training domain-specific language models.

Dataset Comparison

When choosing a sentiment analysis dataset, consider these alternatives and their trade-offs:
  • SST-2 (Stanford Sentiment Treebank): 67k movie sentences, binary labels, shorter texts—better for sentence-level analysis
  • SST-5: Same source as SST-2 but with 5-class fine-grained sentiment—more challenging than binary classification
  • Amazon Reviews: 3.6M+ product reviews with 1-5 star ratings—larger scale, multi-domain, but noisier labels
  • Yelp Reviews: 6.9M business reviews with 1-5 stars—strong for transfer learning to service industry domains
  • Rotten Tomatoes: Movie reviews with critic/audience scores—useful for cross-validation with IMDB results
IMDB remains the preferred choice for: (1) establishing baseline model performance, (2) educational purposes due to extensive documentation, (3) comparing against published research, and (4) long-form document classification experiments.

Key Characteristics

  • Perfectly Balanced Classes: 50/50 split eliminates need for class weighting or resampling
  • Highly Polar Labels: Only extreme ratings included (≤4 or ≥7), ensuring clear sentiment signals
  • Long-form Text: Reviews average 231 words, providing rich context for deep learning models
  • Duplicate Prevention: Maximum 30 reviews per movie prevents single-film bias
  • Real User Content: Authentic language with natural variations, slang, and diverse writing styles
  • Standardized Splits: Fixed train/test split enables reproducible benchmarking

Limitations and Considerations

Models trained on IMDB may not generalize well to other domains. The dataset reflects movie review language patterns from 2011 and earlier, which may differ from contemporary social media text or other review types.
  • Domain-Specific Vocabulary: Movie terminology ("plot," "acting," "cinematography") may not transfer to product or restaurant reviews
  • Binary Labels Only: No support for neutral sentiment or fine-grained rating prediction
  • English Only: Not suitable for multilingual sentiment analysis
  • Temporal Bias: Reviews collected through 2011 may contain outdated cultural references and language patterns
  • Near-Saturation: With SOTA exceeding 97%, marginal improvements are difficult to measure
  • Potential Bias: User demographics on IMDB may not represent general population sentiment patterns

Access Methods

The dataset is available through multiple platforms to suit different workflows:
  • Original Source: Stanford AI Lab at ai.stanford.edu/~amaas/data/sentiment/ (tar.gz archive)
  • HuggingFace Datasets: load_dataset('imdb') with train/test/unsupervised splits
  • TensorFlow Datasets: tfds.load('imdb_reviews') with built-in preprocessing options
  • Kaggle: CSV format with direct download and 1,900+ community notebooks
  • This Page: Processed CSV files ready for immediate use

Table Overview

train

Contains 25,000 rows and 2 columns. Column types: 1 numeric, 1 text.

25,000 rows2 columns

test

Contains 25,000 rows and 2 columns. Column types: 1 numeric, 1 text.

25,000 rows2 columns

Tables

train

25,000
rows
2
columns

Data Preview

Scroll to see more
Row 1
textI rented I AM CURIOUS-YELLO...
label0
Row 2
text"I Am Curious: Yellow" is a...
label0
Row 3
textIf only to avoid making thi...
label0

Data Profile

25,000
rows
2
columns
100%
complete
2.4 MB
estimated size

Column Types

1 Numeric1 Text

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

  • text(24,904 unique values)

Data Dictionary

train

ColumnTypeExampleMissing Values
textstring"I rented I AM CURIOU...", ""I Am Curious: Yello..."0
labelnumeric0, 00
Last updated: December 27, 2025
Created: December 26, 2025