Agents for Data
Skip to main content

Titanic Dataset

Titanic passenger dataset with 887 records featuring survival outcomes, demographics, and fare data. 38.6% survival rate reveals stark gender disparities (74% female vs 19% male) and class-based inequalities in the 1912 maritime disaster.

machine-learningclassificationbinary-classificationsurvival-analysisbeginner-friendlyfeature-engineeringhistoricaleducationkaggletitanic1 table887 rows
Last updated 2 months agoDecember 27, 2025
Time:April 1912
Location:North Atlantic Ocean (Southampton to New York route)
Created by Dataset Agent

Overview

The Titanic dataset is the most widely-used introductory dataset for binary classification in machine learning. It contains passenger information from the RMS Titanic, which sank on April 15, 1912, after colliding with an iceberg during its maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, approximately 1,502 perished—making it one of the deadliest peacetime maritime disasters in history.
This dataset answers the question: What sorts of people were more likely to survive? The data reveals that survival was not random—factors like gender, passenger class, and age significantly influenced who lived and who died.
The dataset contains 887 passenger records with an overall survival rate of 38.56% (342 survivors out of 887 passengers).
View Source
SQL
SELECT COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, COUNT(*) AS total, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv
Data
SurvivorsTotalSurvival Rate
34288738.56
1 row

Dataset at a Glance

Passenger Class Distribution

The Pclass column represents ticket class, which served as a proxy for socioeconomic status in 1912:
Passenger Class Breakdown
Pclass ValueClass NameCountPercentageDescription
1First Class21624.4%Upper class, luxury cabins on upper decks
2Second Class18420.7%Middle class, comfortable accommodations
3Third Class48754.9%Lower class, basic quarters on lower decks
3 rows
View Source
SQL
SELECT Pclass, COUNT(*) AS count, ROUND( COUNT(*) * 100.0 / ( SELECT COUNT(*) FROM titanic.csv ), 1 ) AS percentage FROM titanic.csv GROUP BY Pclass ORDER BY Pclass
Data
Pclass ValueClass NameCountPercentageDescription
1First Class21624.4%Upper class, luxury cabins on upper decks
2Second Class18420.7%Middle class, comfortable accommodations
3Third Class48754.9%Lower class, basic quarters on lower decks
3 rows
Note: The dataset shows significant class imbalance—over half the passengers (54.9%) traveled in third class. This reflects the ship's design to maximize steerage passenger capacity for immigrant transport to America.

Key Survival Patterns

Gender: The Strongest Predictor

74.2% of women survived compared to only 18.9% of men—a 55 percentage point difference that reflects the "women and children first" evacuation protocol.
View Source
SQL
SELECT Sex, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 1 ) AS survival_rate FROM titanic.csv GROUP BY Sex
Data
SexTotalSurvivorsSurvival Rate
male57310918.9
female31423374.2
2 rows
View Source
SQL
SELECT Sex, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 1 ) AS survival_rate FROM titanic.csv GROUP BY Sex
Data
GenderTotal PassengersSurvivorsSurvival Rate (%)
Male57310918.9
Female31423374.2
2 rows

Class: Wealth Determined Access

First-class passengers had a 62.96% survival rate versus only 24.44% for third-class—a gap of nearly 40 percentage points revealing how cabin location affected lifeboat access.
View Source
SQL
SELECT Pclass, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv GROUP BY Pclass ORDER BY Pclass
Data
PclassTotalSurvivorsSurvival Rate
121613662.96
21848747.28
348711924.44
3 rows
View Source
SQL
SELECT Pclass, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv GROUP BY Pclass ORDER BY Pclass
Data
ClassTotal PassengersSurvivorsSurvival Rate (%)
1st Class21613662.96
2nd Class1848747.28
3rd Class48711924.44
3 rows

The Compounding Effect: Gender × Class

First-class women had an extraordinary 96.81% survival rate, while third-class men survived at only 13.7%—a 7:1 ratio demonstrating how gender and class compounded to determine fate.
View Source
SQL
SELECT Sex, Pclass, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv GROUP BY Sex, Pclass ORDER BY survival_rate DESC
Data
SexPclassTotalSurvivorsSurvival Rate
female1949196.81
female2767092.11
female31447250
male11224536.89
male21081715.74
male33434713.7
6 rows
View Source
SQL
SELECT Sex, Pclass, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv GROUP BY Sex, Pclass ORDER BY survival_rate DESC
Data
CategorySurvival Rate (%)
1st Class Women96.81
2nd Class Women92.11
3rd Class Women50
1st Class Men36.89
2nd Class Men15.74
3rd Class Men13.7
6 rows

Age: Children Had Better Odds

Passengers ranged from 0.42 to 80 years old with a mean age of 29.5 years. Children under 18 had a 50% survival rate—the highest among all age groups.
View Source
SQL
SELECT ROUND(AVG(Age), 1) AS avg_age, MIN(Age) AS min_age, MAX(Age) AS max_age FROM titanic.csv WHERE Age IS NOT NULL
Data
Avg AgeMin AgeMax Age
29.50.4280
1 row
View Source
SQL
SELECT CASE WHEN Age < 18 THEN 'Child (0-17)' WHEN Age < 35 THEN 'Young Adult (18-34)' WHEN Age < 55 THEN 'Middle Age (35-54)' ELSE 'Senior (55+)' END AS age_group, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv WHERE Age IS NOT NULL GROUP BY age_group
Data
Age GroupTotalSurvivorsSurvival Rate (%)
Child (0-17)1306550
Young Adult (18-34)47416835.44
Middle Age (35-54)2349641.03
Senior (55+)491326.53
4 rows

Fare: Economic Status Mattered

Ticket fares ranged from £0 to £512.33 (average £32.31). Passengers paying premium fares (£100+) had a 73.58% survival rate versus just 6.67% for those with free tickets.
View Source
SQL
SELECT ROUND(AVG(Fare), 2) AS avg_fare, MIN(Fare) AS min_fare, ROUND(MAX(Fare), 2) AS max_fare FROM titanic.csv
Data
Avg FareMin FareMax Fare
32.310512.33
1 row
View Source
SQL
SELECT CASE WHEN Fare = 0 THEN 'Free (£0)' WHEN Fare < 10 THEN 'Low (<£10)' WHEN Fare < 30 THEN 'Medium (£10-30)' WHEN Fare < 100 THEN 'High (£30-100)' ELSE 'Premium (£100+)' END AS fare_category, COUNT(*) AS total, COUNT( CASE WHEN Survived = 1 THEN 1 END ) AS survivors, ROUND( COUNT( CASE WHEN Survived = 1 THEN 1 END ) * 100.0 / COUNT(*), 2 ) AS survival_rate FROM titanic.csv GROUP BY fare_category ORDER BY MIN(Fare)
Data
Fare CategoryTotalSurvivorsSurvival Rate (%)
Free (£0)1516.67
Low (<£10)3186620.75
Medium (£10-30)31413442.68
High (£30-100)18710254.55
Premium (£100+)533973.58
5 rows

Data Quality Notes

Understanding data quality is essential for accurate analysis. This dataset has varying completeness across columns:
Age Missing Values: Approximately 20% of Age values are missing. Common strategies include: (1) median imputation by passenger class, (2) median imputation by title extracted from Name, or (3) creating an 'Age_Missing' binary feature to capture the missingness pattern itself.

Feature Engineering Opportunities

The raw features can be transformed to improve predictive models. Here are proven feature engineering techniques used by top Kaggle competitors:
Title Extraction Tip: The Name column contains honorific titles (Mr., Mrs., Miss., Master., Dr., Rev., etc.) that strongly correlate with survival. Extract using regex: df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.') — "Master" indicates male children with 57.5% survival rate.

Benchmark Model Performance

When building survival prediction models, here are typical accuracy benchmarks to compare against:

Historical Context

The RMS Titanic was the largest ship afloat at the time of its maiden voyage in April 1912. Operated by the White Star Line, it was marketed as "practically unsinkable" due to its advanced safety features including 16 watertight compartments. The ship struck an iceberg at 11:40 PM on April 14, 1912, and sank in under three hours.
The disaster exposed critical failures: the ship carried only 20 lifeboats (capacity for 1,178 people) despite holding 2,224 passengers and crew. Third-class passengers faced locked gates and confusing routes to the boat deck. The tragedy led to the International Convention for the Safety of Life at Sea (SOLAS), still the primary maritime safety treaty today.

Sample Records

Example Passenger Records
SurvivedPclassNameSexAgeSib SpParchFare
03Mr. Owen Harris Braundmale2210£7.25
11Mrs. John Bradley Cumingsfemale3810£71.28
13Miss. Laina Heikkinenfemale2600£7.93
11Mrs. Jacques Heath Futrellefemale3510£53.10
03Mr. William Henry Allenmale3500£8.05
5 rows
View Source
SQL
SELECT Survived, Pclass, Name, Sex, Age, "Siblings/Spouses Aboard" AS SibSp, "Parents/Children Aboard" AS Parch, Fare FROM titanic.csv LIMIT 5
Data
SurvivedPclassNameSexAgeSib SpParchFare
03Mr. Owen Harris Braundmale2210£7.25
11Mrs. John Bradley Cumingsfemale3810£71.28
13Miss. Laina Heikkinenfemale2600£7.93
11Mrs. Jacques Heath Futrellefemale3510£53.10
03Mr. William Henry Allenmale3500£8.05
5 rows

Important Considerations

Dataset Scope: This dataset contains 887 of the ~1,309 passengers aboard (excluding crew). Records with incomplete information were excluded from this version. The 38.56% survival rate here is slightly higher than the historical 32% overall survival rate due to this sampling.
Column Encoding: The 'Survived' column uses binary encoding (1 = survived, 0 = did not survive). The 'Sex' column contains string values ('male', 'female') that must be encoded for most ML algorithms. SibSp counts siblings AND spouses; Parch counts parents AND children.

Dataset Variants Comparison

Multiple versions of the Titanic dataset exist across different platforms. This table helps identify which version you're working with:

Table Overview

titanic

Contains 887 rows and 8 columns. Column types: 6 numeric, 2 text.

887 rows8 columns

titanic

887
rows
8
columns

Data Preview

Scroll to see more
Row 1
Survived0
Pclass3
NameMr. Owen Harris Braund
Sexmale
Age22
+3 more columns
Row 2
Survived1
Pclass1
NameMrs. John Bradley (Florence...
Sexfemale
Age38
+3 more columns
Row 3
Survived1
Pclass3
NameMiss. Laina Heikkinen
Sexfemale
Age26
+3 more columns

Data Profile

887
rows
8
columns
100%
complete
346.5 KB
estimated size

Column Types

6 Numeric2 Text

High-Cardinality Columns

Columns with many unique values (suitable for identifiers or categorical features)

  • Name(887 unique values)

Data Dictionary

titanic

ColumnTypeExampleMissing Values
Survivednumeric0, 10
Pclassnumeric3, 10
Namestring"Mr. Owen Harris Brau...", "Mrs. John Bradley (F..."0
Sexstring"male", "female"0
Agenumeric22, 380
Siblings/Spouses Aboardnumeric1, 10
Parents/Children Aboardnumeric0, 00
Farenumeric7.25, 71.28330
Last updated: December 27, 2025
Created: December 26, 2025