Agents for Data
Skip to main content

Flights Dataset (1M Records)

One million U.S. domestic flight records from January 2006 with departure/arrival delays, air time, and distance metrics. Clean data with zero missing values—ready for delay prediction ML models, time-series analysis, and aviation research.

aviationflightstransportationdelay-predictiontime-seriesmachine-learningregressionus-domesticbts-data1-million-rows1 table1,000,000 rows
Last updated 3 weeks agoJanuary 2, 2026
Time:January 2006
Location:United States (Domestic Flights)
Created by Dataset Agent

Overview

This dataset contains exactly 1,000,000 U.S. domestic flight records from January 2006, sourced from the Bureau of Transportation Statistics. Each row represents a single flight operation with seven key metrics: flight date, departure delay, arrival delay, air time, distance, departure time, and arrival time.
With zero null values across all columns, this dataset is immediately ready for analysis—no cleaning or preprocessing required. The strong correlation (0.91) between departure and arrival delays makes it particularly valuable for regression modeling and delay prediction applications.
Data Preview: First 5 Flight Records
FL DATEDEP DELAYARR DELAYAIR TIMEDISTANCEDEP TIMEARR TIME
2006-01-0151935024759.0812.48
2006-01-02167216343247511.7815.77
2006-01-03-7-234424758.8812.13
2006-01-04-5-1333124758.9211.95
2006-01-05-3-1732124758.9511.88
5 rows
View Source
SQL
SELECT * FROM flights_1m.csv LIMIT 5
Data
FL DATEDEP DELAYARR DELAYAIR TIMEDISTANCEDEP TIMEARR TIME
2006-01-0151935024759.0812.48
2006-01-02167216343247511.7815.77
2006-01-03-7-234424758.8812.13
2006-01-04-5-1333124758.9211.95
2006-01-05-3-1732124758.9511.88
5 rows

Dataset Schema

Understanding decimal time: DEP_TIME and ARR_TIME use decimal hours. To convert 9.08 to standard time: 9 hours + (0.08 × 60) = 9:05 AM. Multiply the decimal portion by 60 to get minutes.

Data Coverage

  • Time Period: January 1-31, 2006 (31 days of flight operations)
  • Geographic Scope: U.S. domestic flights only
  • Record Count: Exactly 1,000,000 flight records
  • Data Completeness: 100% complete—zero null values in any column
  • Source: Bureau of Transportation Statistics (BTS)

Key Statistics and Insights

The average departure delay is 8.65 minutes, while the average arrival delay is 6.40 minutes—indicating pilots typically recover about 2 minutes during flight.
View Source
SQL
SELECT ROUND(AVG(DEP_DELAY), 2) AS avg_dep_delay, ROUND(AVG(ARR_DELAY), 2) AS avg_arr_delay FROM flights_1m.csv
Data
Avg Dep DelayAvg Arr Delay
8.656.4
1 row
Flight distances span from 30 miles (regional hops) to 4,962 miles (transcontinental), with an average of 741 miles per flight.
View Source
SQL
SELECT MIN(DISTANCE) AS min_distance, MAX(DISTANCE) AS max_distance, ROUND(AVG(DISTANCE), 0) AS avg_distance FROM flights_1m.csv
Data
Min DistanceMax DistanceAvg Distance
304,962741
1 row
The correlation between departure and arrival delays is 0.91—extremely strong, making this dataset ideal for regression-based delay prediction models.
View Source
SQL
SELECT ROUND(CORR(DEP_DELAY, ARR_DELAY), 4) AS correlation FROM flights_1m.csv
Data
Correlation
0.91
1 row

On-Time Performance Analysis

View Source
SQL
SELECT CASE WHEN ARR_DELAY <= 0 THEN 'On Time/Early' WHEN ARR_DELAY <= 15 THEN '1-15 min delay' WHEN ARR_DELAY <= 30 THEN '16-30 min delay' WHEN ARR_DELAY <= 60 THEN '31-60 min delay' ELSE '60+ min delay' END AS delay_category, COUNT(*) AS flight_count FROM flights_1m.csv GROUP BY delay_category ORDER BY flight_count DESC
Data
Delay CategoryFlight Count
On Time/Early635,308
1-15 min delay189,009
16-30 min delay69,887
31-60 min delay56,811
60+ min delay48,985
5 rows
Key findings from the delay distribution:
  • 63.5% of flights arrive on time or early (635,308 flights)
  • 18.9% experience minor delays of 1-15 minutes
  • 4.9% face severe delays exceeding 60 minutes
  • The median departure delay is 0 minutes—half of all flights depart on schedule or early

Delay Patterns by Time of Day

View Source
SQL
SELECT CASE WHEN DEP_TIME >= 6 AND DEP_TIME < 12 THEN 'Morning (6am-12pm)' WHEN DEP_TIME >= 12 AND DEP_TIME < 18 THEN 'Afternoon (12pm-6pm)' WHEN DEP_TIME >= 18 THEN 'Evening (6pm-12am)' ELSE 'Night (12am-6am)' END AS time_of_day, COUNT(*) AS flight_count, ROUND(AVG(DEP_DELAY), 2) AS avg_dep_delay FROM flights_1m.csv GROUP BY time_of_day
Data
Time Of DayFlight CountAvg Delay (Min)
Morning (6am-12pm)379,8772.61
Afternoon (12pm-6pm)380,1259.23
Evening (6pm-12am)219,70218.09
Night (12am-6am)20,2968.78
4 rows
Travel tip: Morning flights (6 AM - 12 PM) average just 2.61 minutes delay—nearly 7x better than evening flights at 18.09 minutes. Book early for the best on-time performance.

Flight Distance Distribution

View Source
SQL
SELECT CASE WHEN DISTANCE < 300 THEN 'Regional (<300 mi)' WHEN DISTANCE < 600 THEN 'Short-haul (300-600 mi)' WHEN DISTANCE < 1200 THEN 'Medium-haul (600-1200 mi)' WHEN DISTANCE < 2500 THEN 'Long-haul (1200-2500 mi)' ELSE 'Ultra long-haul (>2500 mi)' END AS route_type, COUNT(*) AS flights, ROUND(AVG(AIR_TIME), 0) AS avg_air_time FROM flights_1m.csv GROUP BY route_type ORDER BY flights DESC
Data
Route TypeFlightsAvg Air Time (Min)
Regional (<300 mi)217,32241
Short-haul (300-600 mi)297,79669
Medium-haul (600-1200 mi)325,911121
Long-haul (1200-2500 mi)146,122221
Ultra long-haul (>2500 mi)12,849346
5 rows
Medium-haul flights (600-1,200 miles) dominate with 325,911 flights (32.6%), followed by short-haul routes. Ultra long-haul flights over 2,500 miles represent just 1.3% of operations—these transcontinental routes average nearly 6 hours of air time.

Working with 1 Million Rows

Excel limitation: This dataset exceeds Excel's 1,048,576 row limit. Use Python (pandas), R, DuckDB, or database tools for analysis. The CSV file is approximately 45MB.
Recommended approaches for working with this dataset:
  • Python pandas: Loads in ~2 seconds, uses ~150MB RAM
  • DuckDB: Query directly from CSV without loading into memory
  • Polars: Faster alternative to pandas for large datasets
  • R data.table: Memory-efficient for million-row datasets
  • SQL databases: Import into PostgreSQL, SQLite, or MySQL for complex queries

Sample Analysis Questions

This dataset supports a wide range of analytical explorations:
  • What is the probability of a flight being delayed more than 30 minutes given it departs after 6 PM?
  • How does flight distance correlate with delay recovery (difference between departure and arrival delay)?
  • Which day of the week has the highest average delays?
  • Can we predict arrival delay within ±15 minutes using only departure delay and air time?
  • What percentage of severely delayed departures (60+ min) still arrive within 30 minutes of schedule?

Limitations and Considerations

Historical data: This dataset is from January 2006. Aviation patterns, airline operations, and air traffic management have evolved significantly. Use caution when applying insights to current scenarios.
  • No carrier information: Airline codes are not included—cannot compare performance by carrier
  • No route details: Origin/destination airports are absent—prevents route-level analysis
  • No cancellation data: Only completed flights; cancelled flights are not represented
  • Extreme outliers: Some delays reach -1,197 minutes (likely data errors); consider filtering outliers beyond ±500 minutes
  • Single month: January-only data may not capture seasonal patterns (summer thunderstorms, holiday travel)

Data Quality Summary

The median departure delay is 0 minutes, with 75% of flights departing within 8 minutes of scheduled time—demonstrating generally reliable airline operations.
View Source
SQL
SELECT PERCENTILE_CONT(0.25) WITHIN GROUP ( ORDER BY DEP_DELAY ) AS p25, PERCENTILE_CONT(0.5) WITHIN GROUP ( ORDER BY DEP_DELAY ) AS median, PERCENTILE_CONT(0.75) WITHIN GROUP ( ORDER BY DEP_DELAY ) AS p75 FROM flights_1m.csv
Data
P25MedianP75
-408
1 row

Table Overview

flights_1m

Contains 1,000,000 rows and 7 columns. Column types: 6 numeric, 1 text.

1,000,000 rows7 columns

flights_1m

1,000,000
rows
7
columns

Data Preview

Scroll to see more
Row 1
FL_DATEFri Feb 03 2006 13:00:00 GM...
DEP_DELAY26
ARR_DELAY22
AIR_TIME55
DISTANCE337
+2 more columns
Row 2
FL_DATEFri Feb 03 2006 13:00:00 GM...
DEP_DELAY-5
ARR_DELAY2
AIR_TIME122
DISTANCE762
+2 more columns
Row 3
FL_DATEFri Feb 03 2006 13:00:00 GM...
DEP_DELAY-1
ARR_DELAY-20
AIR_TIME120
DISTANCE1,009
+2 more columns

Data Profile

1,000,000
rows
7
columns
100%
complete
333.8 MB
estimated size

Column Types

6 Numeric1 Text

Data Dictionary

flights_1m

ColumnTypeExampleMissing Values
FL_DATEstring"Fri Feb 03 2006 13:0...", "Fri Feb 03 2006 13:0..."0
DEP_DELAYnumeric26, -50
ARR_DELAYnumeric22, 20
AIR_TIMEnumeric55, 1220
DISTANCEnumeric337, 7620
DEP_TIMEnumeric21.100000381469727, 7.9166665077209470
ARR_TIMEnumeric22.33333396911621, 10.3333330154418950
Last updated: January 2, 2026
Created: January 2, 2026