← BACK

ATL Flight Price Predictor

Archived
Timeline

Oct 2023 - Dec 2023

Role & Context

Machine Learning Project

Core Tech
PythonScikit-LearnPandasRandom ForestKNNLasso Regression

Project Summary

Developed a comprehensive machine learning pipeline to predict economy flight prices for ATL routes. Achieved ~96% accuracy using Random Forest after benchmarking against KNN and Linear Regression models.

Key Features

  • Comparative analysis of 3 models: Linear Regression, KNN, and Random Forest
  • Achieved ~96% accuracy (R² = 0.96) with Random Forest optimizer
  • Lasso Regression used for feature selection and dimensionality reduction
  • Processed Kaggle dataset specifically filtering for economy flights to/from ATL

Impact & Takeaways

  • Random Forest proved superior (MSE: 88.46) compared to Linear Regression (MSE: 1076)
  • Linear Regression's poor performance (~50% accuracy) confirmed non-linear pricing relationships
  • Total travel distance and arrival date were identified as the highest-impact features
  • Trade-off observed between model accuracy (Random Forest) and interpretability (Linear models)

Project Definition

Flight ticket prices are highly dynamic and significantly influence consumer travel choices. This project aimed to develop a machine learning model specifically geared towards Georgia Tech students, predicting costs for economy-class flights to and from Hartsfield-Jackson Atlanta International Airport (ATL). By focusing strictly on economy seats and pruning unrelated data, we aimed to create a highly specific and accurate predictor.

Data Pipeline & Feature Engineering

We utilized a Kaggle Flight Prices Dataset, applying a rigorous cleaning process that involved simple imputation for missing values, normalization of numerical data, and pruning of irrelevant features. To identify the most predictive variables, we employed Lasso Regression for feature selection.

Key Features (Ranked by Importance)

  • Total Travel Distance: The primary driver of cost.
  • Arrival Date: Day of year (1-365).
  • Departure Time: Minute of the day (0-1439).
  • Arrival Time: Minute of the day (0-1439).
  • Starting Airport: IATA code.
  • Segments Duration: Total flight time in seconds.
  • Destination Airport: IATA code.

Model Development & Results

We implemented and compared three distinct supervised learning algorithms to determine the best approach for price prediction:

Linear Regression

Failed to capture complexities.

  • Accuracy: ≈50%
  • MSE: 1076
  • R-squared: 0.51

K-Nearest Neighbors

High accuracy, low interpretability.

  • Accuracy: ≈90%
  • MSE: ~210
  • R-squared: ~0.905

Random Forest

Best performance.

  • Accuracy: ≈96%
  • MSE: 88.46
  • R-squared: 0.9599

Discussion & Conclusion

Our analysis revealed a clear hierarchy in model performance using the selected features. Linear Regression was largely unreliable, with an MSE of 1076, proving that flight pricing follows non-linear patterns that simple regression cannot capture. KNN performed significantly better (~90% accuracy), but lacked transparency in how specific features influenced the output.

The Random Forest model emerged as the superior choice, achieving an R-squared of 0.96 and the lowest error rate (MSE 88.46). While it sacrifices some interpretability compared to linear models, its ability to handle non-linear interactions between features like travel distance and seasonality made it the most effective tool for this specific problem domain.

BACKEnd of File