Project Definition
Flight ticket prices are highly dynamic and significantly influence consumer travel choices. This project aimed to develop a machine learning model specifically geared towards Georgia Tech students, predicting costs for economy-class flights to and from Hartsfield-Jackson Atlanta International Airport (ATL). By focusing strictly on economy seats and pruning unrelated data, we aimed to create a highly specific and accurate predictor.
Data Pipeline & Feature Engineering
We utilized a Kaggle Flight Prices Dataset, applying a rigorous cleaning process that involved simple imputation for missing values, normalization of numerical data, and pruning of irrelevant features. To identify the most predictive variables, we employed Lasso Regression for feature selection.
Key Features (Ranked by Importance)
- Total Travel Distance: The primary driver of cost.
- Arrival Date: Day of year (1-365).
- Departure Time: Minute of the day (0-1439).
- Arrival Time: Minute of the day (0-1439).
- Starting Airport: IATA code.
- Segments Duration: Total flight time in seconds.
- Destination Airport: IATA code.
Model Development & Results
We implemented and compared three distinct supervised learning algorithms to determine the best approach for price prediction:
Linear Regression
Failed to capture complexities.
- Accuracy: ≈50%
- MSE: 1076
- R-squared: 0.51
K-Nearest Neighbors
High accuracy, low interpretability.
- Accuracy: ≈90%
- MSE: ~210
- R-squared: ~0.905
Random Forest
Best performance.
- Accuracy: ≈96%
- MSE: 88.46
- R-squared: 0.9599
Discussion & Conclusion
Our analysis revealed a clear hierarchy in model performance using the selected features. Linear Regression was largely unreliable, with an MSE of 1076, proving that flight pricing follows non-linear patterns that simple regression cannot capture. KNN performed significantly better (~90% accuracy), but lacked transparency in how specific features influenced the output.
The Random Forest model emerged as the superior choice, achieving an R-squared of 0.96 and the lowest error rate (MSE 88.46). While it sacrifices some interpretability compared to linear models, its ability to handle non-linear interactions between features like travel distance and seasonality made it the most effective tool for this specific problem domain.