not going anywhere

Are modern cars as efficient as we like to think? A brief data science approach

Jesse Fredrickson
7 min readMay 2, 2019

--

In a word, no. In a few words, yes, but not quite in the way I was expecting. Let me explain.

Like many of my peers in the American Northeast, I consider myself to be somewhat environmentally-woke: I do my part in recycling, I own a reusable grocery shopping bag, I frequently walk when I can, and I don’t drive a gas-guzzling car or truck. I also believe firmly in the power of technology, and in the goodness of people to use technology to overcome contemporary environmental threats. And I see it everywhere — electric cars, solar and wind farms, solar powered trains, the list goes on. That’s why when I came across a dataset containing data on three decades’ worth of car models, I expected to find a steadily decreasing environmental footprint.

The Data

I’d like to give a brief shoutout to u/nicolas-gervais for scraping together a dataset of over 30,000 cars (that’s individual make, model, year, and trim) and posting it to r/Datasets for all to enjoy. He parsed thecarconnection.com for spec data, and what he ended up with is a list of specs containing everything one could ever dream of measuring, from Horsepower, to second passenger hip room, to miles per gallon city and highway, dating back to 1990. 29 years is quite a long time in terms of technological development, and I was eager to see what kinds of trends I could find.

Side note: it appears that before 1996, the data source did not contain much data that I am interested in. I could impute the data for extra analysis if I wanted, but it would pollute my analysis which focuses on yearly trends, and therefore those years have been omitted.

So, has MPG been going up?

In terms of raw miles per gallon, hardly.

Shockingly flat. But there’s much more to it than that. Let’s break it up by car body style.

There is a lot that I find interesting about this plot.

  • First and foremost, the overall peak low in 2008. My immediate guess is that the 2008 recession impacted car manufacturers heavily, and a many chose to sacrifice MPG performance as a cost saving measure.
  • Since then, fuel efficiency has seen a gradual recovery, especially in the 4 door and fullsize van styles (note — the death of the station wagon is because only 7 models were classified as station wagons in 2014, and then none afterwards). However, in aggregate we are still close to the same levels we were in the early 2000s.
  • Another interesting observation is that around 2008, the 4 door style overtook the 2 door style in fuel efficiency, and the gap between them has grown every year. I hypothesize that manufacturers have come to understand that the coupe market is not buying for efficiency, but for power and glamour.

Has technology stagnated?

No. it is possible that manufacturer’s priorities have likely just been elsewhere. Looking at MPG plotted against horsepower and colored by model year, we see something enlightening.

As expected, there is something of a trade-off between fuel efficiency and power. But the coloring by date reveals a drift to the right, as city fuel economy remains the same but horsepower steadily increases. It appears that even though engines are yielding the same net fuel efficiency, they are dramatically more powerful. Let’s see another visualizations of this.

Voila, steadily increasing horsepower across the board (minus full size vans, interestingly… Explanation left as an exercise for the reader)

What kind of efficiency is modern tech capable of?

To attempt to answer this, I am going to employ a bit of machine learning. My goal here is to build a model that can take in features like horsepower, weight, price, etc and make an educated guess at the MPG we would expect from such a car. That way, I can feed it my ideal ‘green’ car — a lightweight 4 door sedan with a modern engine and stripped down horsepower, torque, and displacement — and see if it predicts a significant boost in net fuel efficiency.

First, a sanity check. This is a correlation matrix, showing how each numeric variable trends with each other numeric variable. What we see here is encouraging — horsepower, torque, displacement, and weight all have an inverse relationship with fuel economy. That is what I intuitively expected to see. Interestingly, MSRP also trends negatively with MPG, meaning more expensive sedans are actually less fuel efficient, on average.

Next, I build a supervised learning model. I am seeking to predict a continuous variable (fuel economy), so I will be using a regression model. I tested a number of model types for this task, and I ended up settling on a Random Forest Regressor — a type of model which performs well on feature-rich data. The functional details warrant their own discussion, and you can learn more here. After a good deal of data transformations, we train the model and end up with this:

more behind the scenes

The bottom line is… the bottom line. An r² value, also known as a coefficient of determination, plays a role in evaluating how effectively a regression model can predict a dependent variable. In this case, it means my model could explain 99% of the variation in MPG in the training set, and 98% of variation in the testing set. A lot goes into how training and testing works, and how models can be tweaked to perform better on data they have not been trained on (testing data), but these results are positive enough to continue.

What did the model decide is important?

One of the nice things about the random forest algorithm is we can easily reverse engineer it a bit to see what the model determined as an important feature for estimating the dependent variable. Here I’ve plotted the top 10 out of what ended up being some 70 feature variables, along with their weight in determining variation in the associated MPG. Displacement ended up being by far the strongest determining factor — it would be interesting to see how displacement has varied over time, but for now I’m going to continue to see what kind of predictions I can make with this model based on my own custom car.

Prediction

Here are the median values for sedans in the past two years:

And here are my model’s predictions for a very lightweight car a little below mean MSRP, with low torque and HP, front wheel drive, Gasoline Direct Injection and a 1.25L engine.

It predicts 42.91 MPG City, about almost double the average. Pretty great, but I have a sneaking suspicion it would be a huge challenge to build a car that light at the pricepoint and make it safe. Not to mention, would it sell?

Fin

So what did we learn? I was pleased to see many of my suspicions about fuel efficiency factors verified, but also surprised by some of what I found in this data. Even though our cars are getting significantly more advanced, we have stayed at roughly the same fuel efficiency for decades, and expensive 4-door sedans actually tend to perform worse than modestly priced ones. I found still more curiosities that warrant their own investigations, and I have no doubt there are countless insights to be had. I had a lot of fun writing up this analysis, and I encourage you the reader to stay curious about what’s on the road, now and in the future!

By the way, if you’re curious, here are the most fuel efficient models in the dataset. Way to go, Honda and Toyota!

github repo here: https://github.com/jfreds91/DSND_t2_p1_cardata

--

--