It’s Difficult to Make Predictions, Especially About the Future.

-Unknown

Hello again, I’m writing about my most ambitious project to date.  I have had two blog posts where I used regression to see what factors affected fantasy football points for QBs.  Then the lightbulb went on.  Regression can also be used to make forecasts, so why not try to actually forecast fantasy football points for quarterbacks?  And thus my latest project…

The idea is very simple, forecasts using regression are made by multiplying each coefficient by each independent variable associated with the coefficient and then adding the “y-intercept” in the end.  For categorical variables such as “home/away”, one variable is selected as a 1 and one is selected as a 0.  For example, if “home” is selected as a “1”, if a QB plays at home the regression equation will have the coefficient added (i.e. coefficient*1) if he plays away, the coefficient won’t be added (i.e. coefficient*0).  For more information on dummy variables please see my previous blog post, part 3.  The following are the independent variables I selected:

Defense’s Pass Yards/Game

Defense’s Run Yards/Game

Temperature (F)

Wind (MPH)

Home/Away

Dome/Open (Stadium Type)

Inclement/Clear (Weather)

I mostly chose the same variables I have already used in previous blog posts (part 2 and part 3), with two notable exceptions.  I didn’t include the same team’s defense because it was unclear to me how much predictive power it had.  As I previously mentioned it may be a result rather than a predictor of fantasy points.  I also didn’t include defense’s passer rating because of a phenomenon knows as multicollinearity:

Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.

Here, I was concerned that there was a strong intercorrelation between defense’s passing yards and passer rating as one is included in the other.  I chose the defense’s passing yards since it’s been consistently more significant.

For temperature and wind speed I had to input some educated guesses for a few rows of data.  For every game played in a dome, I put 0 mph for wind (for obvious reasons) and 71F for temperature as it’s the average temperature of Super Bowls played in domes.   There were a few games where my data didn’t have a concrete temperature and wind speed.  Most of those were played in London and I made the executive decision to delete them since not only would the weather be in question, those games don’t really have a home-field.  There was one game that wasn’t played in London but was played at Mile High stadium and I looked up the weather and wind on that day during that time and input that information.

Another note on the data, I used 2013-2016 data for the forecast.  The reason is that I wanted to get the opportunity to then test it on 2017-2018 data.  Therefore the data is missing a few high profile quarterbacks such as Patrick Mahomes and Deshaun Watson.

I used all standard games again for the data (reminder, I chose games where a QB had >10 attempts as standard games).

The following is the regression output:

Fantasy Football Forecast Regression

Therefore, rounding the coefficients to the thousandth decimal place the equation is as follows:

0.081*Def Pass Yards/Gm+0.043*Def Rush Yards/Gm-0.003*Temperature-0.052*Wind+1.284*Home+1.121*Dome-1.242*Inclement-6.027

Next, I input the regression into my spreadsheet for each quarterback.  I took out all of the QBs that played fewer than 7 games in the 4 years (I chose 7 because it was the smallest amount played by a legitimate starting quarterback- Jared Goff).  I also took out outliers of fantasy football points for the remaining QBs from 2013-2016.  The way I did this is by using basic statistics.  Looking at a bell curve, 95% of the data is ~1.96 standard deviations away from the mean in both directions.

The 68-95-99 Statistical Distribution

That funny looking letter in the middle of the X-axis is the Greek letter pronounced “myoo”.  It just means mean or average.  The letters to the right and left are the Greek letter “sigma” or standard deviation.  In layman’s terms, standard deviation measures the dispersion of a sample or population.  The example that was used in my alma mater was that in our class we had relatively similar net worths.  However, if the school’s namesake David Tepper (who is worth ~$10 billion) enters the room, the average would go up exponentially.   Say there are 20 of us in the classroom, all of a sudden the average will be about $500 million.  However, you can find a group of 20 people in an exclusive country club that is worth $500 million on average.  These two data sets are not remotely the same.  So we need to measure the dispersion.  In my grad school’s classroom with David Tepper, the standard deviation would be enormous.  In the country club where the range of net worths may be just between $400 million and $600 million, the standard deviation would be relatively small.  For more information on standard deviation please see:  Standard Deviation

Going back to the bell curve you can see that 2 standard deviations away from the mean is 95.4%.  95% is ~1.96 standard deviations away from the mean.  That is the usual threshold we use.  You may remember that we use a p-value of 0.05 and that’s the inverse of that 95%.

The outliers that I got rid of were the fantasy football points plus and minus 1.96 standard deviations from the mean.  Or more precisely those games where these quarterbacks had more or fewer fantasy points than 1.96 standard deviations above or below the mean.

I was left with the most realistic fantasy points for the QBs.  I looked at each QB and took the average of these filtered fantasy points and compared it to the average score that a typical QB was expected to get given the conditions for each game using the model.  Then I divided by the actual average of the QB by the expected mean of an average QB given the game situations, to get an index.  For example, Aaron Rodgers on average got ~25.24 fantasy points per game.  However, putting in the game conditions for each of his games into the model, getting a forecast of each game and taking the average, produces a ~17.88 expected value.  This gave him a ~1.41 index.  On average he overproduces the score expected from the game conditions by 41%.

In essence that’s the model.  Put the variables into the regression model and multiply the result by the QB’s index.

The following is my model:  Fantasy Football Forecast Model

Interesting note:  A friend of mine suggested making the index the average fantasy points of a QB vs. the average fantasy points for all QBs.  After controlling for outliers and attempt volume I compared the indexes for Aaron Rodgers and the result was nearly identical, with a ~0.3% difference.  For Alex Smith, it was ~0.6%.  This is an interesting finding and I believe signals that regression truly is a type of average in itself.  In this situation, how the average quarterback performs given the conditions outlined.

To test the model I used the test data from 2017 and 2018.  I then took the average of the actual fantasy points for each QB and compared it to the average for my model.  I took out outliers from this average as well.  The absolute value of the percentage difference between the forecast and actual is the variance.  The reason I used the absolute value because otherwise, the average of these would be nonsensical.  One QB could have a +50% variance and another can have a -50% variance.  On average my model would be perfect.  In addition, I really don’t care in what direction my model is off.

As a point of reference, I compared my model to what a layman might do.  A less sophisticated layman may just go back X amount of years and average all of the fantasy points for a QB.  However, a more sophisticated layman may do what I did and filter out low attempt games.  Therefore, I compared my model to both of these scenarios by looking at the average of fantasy football points scored by each QB in 2013-2016 (same data range as my model).

Finally, I tested my model against averaging for only the top 20 QBs from 2016.  (It was really 19 QBs because Colin Kaepernick wasn’t in the league in 2017 and 2018).  There are two reasons why it was worth looking at the top 20 QBs.  One is relevance, you have only so many draft picks so you won’t necessarily draft the 50th best QB in the NFL.  Second, is these QBs tend to be more predictable and thus the variance would be lower.  The results are as follows:

Variance Summary

As you can see for all QBs my model performs better than the more sophisticated layman’s average that takes out low attempt games but performs worse than the less sophisticated layman’s attempt.  For the top 20 QBs, my model performs the worst.  Surprisingly the average of all attempts is more accurate than the average taking out low attempt games in both instances.  However, it’s really splitting hairs.  All three methods have a range of 2% for all QBs and 1% for the top 20 QBs.

In the end, what does this mean?

The NFL Network’s numbers guru Cynthia Frelund once said about her 10 point margin of victory prediction for a particular game “that’s a higher margin of victory than my model usually likes”.  It’s unlikely anyone would consider 10 points a blowout.  However, her models are probably built off averages of one type or another and averages tend to even out.  Taking a simple average of 2013-2016 fantasy points data is not too far off from doing a regression and then indexing based on average fantasy points in 2013-2016.  Both are averages from the same data set.  In fact, I tried to get the model data to as close to average data as possible.  So in that sense, I succeeded.  If you think about it, anyone modeling fantasy points would model them based on actual fantasy points that were previously scored as the standard.  Therefore these models would all probably look like average actual data from years past.  One way to combat that is adjusting for factors such as the team a quarterback plays for in the season that’s subject to predictions.  Easier said than done.  (Also please note, this is in the aggregate while testing my model I notice that quite often for individual QBs playing in one specific set of circumstances the model varies pretty significantly at points.  For example, Cam Newton playing the Buccaneers week 9 of 2018 has 21.92 expected fantasy points based off 2013-2016 averages but 27.98 based off my model).  Finally, there’s a caveat, this model is more for fun.  Individual fantasy points per week have too much variance and are subject to too much noise to accurately predict on any given week for a given QB.

So, in the end, I won’t be the next fantasy football millionaire.  However, this blog is for fun and I hope you had fun reading it.  Besides that, perhaps the interesting tidbit here is that you can just look at past years’ data and get as good or better forecast than building a comprehensive model or maybe even listening to so-called experts.

Sources:

2016 Top 20 Fantasy QBs

Denver Weather 11-17-13

Super Bowl Dome Temperatures

QB Stats

Defensive Stats

Weather Information

 

 

 

Leave a comment