The shock UK election result capped a miserable year for pollsters, following botched predictions for the EU referendum and US Presidential election, but one of them correctly forecast a hung parliament where almost all others failed.
As LBC and Newsnight presenter James O’Brien put it: "The only real winner so far is YouGov."
The YouGov prediction was the result of a new statistical model called Multilevel Regression and Post-stratification (MRP) that was developed to produce estimates for small geographies, such as constituencies. It led the firm to correctly forecast the winner in 93 percent of seats, despite relying on an average sample size in each constituency on just 75 people.
The model was primarily developed by Professor Ben Lauderdale of the London School of Economics and YouGov's data science team, headed by Doug Rivers of Stanford University, who told Computerworld UK how it works.
"A poll of 75 people can easily be off by ten plus points," says Rivers. "The trick was we knew how many people voted Conservative and Labour and SNP in 2015, and we know how many voted to leave or remain in 2016. Those two things when you add them to the demographics are much more powerful predictors."
Previous voting behaviour is added to demographic information to reinforce the small sample sizes that typically leave a lot of room for error when predicting each constituency.
The model thereby enriches insufficient data and low response rates to accurately predict which seats would have swings.
The YouGov MRP model
YouGov used poll data from the preceding seven days to relate the variables on respondent profiles to their current voting intentions. These variables include their constituency, demographics, past voter behaviour and interview date. The model then estimates the probability of each type of voter voting for a specific political party.
The Office of National Statistics (ONS) annual population survey, the British Election Study, and the 2015 general election and EU referendum votes are then used to estimate how many of each voter type there are in every constituency. YouGov can then predict how many of each type intends to vote in their constituency.
The model further compensates for the small number of interviews conducted in each electoral area by pooling data from respondents in other constituencies to augment the sample size and increase its accuracy. This works because voter profiles remain a fairly accurate predictor regardless of where they live.
The data is sent from YouGov's survey system to its in-house Crunch analytics database. The sample is then processed through a piece of open source probabilistic software called Stan that was invented by Columbia University statistician Andrew Gelman. It uses an algorithm known as the Hamiltonian Monte Carlo algorithm to model estimates of the data.
Against the odds?
The loss of Canterbury to Labour after 99 years in Tory hands was a shock to most, but YouGov saw it coming. Its prediction was primarily based on Canterbury having a large presence of remain voters and students.
"That was just what the data says," says Rivers. "The thing that you have to understand about this was it was taking eight hours on a forty core system at AWS to estimate the model, so we weren't going through and adjusting predictions at any place or doing anything special. It was the overall model. That was a constituency where the remain vote was a helpful predictor of what was going to happen this year."
The other pollsters likely overcompensated for their previous failure to predict a Tory majority in 2015 by overcorrecting previous errors manipulating their data to fit their beliefs
Ipsos Mori for example, made a last minute adjustment to their neck-and-neck predictions based on turnout to transform it into a comfortable Conservative win.
"They just believed in their heart it was going to come out this way and they tortured their data to make that happen," says Rivers. "So I think one of the lessons is just to listen to your data.”
Hits and misses
YouGov didn't get everything right, including some surprises in Scotland where they overrated the SNP's chances. The Conservative vote share lead was also smaller than expected, but they still won slightly more seats than the poll predicted.
"The Conservatives won, in the end, a lot of close races," says Rivers. "The Amber Rudd seat was one where we actually predicted she would lose by a small amount.
"She won by a small amount, so we felt the quality of that prediction was pretty good. When it popped up people raised their eyebrows and said you're not really predicting the home secretary's going to lose. The test is does it tell you the race is close when you have a close race. You're going to be lucky in some places like we were in Canterbury and slightly unlucky in the Amber Rudd seat."
YouGov's MRP model had previously successfully predicted the EU referendum result, but it narrowly failed to predict Trump's victory in the 2016 US Election. YouGov has less experience operating in the US market, and covering its less densely populated area presented different challenges.
They correctly predicted that Hillary Clinton would narrowly win the popular vote, but were mistaken in forecasting that she would also edge the Electoral College. This was largely because the key Midwestern battleground states were too close to call, and a higher turnout of Trump supporters shifted some of them his way.
The methodology will next be applied in Germany for the first time at the federal elections in September. This will present a different challenge as the national number is more important. YouGov is also working to expand sample sizes and add more information on its respondents.
"We're moving to bigger data in terms of more people in our samples," says Rivers. "If we could go from 50,000 to 100,000 or 200,000 that would improve the quality of the predictions. It tracks with the software in terms of the time it takes just to model.
"The other thing we're doing is we're constantly adding more variables or data points. We collect thousands of pieces of information out of our panelists, and figuring out essentially how to utilise that more effectively to improve the quality of predictions is an area of research.”
YouGov also ran a second poll for the 2017 election that used a traditional methodology and predicted a Conservative majority. The result predicted by the MRP model was dismissed by most upon its release, but seems likely to replace its predecessor in the future.
"I think in a decade we're going to look at that as sort of old-fashioned and this kind of approach is going to be used as a matter of course," says Rivers. "It's incremental. You're not throwing away what was done before, you're adding to it.