Following its massive £7 billion merger last year, bookmaker Paddy Power Betfair has consolidated a lot of its monitoring, from security to IT operations, around the enterprise version of Splunk, moving away from less sophisticated options.
Speaking to Computerworld UK during Splunk's .conf2017 event in Washington D.C. this week, site reliability engineer (SRE) manager at Paddy Power Betfair, David Butler, explained why Splunk came out on top for log analytics, security and fraud monitoring, over some of its open source rivals.
At the time of the merger between the Anglo-Irish bookmakers, the two companies were using three tools for monitoring its IT systems: Splunk, ElkMonitor and Sumo Logic.
How Splunk won out
Each had their pros and cons. As Butler explained: "We had Splunk Enterprise in 2010 at Paddy Power then moved to ElkMonitor. Betfair moved off Splunk and on to Sumo Logic, so we had all three when we merged."
In terms of cost, Butler said that ElkMonitor being mostly open source meant the capital expenditure (CapEx) was perfectly reasonable, but because it's your own installation, the operation expenditure (OpEx) can get pretty high if you run at their scale - 26 billion time series data points a day, to be exact. The main thing with ElkMonitor though was that "the end users hated it", so that was out.
Sumo Logic, by comparison, came in cheaper on CapEx. However, the company charges for compute, so the OpEx became "a cost we couldn't control", according to Butler. When it came to the end users, they "liked it, but Sumo to me is like a cheap copy of Splunk", Butler said.
They ran a proof of concept (PoC), and Splunk Cloud came out on top for time series data, so the organisation decided to consolidate around that software.
Now all technology teams - infrastructure, database, IT security, development, IT operations, fraud and prodops - use Splunk Cloud in some capacity.
Combatting fraud with Splunk
The fraud team at Paddy Power Betfair is a particularly big fan of the tool. "They now have access to data across the Paddy Power brand and the Betfair brand, giving them a lot more data to see things they wouldn’t have been able to see before and correlate things they wouldn't have been able to see before, which makes them far more effective," Butler said.
Butler says that what used to take the team 40 minutes in Sumo Logic, in terms of pulling a report and acting on it, is down to 90 seconds or less with Splunk, "so the speed that it works is far better, and that's with a much larger data set than Sumo ever had to deal with. This gives them more confidence of what they are shutting down, so that team is a big fan of Splunk now," he said.
Next Butler is looking to assess more areas where Splunk could be used.
"I suppose where we are at the moment is we did the merger and have everyone on the single platform, so we have people at different levels, utilising the data in different ways, so the idea is to keep looking at areas where we aren't utilising the data." This could include increased usage from the security and application reliability teams.
In general, monitoring and tooling are provided across the organisation as-a-service. Butler's team doesn't enforce tools on any teams across the organisation, so it doesn't set up and distribute dashboards for people, but it does look to offer best practice and support around certain tools.
So for security and IT operations the ideal stack looks roughly like this: Sensu at the top for real-time monitoring, Prometheus for time-series alerting, OpenTSDB as a metric store and Splunk for log analytics.
"What's interesting with those technologies is that they all overlap in one area or another," Butler explained. "That's in line with our monitoring strategy as we want them to overlap, as that means we aren't beholden to any one technology."
Butler's remit post-merger was "to make everything easier, do it for less money and remove all of the technical debt", and he initially thought a single platform approach would achieve this. However, he soon found that there was no magic tool to rule them all because "everyone had their own specific use cases".
So in the end he decided that the important thing was to harmonise the language everyone was speaking around a core set of tools.
So where Butler would like everyone to use Sensu "because we want all product ops to have a single place to look at alerts and centralise that", if a team wants to consume analytics directly through Splunk dashboards, they can use that instead.
"Rather than trying to restrict people we try to accommodate the most use cases that we can and offer it as a stack," he added.