Audi has been adopting a range of open source technologies to build a big data foundation for collecting increasing data volumes from its latest luxury car models, as well as the machinery in its production facilities.
Speaking to a packed room at the Dataworks summit in Berlin last week, Matthias Graunitz and Carsten Herbe, two big data architects at Audi spoke about how they built the data backend to store these new data sources, and some lessons learned along the way.
Open source stack
Audi is a big Hadoop user, storing data in Hadoop Distributed File System (HDFS) since 2015. Laying out the roadmap, Graunitz said: "So we started with a small cluster by the end of 2015, started with clients to investigate how to build and run this system and if they fulfilled the business requirements. So we started with a small Hortonworks data platform (HDP) cluster and have four nodes, 96 cores and 160TB of raw capacity."
Today that is a productive Hadoop cluster with 1PB of storage capacity, 288 cores across 12 nodes and 6TB of RAM, as well as a productive Kafka cluster with 4 nodes, 128GB of RAM and16TB of raw capacity.
That doesn't mean the integration was seamless though, as Herbe identified: "Introducing Hadoop into an existing enterprise environment has challenges, it has to integrate with business systems, there are security requirements.” This tweet shows a more detailed map of the Hadoop path within Audi.
This data foundation then led to two proof of concepts, one from the screwdrivers being used in production facilities and one for car usage data being transmitted from control units.
Regarding the screwdrivers: "We wanted to see if the process within the production facilities was working well or if we could predict any problems in the maintenance process up front," Graunitz said.
Then with the car data, each new Audi model now transmits around 25,000 signals over the air into a HDFS store to be analysed by staff at the company.
It then layers some BI tools like Tableau on top to give business users access to this information to make design decisions on future models.
"A lot of business departments want insights from that data, so we had to rethink data ownership and data governance to define a data domain model with clear responsibility in each domain," Graunitz explained.
"As we provide each shared service in a multi tenant environment means we need a lifecycle management for each service and strong collaboration with the projects that use this service. Of course we need defined SLAs also, like 24/7 availability and different development lifecycles."
The technology was first trialled using anonymised data from a test fleet of vehicles, before being rolled out for the automaker's latest model: the A8.
Now, Graunitz admits that cloud is the "missing piece" at Audi, saying: "Everything we presented today is built on premise or in our private cloud."
"Due to data gravity and legal requirements we need to get closer to the cars and where the data occurs so we set up a second initiative where this data is first landed in a cloud environment and to have a hybrid approach we decided to use a direct cloud connect to ingest data still to our on-premise world, because our business processes are on-premise in our headquarters."
To do this Audi turned to a HDFS connector from third party vendor Confluent to bring that data via Kafka pipelines into HDFS, with a metadata catalogue also built in the cloud as a means to locate data regardless of which store it resides in.
What the staff at Audi do with this data we don't know yet, but future models could certainly be built according to the usage data being driven from A8 control units this very second.