Russian search engine provider Yandex is releasing its first open source machine learning library, as it looks to help developers and data scientists create more accurate machine learning models by embracing a technique known as gradient boosting.
The library, called CatBoost, is designed to work "out-of-the-box" for data scientists to create predictive models that combine multiple data types and sources.
Misha Bilenko, the fast-talking head of machine intelligence and research at Yandex told Computerworld UK that despite there being a wealth of good open source algorithms, "there are not as many libraries out there for gradient boosting".
CatBoost "improves data scientists’ ability to create predictive models using a variety of data sources, such as sensory, historical and transactional data", he wrote in a blog post. "While most competing gradient boosting algorithms need to convert data descriptors to numerical form, CatBoost’s ability to support categorical data directly saves businesses time while increasing accuracy and efficiency."
Gradient boosting typically combines a range of decision trees to 'boost' the predictive model. It essentially allows for developers to analyse different forms of data under a master model. Bilenko calls gradient boosting the "duct tape of machine learning. Very few jobs don’t use it". Ben Gorman has a good explainer over at the Kaggle blog.
Gradient Boosting is a popular technique for delivering highly accurate recommendations, predictive models, fraud detection and ranking tasks. Binlenko wrote: "It is especially powerful in two ways: it yields state-of-the-art results without extensive data training typically required by other machine learning methods, and it provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems."
The Russian search giant incorporates machine learning into a number of its products, from core search and personalisation, speech recognition for its translation service, to the routing engine and self driving car technology at its Uber rival car hailing service Yandex.taxi.
Externally, CatBoost is already being used by data scientists at the European Organization for Nuclear Research (CERN) to improve the accuracy of its algorithms.
CatBoost is designed to be enterprise-ready, integrating with popular deep learning tools like Google’s TensorFlow and programming languages like Python.
Like TensorFlow came from DistBelief, CatBoost is a second generation library, improving upon internal algorithms developed at Yandex under the title of MatrixBoost, which Bilenko calls "a crown jewel for Yandex".
"When we decided to build the next generation gradient boosting platform we thought it would be useful in open source", Bilenko said.
Bilenko added that Yandex wants to learn from the teething issues Google experienced when open sourcing TensorFlow. "TensorFlow had a bad community reaction to start and it took iterations, and we feel it will be a journey. We will take feedback and there will be more releases later in the year for integrations and scaling up," he said.
Models trained by CatBoost can also be used in production via Apple’s Core ML framework, so that apps can be built with CatBoost-trained models.