Microsoft has released a machine learning library for Apache Spark. The latest development is designed to boost the productivity of data scientists while using open source big data tools.
Apache Spark users have faced the low-level APIs problem while indexing strings and assembling feature vectors. Although Spark is powerful enough to build scalable machine learning models, the lack of machine learning library increases the time taken to implement machine learning algorithms.
The new library, called MMLSpark, simplifies many of these tasks. The library is integrated with algorithms to build models in PySpark. Data scientists can leverage the library to boost productivity and focus on data scents aspect of machine learning. The library takes care of creating tokens for the strings, converting them in vectors, assembling and indexing the vectors as well as label columns.
“MMLSpark uses DataFrames as its core datatype and integrates with SparkML pipelines for composability and modularity. It is implemented as Python bindings over Scala APIs, ensuring native JVM performance,” Microsoft’s Senior Program Manager Roope Astala and Principal Software Engineering Manager Sudarshan Raghunathan explain in blog post.
Pre-trained neural network on board
One of the most notable features of MMLSpark is its pre-trained neural network. The library can extract features from images and then pass them on to a traditional machine learning model. It can also train a DNN model if pre-trained model is too domain-specific and unsuitable. According to Microsoft, data scientists can use the APIs in MMLSpark to rapidly build image analysis and computer vision pipelines.
The new tool also comes with OpenCV-based image transformations that can help you read in and prepare the data.
Microsoft has released MMLSpark as an open source project on GitHub. The Redmond company is also encouraging developer contributions to get deeper into the emerging world of machine learning.