Compilation of the Best 13 Frameworks for Machine Learning
Over the past years, machine learning has turned into a mainstream of unprecedented strength. This trend is fueled not only by the cheapness of cloud environments but also by the availability of the most powerful video cards used for such computations – there is also a mass of frameworks for machine learning. Almost all of them are open source, but more importantly, these frameworks are designed in such a way as to abstract from the most difficult parts of machine learning, making these technologies more accessible to a wide class of developers. In this article, we are going to show you a selection of frameworks for machine learning, both recently created and processed in the outgoing year.
- Apache Spark is best known for its involvement in the Hadoop family. But this framework for in-memory processing appeared outside Hadoop and still continues to earn a reputation outside of this ecosystem. Spark has become a familiar tool for machine learning due to the growing library of algorithms that can be quickly applied to the data stored in memory.
- Spark has not stopped in its development, its algorithms are constantly expanding and being revised. In release 1.5, many new algorithms have been added, existing ones have been improved, and also MLlib support, the main platform for solving mathematical and statistical problems, has been strengthened in Python. In Spark 1.6, among other things, thanks to continuous pipelines (persistent pipelines), it was possible to suspend and continue the tasks of Spark ML.
- The “deep learning” frameworks are used to solve the heavy tasks of machine learning, such as the processing of natural languages and image recognition. Recently, the open source-framework Singa, designed to facilitate the training of deep training models on large volumes of data, was adopted in the Apache incubator.
- Singa provides a simple software model for training networks based on a cluster of machines and also supports many standard types of training tasks: convolutional neural networks, limited Boltzmann machines, and recurrent neural networks. Models can be trained synchronously (one after another) and asynchronously (together), depending on what is best suited for this problem. Also, Singa makes it easier to configure the cluster using Apache Zookeeper.
- Caffe is a deep learning framework. It is made “with the calculation for expressiveness, speed, and modularity”. Initially, the framework was created for machine vision projects, but has since evolved and is now being used for other tasks, including speech recognition and multimedia work.
- The main advantage of Caffe is speed. The framework is entirely written in C ++, supports CUDA, and, if necessary, is able to switch the processing flow between the processor and the video card. The package includes a set of free and open source reference models for standard classification tasks. Also, many models created by the community of users Caffe.
- Given the huge amount of data and processing power required for machine learning, clouds are an ideal medium for ML applications. Microsoft has equipped Azure with its own machine learning service, for which you can pay only for the fact of using – Azure ML Studio. Available versions with monthly and hourly rates, as well as free (tier). In particular, with the help of this system, the project HowOldRobot was created.
- Azure ML Studio allows you to create and train models, turn them into APIs to provide other services. Up to 10 GB of storage can be allocated per user account, although you can also connect your own Azure storage. A wide range of algorithms are available, created by Microsoft and third-party companies. To try the service, you do not even need to create an account, just log in anonymously, and you can drive Azure ML Studio for eight hours.
- Amazon has its own standard approach to providing cloud services: first, the interested audience is given basic functionality, this audience molds something from it, and the company finds out what people really need.
- The same can be said about Amazon Machine Learning. The service connects to data stored in Amazon S3, Redshift or RDS, it can perform binary classification, multi-class categorization, and regression on the specified data to create a model. However, this service is tied to Amazon. Not only does it use data stored in company-owned repositories, so models can not be imported or exported, and data samples for training can not be more than 100 GB. But still, it is a good tool for beginning, illustrating that machine learning turns from a luxury into a practical tool.
Microsoft Distributed Machine Learning Toolkit (DMTK)
- The more computers you can use to solve the problem of machine learning, the better. But combining a large fleet of machines and creating ML applications that are effectively performed on them can be a challenge. The DMTK (Distributed Machine Learning Toolkit) framework is designed to solve the problem of distributing various ML operations on a cluster of systems.
- DMTK is considered to be a framework, rather than a full-scale boxed solution, so with it, there is a small number of algorithms. But the architecture of DMTK allows you to extend it, and also squeeze everything possible from clusters with limited resources. For example, each cluster node has its own cache, which reduces the amount of data exchange with the central node that provides query parameters for the execution of tasks.
- Like Microsoft DMTK, Google TensorFlow is a machine learning framework designed to distribute computations within a cluster. Along with Google Kubernetes, this framework was developed to solve the internal problems of Google, but in the end, the company released it in open sailing as an open source product.
- TensorFlow implements data flow graphs when data portions (“tensors”) can be processed by a series of algorithms described by the graph. Moving data through the system is called “threads”. Graphs can be collected using C ++ or Python, and processed by a processor or video card. Google has long-term plans to develop TensorFlow by third-party developers.
Microsoft Computational Network Toolkit
In the hot pursuit of DMTK, Microsoft released another tool for machine learning – CNTK.
- CNTK is similar to Google TensorFlow, it allows you to create neural networks through oriented graphs. Microsoft compares this framework with products such as Caffe, Theano, and Torch. Its main advantage is speed, especially when it comes to the parallel use of multiple processors and video cards. Microsoft claims that the use of CNTK in combination with Azure-based GPU clusters makes it possible to speed up the speech recognition training by the virtual assistant Cortana by an order of magnitude.
- Initially, CNTK was developed as part of the research program on speech recognition and was offered as an open source project, but since then the company re-released it to GitHub under a much more liberal license.
- Veles is a distributed platform for creating an in-depth training application. Like TensorFlow and DMTK, it’s written in C ++, although Python is used to automate and coordinate nodes. Before you can feed a data sampling cluster, they can be analyzed and automatically normalized. REST API allows you to immediately use trained models in working projects (if you have enough powerful equipment).
- The use of Python in Veles goes beyond the “gluing code”. For example, IPython (now Jupiter), a tool for visualizing and analyzing data, can output data from a Veles cluster. Samsung hopes that the open source status will help to stimulate the further development of the product, as well as porting to Windows and Mac OS X.
- The Brainstorm project was developed by graduate students from the Swiss Institute IDSIA (Institute Dalle Molle for Artificial Intelligence). It was created “in order to make neural networks of in-depth learning faster, more flexible and more interesting.” There is already support for various recurrent neural networks, for example, LSTM.
- Brainstorm uses Python to implement two “handlers” – the Data Management API: one for processor computing using the Numpy library, and the other for using video cards with CUDA. Most of the work is done in Python scripts, so do not expect a luxurious front end interface, unless you screw up something of your own. But the authors have far-reaching plans to “learn from earlier open source projects” and use “new design elements that are compatible with different platforms and computational backends.”
- Many machine learning projects use mlpack, written in the C ++ library, created in 2011 and intended for “scaling, speeding up and simplifying the use”. To implement mlpack, you can use the cache of files executed through the command line to perform fast operations like the “black box”, and for more complex works – using the C ++ API.
- In mlpack 2.0, a lot of work was done on refactoring and implementing new algorithms, processing, speeding up and getting rid of inefficient old algorithms. For example, for the native functions of generating random numbers C ++ 11, the Boost library generator was excluded.
- One of the long-standing disadvantages of mlpack is the lack of bindings for any other languages, except for C ++. Therefore, programmers who write in these other languages can not use mlpack until someone rolls out the appropriate wrapper. MATLAB support has been added, but such projects are most beneficial when they are directly useful in the main environments where machine learning is used.
- Another relatively fresh product. Marvin is a framework for neural networks created in the Princeton Vision Group. It is based on just a few files written in C ++ and CUDA-framework. Despite the minimalism of the code, Marvin comes with a good number of pre-trained models that can be used with proper citation and implemented using pull requests, as well as the code of the project itself.
- The company Nervana creates a software and hardware platform for in-depth training. And as an open source project, the Neon framework offers. With plugins, it can perform on the heavy computing processors, graphics cards or equipment, created Nervana.
- Neon is written in Python, with several pieces of C ++ and assembly. So if you are doing research work in Python, or use some other framework, having the Python-bindings, you can immediately use the Neon.
In conclusion, we would like to say that of course, it’s not all popular frameworks. Surely there is a dozen of your favorite instruments in your bins. Do not be shy, share your findings in the comments to this article.