There are some cool python libraries and machine learning platforms out there to support work in data science. I thought of listing out some of these in this article along with a few highlights of their core capabilities. Also included the links to details for quick reference. Here they are in random order:
statsmodel – Great for statistics around ordinary least squares (OLS), that gives measures such as R-squared, skewness, kurtosis, AIC and BIC scores on the data. It is great for conducting statistical tests, and statistical data exploration.
bokeh – This library is great for providing end users interactive data visualisation inside modern web browsers. Using bokeh, one can quickly and easily make interactive plots, dashboards, and data applications. It can generate highly customisable glyphs such as line, step lines, multiple lines, stacked lines, as well as stacked bars, hex tiles and timeseries plots.
Theano – It is used to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It does this by making use of GPUs for computation. It works with expressions that are passed to it from Python.
yellowbrick – Yellowbrick extends the Scikit-Learn API to make model selection and hyperparameter tuning easier. It builds on top of matplotlib. Some of its key visualization features cover feature, classification, regression, clustering, model selection, target, and text visualizations.
plotly – It is an interactive, open-source, and browser-based graphing library for Python. It supports:
basic charts – scatter, line, pie, bar and more
statistical charts – histograms, box plots, distplots…
scientific charts – contour, heatmaps, ternary plots…
financial charts – time series, candlestick, funnel chart…
maps, 3D charts, and subplots. It even supports animations.
Keras – A python deep learning library. It supports CNNs and RNNs and can run seamlessly on CPUs and GPUs. It supports the sequential model and Model class that’s used with Keras functional API.
Scikit-learn – A sort of swiss-knife of libraries that allows to perform many objectives not limited to – Classification, Regression, Clustering, Dimensionality reduction, Preprocessing (Transformers and Pipelines) and for model selection.
Numpy – It is the fundamental package for scientific computing in Python. It is great for working with arrays and performing linear algebra operations on arrays. The broadcasting feature is extremely useful, making coding simpler.
Pandas – It is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language. It can import and export data from and to a variety of file formats, such as csv and excel. It can be used to slice the data, subset it, merge/join/concatenate data from multiple sources, and remediate missing data. Pandas supports groupby, pivot table, time series, and sparse datasets. It is one of the most essential tools for a data scientist, especially when needing to perform exploratory data analysis (EDA)
MXNet – A flexible and efficient library for deep learning from Apache. It supports multi GPU and multi host training. MXNet has deep integration into Python and support for Scala, Julia, Clojure, Java, C++, R and Perl.
PaddlePaddle – Is a popular deep learning framework that has been developed and is used in China. PaddlePaddle is originated from industrial practices with dedication and commitments to industrialisation. It has been widely adopted by a wide range of sectors including manufacturing, agriculture, enterprise service and so on while serving more than 1.5 million developers.
Platforms, Ecosystems and Frameworks
TensorFlow – TensorFlow is an end-to-end open-source platform for machine learning. It has all the tools, libraries and resources to build and deploy machine learning-powered applications. Models built with TensorFlow can be deployed on desktop, mobile, web and cloud. This is an offering from Google.
Caffe – A deep learning framework made with expression, speed, and modularity. There is no hardcoding and models and optimization are defined through configuration. Its speed makes it perfect for research experiments and industry deployment. Caffe can be used to create and train CNN inference models.
PyTorch – An open source machine learning framework that accelerates the path from research prototyping to production deployment. It provides an ecosystem of tools and libraries and deployment options deployment to cloud platforms such as Alibaba Cloud, Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure.
Scipy – Python-based ecosystem of open-source software for mathematics, science, and engineering. The SciPy ecosystem includes general and specialized tools for data management and computation, productive experimentation, and high-performance computing. Numpy, Matplotlib, and Pandas are some of the libraries that are part of the Scipy ecosystem.
CNTK – This is a Microsoft offering called Cognitive Toolkit that is open source. CNTK is also one of the first deep-learning toolkits to support the Open Neural Network Exchange ONNX format, an open-source shared model representation for framework interoperability and shared optimization. Co-developed by Microsoft and supported by many others, ONNX allows developers to move models between frameworks such as CNTK, Caffe2, MXNet, and PyTorch.
These were some of the libraries, ecosystems, and platforms that support a lot of work that can be done using them. Spending time exploring and getting experience with these can help make one proficient in the field.