Machine Learning with Aunsight

Data science is more than just statistics: machine learning can unlock the hidden potential in data by developing algorithms that can extract complex predictive patterns within a dataset or classify an unknown data object by comparing its features to previously learned objects with similar feature space. Developing models is an intensive process both in terms of compute resources and the human intelligence required, so it is essential that data scientists be able to focus on what they do best and leave the management of tools and infrastructure to the platform.

Aunsight enables digital transformation by providing a convenient environment where data scientists can explore data and train models on it, while managing the infrastructure behind the scenes. Rather than worrying about how data is being stored and managed, machine learning in Aunsight is as simple as writing modeling code in a notebook environment. Behind the scenes, Aunsight handles the complexity of managing the storage and compute environments, since the notebooks are actually running as Docker containers on a scalable elastic compute environment. Terabytes of data can be fed into notebooks because all computation happens inside the Aunsight platform infrastructure. The resulting models can then be stored and deployed within Aunsight processes to serve in production data pipelines.

Developing Models in Data Labs

Aunsight's Data Lab provides a convenient dashboard for developing machine learning models. Data Labs are notebooks that provide a self-contained environment for machine learning experiments and production model training. Notebooks can load industry standard tools like TensorFlow and PyTorch to perform modeling. Aunsight provides a number of base images that contain toolkits, libraries, and common packages pre-loaded and ready to use. Users need only select a base image begin writing code, or if needed, install packages themselves if they are not available in one of the pre-built images.

Users can interact with data in the Aunsight platform using the platform SDKs for Python or R, and load the data into libraries like Pandas. Notebooks also feature a fully-loaded Debian Linux subsystem with access to Toolbelt, Aunsight's command line interface. Using the toolbelt, whole datasets can literally be loaded into the notebook's storage volume for use in multiple notebooks.

Every notebook container has an attached NFS storage volume to allow data persistence between container restarts. It is also possible to mount other volumes both within Aunsight and from other locations (e.g. Amazon S3).

At the end of model training, data scientists can deploy their models into the Model service, where up to thirty model versions can be stored for future deployment or comparison.

Deploying and Managing the Model Lifecycle

Aunsight Data labs provide a convenient environment for experimentation and model development, but in most cases the resulting trained models can be more efficiently deployed in specialized production container environments. Aunsight's Model service provides a vehicle for storing and deploying models into production workflows. At the end of the model training process, data scientists can serialize their models (called "pickling" in Python) in the notebook and push that binary model data into the Aunsight platform as model objects. These objects are later loaded into a specialized Aunsight process and deserialized in order to score new data with the model.

The Models dashboard in the web interface allows users to view and manage model versions in the platform. Each model object can store up to thirty different model versions. Storing multiple versions of a model enables a dataset to be run through different versions to compare the results and gain insight into the success of each model version's training data. For example, if the most recently trained model is not performing as well as a previous version or is generating an abnormal distribution of scores compared to previous versions, it may be worth considering if the model has been "overtrained" and is in need of pruning. Comparing model versions can also help with explainability, as significant changes in one model version's performance may indicate the presence of anomalies in that version's training data that helps to explain the model.

Getting Started

To get started with machine learning in Aunsight, users will want to know how to use the Data Labs dashboard to view, create, and manage notebook containers. Users can start and stop data labs and monitor resource usage through this interface, provided a convenient way to perform computationally-intensive model training without over-running resource quotas. Most data scientists will be familiar with the Jupyter notebooking environment or at least have used similar tools like Matlab, but Aunsight's Jupyter notebooks also contain special features that facilitate the interaction with the Aunsight platform. Finally, managing models forms an important part of the machine learning lifecycle, so data scientists will want to know how to manage models using the models dashboard in the web interface as well as how to interact with Aunsight model objects using the platform SDKs.