Databases or Datasets?

Aunsight is not a database; it is a platform for digital transformation that extracts insights from in a wide variety of data sources. However, whether those sources be traditional SQL-based relational databases or unstructured IoT data streams, natural language text, or JSON data objects, this process begins and ends with data. For this reason, Aunsight provides datasets, a flexible and open-ended mechanism for describing and managing data sources regardless of their format. The term data lake is often used to describe the storage and management of this kind of big data storage architecutre. Aunsight implements its data lakes through a data storage abstraction called a "dataset." The present article explains what Aunsight datasets are and what users can do with them using the Aunsight web interface.

What is an Aunsight Dataset?

An Aunsight dataset is a type of JSON object for storing metadata about a data source. Internally, the Aunsight platform's Atlas service manages these records and interprets this metadata for other services in the platform. At its simplest, Atlas records help the Aunsight platform to locate, load, and interpret data stored on hardware resources attached to an organization or project. Usually the underlying storage is simply a file stored on a Hadoop Distributed File System volume, but other filesystems and storage mechanisms are possible. In addition to physical storage, Atlas records describe the kind of data stored in that file. For example, what type of data is contained in it? Does it contain delimiter separated values (DSV) or a JSON object store? What sort of structure conventions are observed in the file? For example, what character delimits the DSV file? Commas (CSV), pipes (PSV), tabs (TSV), or something else? Also, does the file contain a header row with the title for each of the recordset's rows? And what about delimiters between records? Does it follow Windows (\r\n) or MacOS and Linux (\n) newline conventions?

Another aspect of datasets are the Atlas metadata for schemas. Schemas define the types of data in the recordset so that Aunsight tools know how to manipulate and operate on the data appropriately. For example, since dates are physically stored using text strings in most file formats, Aunsight needs to know that a field containing strings like "2018~07~09" are actually dates and not an arbitrary grouping of characters. Once Aunsight knows that a field contains dates, Aunsight intuitively handles that data in appropriate ways. For example, subtracting one timestamp from another is not as simple as subtracting the numbers stored between the delimiters. But if a schema instructs Aunsight to handle that same string as a timestamp, it can correctly apply specialized operations for subtracting the seconds, minutes, hours, days, months, and years to return the amount of time between the two timestamps. Another important feature of Aunsight schemas is to provide instructions for how to preserve data integrity when passing it to software that may handle data types in different ways. For example, certain data processing platforms like Apache Spark or the Jethro have different ways of handling datatypes and Aunsight sometimes needs to adjust its output to accomodate the differences between these systems. Schemas allow dataset managers to specify not just how data is handled within Aunsight, but also how to translate the data for use by other applications.

Because managing schemas is one of the most important functions of maintaining a dataset, Aunsight offers several tools that help to make this process easier. Users can use a schema auto-detector to read a small portion (10mb) of data and intuit a pattern from the values contained in this sample. Alternately, if you have good documentation or information about the data's provenance, you can manually build a schema using the visual builder tool, or edit the actual schema JSON directly.

Getting Started with datasets

The first step in managing datasets is to learn about the Aunsight Web interface's datasets workspace. Using this workspace, users can create a new dataset, ingest data, define a schema, and inspect that data. Dataset management is a crucial component of the data analytics process, so becoming familiar with these skills will help with users in many different roles in a data analytics team.