From Data Warehouse to Data Lake¶

Data comes in many forms. A bank or credit union generates a stream of data from account transactions. A durable goods manufacturer maintains a list with the name and address of warranty-registered product owners. A retailer generates customer purchase data through point of sale transactions and may have customer rewards account data linked to some of these purchases.

Some data possesses an easily defined structure: bank transaction records, for example, or a list of customer names and addresses. Storing and retrieving data in this form can be easily accomplished more often than not in-house, using traditional data warehouse technologies such as standard SQL database. Other forms of data are highly unstructured, such as captured text from call center recordings or notes from a physician's medical record. Deriving business value from these kinds of data requires a different approach to analyizing data: for example, natural language processing (NLP) and computer vision (CV) machine learning algorithms can operate on these kinds of data in ways that a database simply cannot.

To derive value and actionable insights from all of these kinds of data requires analyzing it all in an integrated way: all the data needs to come together into a platform where it can be governed and accessed by a data analytics team. That in turn requires a flexible storage environment that can accommodate data of different types. This storage environment should also be able to scale to keep up with the growing size of data. Aunsight does all of this by implementing a private data lake for its clients where data is securely stored in massively-scalable parallel storage clusters.

Aunsight Data Storage¶

Most Aunsight client data is stored on a Hadoop Distributed File System (HDFS). HDFS is a distributed filesystem developed by Apache for managing high-throughput, fault-tolerant storage clusters on commodity hardware.

Hadoop achieves high performance on very large volumes of data by distributing "chunks" of large files across a number of storage hosts. All the hosts are kept in sync by a master server or namenode, which also manages replication to make sure every chunk is stored on at least two hosts to provide fault-tolerance and increased access speeds. Requests to store a very large file, say a CSV file containing millions of rows of data measuring in the terabytes can be handled easily as the namenode breaks the file up into individual pieces that are distributed to datanodes. Using this paradigm, data can be read and operated on at scale, since HDFS can distribute the work of retrieving data and even parallelize operations on data across the cluster. Moreover, because data is redundantly stored on multiple hosts, failure in one or more hosts can easily be healed automatically.

Atlas Records¶

Because Aunsight accelerates access to data by HDFS, there is no need to perform a transformation on data in order to import it. Whatever its native type, Aunsight can store data in any useful format for data analysis. For text and numerical data, most users prefer JSON or various delimited formats: comma separated values (CSV), tab separated (TSV), or pipe separated (PSV).

Aunsight uses a special type of metadata object called an Atlas record to track files that physically reside on platform HDFS storage. Atlas records store metadata about files and other resources of every type within Aunsight. Atlas records provide a convenient handle for the location and access permissions for a file, and also store dataset schema information about how to interpret the contents of that file. Not all Atlas data records have a schema, but defining a schema allows that data to be queried and operated upon intelligently by other Aunsight tools.

Dataset Schemas¶

Dataset schemas define a number of properties that can be used to interpret and process data in raw data files stored on an HDFS cluster. Schemas can be thought of as a kind of "header row" in a spreadsheet or table description in a database, but Aunsight schemas contain more information and allow greater flexibility in dataset structure. In addition to field names, schemas define the implied data type used when performing expressions on data (e.g. is it number, a date, text, or boolean value?). Schemas can also describe complex data structures represented by arrays and JSON objects, which allow multidimensional data structures that are represented in relational databases by separate tables linked by a primary key, but without the inefficiencies caused by maintaining separate tables or executing JOIN statements to flatten this data into a table. Aunsight datasets schemas are natively stored as JSON objects inside a dataset's Atlas record. To aid in constructing schemas, Aunsight features several tools for automatic or guided creation of dataset schemas. Because Atlas stores only one schema per dataset, changes made to a schema are irreversible. For that reason, it is a good idea for data engineers and analysts to maintain good documentation on datasets and their sources.