Ingesting Data

When an Aunsight dataset is created, it will exist as an Atlas metadata record with no data associated with it by default. In fact, Aunsight will not even create a file in the storage infrastructure until some kind of data is written to the dataset. The first step in working with datasets, then, is getting data into it through a process called data ingestion. An ingest operation is overseen by the Aunsight metro-dispatcher service which oversees the transfer of data from the web interface into the platform architecture and notifies Aunsight when it is complete. This article explains how to perform this kind of data upload or "ingestion" using the Aunsight web interface.

The Basic Process

After logging in to the Web interface and selecting the relevant context you wish to work in, click the "Datasets" icon (team icon) in the palette on the right.

datasets workspace

Select the dataset you wish to upload data to from the list displayed on the left of the main view. This will bring up its record in the main view. Click the "Ingest" tab to continue.

new dataset creation dialog

The ingest tab presents a variety of options to complete the data ingestion process. Users need to select a data source and specify a write mode for ingestion. When these settings have been specified, click "Submit" in the lower left to start an upload job and add the data you have submitted into this Aunsight dataset. This process can take from several seconds to several hours depending on the size of the dataset and the storage resources used. When it is finished, Aunsight will release the lock it places on datasets to prevent other sources from trying the read or write to the dataset during the ingestion and will notify the user interface that the upload was successful. In the event of a failure, Aunsight will provide an error message specifying the reason for the upload's failure. Information about upload jobs is also logged in the jobs workspace in the Aunsight web interface.

Understanding the Settings

Aunsight data ingestion jobs can be configured with a variety of options. The remainder of this article explains these options and what they do.

Selecting a Data Source

Data ingestion requires a data source. Aunsight allows two kinds of data sources that can be toggled by the "Ingest Source" radio buttons.

  • The File option will allow the user to select a file from their computer's filesystem to upload to the platform. When selected, a file chooser button will appear at the bottom of the data ingest tab. Clicking this will bring up a file selector dialog where users can browse for the file they would like to upload.

  • The Text option opens a text editor at the end of the ingest form for you to enter properly formatted data (character delimited or JSON) directly into the Web interface. Users can copy and paste an entire dataset if they wish, but this option is best for small datasets. The text option is also useful if you want to add a few records to your dataset using the append write mode.

Understanding the Write Modes

When ingesting data, Aunsight provides three write modes that determine how Aunsight will handle the incoming data.

  • Append will add data from the source at the end of the existing data in the target. None of the target's data will be lost in this process.

  • Create will write data to the dataset, but only if the dataset is empty. If it does exist, the upload job will fail and report the "DestinationExistsError."

  • Overwrite will overwrite the existing data in the target. This effectively recreates the contents of the dataset from scratch.

Caution

Dataset operations are irreversible; never overwrite essential data as it cannot be recovered.

Formats and Delimiters

Aunsight displays three options that are relevant to data ingestion operations but cannot be directly edited.

  • The Format option shows what file format Aunsight expects data to be in (e.g. CSV, TSV, JSON, etc.).
  • The Delimiter option shows what character delimiter Aunsight expects to separate fields in the source file.
  • The Row Delimiter option will be true if Aunsight expects rows to be delimited with Windows (\r\n) or MacOS/Linux (\n) style newlines.

These options are locked because Aunsight does not allow data to be stored in "mixed" formats, since that could lead to data corruption. Displaying these immutable options on this page ensures that users are aware of the constraints on the kind of data this Aunsight dataset can receive. For example, if your source data does not match the specified format, you can either transform the data in the source file, or create a new dataset that does match the requirements of your source file and upload the data into it.

Note

These options are locked because data is stored in these literal formats; once a record has been created, changing the format requires the entire dataset to be re-read and re-written in place, a resource intensive operation.