SkySpark by SkyFoundry

4. Folio

Overview

Folio is the tag database used by SkySpark to store and organize your data. The tagging model allows you to easily design free-form, dynamic models of your data.

Folio organizes data into a three level hierarchy:

Projects

A given SkySpark server hosts one or more projects. Projects are used to group records together into a single "database". The following features operate at the project level:

Projects must be named using a legal programmatic name and must be four chars or longer. Projects of three letters or less are reserved for SkyFoundry, as well as names such as "skyspark", "folio", etc. To take advantage of future cloud support, it is highly recommended to use a unique project name for all your projects.

Projects are physically stored on the file system under a directory structured as follows:

{skyspark-home}/
  db/
    projA/
      data/             // mastered data
        proj.diffs      // diff log file
        password.props  // password storage
        bins/           // bins directory
          binXXXX/      // bins sharded into sub-dirs by age
          binXXXY/      // bins sharded into sub-dirs by age
      snapshots/        // default location of snapshot zips
      cache/            // used for various temporary cache files
    projB/
      ...

Use the Host App page to manage your projects.

Records

A record or rec is the basic unit of modeling in the Folio database. Records are defined as a map of tags (name/value pairs). The tags assigned to a record are free-form; you may add, update, or remove tags at anytime. Different extensions define tag libraries for modeling data using standard conventions.

There are a couple tags which have special meaning:

Tags

Tags are the name/value pairs stored in a record. The name of a tag must follow the standard rules for Naming. The value of the tag is one of the following scalar types:

Note that although Bool is supported, convention is to use presence of a marker tag.

Storage

Folio persists data to disk, but operates as an in-memory based database. Most records and tags are read from disk on startup and stored in RAM for fast access. This is required to support the real-time nature of sensor data. But it also imposes limits on Folio since most hardware tend to have less RAM than disk space.

Every project, record, and tag is stored in RAM during runtime with the exception of bins (discussed next). As you design your data models for the tag database, you should take care to limit what is stored in memory and utilize bin tags as necessary. For example the time-series historian uses records/tags for indexing, but bins to actually store the time-series samples.

Persistence is managed using a file called "proj.diffs" for each project. This is a simple text based, append-only log file. As diffs are committed, they are applied in-memory and appended to the log file. During restart, the log file is replayed from beginning to end to reconstruct the state of the database. This design is extremely robust; in the rare case of data corruption the file is easily repaired with a text editor. However the design does require periodic compression of the log file using compaction.

Bins

Bins are special tags used to store blobs or files on disk (as opposed to other tags which are stored in-memory during runtime). For future proofing for cloud/grid architectures, only stream access is provided to bins (no random access I/O). However, Folio does support an efficient model for append only transactions to a bin.

The value of the bin tag in-memory specifies the MIME type of the file. The bin tag must be created before attempting to read/write data to the stream.

Because a bin is a just a normal tag, a given record can have multiple bins. The standard tag for storing a normal file is file. But as an example, for an image file you might also store a thumbnail tag on the rec.

Naming

Programmatic names are used by Folio for features such as:

Programmatic names use camelCaseNaming as follows:

The name tag may be used to assign a well-known name to records. A given project has one namespace, so it is illegal to create multiple records with the same name tag value. Named records may be easily accessed by name in addition to their id. Funcs are uniquely named via the name tag.

Queries

The APIs for querying a Folio database are based on filters. Filters allow you to construct predicates using basic boolean logic and comparison operators. Filters support pathing: any tag with a Ref value may be traversed using the -> operator during the query operation.

The following types of queries are supported:

Indexing

All queries to a Folio project take the form of a predicate Filter which is used to match a set of records. In the simplest case, each record in a project is scanned and checked against the filter for a match. Because the records are stored in RAM this operation is very fast; a run-of-the-mill server and can roughly scan 10K records every millisecond. This time will scale up linearly with your database.

Query Optimizer

To optimize performance as the number of records grow, Folio will build and maintain a memory based index for "hot" tags. The query optimizer uses these indexes to avoid scanning every record. Lets take an example query:

site and equipRef==xxxxx

This query has two tags which are used: site and equipRef. In the case of site we only care if a rec has the tag (we don't care about its value). In the case of equipRef we care if a rec has the tag and it has a specific Ref value. Folio indexing handles both cases: it indexes which records have a tag, but it also sub-indexes the tag values.

The query optimizer will select the best index to use for the scan. If we have 300 recs with the tag site (whatever the value might be) and we have 15 recs with the tag/value pair equipRef==xxxx, then the query optimizer will chose the smallest index. In this case it will chose the equipRef==xxxx bucket and we only have to scan 15 items before determining the result.

Auto Indexing

You do not need to configure anything special to use indexing. Folio always keeps track of what tags are being used by queries. As soon as it detects a "hot" tag, it will automatically build an index for that tag. The current algorithm indexes a tag once it has been used in 100 queries.

Query Tuning

A full scan of a large database with millions of recs might take 100s of milliseconds. This might be fast enough for populating user interface screens, but when distributed across multiple functions it can add up quickly. So if working with large databases it is important to ensure that hot queries are utilizing the index.

The best way to analyze queries is with the Debug tab in the Folio app. This tab provides a wealth of statistics on tag indexes and the queries being run.

The Tag Index section lists statistics on all tags which have been analyzed for optimization. This list includes statistics on tags which have not been indexed yet. The columns for the table:

The Queries section lists statistics on all the queries which have been run for the project since boot time:

Note that the current query optimizer cannot use the index for NOT filters and OR filters. Examples:

ahu and not rooftop        // can use ahu index, but not rooftop index
ahu or chiller             // will be unoptimized
equip and (ahu or chiller) // will use equip index

Diffs

Modifications to a Folio database are encapsulated as diffs. Diffs are a set of changes to apply to a record. Diffs work just like a patch file in a version control system. Diffs include the ability:

Transient Diffs

In general when a diff is committed, it is written to the log file for durability. However, if your application has rapidly changing real-time data this can cause serious performance issues. To support real-time data, Folio supports the concepts of transient diffs. Transient diffs are applied only to the in-memory representation of the records, but are not serialized to the log file. Transient diffs do not update the mod tag of the record. Transient diffs are allowed to be committed even when the project is in readonly mode.

Transient Recs

Folio also supports transient recs which are records never serialized to disk. Transient records are defined by specifying the transient marker tag during creation. Diffs applied to transient records are never written to disk regardless whether the diff itself is marked transient.

Trash

Records are moved into the trash bin by adding the trash marker tag. Trash recs continue to operate in the database just like any other record with two exceptions:

Concurrency Control

All records support the required mod tag indicating the timestamp of their last persistent modification. This timestamp is used to implement optimistic concurrency control. This model allows queries and diffs to operate without explicit locking. When constructing diffs, they are passed the version of the rec which was read. If during the diff commit the database detects that the record has been modified since the last read, then the commit fails with a ConcurrentChangeErr.

Diffs support the ability to force a commit to by-pass concurrency control. This is typically used when updating status tags under complete control of a given application. Transient diffs to not update the mod tag, however unless the the force flag is used they are still checked for concurrent change.

Compaction

You can run a compaction operation on a folio project to compress the "proj.diffs" file. This can result in a smaller file size and speed up load time on restart. During a compaction operation the project is set into read-only mode.

User must be su to perform a compaction.

Snapshots

Folio supports the ability to take a snapshot of a project during runtime. A snapshot is a zip file which includes an atomic backup of the records, tags, and all the bin files. During a snapshot, the project is set into read-only mode. Any diffs committed or attempts to write to a bin during a snapshot operation will fail.

User must be su to perform a snapshot or restore.

Proj Meta

Every folio project should have exactly one rec with the projMeta tag. This record is used to store project wide settings. The following tags may be configured on the projMeta rec:

Many of these tags can be configured in the SettingApp. Although some may be require manual editing using the FolioApp.

In addition to the tags above the system automatically maintains a version tag on the projMeta record (do not modify this tag). The projMeta rec is also used to configure tuning parameters for the folio database.