Architecture

aios-all-components

Component Descriptions

Management Services

Parser

The Parser service acts as a gateway to create blocks, vDAGs, registry components, and clusters by submitting a specification JSON document. It applies a default standard specification but also allows users to define their own using templates from the registry and the policy system.

Documentation for Parser

Cluster Controller Gateway

The Cluster Controller Gateway manages clusters within a network. It initializes cluster infrastructure, schedules blocks (in coordination with the target cluster's controller), and acts as a proxy to execute management commands for blocks and clusters. It also handles configuration tweaks for policies, vDAG controller infrastructure setup, and pre-check policy implementations.

Documentation for Cluster Controller Gateway

Resource Allocator

The Resource Allocator executes resource allocation policies. The Cluster Controller Gateway uses it to determine the appropriate cluster, node, and GPUs for block scheduling. Users can also perform manual dry-runs to identify the optimal deployment targets.

Documentation for Resource Allocator

Policies System

The Policies System includes a database, registry service, executors, and Kubernetes interfaces. It allows storing, loading, querying, and executing policies, or deploying them as jobs or remote functions.

Documentation for Policies System

Spec and Template Registries

These registries store pre-built specifications and custom parser templates. The Parser service uses them to invoke custom templates for parsing and validation. Users can also query available specifications and templates.

Documentation for Template Registry
Documentation for Spec Registry

Container Registry

The Container Registry provides a distributed system for storing container images across clusters and remote machines. It maintains a central index to enable discovery and upload of container images.

Documentation for Container Registry

Assets Registry

The Assets Registry stores various types of assets used by applications, including policy code archives, model files, images, videos, etc. These assets can be referenced within blocks. Similar to the container registry, it is distributed and can be hosted on any cluster or remote machine with object storage. The Assets DB Registry provides a centralized index of available asset registries.

Documentation for Assets Registry

Components Registry

The Components Registry stores components that can be initialized as blocks. Each component includes references to the AIOS instance container image, any sidecar container images, metadata such as author information, input/output templates, default parameters, and settings.

Adhoc Inference Servers Registry

Inference servers can be deployed on any cluster. The Adhoc Inference Servers Registry maintains a list of all deployed inference servers, their public endpoints, and metadata, allowing users to discover and submit inference tasks.

Documentation for Adhoc Inference Servers Registry

Metrics System

The Metrics System includes metrics collectors, local cluster-level metrics storage, and a centralized global metrics database. It collects and stores metrics from nodes, block instances, and vDAG controllers.

Documentation for Metrics System

Cluster Services

Cluster Controller

The Cluster Controller allocates blocks, provisions resources, monitors deployments and node health, and interacts with the Cluster Controller Gateway for initial scheduling.

Documentation for Cluster Controller

Cluster Local Metrics System

The Cluster Local Metrics System includes hardware metric daemons on each node, along with local databases to store metrics related to blocks, clusters, and vDAGs for efficient querying.

Documentation for Metrics System

Application Layer

Block

A Block is an instance of a component deployed using a container image built with the AIOS SDK. Blocks can be deployed on any cluster in the network. The SDK supports building custom computational logic for execution as a block. Each block includes services like auto-scaling, load balancing, health checking, and metrics integration for effective management and monitoring.

Documentation for Block

Adhoc Inference Server

Inference or general computational tasks can be submitted to blocks via inference servers. These servers can be deployed on any cluster and registered for public discovery through the registry.

Documentation for Adhoc Inference Server

vDAG and vDAG Controller

A vDAG (virtual Directed Acyclic Graph) spans multiple blocks across the network. The vDAG Controller enables submission of vDAG inference tasks and provides mechanisms for health checks, quota management, and quality audits via policies.

Documentation for vDAG Controller

LLM System

The LLM System supports deploying large language models as blocks using the AIOS LLM SDK, which builds upon the base AIOS SDK. It allows model splitting and deployment across multiple nodes or clusters using the vDAG pipeline parallelism approach.

Documentation for LLM Block Instance
Documentation for LLM Model Splitting and Deployment

Third-Party System Integration

Blocks can integrate with third-party services, whether external or internal to the network. These services can be deployed independently or co-located with blocks using init containers to ensure they start within the same cluster environment.

Documentation for Third-Party Blocks

Index:

1. Getting Started

1.1 Installation
1.2 Onboarding Cluster
1.3 Onboarding Node to a Cluster

2. Management Services

2.1 Parser

2.1.1 Parser Introduction
2.1.2 Creating a Block
2.1.3 Creating a Cluster
2.1.4 Creating a Component
2.1.5 Creating vDAGs
2.1.6 Executing Management Commands
2.1.7 Filtering and Search

2.2 Other Services

2.2.1 Policies-System
2.2.2 Cluster-Controller-Gateway
2.2.3 Resource Allocator
2.2.4 Failure Policy Executor
2.2.5 Tasks Tracking and Tasks DB

3. Registries

4. Cluster Services

4.1 Cluster Controller
4.2 Metrics System

5. Application Layer

5.1 Block
5.2 AIOS Instance SDK
5.3 Inference Server
5.4 vDAG Controller

5.5 LLM Ecosystem

5.5.1 Method-1: LLM SDK
5.5.2 Method-2: Model Splitting and Distributed Inference
5.5.3 Third-party System Integration