Architecture
(Download the architecture diagram)
Component Descriptions
Management Services
Parser
The Parser service acts as a gateway to create blocks, vDAGs, registry components, and clusters by submitting a specification JSON document. It applies a default standard specification but also allows users to define their own using templates from the registry and the policy system.
Cluster Controller Gateway
The Cluster Controller Gateway manages clusters within a network. It initializes cluster infrastructure, schedules blocks (in coordination with the target cluster's controller), and acts as a proxy to execute management commands for blocks and clusters. It also handles configuration tweaks for policies, vDAG controller infrastructure setup, and pre-check policy implementations.
Documentation for Cluster Controller Gateway
Resource Allocator
The Resource Allocator executes resource allocation policies. The Cluster Controller Gateway uses it to determine the appropriate cluster, node, and GPUs for block scheduling. Users can also perform manual dry-runs to identify the optimal deployment targets.
Documentation for Resource Allocator
Policies System
The Policies System includes a database, registry service, executors, and Kubernetes interfaces. It allows storing, loading, querying, and executing policies, or deploying them as jobs or remote functions.
Documentation for Policies System
Spec and Template Registries
These registries store pre-built specifications and custom parser templates. The Parser service uses them to invoke custom templates for parsing and validation. Users can also query available specifications and templates.
Documentation for Template Registry
Documentation for Spec Registry
Container Registry
The Container Registry provides a distributed system for storing container images across clusters and remote machines. It maintains a central index to enable discovery and upload of container images.
Documentation for Container Registry
Assets Registry
The Assets Registry stores various types of assets used by applications, including policy code archives, model files, images, videos, etc. These assets can be referenced within blocks. Similar to the container registry, it is distributed and can be hosted on any cluster or remote machine with object storage. The Assets DB Registry provides a centralized index of available asset registries.
Documentation for Assets Registry
Components Registry
The Components Registry stores components that can be initialized as blocks. Each component includes references to the AIOS instance container image, any sidecar container images, metadata such as author information, input/output templates, default parameters, and settings.
Adhoc Inference Servers Registry
Inference servers can be deployed on any cluster. The Adhoc Inference Servers Registry maintains a list of all deployed inference servers, their public endpoints, and metadata, allowing users to discover and submit inference tasks.
Documentation for Adhoc Inference Servers Registry
Metrics System
The Metrics System includes metrics collectors, local cluster-level metrics storage, and a centralized global metrics database. It collects and stores metrics from nodes, block instances, and vDAG controllers.
Documentation for Metrics System
Cluster Services
Cluster Controller
The Cluster Controller allocates blocks, provisions resources, monitors deployments and node health, and interacts with the Cluster Controller Gateway for initial scheduling.
Documentation for Cluster Controller
Cluster Local Metrics System
The Cluster Local Metrics System includes hardware metric daemons on each node, along with local databases to store metrics related to blocks, clusters, and vDAGs for efficient querying.
Documentation for Metrics System
Application Layer
Block
A Block is an instance of a component deployed using a container image built with the AIOS SDK. Blocks can be deployed on any cluster in the network. The SDK supports building custom computational logic for execution as a block. Each block includes services like auto-scaling, load balancing, health checking, and metrics integration for effective management and monitoring.
Adhoc Inference Server
Inference or general computational tasks can be submitted to blocks via inference servers. These servers can be deployed on any cluster and registered for public discovery through the registry.
Documentation for Adhoc Inference Server
vDAG and vDAG Controller
A vDAG (virtual Directed Acyclic Graph) spans multiple blocks across the network. The vDAG Controller enables submission of vDAG inference tasks and provides mechanisms for health checks, quota management, and quality audits via policies.
Documentation for vDAG Controller
LLM System
The LLM System supports deploying large language models as blocks using the AIOS LLM SDK, which builds upon the base AIOS SDK. It allows model splitting and deployment across multiple nodes or clusters using the vDAG pipeline parallelism approach.
Documentation for LLM Block Instance
Documentation for LLM Model Splitting and Deployment
Third-Party System Integration
Blocks can integrate with third-party services, whether external or internal to the network. These services can be deployed independently or co-located with blocks using init containers to ensure they start within the same cluster environment.
Documentation for Third-Party Blocks
Index:
1. Getting Started
1.1 Installation
1.2 Onboarding Cluster
1.3 Onboarding Node to a Cluster
2. Management Services
2.1 Parser
2.1.1 Parser Introduction
2.1.2 Creating a Block
2.1.3 Creating a Cluster
2.1.4 Creating a Component
2.1.5 Creating vDAGs
2.1.6 Executing Management Commands
2.1.7 Filtering and Search
2.2 Other Services
2.2.1 Policies-System
2.2.2 Cluster-Controller-Gateway
2.2.3 Resource Allocator
2.2.4 Failure Policy Executor
2.2.5 Tasks Tracking and Tasks DB
3. Registries
3.1 Specification Registries
3.1.1 Spec Store
3.1.2 Template Store
3.2 Assets, Container Registries and Components Registry:
3.2.1 Assets Registry
3.2.2 Container Registry
3.3.3 Component Registry
3.3 Runtime Registries
3.3.1 Introduction
3.3.2 Clusters Registry
3.3.3 Block Registry
3.3.4 vDAG Registry
4. Cluster Services
4.1 Cluster Controller
4.2 Metrics System
5. Application Layer
5.1 Block
5.2 AIOS Instance SDK
5.3 Inference Server
5.4 vDAG Controller
5.5 LLM Ecosystem
5.5.1 Method-1: LLM SDK
5.5.2 Method-2: Model Splitting and Distributed Inference
5.5.3 Third-party System Integration