Skip to content

Clusters registry:

Clusters registry contains the information about the on-boarded clusters in the network.

Here is the schema of the cluster entry in clusters registry:

const clusterSchema = new Schema({
    // Unique ID for the cluster
    id: { type: String, required: true, unique: true },

    // Optional region or network ID where the cluster is deployed
    regionId: { type: String, required: false },

    status: { type: String, required: true },

    // Aggregated and per-node information for all nodes in the cluster
    nodes: {
        // Total number of nodes in the cluster
        count: { type: Number, required: true },
        // Detailed info for each individual node
        nodeData: [{
            // Unique ID for the node
            id: { type: String, required: true },
            // GPU details for the node
            gpus: {
                // Number of GPUs in the node
                count: { type: Number, required: true },
                // Total GPU memory in MB
                memory: { type: Number, required: true },
                // List of GPU models with individual memory sizes
                gpus: [{
                    modelName: { type: String, required: true }, // GPU model name
                    memory: { type: Number, required: true }     // Memory per GPU in MB
                }],
                // Optional GPU features (e.g., CUDA versions)
                features: [String],
                // List of distinct GPU model names
                modelNames: [String]
            },
            // Virtual CPU details
            vcpus: {
                count: { type: Number, required: true } // Number of vCPUs in the node
            },
            // Total memory in MB
            memory: { type: Number, required: true },
            // Total swap space in MB
            swap: { type: Number, required: true },
            // Storage info per node
            storage: {
                disks: { type: Number, required: true }, // Number of disks
                size: { type: Number, required: true }   // Total storage size in MB
            },
            // Network interface stats per node
            network: {
                interfaces: { type: Number, required: true },   // Number of network interfaces
                txBandwidth: { type: Number, required: true },  // Transmit bandwidth (MBps)
                rxBandwidth: { type: Number, required: true }   // Receive bandwidth (MBps)
            }
        }]
    },

    // Total GPU stats across all nodes
    gpus: {
        count: { type: Number, required: true },   // Total number of GPUs in the cluster
        memory: { type: Number, required: true }   // Total GPU memory in MB
    },

    // Total vCPU count across the cluster
    vcpus: {
        count: { type: Number, required: true }
    },

    // Total memory across the cluster in MB
    memory: { type: Number, required: true },

    // Total swap space across the cluster in MB
    swap: { type: Number, required: true },

    // Aggregated storage details for the cluster
    storage: {
        disks: { type: Number, required: true }, // Total number of disks
        size: { type: Number, required: true }   // Total storage size in MB
    },

    // Aggregated network configuration
    network: {
        interfaces: { type: Number, required: true },  // Total number of interfaces
        txBandwidth: { type: Number, required: true }, // Total TX bandwidth
        rxBandwidth: { type: Number, required: true }  // Total RX bandwidth
    },

    // Configuration used by the cluster controller
    config: {
        type: new Schema({
            policyExecutorId: { type: String, required: false, default: "" },      // Optional custom policy executor ID
            policyExecutionMode: { type: String, required: false, default: "local" }, // Execution mode for policies
            customPolicySystem: { type: Schema.Types.Mixed, required: false },     // Any custom policy logic/plugin
            publicHostname: { type: String, required: true },                      // Public hostname for the cluster
            useGateway: { type: Boolean, required: false, default: true },         // Whether to use a gateway for access
            actionsPolicyMap: { type: Schema.Types.Mixed, required: false },       // Mapping for policy actions
            // URLs to internal/external services in the cluster
            urlMap: {
                controllerService: { type: String, required: true },     // URL for controller service
                metricsService: { type: String, required: true },        // URL for metrics collection
                blocksQuery: { type: String, required: true },           // URL for querying blockchain blocks
                publicGateway: { type: String, required: true },         // Public-facing gateway URL
                parameterUpdater: { type: String, required: true }       // URL for model/cluster parameter updates
            }
        }),
        required: true
    },

    // List of user-defined tags or labels
    tags: { type: [String], required: true },

    // Human-readable metadata about the cluster
    clusterMetadata: { 
        type: new Schema({
            name: { type: String, required: true },                   // Friendly name of the cluster
            description: { type: String, required: true },            // Purpose or use-case of the cluster
            owner: { type: String, required: true },                  // Who owns or manages the cluster
            email: { type: String, required: false },                 // Optional contact email
            countries: { type: [String], required: false },           // Countries associated with this cluster
            miscContactInfo: { type: Schema.Types.Mixed, required: false },  // Additional contact or support info
            additionalInfo: { type: Schema.Types.Mixed, required: false }    // Any extra metadata as needed
        }),
        required: true
    },

    // Reputation score or reliability indicator for the cluster (not yet used anywhere in the system)
    reputation: { type: Number, required: false }
});

Example:

{
  "id": "cluster-west-vision-001",
  "regionId": "us-west-2",
  "status": "live",
  "nodes": {
    "count": 2,
    "nodeData": [
      {
        "id": "node-1",
        "gpus": {
          "count": 2,
          "memory": 32768,
          "gpus": [
            { "modelName": "NVIDIA A100", "memory": 16384 },
            { "modelName": "NVIDIA A100", "memory": 16384 }
          ],
          "features": ["fp16", "tensor_cores"],
          "modelNames": ["NVIDIA A100"]
        },
        "vcpus": { "count": 32 },
        "memory": 131072,
        "swap": 8192,
        "storage": {
          "disks": 2,
          "size": 1048576
        },
        "network": {
          "interfaces": 2,
          "txBandwidth": 10000,
          "rxBandwidth": 10000
        }
      },
      {
        "id": "node-2",
        "gpus": {
          "count": 1,
          "memory": 16384,
          "gpus": [
            { "modelName": "NVIDIA V100", "memory": 16384 }
          ],
          "features": ["fp16"],
          "modelNames": ["NVIDIA V100"]
        },
        "vcpus": { "count": 16 },
        "memory": 65536,
        "swap": 4096,
        "storage": {
          "disks": 1,
          "size": 524288
        },
        "network": {
          "interfaces": 1,
          "txBandwidth": 5000,
          "rxBandwidth": 5000
        }
      }
    ]
  },
  "gpus": {
    "count": 3,
    "memory": 49152
  },
  "vcpus": {
    "count": 48
  },
  "memory": 196608,
  "swap": 12288,
  "storage": {
    "disks": 3,
    "size": 1572864
  },
  "network": {
    "interfaces": 3,
    "txBandwidth": 15000,
    "rxBandwidth": 15000
  },
  "config": {
    "policyExecutorId": "policy-exec-007",
    "policyExecutionMode": "local",
    "customPolicySystem": {
      "name": "AdvancedPolicyRunner",
      "version": "2.1.0"
    },
    "publicHostname": "cluster-west-vision-001.company.net",
    "useGateway": true,
    "actionsPolicyMap": {
      "onScaleUp": "evaluate-gpu-availability",
      "onFailure": "notify-admin"
    },

    // these fields are populated by the system:
    "urlMap": {
      "controllerService": "http://cluster-west-vision-001.company.net:32000/controller",
      "metricsService": "http://cluster-west-vision-001.company.net:32000/metrics",
      "blocksQuery": "http://cluster-west-vision-001.company.net:32000/blocks",
      "publicGateway": "http://cluster-west-vision-001.company.net:32000",
      "parameterUpdater": "http://cluster-west-vision-001.company.net:32000/mgmt"
    }
  },
  "tags": ["gpu", "production", "ml", "vision", "us-west"],
  "clusterMetadata": {
    "name": "Sample cluster",
    "description": "Dedicated to serving large-scale computer vision models in production.",
    "owner": "AI Infrastructure Team",
    "email": "[email protected]",
    "countries": ["USA", "Canada"],
    "miscContactInfo": {
      "pagerDuty": "https://sample-website/ai-clusters",
      "slack": "#ml-infra"
    },
    "additionalInfo": {

    }
  },
  "reputation": 94
}

Creating a cluster:

For creating the cluster, refer to the documentation of Parser.

Cluster registry APIs:

Endpoint: /clusters/:id
Method: GET
Description:
Fetches a single cluster document by its unique id.

Example curl Command:

curl -X GET http://<server-url>/clusters/cluster-west-vision-001

Endpoint: /clusters/:id
Method: PUT
Description:
Updates a cluster document by its id using the payload provided in the request body. The body should use MongoDB-style update syntax.

Example curl Command:

curl -X PUT http://<server-url>/clusters/cluster-west-vision-001 \
  -H "Content-Type: application/json" \
  -d '{
    "$set": {
      "tags": ["gpu", "updated"],
      "reputation": 97
    }
  }'

Endpoint: /clusters/:id
Method: DELETE
Description:
Deletes the cluster document with the specified id.

Example curl Command:

curl -X DELETE http://<server-url>/clusters/cluster-west-vision-001

Endpoint: /clusters/query
Method: POST
Description:
Queries cluster documents using a MongoDB-style filter provided in the request body. Supports standard MongoDB operators such as $eq, $gt, $in, etc.

Example curl Command:

curl -X POST http://<server-url>/clusters/query \
  -H "Content-Type: application/json" \
  -d '{
    "gpus.count": { "$gte": 2 },
    "clusterMetadata.countries": { "$in": ["USA"] }
  }'