Skip to content

ICOS FL

ICOS FL is a federated learning framework for real-time resource monitoring and prediction within the ICOS ecosystem. It enables distributed training of LSTM models to predict system metrics such as CPU usage, memory consumption, and power usage across ICOS nodes.

Overview

ICOS FL leverages federated learning to train machine learning models across multiple decentralized edge devices or servers while keeping data private - only model updates are shared. This approach is particularly valuable for resource monitoring in cloud-edge-IoT environments where data privacy and sovereignty are crucial.

The framework is built on the Flower federated learning framework and includes the following main components:

  • Server components: Coordinate federated learning and aggregate model updates
  • Client components: Train local models on system metrics data collected from the device
  • LSTM Models: Neural networks for time series prediction of resource usage
  • Data Management components: Store and retrieve time series data efficiently

Key Features

  • Federated Learning: Train models across distributed nodes while keeping data local
  • Real-time Monitoring: Track CPU, memory, and power consumption metrics
  • LSTM Prediction: Forecast resource usage with configurable time windows
  • Privacy Preservation: Raw metrics never leave the node where they're collected
  • Adaptive Training: Support for heterogeneous devices with varying capabilities
  • Fault Tolerance: Ability to handle nodes joining or leaving the federation

Architecture

The ICOS FL architecture follows a federated client-server design pattern, built on the Flower framework:

+----------------------+                +----------------------+
|   ICOS Controller    |                |     ICOS Agent      |
|                      |                |                     |
| ┌------------------┐ |                | ┌-----------------┐ |
| │    SuperLink     │ |                | │    SuperNode    │ |
| │   (FL Server)    │<---------------->│   (FL Client)     │ |
| └--------┬---------┘ |                | └-------┬---------┘ |
|          │           |                |         │           |
| ┌--------▼---------┐ |                | ┌-------▼---------┐ |
| │ Telemetruum Hub  │ |                | │ Telemetruum Leaf│ |
| │                  │ |                | │                 │ |
| └------------------┘ |                | └-----------------┘ |
+----------------------+                +----------------------+

The architecture consists of these key components:

Server Side (ICOS Controller)

  • SuperLink: The federated learning server component that:
  • Coordinates the training process
  • Aggregates model updates from clients
  • Distributes the global model
  • Handles client selection and participation

  • Telemetruum Hub: Collects and aggregates telemetry data from all connected agents

Client Side (ICOS Agent)

  • SuperNode: The federated learning client component that:
  • Trains local models on device-specific data
  • Sends model updates to the server
  • Receives and applies global model updates
  • Performs local evaluation

  • Telemetruum Leaf: Collects system metrics from the local device

Data Management Layer

  • TimeSeriesData: A DataClay-based storage component that:
  • Maintains a sliding window of recent metrics
  • Provides efficient access to time series data
  • Supports preprocessing for LSTM input

LSTM Model Architecture

The LSTM model architecture is designed specifically for time series forecasting:

+------------------+     +-------------------+     +---------------+
| Input Sequence   |     | LSTM Layer(s)     |     | Linear Layer  |
| [batch, 1, time] |---->| hidden_size units |---->| output_size=1 |
+------------------+     +-------------------+     +---------------+

This design enables: - Processing of sequential data with temporal dependencies - Capture of both short-term and long-term patterns - Prediction of future values based on historical trends

Configuration

ICOS FL can be configured through the following options in pyproject.toml:

[tool.flwr.app.config]
# Server configuration
num-server-rounds = 10
min-fit-clients = 2
min-evaluate-clients = 2
min-available-clients = 2

# LSTM model configuration
hidden-layer-size = 10
time-step = 10
num-layers = 1

# Resource metric to monitor and predict
metric = "cpu_usage"
batch-size = 64
train-test-split = 0.8
local-epochs = 100
learning-rate = 0.001

Federated Learning Process

The federated learning process in ICOS FL follows these steps:

  1. Data Collection: System metrics (CPU, memory, power) are collected by Telemetruum Leaf
  2. Local Training: Each client trains LSTM models on local metrics data
  3. Model Sharing: Only model updates (not raw data) are sent to the server
  4. Aggregation: The server combines updates from multiple clients into a single model
  5. Distribution: The improved global model is sent back to all clients
  6. Prediction: Trained models predict future resource usage at each node

Integration with ICOS

ICOS FL integrates with other ICOS components:

  • Telemetruum: For collecting and storing system metrics
  • DataClay: For distributed data management across the continuum
  • Intelligence Coordination: For model registry and deployment
  • Continuum Management: For infrastructure awareness and optimization

The resource predictions provided by ICOS FL can be used by other ICOS components for:

  • Anomaly Detection: Identify unusual resource usage patterns
  • Capacity Planning: Forecast future resource requirements
  • Energy Optimization: Optimize power consumption based on predictions
  • Load Balancing: Distribute workloads based on predicted resource availability

LSTM Models for Time Series Prediction

ICOS FL uses Long Short-Term Memory (LSTM) neural networks specifically designed for time series forecasting:

class LSTMModel(nn.Module):
    def __init__(
        self,
        hidden_layer_size: int,
        time_step: int,
        num_layers: int,
        output_size: int = 1,
    ) -> None:
        super().__init__()

        self.hidden_layer_size = hidden_layer_size
        self.time_step = time_step
        self.num_layers = num_layers

        # LSTM layer
        self.lstm = nn.LSTM(time_step, hidden_layer_size, num_layers, batch_first=True)

        # Linear layer to produce output prediction
        self.linear = nn.Linear(hidden_layer_size, output_size)

    def forward(self, input_seq: torch.Tensor) -> torch.Tensor:
        lstm_out, _ = self.lstm(input_seq)
        predictions = self.linear(lstm_out[:, -1, :])
        return predictions

Advanced Features

Custom Model Implementations

ICOS FL allows for custom model implementations that extend the base LSTM model:

import torch.nn as nn
from icos_fl.models.lstm import LSTMModel

class LSTMWithDropout(LSTMModel):
    def __init__(
        self,
        hidden_layer_size: int,
        time_step: int,
        num_layers: int,
        dropout_rate: float = 0.2,
        output_size: int = 1,
    ) -> None:
        super().__init__(hidden_layer_size, time_step, num_layers, output_size)

        # Replace the existing LSTM with one that has dropout
        self.lstm = nn.LSTM(
            time_step,
            hidden_layer_size,
            num_layers,
            batch_first=True,
            dropout=dropout_rate
        )

        # Add dropout before the linear layer
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, input_seq):
        lstm_out, _ = self.lstm(input_seq)
        lstm_out = self.dropout(lstm_out[:, -1, :])
        predictions = self.linear(lstm_out)
        return predictions

Multi-Step Forecasting

ICOS FL supports forecasting multiple steps into the future using a recursive approach:

def predict_multi_step(model, initial_sequence, steps=5):
    """Predict multiple steps ahead using recursive approach."""
    predictions = []
    curr_seq = initial_sequence.clone()

    for _ in range(steps):
        # Get next prediction
        with torch.no_grad():
            next_pred = model(curr_seq)

        # Add prediction to results
        predictions.append(next_pred.item())

        # Update sequence by removing oldest value and adding prediction
        curr_seq = torch.cat((curr_seq[:, 1:], next_pred.unsqueeze(0).unsqueeze(0)), dim=1)

    return predictions

Data Processing Pipeline

The framework includes a comprehensive data processing pipeline that handles:

  • Time series normalization and standardization
  • Sequence creation with configurable window sizes
  • Train/validation splitting
  • Data loading with appropriate batching

Installation and Requirements

ICOS FL requires:

  • Python 3.10 or newer
  • PyTorch 2.5.1
  • Flower 1.17.0 or newer
  • Pandas 2.2.3 or newer
  • DataClay 4.0.0

Install from source

git clone https://github.com/icos-project/icos-fl.git
cd icos-fl
pip install -e .

References