ICOS FL¶

ICOS FL is a federated learning framework for real-time resource monitoring and prediction within the ICOS ecosystem. It enables distributed training of LSTM models to predict system metrics such as CPU usage, memory consumption, and power usage across ICOS nodes.

Overview¶

ICOS FL leverages federated learning to train machine learning models across multiple decentralized edge devices or servers while keeping data private - only model updates are shared. This approach is particularly valuable for resource monitoring in cloud-edge-IoT environments where data privacy and sovereignty are crucial.

The framework is built on the Flower federated learning framework and includes the following main components:

Server components: Coordinate federated learning and aggregate model updates
Client components: Train local models on system metrics data collected from the device
LSTM Models: Neural networks for time series prediction of resource usage
Data Management components: Store and retrieve time series data efficiently

Key Features¶

Federated Learning: Train models across distributed nodes while keeping data local
Real-time Monitoring: Track CPU, memory, and power consumption metrics
LSTM Prediction: Forecast resource usage with configurable time windows
Privacy Preservation: Raw metrics never leave the node where they're collected
Adaptive Training: Support for heterogeneous devices with varying capabilities
Fault Tolerance: Ability to handle nodes joining or leaving the federation

Architecture¶

The ICOS FL architecture follows a federated client-server design pattern, built on the Flower framework:

+----------------------+                +----------------------+
|   ICOS Controller    |                |     ICOS Agent      |
|                      |                |                     |
| ┌------------------┐ |                | ┌-----------------┐ |
| │    SuperLink     │ |                | │    SuperNode    │ |
| │   (FL Server)    │<---------------->│   (FL Client)     │ |
| └--------┬---------┘ |                | └-------┬---------┘ |
|          │           |                |         │           |
| ┌--------▼---------┐ |                | ┌-------▼---------┐ |
| │ Telemetruum Hub  │ |                | │ Telemetruum Leaf│ |
| │                  │ |                | │                 │ |
| └------------------┘ |                | └-----------------┘ |
+----------------------+                +----------------------+

The architecture consists of these key components:

Server Side (ICOS Controller)¶

SuperLink: The federated learning server component that:
Coordinates the training process
Aggregates model updates from clients
Distributes the global model
Handles client selection and participation
Telemetruum Hub: Collects and aggregates telemetry data from all connected agents

Client Side (ICOS Agent)¶

SuperNode: The federated learning client component that:
Trains local models on device-specific data
Sends model updates to the server
Receives and applies global model updates
Performs local evaluation
Telemetruum Leaf: Collects system metrics from the local device

Data Management Layer¶

TimeSeriesData: A DataClay-based storage component that:
Maintains a sliding window of recent metrics
Provides efficient access to time series data
Supports preprocessing for LSTM input

LSTM Model Architecture¶

The LSTM model architecture is designed specifically for time series forecasting:

+------------------+     +-------------------+     +---------------+
| Input Sequence   |     | LSTM Layer(s)     |     | Linear Layer  |
| [batch, 1, time] |---->| hidden_size units |---->| output_size=1 |
+------------------+     +-------------------+     +---------------+

This design enables: - Processing of sequential data with temporal dependencies - Capture of both short-term and long-term patterns - Prediction of future values based on historical trends

Configuration¶

ICOS FL can be configured through the following options in pyproject.toml:

[tool.flwr.app.config]
# Server configuration
num-server-rounds = 10
min-fit-clients = 2
min-evaluate-clients = 2
min-available-clients = 2

# LSTM model configuration
hidden-layer-size = 10
time-step = 10
num-layers = 1

# Resource metric to monitor and predict
metric = "cpu_usage"
batch-size = 64
train-test-split = 0.8
local-epochs = 100
learning-rate = 0.001

Federated Learning Process¶

The federated learning process in ICOS FL follows these steps:

Data Collection: System metrics (CPU, memory, power) are collected by Telemetruum Leaf
Local Training: Each client trains LSTM models on local metrics data
Model Sharing: Only model updates (not raw data) are sent to the server
Aggregation: The server combines updates from multiple clients into a single model
Distribution: The improved global model is sent back to all clients
Prediction: Trained models predict future resource usage at each node

Integration with ICOS¶

ICOS FL integrates with other ICOS components:

Telemetruum: For collecting and storing system metrics
DataClay: For distributed data management across the continuum
Intelligence Coordination: For model registry and deployment
Continuum Management: For infrastructure awareness and optimization

The resource predictions provided by ICOS FL can be used by other ICOS components for:

Anomaly Detection: Identify unusual resource usage patterns
Capacity Planning: Forecast future resource requirements
Energy Optimization: Optimize power consumption based on predictions
Load Balancing: Distribute workloads based on predicted resource availability

LSTM Models for Time Series Prediction¶

ICOS FL uses Long Short-Term Memory (LSTM) neural networks specifically designed for time series forecasting:

class LSTMModel(nn.Module):
    def __init__(
        self,
        hidden_layer_size: int,
        time_step: int,
        num_layers: int,
        output_size: int = 1,
    ) -> None:
        super().__init__()

        self.hidden_layer_size = hidden_layer_size
        self.time_step = time_step
        self.num_layers = num_layers

        # LSTM layer
        self.lstm = nn.LSTM(time_step, hidden_layer_size, num_layers, batch_first=True)

        # Linear layer to produce output prediction
        self.linear = nn.Linear(hidden_layer_size, output_size)

    def forward(self, input_seq: torch.Tensor) -> torch.Tensor:
        lstm_out, _ = self.lstm(input_seq)
        predictions = self.linear(lstm_out[:, -1, :])
        return predictions

Advanced Features¶

Custom Model Implementations¶

ICOS FL allows for custom model implementations that extend the base LSTM model:

import torch.nn as nn
from icos_fl.models.lstm import LSTMModel

class LSTMWithDropout(LSTMModel):
    def __init__(
        self,
        hidden_layer_size: int,
        time_step: int,
        num_layers: int,
        dropout_rate: float = 0.2,
        output_size: int = 1,
    ) -> None:
        super().__init__(hidden_layer_size, time_step, num_layers, output_size)

        # Replace the existing LSTM with one that has dropout
        self.lstm = nn.LSTM(
            time_step,
            hidden_layer_size,
            num_layers,
            batch_first=True,
            dropout=dropout_rate
        )

        # Add dropout before the linear layer
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, input_seq):
        lstm_out, _ = self.lstm(input_seq)
        lstm_out = self.dropout(lstm_out[:, -1, :])
        predictions = self.linear(lstm_out)
        return predictions

Multi-Step Forecasting¶

ICOS FL supports forecasting multiple steps into the future using a recursive approach:

def predict_multi_step(model, initial_sequence, steps=5):
    """Predict multiple steps ahead using recursive approach."""
    predictions = []
    curr_seq = initial_sequence.clone()

    for _ in range(steps):
        # Get next prediction
        with torch.no_grad():
            next_pred = model(curr_seq)

        # Add prediction to results
        predictions.append(next_pred.item())

        # Update sequence by removing oldest value and adding prediction
        curr_seq = torch.cat((curr_seq[:, 1:], next_pred.unsqueeze(0).unsqueeze(0)), dim=1)

    return predictions

Data Processing Pipeline¶

The framework includes a comprehensive data processing pipeline that handles:

Time series normalization and standardization
Sequence creation with configurable window sizes
Train/validation splitting
Data loading with appropriate batching

Installation and Requirements¶

ICOS FL requires:

Python 3.10 or newer
PyTorch 2.5.1
Flower 1.17.0 or newer
Pandas 2.2.3 or newer
DataClay 4.0.0

Install from source

git clone https://github.com/icos-project/icos-fl.git
cd icos-fl
pip install -e .