ICOS FL¶
ICOS FL is a federated learning framework for real-time resource monitoring and prediction within the ICOS ecosystem. It enables distributed training of LSTM models to predict system metrics such as CPU usage, memory consumption, and power usage across ICOS nodes.
Overview¶
ICOS FL leverages federated learning to train machine learning models across multiple decentralized edge devices or servers while keeping data private - only model updates are shared. This approach is particularly valuable for resource monitoring in cloud-edge-IoT environments where data privacy and sovereignty are crucial.
The framework is built on the Flower federated learning framework and includes the following main components:
- Server components: Coordinate federated learning and aggregate model updates
- Client components: Train local models on system metrics data collected from the device
- LSTM Models: Neural networks for time series prediction of resource usage
- Data Management components: Store and retrieve time series data efficiently
Key Features¶
- Federated Learning: Train models across distributed nodes while keeping data local
- Real-time Monitoring: Track CPU, memory, and power consumption metrics
- LSTM Prediction: Forecast resource usage with configurable time windows
- Privacy Preservation: Raw metrics never leave the node where they're collected
- Adaptive Training: Support for heterogeneous devices with varying capabilities
- Fault Tolerance: Ability to handle nodes joining or leaving the federation
Architecture¶
The ICOS FL architecture follows a federated client-server design pattern, built on the Flower framework:
+----------------------+ +----------------------+
| ICOS Controller | | ICOS Agent |
| | | |
| ┌------------------┐ | | ┌-----------------┐ |
| │ SuperLink │ | | │ SuperNode │ |
| │ (FL Server) │<---------------->│ (FL Client) │ |
| └--------┬---------┘ | | └-------┬---------┘ |
| │ | | │ |
| ┌--------▼---------┐ | | ┌-------▼---------┐ |
| │ Telemetruum Hub │ | | │ Telemetruum Leaf│ |
| │ │ | | │ │ |
| └------------------┘ | | └-----------------┘ |
+----------------------+ +----------------------+
The architecture consists of these key components:
Server Side (ICOS Controller)¶
- SuperLink: The federated learning server component that:
- Coordinates the training process
- Aggregates model updates from clients
- Distributes the global model
-
Handles client selection and participation
-
Telemetruum Hub: Collects and aggregates telemetry data from all connected agents
Client Side (ICOS Agent)¶
- SuperNode: The federated learning client component that:
- Trains local models on device-specific data
- Sends model updates to the server
- Receives and applies global model updates
-
Performs local evaluation
-
Telemetruum Leaf: Collects system metrics from the local device
Data Management Layer¶
- TimeSeriesData: A DataClay-based storage component that:
- Maintains a sliding window of recent metrics
- Provides efficient access to time series data
- Supports preprocessing for LSTM input
LSTM Model Architecture¶
The LSTM model architecture is designed specifically for time series forecasting:
+------------------+ +-------------------+ +---------------+
| Input Sequence | | LSTM Layer(s) | | Linear Layer |
| [batch, 1, time] |---->| hidden_size units |---->| output_size=1 |
+------------------+ +-------------------+ +---------------+
This design enables: - Processing of sequential data with temporal dependencies - Capture of both short-term and long-term patterns - Prediction of future values based on historical trends
Configuration¶
ICOS FL can be configured through the following options in pyproject.toml
:
[tool.flwr.app.config]
# Server configuration
num-server-rounds = 10
min-fit-clients = 2
min-evaluate-clients = 2
min-available-clients = 2
# LSTM model configuration
hidden-layer-size = 10
time-step = 10
num-layers = 1
# Resource metric to monitor and predict
metric = "cpu_usage"
batch-size = 64
train-test-split = 0.8
local-epochs = 100
learning-rate = 0.001
Federated Learning Process¶
The federated learning process in ICOS FL follows these steps:
- Data Collection: System metrics (CPU, memory, power) are collected by Telemetruum Leaf
- Local Training: Each client trains LSTM models on local metrics data
- Model Sharing: Only model updates (not raw data) are sent to the server
- Aggregation: The server combines updates from multiple clients into a single model
- Distribution: The improved global model is sent back to all clients
- Prediction: Trained models predict future resource usage at each node
Integration with ICOS¶
ICOS FL integrates with other ICOS components:
- Telemetruum: For collecting and storing system metrics
- DataClay: For distributed data management across the continuum
- Intelligence Coordination: For model registry and deployment
- Continuum Management: For infrastructure awareness and optimization
The resource predictions provided by ICOS FL can be used by other ICOS components for:
- Anomaly Detection: Identify unusual resource usage patterns
- Capacity Planning: Forecast future resource requirements
- Energy Optimization: Optimize power consumption based on predictions
- Load Balancing: Distribute workloads based on predicted resource availability
LSTM Models for Time Series Prediction¶
ICOS FL uses Long Short-Term Memory (LSTM) neural networks specifically designed for time series forecasting:
class LSTMModel(nn.Module):
def __init__(
self,
hidden_layer_size: int,
time_step: int,
num_layers: int,
output_size: int = 1,
) -> None:
super().__init__()
self.hidden_layer_size = hidden_layer_size
self.time_step = time_step
self.num_layers = num_layers
# LSTM layer
self.lstm = nn.LSTM(time_step, hidden_layer_size, num_layers, batch_first=True)
# Linear layer to produce output prediction
self.linear = nn.Linear(hidden_layer_size, output_size)
def forward(self, input_seq: torch.Tensor) -> torch.Tensor:
lstm_out, _ = self.lstm(input_seq)
predictions = self.linear(lstm_out[:, -1, :])
return predictions
Advanced Features¶
Custom Model Implementations¶
ICOS FL allows for custom model implementations that extend the base LSTM model:
import torch.nn as nn
from icos_fl.models.lstm import LSTMModel
class LSTMWithDropout(LSTMModel):
def __init__(
self,
hidden_layer_size: int,
time_step: int,
num_layers: int,
dropout_rate: float = 0.2,
output_size: int = 1,
) -> None:
super().__init__(hidden_layer_size, time_step, num_layers, output_size)
# Replace the existing LSTM with one that has dropout
self.lstm = nn.LSTM(
time_step,
hidden_layer_size,
num_layers,
batch_first=True,
dropout=dropout_rate
)
# Add dropout before the linear layer
self.dropout = nn.Dropout(dropout_rate)
def forward(self, input_seq):
lstm_out, _ = self.lstm(input_seq)
lstm_out = self.dropout(lstm_out[:, -1, :])
predictions = self.linear(lstm_out)
return predictions
Multi-Step Forecasting¶
ICOS FL supports forecasting multiple steps into the future using a recursive approach:
def predict_multi_step(model, initial_sequence, steps=5):
"""Predict multiple steps ahead using recursive approach."""
predictions = []
curr_seq = initial_sequence.clone()
for _ in range(steps):
# Get next prediction
with torch.no_grad():
next_pred = model(curr_seq)
# Add prediction to results
predictions.append(next_pred.item())
# Update sequence by removing oldest value and adding prediction
curr_seq = torch.cat((curr_seq[:, 1:], next_pred.unsqueeze(0).unsqueeze(0)), dim=1)
return predictions
Data Processing Pipeline¶
The framework includes a comprehensive data processing pipeline that handles:
- Time series normalization and standardization
- Sequence creation with configurable window sizes
- Train/validation splitting
- Data loading with appropriate batching
Installation and Requirements¶
ICOS FL requires:
- Python 3.10 or newer
- PyTorch 2.5.1
- Flower 1.17.0 or newer
- Pandas 2.2.3 or newer
- DataClay 4.0.0
Install from source