Troubleshooting¶
This section collects some of the most frequent errors and issues that might appear when managing an ICOS Continuum.
On-Boarding¶
New ICOS Workers do not appear as resources in the ICOS Continuum¶
After the installation of a new ICOS Worker, it should be possible to see it in the Telemetry Dashboards or using the icos-shell get resource
command.
If it is not happening, the first possible cause could be related to the telemetry from the new ICOS Worker that is not "flowing" correctly from the Worker to the ICOS Agent and from the ICOS Agent to the ICOS Controller.
As first step to debug this issue, connect to the ICOS Worker and get the logs of the pods that have in the name ...-telemetruum-leaf-otel-node-agent-...
(for the Kubernetes version) or the container named otelcol
(for the Docker verion) and see if there are exceptions or error messages. The following is a normal log of the telemetry containers:
2025-02-21T07:06:17.490Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 110, "data points": 477}
2025-02-21T07:06:23.910Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 65, "metrics": 725, "data points": 781}
2025-02-21T07:06:27.719Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 12, "data points": 50}
2025-02-21T07:06:33.953Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 44, "data points": 1879}
2025-02-21T07:07:10.216Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 6, "data points": 7}
2025-02-21T07:07:17.435Z info Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 110, "data points": 477}
2
Typical errors are related to the network like Connection refused
, No such host
, Timeout
, Invalid Certificate
or Expired Certificate
. This indicates that the Worker is not able to connect and send data to the Telemetry Gateway in the ICOS Agent.
This is a typical error log with network errors:
2025-02-21T06:18:27.835Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/gateway", "error": "failed to make an HTTP request: Post \"https://telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io/v1/metrics\": dial tcp: lookup telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io on 10.43.0.10:53: no such host", "interval": "4.277566092s"}
2025-02-21T06:18:29.669Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/gateway", "error": "failed to make an HTTP request: Post \"https://telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io/v1/metrics\": dial tcp: lookup telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io on 10.43.0.10:53: no such host", "interval": "11.172065255s"}
2025-02-21T06:18:31.310Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/gateway", "error": "failed to make an HTTP request: Post \"https://telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io/v1/metrics\": dial tcp: lookup telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io on 10.43.0.10:53: no such host", "interval": "9.043482127s"}
2025-02-21T07:03:23.988Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/gateway", "error": "failed to make an HTTP request: Post \"https://telemetry.ocm-agent-1.staging.10-160-3-14.sslip.io/v1/metrics\": read tcp 10.0.1.49:42008->10.160.3.14:443: read: connection reset by peer", "interval": "4.818103227s"}
...
2025-02-20T15:31:52.171Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/tlum-hub", "error": "failed to make an HTTP request: Post \"https://telemetry.controller.icos-stable.10-127-50-2.sslip.io:31000/v1/metrics\": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2025-02-20T15:31:52Z is after 2025-02-20T02:23:02Z", "interval": "25.655902211s"}
2025-02-20T15:31:53.259Z error internal/queue_sender.go:100 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/tlum-hub", "error": "no more retries left: failed to make an HTTP request: Post \"https://telemetry.controller.icos-stable.10-127-50-2.sslip.io:31000/v1/metrics\": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2025-02-20T15:31:53Z is after 2025-02-20T02:23:02Z", "dropped_items": 33}
Each error will have its own solution, but a general list of things to double check is:
-
verify that the URL to which the Telemetry module in the Worker is trying to connect is correct. If not, the
url
value of the Helm value file is probably wrong. Also make sure that the URL is reachable from the Worker -
verify that the Telemetry Gateway in the ICOS Agent is listening correctly.
For instance, the command
curl -k https://telemetry.my-agent-url.com/v1/metrics
should return405 method not allowed, supported: [POST]
-
verify that the TLS certificate (if HTTPS is used) for the Telemetry service in the ICOS Agent is not expired. If it is expired, some components of the ICOS CA could be not working (see the specific section in this page)
If in the ICOS Worker there are no errors, the same should be checked in the ICOS Agent looking at the pod that has in the name ...-telemetruum-gateway-otel-collector-...
. In this case, if there are errors, it indicates that the Telemtry module in the ICOS Agent is not able to send data to the ICOS Controller.
TLS errors¶
ICOS services uses TLS certificates to establish HTTPS connections. The certificates are issued by the ICOS CA running in the ICOS Core node.
If in the logs of an ICOS service there are errors related to TLS certificates, check the status of certificates in the ICOS Core, Controllers and Agents clusters.
❯ kubectl get certificate -n icos-system
NAME READY SECRET AGE
agent-nuvla-1-telemetry-ingress-tls True agent-nuvla-1-telemetry-ingress-tls 27d
agent1-telemetry-ingress-tls True agent1-telemetry-ingress-tls 70d
contrl1-grafana-ingress-tls True contrl1-grafana-ingress-tls 7d16h
contrl1-jobmanager-ingress-tls True contrl1-jobmanager-ingress-tls 27d
contrl1-shell-ingress-tls True contrl1-shell-ingress-tls 27d
contrl1-telemetruum-ingress-tls True contrl1-telemetruum-ingress-tls 27d
contrl1-test-polmangui-ingress-tls True contrl1-test-polmangui-ingress-tls 24d
If there are missing certificates, it might indicate that the Cert Manager
component in the cluster is not working properly.
If there are certificates that are not Ready
or are expired, it might indicate that the ICOS CA is not working properly or the Step Issuer component in the cluster is not configured correctly.
Typical things to check are:
-
check that in the ICOS Core node the CA pod is running fine without errors in the logs
and that it is exposed at the expected URL/IP (it depends on the Helm values used in the ICOS Core installation) -
in the clusters with missing or invalid certificates:
- check that the
cert-manager
pods are running fine without errors in the logs❯ kubectl get pods -n icos-system -l app.kubernetes.io/name=cert-manager NAME READY STATUS RESTARTS AGE contrl1-cert-manager-cainjector-5554548786-ds44g 1/1 Running 168 (21h ago) 73d contrl1-cert-manager-controller-57c565d4d5-fw66q 1/1 Running 142 (21h ago) 73d contrl1-cert-manager-webhook-7df445c8f7-vpddd 1/1 Running 181 (21h ago) 73d
- check that the
step-issuer
pod is running fine without errors in the logs. This component is the one responsible for connecting to the ICOS CA and request the issuing of certificates - check that the issuer configuration is ok. If not double check that the
global.core.url
,global.core.routing
,global.core.ca.bundle
,global.core.ca.issuerKid
andglobal.core.ca.issuerPassword
values in the Helm value file are correct and corresponds to the ones used in the ICOS Core Suite installation.
- check that the
Keycloak JWT Token or Credentials errors¶
TBD
ICOS Shell¶
Connection Refused
exception while using any command¶
If the Lighthouse is exposed as HTTPS service and you are getting the following errors:
Controller not defined, asking lighthouse for a controller...
Error fetching controllers: Get "http://localhost:8080/api/v3/controller/": dial tcp [::1]:8080: connect: connection refused
it is related to a known isuse in the ICOS Shell CLI that is not able to handle https endpoints. We are working to fix this issue.
In the meantime, the solution for this is to expose the Shell Backend service as nodeport and update the ICOS Shell CLI configuration to point to the new endpoint (that will be in the format ip:port and so supported).
-
Add the following section to the ICOS Controller Helm values file:
-
Upgrade the ICOS Controller Helm release:
helm upgrade...
-
Modify the ICOS Shell CLI configuration file, removing the
lighthouse
section and adding thecontroller: <controller-ip>:32500
value. For instance: