Skip to content

Commit 0c24e48

Browse files
yukirorazhoguabuccts
authored
Release - Lucia Training Platform v1.4 (#131)
**Description** Merge bug fixes from v1.4 to dev branch. **Major Revisions** * ModelProxy: fix uncommit changes (#122) * Fix cicd errors when branch creation and building all imges (#123) * Bugfix - fix bug of image regex deployment (#125) * Bugfix - fix bugs in alert manager test and fix kusto alert query issue (#126) * Fix openpai runtime build on arm64 (#128) * Fix fluentd build on arm64 (#129) * Bugfix - bug fix for baremetal support(#127) * Doc - Add Release note for v1.4.0 (#130) --------- Co-authored-by: zhogu <[email protected]> Co-authored-by: Yifan Xiong <[email protected]>
1 parent bf4eef9 commit 0c24e48

File tree

30 files changed

+1559
-320
lines changed

30 files changed

+1559
-320
lines changed

.github/workflows/build-all.yaml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ permissions:
55

66
on:
77
push:
8-
branches: ['release/*']
8+
branches: ["release/*"]
9+
pull_request:
10+
branches: ["release/*"]
911
release:
1012
types: [published]
1113
workflow_dispatch:
@@ -56,8 +58,9 @@ jobs:
5658
- name: Install Package
5759
if: steps.all.outputs.services != ''
5860
run: |
59-
DEBIAN_FRONTEND=noninteractive apt install -y python3 python-is-python3 pip git unzip docker-cli ca-certificates curl apt-transport-https lsb-release gnupg parallel
61+
DEBIAN_FRONTEND=noninteractive apt install -y python3 python-is-python3 pip git unzip ca-certificates curl apt-transport-https lsb-release gnupg parallel
6062
curl -sL https://aka.ms/InstallAzureCLIDeb | bash
63+
curl -fsSL https://get.docker.com | sh
6164
6265
- name: Install python libs
6366
if: steps.all.outputs.services != ''
@@ -78,6 +81,9 @@ jobs:
7881
mv $GITHUB_WORKSPACE/config/auth-configuration /tmp/
7982
ls -l /tmp/auth-configuration
8083
84+
- name: Log in to GHCR
85+
run: echo "${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
86+
8187
- name: Build Images of Services
8288
if: steps.all.outputs.services != ''
8389
run: |
@@ -86,7 +92,7 @@ jobs:
8692
echo "--------------------------------"
8793
failed_services=""
8894
for service in $all_services; do
89-
if [[ "$service" =~ alert-manager ]]; then
95+
if echo "$service" | grep -q "alert-manager"; then
9096
echo "alert-manager is in the changed services"
9197
# Build specific images in alert-manager
9298
echo "Building specific alert-manager images"

.github/workflows/build-deploy-changes.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,12 @@ jobs:
4949
base_sha=$(git merge-base origin/${{ github.event.pull_request.base.ref }} ${{ github.event.pull_request.head.sha }})
5050
head_sha="${{ github.event.pull_request.head.sha }}"
5151
else
52-
base_sha="${{ github.event.before }}"
52+
if [ "${{ github.event.before }}" = "0000000000000000000000000000000000000000" ]; then
53+
# Get the previous commit on branch
54+
base_sha=$(git rev-parse ${{ github.sha }}^)
55+
else
56+
base_sha="${{ github.event.before }}"
57+
fi
5358
head_sha="${{ github.sha }}"
5459
fi
5560
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
slug: release-ltp-v1.4
3+
title: Releasing Lucia Training Platform v1.4
4+
author: Lucia Training Platform Team
5+
tags: [ltp, announcement, release]
6+
---
7+
8+
We are pleased to announce the official release of **Lucia Training Platform v1.4.0**!
9+
10+
## Lucia Training Platform v1.4.0 Release Notes
11+
12+
This release brings significant improvements across platform stability, inference capabilities, CI/CD automation, security enhancements, and user experience.
13+
14+
## Platform Features & Stability
15+
- Fixed Blobfuse2 installation failure
16+
- Fixed slow portal response issues
17+
- Fixed deployment issues on arm64 architecture and bare metal machines
18+
- Added image regex to support valid image limitation
19+
- Added webportal plugin for cluster local storage
20+
- Added tolerance and priority class in cluster-local-storage
21+
- Created PostgreSQL backend and support backend switch with Kusto
22+
23+
## Inference Plugin
24+
- Inference plugin support for model proxy
25+
- Added support for training/inference job type in restserver
26+
- Added support for training/inference job type in webportal
27+
- Support inference job type in model proxy
28+
29+
## CI/CD
30+
- Added CI/CD workflow to build all service images
31+
- Enabled imagelist argument for image build script
32+
- Fixed copy error in building image process
33+
- Fix cicd errors when branch creation and building all images
34+
35+
## Security
36+
- Upgraded Dockerfiles with latest system updates
37+
- Fixed credentials exposure in source code
38+
- Updated Cilium to latest version
39+
40+
## User Experience
41+
- Added comprehensive user manual documentation
42+
- Improved Copilot test experience
43+
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Copyright (c) Microsoft Corporation.
2+
# Licensed under the MIT License.
3+
4+
FROM prom/alertmanager:v0.29.0

src/alert-manager/deploy/alert-manager-deployment.yaml.template

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ spec:
4242
containers:
4343
- name: alertmanager
4444
image: {{ cluster_cfg['cluster']['docker-registry']['prefix'] }}alertmanager:{{ cluster_cfg['cluster']['docker-registry']['tag'] }}
45+
imagePullPolicy: Always
4546
args:
4647
- '--config.file=/etc/alertmanager/config.yml'
4748
- '--storage.path=/alertmanager'
@@ -196,8 +197,8 @@ spec:
196197
value: {{ cluster_cfg["alert-manager"]["alert-parser"]["log-level"] | default(cluster_cfg["alert-manager"]["log-level"]) | default("INFO") }}
197198
- name: UPDATE_INTERVAL
198199
value: "{{ cluster_cfg["alert-manager"]["alert-parser"]["update-interval"] | default("10") }}"
199-
- name: REST_SERVER_URI
200-
value: {{ cluster_cfg["alert-manager"]["alert-parser"]["uri"] | default(cluster_cfg["alert-manager"]["prometheus-uri"]) }}
200+
- name: PROMETHEUS_SERVER_URI
201+
value: {{ cluster_cfg["prometheus"]["url"] }}
201202
- name: PAI_TOKEN
202203
value: {{ cluster_cfg["alert-manager"]["pai-bearer-token"] }}
203204
{% if cluster_cfg["alert-manager"]["alert-parser"].get("storage-backend", cluster_cfg["alert-manager"].get("storage-backend", "kusto")) == "kusto" %}
@@ -217,11 +218,11 @@ spec:
217218
value: "{{ cluster_cfg['alert-manager']['alert-parser'].get('alert-kusto-database', 'DefaultWorkspace-id-westus2') }}"
218219
- name: LTP_KUSTO_VM_INFO_TABLE_NAME
219220
value: {{ cluster_cfg["alert-manager"]["alert-parser"].get("kusto-vm-table", cluster_cfg["alert-manager"]["kusto"].get("kusto-vm-table", "")) }}
220-
{% endif %}
221221
- name: LTP_VMSS_IDS
222222
value: {{ cluster_cfg["alert-manager"]["alert-parser"]["vmss_ids"] | default(cluster_cfg["alert-manager"]["vmss_ids"]) }}
223223
- name: AZURE_CLIENT_ID
224224
value: {{ cluster_cfg["alert-manager"]["alert-parser"]["vmss_client_id"] | default(cluster_cfg["alert-manager"]["vmss_client_id"]) }}
225+
{% endif %}
225226
# Storage backend selection
226227
- name: LTP_STORAGE_BACKEND_DEFAULT
227228
value: "{{ cluster_cfg['alert-manager']['alert-parser'].get('storage-backend', cluster_cfg['alert-manager'].get('storage-backend', 'kusto')) }}"
@@ -332,7 +333,7 @@ spec:
332333
- name: LOG_LEVEL
333334
value: {{ cluster_cfg["alert-manager"]["node-failure-detection"]["monitor"]["log-level"] | default("INFO") }}
334335
- name: PROMETHEUS_SERVER_URI
335-
value: {{ cluster_cfg["alert-manager"]["node-failure-detection"]["monitor"]["prometheus-server-uri"] }}
336+
value: {{ cluster_cfg["prometheus"]["url"] }}
336337
- name: REST_SERVER_URI
337338
value: {{ cluster_cfg["rest-server"]["uri"] }}
338339
- name: PAI_TOKEN
@@ -405,7 +406,7 @@ spec:
405406
- name: REST_SERVER_URI
406407
value: {{ cluster_cfg["rest-server"]["uri"] }}
407408
- name: PROMETHEUS_SERVER_URI
408-
value: {{ cluster_cfg["alert-manager"]["job-data-recorder"]["prometheus-uri"] | default(cluster_cfg["alert-manager"]["prometheus-uri"]) }}
409+
value: {{ cluster_cfg["prometheus"]["url"] }}
409410
- name: PAI_TOKEN
410411
value: {{ cluster_cfg["alert-manager"]["pai-bearer-token"] }}
411412
{% if cluster_cfg["alert-manager"]["job-data-recorder"].get("storage-backend", cluster_cfg["alert-manager"].get("storage-backend", "kusto")) == "kusto" %}

src/alert-manager/deploy/redis-deployment.yaml.template

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,11 +72,13 @@ spec:
7272
periodSeconds: 10
7373
successThreshold: 1
7474
failureThreshold: 3
75+
imagePullSecrets:
76+
- name: {{ cluster_cfg["cluster"]["docker-registry"]["secret-name"] }}
7577
volumes:
7678
- name: redis-data
7779
hostPath:
7880
path: /data/redis-data
7981
type: ""
8082
- name: redis-config
8183
configMap:
82-
name: redis-config
84+
name: redis-config

src/alert-manager/src/alert-parser/node_alert_monitor.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ def __init__(self):
3434
"""
3535
self.endpoint = os.getenv("CLUSTER_ID")
3636
self.update_interval = int(os.getenv("UPDATE_INTERVAL", 10)) # Default to 10 minutes
37-
self.rest_server_uri = os.getenv("REST_SERVER_URI", "http://localhost:8080")
37+
self.prometheus_uri = os.getenv("PROMETHEUS_SERVER_URI", "http://localhost:8080")
3838
self.is_running = False
3939
self.last_update_time = None
4040
self.tolerance_time = int(os.getenv("TOLERANCE_TIME", 300))
@@ -54,7 +54,7 @@ def query_availability_changes(self, end_time: int, time_offset: str, interval:
5454
'query?query=(avg_over_time(avg by (node_name) (pai_node_count{unschedulable="true",node_name!~"aks-.*"} or pai_node_count{unschedulable="false",node_name!~"aks-.*"}*0)'
5555
+ f"[{time_offset}:{interval}] @ {end_time} )) >0")
5656

57-
result = RequestUtil.prometheus_query(query=query, data={}, uri=self.rest_server_uri)
57+
result = RequestUtil.prometheus_query(query=query, data={}, uri=f"{self.prometheus_uri}/prometheus")
5858

5959
if result is not None:
6060
result = result["result"]
@@ -68,7 +68,7 @@ def query_availability_changes(self, end_time: int, time_offset: str, interval:
6868

6969
query = ('query?query=avg by (node_name) (pai_node_count{unschedulable="false",node_name!~"aks-.*"})'
7070
+ f"[{interval}:{interval}] @ {end_time}")
71-
result = RequestUtil.prometheus_query(query=query, data={}, uri=self.rest_server_uri)
71+
result = RequestUtil.prometheus_query(query=query, data={}, uri=f"{self.prometheus_uri}/prometheus")
7272
if result is not None:
7373
result = result["result"]
7474
for node_result in result:
@@ -84,7 +84,7 @@ def get_node_status_changes(self, node: str, end_time: int, time_offset: str, in
8484
f'unschedulable="false",node_name="{node}"' + "}*0) " +
8585
f"[{time_offset}:{interval}] @ {end_time}")
8686

87-
result = RequestUtil.prometheus_query(query=query, data={}, uri=self.rest_server_uri)
87+
result = RequestUtil.prometheus_query(query=query, data={}, uri=f"{self.prometheus_uri}/prometheus")
8888
status_changes = {}
8989
raw_values = {}
9090

@@ -279,7 +279,7 @@ def process_node_changes(self, node: str, changes: Dict[float, float], end_time:
279279
return
280280
# fetch alerts for the node
281281
alerts = self.alert_fetcher.get_node_alert_records(
282-
end_time, f"{time_offset}s", endpoint=self.endpoint, nodes=[node]
282+
end_time, f"{time_offset}s", nodes=[node], severity="error"
283283
)
284284

285285
sorted_changes = sorted(changes.items(), key=lambda x: x[0])

src/alert-manager/src/alert-parser/tests/conftest.py

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,7 @@
99
def mock_env_vars():
1010
"""Mock environment variables used in the tests"""
1111
env_vars = {
12-
'CLUSTER_ID': 'test-cluster',
13-
'LTP_KUSTO_CLUSTER_URI': 'test-kusto-cluster',
14-
'LTP_KUSTO_DATABASE_NAME': 'test-db',
15-
'KUSTO_NODE_STATUS_TABLE_NAME': 'NodeStatusRecord',
16-
'KUSTO_NODE_STATUS_ATTRIBUTE_TABLE_NAME': 'NodeStatusAttributes',
17-
'KUSTO_NODE_ACTION_TABLE_NAME': 'NodeActionRecord',
18-
'KUSTO_NODE_ACTION_ATTRIBUTE_TABLE_NAME': 'NodeActionAttributes'
12+
'CLUSTER_ID': 'test-cluster'
1913
}
2014
with pytest.MonkeyPatch.context() as m:
2115
for key, value in env_vars.items():

0 commit comments

Comments
 (0)