Skip to content

weixian-zhang/GCC-CWHD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure Monitoring with Grafana

powered by Central Workload Health Dashboard (CWHD)

Warning: Important information for customers using Central Workload Health Dashboard
This solution, offered by the Open-Source community, does not receive contributions nor support by Microsoft.

Used by gov agencies on GCC, CWHD uses Grafana to visualize performance and health of Azure resources



What are Tier 0, 1 & 2 dashboards?

Tier 1 Dashboard (tailor-made)

You aim to to cohesively group up all dependent Azure resources into a Tier 1 dashboard. How do you want to group resources is entirely up to you, below is a general guideline:

  • Group by system
    For e.g: You have a system that leverages App Services, VMs, Redis Cache, Azure SQL, Storage, Azure OpenAI service and Azure Function. If any one or more of these services fails your system will be affected. The Tier 1 dashboard should monitor all these Azure resources that together supports the functioning of your system.

    For example if you have 2 systems Cloud Crafty and Pocket Geek, you will have two Tier 1 dashboards.

    Tier 1 / Cloud Crafty image
    Tier 1 / Pocket Geek image image
  • Group by Subscription/Resource Group
    The context could be cloud admin monitoring shared resources in landing zones and shared resources are already grouped by Subscription or Resource Group. In this case, 1 subscription = Tier 1 dashbaord
  • Group scattered resources
    You could also group Azure resources from different subscriptions and resource groups into a Tier 1 dashboard.

Tier 0 Dashboard (tailor-made)

This dashboard is a summary view of all Tier 1 dashboards.
Similar to Tier 1 dashboards, CWHD cannot offer pre-built dashboards as Tier 0 and 1 are fully customized and adapted to your specific grouping of resources.

For this reason, Tier 0 and 1 dashboards is the core delivery work I will do for my customers, in addition to other custom request for e.g: IIS App Pool start/stop

image image

Tier 2 Dashboards (ready-to-use)

Activity Audit Dashboard

Shows you who made changes to Firewall rules, NSG rules, Key Vaults opearations and oerations of all other Azure services

image

Applcation Gateway Dashboard

image

image

Firewall Dashboard

image image

API Management Dashboard

image

Key Vault Dashboard

Shows you Key Vault metrics and operations (a modifed version from Azure Monitor)

image

Storage dashboard (a modifed version from Azure Monitor)


Specialized Dashboards (ready-to-use)

Wara Dashboard

With version 0.2-wara-preview (docker pull wxzd/cwhd:v0.2.0-wara-preview_130325),
CWHD runs Azure WARA assessment on startup and subsequently every 6 hourly schedule to bring you the past and latest reliability states of your Azure environment.

Under the hood, on every WARA run, CWHD downloads the latest copy of collector.ps1 and analyzer.ps1 and executes these 2 scripts to produce assessment result. Result is then formatted and publish via CWHD-Backend REST APIs to be consumed by Grafana.

  • dashboard requires Grafana 11 due to Business Table
  • CWHD backend runs on Windows Container

image

image

Able to select by past and latest reports and filter by subscription

image

What is a Color-coded tile?

Color-coded tiles exist in Tier 0 and 1 dashboards only and each Azure resource is represented by a color-coded tile.
Each color-coded tile displays one of the 3 colors at any one time: Green, Amber and Red which represents the different health status.

  • Green

    • health status from Azure Resource Health API is healthy
    • for App Service specifically, health status are determined by either one of the following data source
      • Application Insights Availability Test
      • Network Watcher Connection Monitor
      • Azure Resource Health API
    • when all resources in Tier 1 color-coded tiles are Green, Tier 0 summarizes system status as Green
  • Amber

    • affects only Virtual Machine resources. if VM's CPU, Memory and/or Disk usage percentage hits threshold. amber color will be shown. See Deployment & Configuration
    • when any one of VM in Tier 1 color-coded tiles is Amber, Tier 0 dashboard summarizes system status as Amber
  • Red

    • when Resource Health API returns unhealthy result

    • for App Service specifically, if either of the following returns unhealthy status

      • Application Insights Availability Test
      • Network Watcher Connection Monitor
      • Azure Resource Health API
    • when any one resource in Tier 1 color-coded tiles is Red, Tier 0 dashboard summarizes system status as Red. Red is "larger" than Amber.


Tech Stack

  • Python 3.11
  • Azure Managed Grafana Standard - Grafana 10.4.11
  • Docker image

Logs Required

Grafana Dashboards Logs required in Workspace
  • Tier 0 resource specific dashboard
  • Tier 1 resource specific dashboard
  • App Service health signal - either one of the following logs
    • Application Insights Availability Test result
    • Network Watcher Connection Monitor
    • Resource Health API if above are not available
  • Virtual Machine CPU, Memory and Disk usage percentage requires Performance Counters collected by Data Collection Rule / Data Sources / Performance Counters - Basic -> / Destination / Log Analytics Workspace
Tier 2 / Activity Audit dashboard send Activity Log to Workspace
Tier 2 / Firewall dashboard enable Firewall diagnostics settings
  • Azure Firewall Network Rule
  • Azure Firewall Application Rule
  • Azure Firewall Nat Rule
  • Azure Firewall Threat Intelligence
  • Azure Firewall IDPS Signature
  • Azure Firewall DNS query
Tier 2 / API Management dashboard enable APIM diagnostics settings
    • Logs related to APIManagement Gateway
  • enable Application Insights linked to Workspace
Tier 2 / Application Gateway dashboard
  • enable App Gateway diagnostics settings
    • Application Gateway Performance Log
    • Application Gateway Firewall Log
  • enable Application Insights linked to Workspace
Tier 2 / Key Vault dashboar enable Key Vault diagnostics settings
  • Audit Logs

Deployment & Configuration

  1. App Service for Containers

    • Publish = Container

    • Operating System = Linux

    • container image - from Dockerhub image wxzd/cwhd:v1.1.1

    • App Service Plan - Standard S1, Premium v3 P0V3 or higher

    • Environment Variables

      • APPLICATIONINSIGHTS_CONNECTION_STRING={conn string}
      • HealthStatusThreshold={"metricUsageThreshold": { "vm": { "cpuUsagePercentage": 80, "memoryUsagePercentage": 80, "diskUsagePercentage": 80 } } }
      • QueryTimeSpanHour=2
      • WEBSITES_PORT=8000
      • Version=1.1.1
    • enable Managed Identity

      • add Azure role assignment (RBAC) for Managed Identity with Monitor Reader to:
        • Subscriptions containing resources under monitoring
        • Log Analytics Workspace (if workspace in different subscription from above)
    • Enable Application Insights

    • Setup Easy Auth with Microsoft Provider

    • Networking / Access Restrictions / Site access and rules (After Managed Grafana is deployed and configured)

      • Public network access = "Enabled from selected virtual networks and IP addresses"
      • Unmatched rule action = Deny
      • add 2 Grafana Static IP addresses found under "Deterministic outbound IP"
  2. Azure Managed Grafana

    • Sku = Standard
    • enable Managed Identity
      • add Azure role assignment (RBAC) for Grafana Managed Identity with Monitor Reader to:
        • Subscriptions containing resources under monitoring
        • Log Analytics Workspaces (if workspaces are in different subscription from above)
    • Infinity plugin
      • Plugin Management, add plugins
        • Infinity
        • Business Variable Select and hit "Save"
      • Configure Infinity data source authn with Entra ID
      • Test if Infinity data source is able to authenticate with CWHD web app
      • Configuration / Deterministic outbound IP - Enable

CWHD Backend REST API Spec

Path Method Input Param Description
/ GET Root path returns "alive"
/RHRetriever POST {
"resources": [
{ [
"resourceId":"{resource id}", [
"standardTestName": "{ App Insights standard test name }", [
"workspaceId": "{Log Analytics Workspace Id}" [
"network_watcher_conn_mon_test_group_name": "{network watcher connection monitor test group name}"   }
 ]
}
standardTestName and network_watcher_conn_mon_test_group_name are optional params for getting App Service health and will fall back to Resource Health API if not supplied

Architecture

image



CWHD BAckend is a web app that curates telemetry from different data sources including:

  • Azure Monitor REST API
    • App Service health status determine by any one of the following result:
      • Kusto query - Application Insights Availability Test result (AppAvailabilityResults table)
      • Kusto query - Network Watcher Connection Monitor (NWConnectionMonitorTestResult table)
      • Resource Health API as last option to determine health status if above options are not available
    • VM: health status is determine by 2 factors
      • Resource Health availability status determines if VM is available or not depicting the Green or Red status.
      • If resource health status is Available/Green, Log Analytics Workspace ID is provided, additional 3 metrics of CPU, Memory and Disk usage percentage will be monitored according to a set of configurable thresholds. In Grafana, VM Stat visualization will show Amber status if one or more of the 3 metrics reaches the threshold.
  • Azure Resource Health API - get resource health for all resource types except App Service, which gets health status from App Insight Standard Test