Skip to content

Investigation Pipeline

Pipeline Overview

Every BWTS alert investigation follows a fixed sequence of phases. The key design principle: Phase 1 runs 4 agents in parallel to minimise investigation time, and gate-based dependency tracking ensures each phase transitions instantly without polling delays.

sequenceDiagram
participant App as Dashboard App
participant AM as Alert Monitor
participant MGR as IOT Manager
participant DA as Data Agent
participant MA as Manual Agent
participant PMS as PMS Agent
participant CF as Casefile Agent
participant RPT as Report Agent
participant ENG as Engineering Team
App->>AM: Alert email + API record
activate AM
AM->>AM: Claim & deduplicate
AM->>MGR: Create alert task (alert_monitor_v1)
deactivate AM
activate MGR
Note over MGR: Parse alert, assess severity
MGR->>DA: Phase 1 — Sensor trends
MGR->>MA: Phase 1 — All possible causes
MGR->>PMS: Phase 1 — Maintenance history
MGR->>CF: Phase 1 — Past incidents
MGR->>MGR: Create Phase 1 Gate (blocked by all 4)
deactivate MGR
activate DA
activate MA
activate PMS
activate CF
Note over DA,CF: Running in parallel
DA-->>MGR: Trends, correlations, anomaly timeline
deactivate DA
MA-->>MGR: All causes with alarm codes
deactivate MA
PMS-->>MGR: Overdue status, service gaps
deactivate PMS
CF-->>MGR: Past incidents, fleet patterns
deactivate CF
Note over MGR: Gate resolves → auto-wake
activate MGR
Note over MGR: SYNTHESIS — Cross-reference all evidence — Confirm root causes — Assign confidence levels
MGR->>MA: Phase 2 — Targeted remediation
MGR->>MGR: Create Phase 2 Gate
deactivate MGR
activate MA
MA-->>MGR: Step-by-step procedures + safety warnings
deactivate MA
Note over MGR: Gate resolves → auto-wake
activate MGR
MGR->>RPT: Compile full context → Report
deactivate MGR
activate RPT
RPT->>RPT: Build HTML report
RPT->>ENG: Email report
RPT-->>MGR: Delivery confirmed
deactivate RPT
Note over MGR: Mark investigation complete ✓

Phase-by-Phase Detail

  1. Alert Detection

    The BWTS dashboard application monitors sensor data continuously. When a parameter crosses a threshold (e.g., UV intensity drops below IMO compliance level), the app creates an alert record in the API and sends a notification email.

    The Alert Monitor agent picks this up via its scheduled routine, claims the alert, and fetches the full picture of all unresolved alerts.

    graph LR
    A["Sensor Breach"] --> B["API Record"]
    A --> C["Email Notification"]
    B --> D["Alert Monitor"]
    C --> D
    D --> E["Claim + Deduplicate"]
    E --> F["Forward to Manager"]
    style A fill:#fef3c7,stroke:#f59e0b
    style D fill:#dbeafe,stroke:#3b82f6
    style F fill:#ede9fe,stroke:#8b5cf6
  2. Phase 1 — Parallel Investigation

    The IOT Manager receives the alert task and immediately dispatches 4 specialist agents in parallel:

    AgentInvestigation FocusData Source
    Data AnalysisSensor trends, correlations, anomaly timelinePostgreSQL (telemetry)
    Manual AgentAll possible causes for the alarm typePureBallast 3.1 manual (621 pages)
    PMS AgentMaintenance history, overdue componentsPostgreSQL (maintenance log)
    Casefile AgentPast similar incidents, fleet patternsTask history + events DB

    After dispatching, the Manager creates a Phase 1 Gate — a task blocked by all 4 subtask IDs. When the last specialist finishes, the gate automatically unblocks and wakes the Manager.

  3. Synthesis

    The Manager personally cross-references all Phase 1 outputs:

    1. Check data evidence — Is the decline gradual (degradation) or sudden (failure)?
    2. Cross-check maintenance — Is the component overdue for service?
    3. Validate manual causes — Does the alarm code match what the data shows?
    4. Check history — Has this happened before? What worked last time?

    Each root cause gets a confidence rating:

    RatingMeaningCriteria
    HIGHVery likely3+ independent sources agree
    MEDIUMProbable2 sources agree
    LOWPossibleSingle source or conflicting evidence

    This is the one step the Manager never delegates — it requires reasoning across all evidence sources simultaneously.

  4. Phase 2 — Targeted Remediation

    The Manager sends the confirmed root causes to the Machinery Manual Agent for a second, targeted search:

    • Exact step-by-step repair procedures for each confirmed cause
    • Safety warnings (quoted verbatim from the manual)
    • Required tools and spare parts
    • Manual section and page references
  5. Report & Delivery

    The Final Report & Email Agent receives the complete investigation package and:

    1. Validates all required inputs are present
    2. Generates a structured HTML report using report_builder.py
    3. Applies urgency-based colour coding (red/amber/blue)
    4. Sends the report via Gmail SMTP to the engineering team

    The report contains:

    • Alert Summary — All active alerts in a table
    • Diagnosis — Confirmed root causes with confidence levels
    • Recommended Actions — Step-by-step remediation grouped by urgency

Gate-Based Wake Pattern

The critical design feature that eliminates polling delays:

graph TD
subgraph "Phase 1"
T1["Data Agent Task"]
T2["Manual Agent Task"]
T3["PMS Agent Task"]
T4["Casefile Agent Task"]
end
G1["Phase 1 Gate\nstatus: blocked\nblockedBy: T1, T2, T3, T4"]
T1 -- "done ✓" --> G1
T2 -- "done ✓" --> G1
T3 -- "done ✓" --> G1
T4 -- "done ✓" --> G1
G1 -- "All resolved → auto-wake" --> SYN["Manager: Synthesis"]
SYN --> T5["Manual Agent Phase 2"]
T5 --> G2["Phase 2 Gate\nblockedBy: T5"]
G2 -- "Resolved → auto-wake" --> RPT["Manager: Dispatch Report"]
style G1 fill:#fef3c7,stroke:#f59e0b
style G2 fill:#fef3c7,stroke:#f59e0b
style SYN fill:#fce7f3,stroke:#ec4899
style RPT fill:#fef2f2,stroke:#ef4444