AI & Safety20 pages

SoterAI: A Proven
Safety‑Critical AI System

Why Relying on LLMs Alone for Safety-Critical Tasks Introduces Unnecessary Risk

The first independent benchmark designed for real-world safety and risk management tasks shows SoterAI achieves 95.64% overall — outperforming every general-purpose LLM tested, including GPT, Claude, Gemini, and Grok.

SAFE-Bench results

SoterAI overall95.64%

Best LLM87.50%

Worst LLM77.07%

8 models tested — blinded evaluation by certified safety professionals

Executive summary

General-purpose AI models — ChatGPT, Claude, Grok, Gemini — excel at language tasks. But they are fundamentally unsuited for domains where omissions, ambiguity, and non-deterministic conclusions carry real-world consequences.

SAFE-Bench (safe-bench.com) is the first independent benchmark designed specifically to evaluate AI performance on real-world safety and risk management tasks. Evaluated by a board of certified safety professionals using blinded protocols, the results are unambiguous.

About SAFE-Bench

Traditional AI benchmarks measure language fluency, reasoning, and general problem-solving. They do not measure what matters most in safety work: the ability to identify all hazards, provide deterministic conclusions, and deliver auditable reasoning chains.

SAFE-Bench evaluates AI systems across seven critical domains, administered by an independent board of certified safety professionals — CSPs, loss prevention directors, HSE managers, and OSHA inspectors — using blinded protocols. No model knew it was being scored.

All evaluations were conducted under a blinded review process. Board members did not know which AI model produced each output — scoring was based strictly on accuracy, completeness, and production-readiness.

Seven domains of testing

Safety Intelligence

100 multiple-choice questions derived from CCSP-aligned materials covering OSHA regulations, hazard controls, incidence rate calculations, and emergency management.

Hazard Identification

Visual analysis of industrial, construction, warehouse, and maintenance images. Models must detect all meaningful hazards — explicit, implicit, and secondary.

Ergonomics

Video-based assessment of human motion, posture, force, and repetition. Models must classify severity using REBA or RULA frameworks and recommend targeted controls.

Policy Review

Analysis of written safety programs and procedures to identify deficiencies, missing elements, and non-compliant language relative to OSHA and international standards.

Incident Investigation

Structured root cause analysis using 5 Whys, TapRooT, and Fishbone methodologies from narrative descriptions, witness statements, and evidence packets.

Generative Materials

Production of toolbox talks, safety bulletins, and training materials accurate enough for frontline deployment.

RAMS (Risk Assessment)

Full risk assessment and method statement documents — hazard identification, risk ratings, hierarchy-of-controls proposals, and audit-ready output.

Key findings

95.64%

SoterAI overall score — 8+ points above the next best LLM

100%

Perfect score in Safety Intelligence and Incident Investigation

60 pts

Gap in Hazard Identification between SoterAI (95%) and worst LLM (35%)

8 models

Tested: GPT-5.1, GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro, Grok4, and more

Benchmark results

SoterAI's 95.64% overall score represents an 8.14 percentage point advantage over the next-best model. The spread between best (SoterAI) and worst (Microsoft Copilot at 77.07%) is 18.57 points.

Domain	SoterAI	Best LLM	Worst LLM
Safety Intelligence	100%	98%	89%
Hazard Identification	95%	90%	35%
Policy Review	95%	95%	85%
Ergonomics	95%	92%	77.5%
Incident Investigation	100%	90%	60%
Generative Materials	95%	93%	72%
RAMS	88%	85%	76%
Overall	95.64%	87.50%	77.07%

The hazard identification gap

Hazard Identification shows the largest performance spread of any domain: 60 percentage points. Microsoft Copilot's 35% score means it misses nearly two-thirds of workplace hazards. In a facility with 100 potential hazards, that translates to 65 missed hazards — some minor, others potentially catastrophic.

Why the gap is so large

Explicit hazards (obvious) — Most models identify these — e.g. a worker near unguarded machinery.

Implicit hazards (contextual) — Performance drops sharply — e.g. worker fatigue indicators suggesting increased error likelihood.

Secondary hazards (cascading) — Many models miss entirely — e.g. emergency egress obstruction due to material placement.

Practical implications

The data reveals a clear conclusion: relying on LLMs alone for safety-critical tasks introduces unnecessary operational and liability risk. Even the highest-performing AI systems require mandatory verification by qualified safety professionals before deployment.

SoterAI's specialized sub-agentic architecture — built from the ground up for safety-critical applications — consistently outperforms general-purpose models across every domain evaluated. The white paper presents the full methodology, domain breakdowns, and practical guidance for safety teams evaluating AI tools.

Download the full white paper

Get the complete 20-page report — full methodology, domain-by-domain breakdowns, and practical implications for safety teams evaluating AI tools.

AI & Safety20 pages

SoterAI: A Proven
Safety‑Critical AI System

Why Relying on LLMs Alone for Safety-Critical Tasks Introduces Unnecessary Risk

SAFE-Bench results

SoterAI overall95.64%

Best LLM87.50%

Worst LLM77.07%

8 models tested — blinded evaluation by certified safety professionals

Executive summary

About SAFE-Bench

Seven domains of testing

Safety Intelligence

100 multiple-choice questions derived from CCSP-aligned materials covering OSHA regulations, hazard controls, incidence rate calculations, and emergency management.

Hazard Identification

Visual analysis of industrial, construction, warehouse, and maintenance images. Models must detect all meaningful hazards — explicit, implicit, and secondary.

Ergonomics

Video-based assessment of human motion, posture, force, and repetition. Models must classify severity using REBA or RULA frameworks and recommend targeted controls.

Policy Review

Analysis of written safety programs and procedures to identify deficiencies, missing elements, and non-compliant language relative to OSHA and international standards.

Incident Investigation

Structured root cause analysis using 5 Whys, TapRooT, and Fishbone methodologies from narrative descriptions, witness statements, and evidence packets.

Generative Materials

Production of toolbox talks, safety bulletins, and training materials accurate enough for frontline deployment.

RAMS (Risk Assessment)

Full risk assessment and method statement documents — hazard identification, risk ratings, hierarchy-of-controls proposals, and audit-ready output.

Key findings

95.64%

SoterAI overall score — 8+ points above the next best LLM

100%

Perfect score in Safety Intelligence and Incident Investigation

60 pts

Gap in Hazard Identification between SoterAI (95%) and worst LLM (35%)

8 models

Tested: GPT-5.1, GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro, Grok4, and more

Benchmark results

SoterAI's 95.64% overall score represents an 8.14 percentage point advantage over the next-best model. The spread between best (SoterAI) and worst (Microsoft Copilot at 77.07%) is 18.57 points.

Domain	SoterAI	Best LLM	Worst LLM
Safety Intelligence	100%	98%	89%
Hazard Identification	95%	90%	35%
Policy Review	95%	95%	85%
Ergonomics	95%	92%	77.5%
Incident Investigation	100%	90%	60%
Generative Materials	95%	93%	72%
RAMS	88%	85%	76%
Overall	95.64%	87.50%	77.07%

The hazard identification gap

Why the gap is so large

Explicit hazards (obvious) — Most models identify these — e.g. a worker near unguarded machinery.

Implicit hazards (contextual) — Performance drops sharply — e.g. worker fatigue indicators suggesting increased error likelihood.

Secondary hazards (cascading) — Many models miss entirely — e.g. emergency egress obstruction due to material placement.

Practical implications

Download the full white paper

Get the complete 20-page report — full methodology, domain-by-domain breakdowns, and practical implications for safety teams evaluating AI tools.

SoterAI: A ProvenSafety‑Critical AI System

Executive summary

About SAFE-Bench

Seven domains of testing

Safety Intelligence

Hazard Identification

Ergonomics

Policy Review

Incident Investigation

Generative Materials

RAMS (Risk Assessment)

Benchmark results

The hazard identification gap

Why the gap is so large

Practical implications

Download the full white paper

SoterAI: A ProvenSafety‑Critical AI System

Executive summary

About SAFE-Bench

Seven domains of testing

Safety Intelligence

Hazard Identification

Ergonomics

Policy Review

Incident Investigation

Generative Materials

RAMS (Risk Assessment)

Benchmark results

The hazard identification gap

Why the gap is so large

Practical implications

Download the full white paper

SoterAI: A Proven
Safety‑Critical AI System

SoterAI: A Proven
Safety‑Critical AI System