Methodology

How scoring works

Task extraction is structured, scoring is deterministic, and tool coverage reflects current practical workflows.

Pipeline contribution model

How each system layer contributes to the final benchmark output.

Pipeline contribution modelHow each system layer contributes to the final benchmark output.Task extraction: 30%Scoring formula: 35%Tool mapping: 20%Confidence checks: 15%30%Task extraction
  • Task extraction30%
  • Scoring formula35%
  • Tool mapping20%
  • Confidence checks15%

Confidence thresholds

Confidence labels are assigned from deterministic quality thresholds.

Confidence thresholdsConfidence labels are assigned from deterministic quality thresholds.0%20%40%60%80%100%High confidenceHigh confidence: >= 85 quality index>= 85 quality indexMedium confidenceMedium confidence: 65-84 quality index65-84 quality indexLow confidenceLow confidence: < 65 quality index< 65 quality index

Core scoring weights

Task-level contribution weights used by the scoring engine.

Core scoring weightsTask-level contribution weights used by the scoring engine.0%20%40%60%80%100%Task structureTask structure: 30% weight30% weightRepetitionRepetition: 26% weight26% weightTool coverageTool coverage: 22% weight22% weightOversight complexityOversight complexity: 22% weight22% weight

Governance safeguards

Deterministic controls that enforce scoring integrity, compliance, and explainability.

Governance safeguardsDeterministic controls that enforce scoring integrity, compliance, and explainability.0%20%40%60%80%100%Schema validationSchema validation: 30% control weight30% control weightDeterministic replayDeterministic replay: 28% control weight28% control weightTool evidence checksTool evidence checks: 24% control weight24% control weightConfidence gatingConfidence gating: 18% control weight18% control weight

Pipeline deep dive

Structured extraction and deterministic scoring pipeline

Four deterministic checkpoints convert user input into explainable role exposure with traceable confidence.

SE

Structured extraction

92/98

Models return only structured task dimensions and evidence snippets.

Signal strength92%
Auditability98%
DS

Deterministic scoring

94/100

Application code applies fixed weighted formulas for task and job exposure.

Signal strength94%
Auditability100%
TM

Tool coverage mapping

86/90

Task recommendations are mapped to curated tools with oversight labels.

Signal strength86%
Auditability90%
CL

Confidence labeling

81/88

Confidence reflects input quality, coverage depth, and pipeline completeness.

Signal strength81%
Auditability88%

Active checkpoint: Structured extraction (92/98) with 92% signal strength.

References

Academic papers, standards, and source material

Relevant references used to ground task extraction, deterministic scoring, confidence labeling, and labor-market interpretation.

  1. The Skill Content of Recent Technological Change: An Empirical Exploration

    Autor, Levy, and Murnane (2003) - NBER Working Paper 8337

  2. The Future of Employment: How Susceptible Are Jobs to Computerisation?

    Frey and Osborne (2013) - Oxford Martin School

  3. The Risk of Automation for Jobs in OECD Countries: A Comparative Analysis

    Arntz, Gregory, and Zierahn (2016) - OECD

  4. Robots and Jobs: Evidence from US Labor Markets

    Acemoglu and Restrepo (2017) - NBER Working Paper 23285

  5. What Can Machines Learn, and What Does It Mean for Occupations and the Economy?

    Brynjolfsson, Mitchell, and Rock (2018) - NBER Working Paper 24839

  6. Generative AI at Work

    Brynjolfsson, Li, and Raymond (2023) - NBER Working Paper 31161

  7. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models

    Eloundou et al. (2023) - arXiv:2303.10130

  8. GPT-4 Technical Report

    OpenAI (2023) - arXiv:2303.08774

  9. Training Language Models to Follow Instructions with Human Feedback

    Ouyang et al. (2022) - arXiv:2203.02155

  10. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei et al. (2022) - arXiv:2201.11903

  11. Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang et al. (2022) - arXiv:2203.11171

  12. Large Language Models are Zero-Shot Reasoners

    Kojima et al. (2022) - arXiv:2205.11916

  13. ReAct: Synergizing Reasoning and Acting in Language Models

    Yao et al. (2022) - arXiv:2210.03629

  14. Toolformer: Language Models Can Teach Themselves to Use Tools

    Schick et al. (2023) - arXiv:2302.04761

  15. Constitutional AI: Harmlessness from AI Feedback

    Bai et al. (2022) - arXiv:2212.08073

  16. Measuring Massive Multitask Language Understanding

    Hendrycks et al. (2020) - arXiv:2009.03300

  17. TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Lin, Hilton, and Evans (2021) - arXiv:2109.07958

  18. Holistic Evaluation of Language Models

    Liang et al. (2022) - arXiv:2211.09110

  19. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models

    BIG-bench authors (2022) - arXiv:2206.04615

  20. Model Cards for Model Reporting

    Mitchell et al. (2019) - arXiv:1810.03993

  21. Datasheets for Datasets

    Gebru et al. (2018) - arXiv:1803.09010

  22. On the Opportunities and Risks of Foundation Models

    Bommasani et al. (2021) - arXiv:2108.07258

  23. AI Risk Management Framework (AI RMF 1.0)

    NIST (2023) - Standards guidance

  24. NIST AI 600-1: Generative AI Profile

    NIST (2024) - AI RMF profile extension

  25. OECD AI Principles

    OECD AI Policy Observatory

  26. O*NET-SOC Taxonomy

    O*NET Resource Center

  27. O*NET Content Model

    O*NET Resource Center