Methodology
How scoring works
Task extraction is structured, scoring is deterministic, and tool coverage reflects current practical workflows.
Pipeline contribution model
How each system layer contributes to the final benchmark output.
- Task extraction30%
- Scoring formula35%
- Tool mapping20%
- Confidence checks15%
Confidence thresholds
Confidence labels are assigned from deterministic quality thresholds.
Core scoring weights
Task-level contribution weights used by the scoring engine.
Governance safeguards
Deterministic controls that enforce scoring integrity, compliance, and explainability.
SE
Structured extraction
92/98Models return only structured task dimensions and evidence snippets.
Signal strength92%
Auditability98%
DS
Deterministic scoring
94/100Application code applies fixed weighted formulas for task and job exposure.
Signal strength94%
Auditability100%
TM
Tool coverage mapping
86/90Task recommendations are mapped to curated tools with oversight labels.
Signal strength86%
Auditability90%
CL
Confidence labeling
81/88Confidence reflects input quality, coverage depth, and pipeline completeness.
Signal strength81%
Auditability88%
Active checkpoint: Structured extraction (92/98) with 92% signal strength.
References
Academic papers, standards, and source material
Relevant references used to ground task extraction, deterministic scoring, confidence labeling, and labor-market interpretation.
- The Skill Content of Recent Technological Change: An Empirical Exploration
Autor, Levy, and Murnane (2003) - NBER Working Paper 8337
- The Future of Employment: How Susceptible Are Jobs to Computerisation?
Frey and Osborne (2013) - Oxford Martin School
- The Risk of Automation for Jobs in OECD Countries: A Comparative Analysis
Arntz, Gregory, and Zierahn (2016) - OECD
- Robots and Jobs: Evidence from US Labor Markets
Acemoglu and Restrepo (2017) - NBER Working Paper 23285
- What Can Machines Learn, and What Does It Mean for Occupations and the Economy?
Brynjolfsson, Mitchell, and Rock (2018) - NBER Working Paper 24839
- Generative AI at Work
Brynjolfsson, Li, and Raymond (2023) - NBER Working Paper 31161
- GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
Eloundou et al. (2023) - arXiv:2303.10130
- GPT-4 Technical Report
OpenAI (2023) - arXiv:2303.08774
- Training Language Models to Follow Instructions with Human Feedback
Ouyang et al. (2022) - arXiv:2203.02155
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei et al. (2022) - arXiv:2201.11903
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang et al. (2022) - arXiv:2203.11171
- Large Language Models are Zero-Shot Reasoners
Kojima et al. (2022) - arXiv:2205.11916
- ReAct: Synergizing Reasoning and Acting in Language Models
Yao et al. (2022) - arXiv:2210.03629
- Toolformer: Language Models Can Teach Themselves to Use Tools
Schick et al. (2023) - arXiv:2302.04761
- Constitutional AI: Harmlessness from AI Feedback
Bai et al. (2022) - arXiv:2212.08073
- Measuring Massive Multitask Language Understanding
Hendrycks et al. (2020) - arXiv:2009.03300
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
Lin, Hilton, and Evans (2021) - arXiv:2109.07958
- Holistic Evaluation of Language Models
Liang et al. (2022) - arXiv:2211.09110
- Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models
BIG-bench authors (2022) - arXiv:2206.04615
- Model Cards for Model Reporting
Mitchell et al. (2019) - arXiv:1810.03993
- Datasheets for Datasets
Gebru et al. (2018) - arXiv:1803.09010
- On the Opportunities and Risks of Foundation Models
Bommasani et al. (2021) - arXiv:2108.07258
- AI Risk Management Framework (AI RMF 1.0)
NIST (2023) - Standards guidance
- NIST AI 600-1: Generative AI Profile
NIST (2024) - AI RMF profile extension
- OECD AI Principles
OECD AI Policy Observatory
- O*NET-SOC Taxonomy
O*NET Resource Center
- O*NET Content Model
O*NET Resource Center