humanEVALS

The people who teach a child determine what that child can think. Not the curriculum. Not the textbook. The person standing in front of them - what they notice, what they care about, what they know well enough to explain badly and then better.

AI is no different. The models being built today are the accumulated judgment of everyone who ever told them what was right, what was useful, what was good. That judgment is in there, invisibly, shaping every output.

Most of the industry is trying to scale the process of collecting that judgment. humanEVALS is trying to improve the quality of it.

Because scale without quality doesn't produce better AI. It produces more confident AI. And those are not the same thing.

We find the people whose judgment is worth having, verify that it is, and build the infrastructure to apply it.
See how we work →

Expert Network

Domain knowledge,
not crowdsourced
opinion.

AI evaluation requires people who have spent years inside a field. We work with practitioners: clinicians who see patients, engineers who write production code, attorneys who argue cases.

Domains

01-Medicine

Clinical & Medical

Physicians, specialists, and researchers who can evaluate diagnostic reasoning, treatment recommendations, and medical literature.

Internal MedicineOncologyRadiologyPsychiatrySurgery

02-Law

Legal & Regulatory

Practicing attorneys evaluating legal reasoning, contract interpretation, case analysis, and jurisdictional nuance.

Corporate LawIPLitigationComplianceTax

03-Engineering

Software & Systems

Engineers from production environments evaluating code correctness, architecture decisions, security posture, and debugging approaches.

BackendML SystemsSecurityDistributed

04-Finance

Finance & Economics

Portfolio managers, analysts, and economists evaluating financial reasoning, risk models, and market analysis.

Equity ResearchRiskMacroDerivatives

05-Science

Research & Academia

PhD researchers evaluating scientific claims, experimental methodology, and literature synthesis across disciplines.

BiologyChemistryPhysicsClimateNeuroscience

06-Strategy

Business & Strategy

Management consultants and operators evaluating strategic recommendations, market sizing, and organisational reasoning.

MBBOperationsM&AGTM

07-Language

Linguistics & Translation

Native speakers and linguists evaluating multilingual output for accuracy, register, cultural appropriateness, and idiomatic precision.

HindiArabicMandarinFrench+40 more

08-Education

Pedagogy & Assessment

Educators and curriculum designers evaluating explanations, grade-level calibration, and the accuracy of instructional content.

K–12Higher EdSTEMCurriculum Design

Process

How an expert enters the network.

We do not onboard quickly. The evaluation of experts is as rigorous as the evaluation of models.

01

Credential verification

We verify degrees, licences, publications, and institutional affiliations directly.

02

Domain calibration test

Each applicant completes a structured evaluation in their domain using tasks that already have known answers. We are measuring how they reason, not just whether they are correct. Borderline cases are reviewed by a senior expert in the same field.

03

Inter-rater reliability check

New experts evaluate the same tasks as established network members. We measure agreement rates and flag systematic divergence. Low agreement is not automatically disqualifying- it opens a conversation about where and why judgment differs.

04

Ongoing quality monitoring

Every expert is re-evaluated periodically through blind calibration tasks embedded in live work. Performance that drifts from baseline is flagged and reviewed.

Example profile

For companies

You need evaluators who know the field.

Tell us the domain, the task type, and the output format you need evaluated. We match you with verified experts and manage quality throughout.

Tell us what you need

For experts

Your judgment shapes how models think.

If you have deep domain expertise and want to apply it to one of the more consequential problems in technology right now, we want to hear from you.

Apply to the network

For companies

Tell us what you
need evaluated.

We match you with verified domain experts for your specific task-evaluation, annotation, red-teaming, or preference ranking. We manage quality throughout, not just at the point of delivery.

Name

Organisation

Email

Domain

Task type

What does your model get wrong today?

For experts

Your judgment is
worth something.

We are looking for people with deep, verifiable expertise who want to apply it to one of the more consequential problems in technology. As a serious contribution to how AI understands the world.

Name

Current role

Email

Institution or organisation

Primary domain

Years of experience

Credentials and verifiable background

Why do you want to do this work?

Training Solutions

Better signal.
At every stage
of training.

Training a model is only as good as the feedback it receives. We provide the human signal through SFT, RLHF, active learning, and preference data using verified domain experts who understand what correct actually looks like in each field.

Talk to us

For teams building domain-specific AI

AI labs fine-tuning for specific domains

Companies using GPT or open-source models

Teams auditing existing annotation pipelines

What we do

SFT

Supervised Fine-Tuning Data

High quality demonstration data written or reviewed by domain experts. A practitioner writing a solution thinks differently from a generalist annotator. That difference compounds in the model.

RLHF

Reward Modeling

Expert annotators compare model outputs and provide preference rankings with written justifications. We track inter-rater agreement and surface disagreement because disagreement often contains the most signal.

Active Learning

Targeted Data Collection

Active learning identifies the highest leverage examples reducing annotation volume without sacrificing coverage.

Preference Data

Preference Generation

Structured preference datasets built for your domain and use case. We work with you to define evaluation criteria, then produce preference data that reflects how real practitioners make judgments.

Red-teaming

Failure Testing

Domain experts attempt to elicit failure modes in your model - factual errors, reasoning gaps, dangerous outputs. Failures found by experts in a field are qualitatively different from failures found by generalist red-teamers.

Audit

Pipeline Review

We review your existing annotation guidelines, calibration process, and quality controls.

What changes

What most annotation pipelines get wrong.

These aren't edge cases. They're the standard state of annotation pipelines not designed with domain expertise in mind.

WithoutGeneralist annotators judging whether an output is "good" based on surface plausibility not whether it is actually correct in the domain.

With humanEVALSDomain experts evaluate whether the output is substantively correct whether someone who knows the field would trust it, act on it, or stake their name on it.

WithoutNo distinction between a wrong answer and a correct answer framed badly. Both fail, for different reasons that require different fixes.

With humanEVALSFailures are categorised: factual error, reasoning gap, overconfident output, register mismatch. Each type requires a different training intervention.

WithoutAudience calibration is inconsistent; the same model frames outputs the same way regardless of who is reading them or how the output will be used.

With humanEVALSExperts calibrate responses by context. Preference data explicitly encodes what appropriate looks like for a given audience, use case, and domain.

WithoutAnnotator quality drifts undetected. Fatigue, inconsistency, and changing standards affect data quality invisibly.

With humanEVALSBlind calibration tasks run throughout. Quality drift is caught before it corrupts training data.

How we engage

01

Understand your model

We look at what your model gets wrong today - which task types, which edge cases. We scope the intervention from there, not from a generic template.

02

Match the experts

We identify and onboard the right specialists from our network for your domain calibrating them against your specific task before any production annotation begins.

03

Run the pipeline

Annotation, preference collection, or red-teaming whichever your training stage requires. You receive data with provenance, quality scores, and inter-rater statistics included.

04

Iterate with you

We review model outputs post-training, identify where signal needs to be refined, and adjust. This is not a one-time delivery - it's a working relationship tied to your model's trajectory.

Tell us what your
model gets wrong
today.

We work with teams at all stages from labs designing their first RLHF pipeline to established teams auditing annotation processes that have been running for years. The conversation starts with what your model gets wrong today.

Get in touch

Domain knowledge,not crowdsourcedopinion.

How an expert enters the network.

Tell us what youneed evaluated.

Your judgment isworth something.

Better signal.At every stageof training.

What most annotation pipelines get wrong.

Domain knowledge,
not crowdsourced
opinion.

Tell us what you
need evaluated.

Your judgment is
worth something.

Better signal.
At every stage
of training.