humanEVALS

The people who teach a child determine what that child can think. Not the curriculum. Not the textbook. The person standing in front of them - what they notice, what they care about, what they know well enough to explain badly and then better.

AI is no different. The models being built today are the accumulated judgment of everyone who ever told them what was right, what was useful, what was good. That judgment is in there, invisibly, shaping every output.

Most of the industry is trying to scale the process of collecting that judgment. humanEVALS is trying to improve the quality of it.

Because scale without quality doesn't produce better AI. It produces more confident AI. And those are not the same thing.

We find the people whose judgment is worth having, verify that it is, and build the infrastructure to apply it.
See how we work →

Expert Network

Domain knowledge,
not crowdsourced
opinion.

AI evaluation requires people who have spent years inside a field. We work with practitioners: clinicians who see patients, engineers who write production code, attorneys who argue cases.

01-Medicine
Clinical & Medical
Physicians, specialists, and researchers who can evaluate diagnostic reasoning, treatment recommendations, and medical literature.
Internal MedicineOncologyRadiologyPsychiatrySurgery
02-Law
Legal & Regulatory
Practicing attorneys evaluating legal reasoning, contract interpretation, case analysis, and jurisdictional nuance.
Corporate LawIPLitigationComplianceTax
03-Engineering
Software & Systems
Engineers from production environments evaluating code correctness, architecture decisions, security posture, and debugging approaches.
BackendML SystemsSecurityDistributed
04-Finance
Finance & Economics
Portfolio managers, analysts, and economists evaluating financial reasoning, risk models, and market analysis.
Equity ResearchRiskMacroDerivatives
05-Science
Research & Academia
PhD researchers evaluating scientific claims, experimental methodology, and literature synthesis across disciplines.
BiologyChemistryPhysicsClimateNeuroscience
06-Strategy
Business & Strategy
Management consultants and operators evaluating strategic recommendations, market sizing, and organisational reasoning.
MBBOperationsM&AGTM
07-Language
Linguistics & Translation
Native speakers and linguists evaluating multilingual output for accuracy, register, cultural appropriateness, and idiomatic precision.
HindiArabicMandarinFrench+40 more
08-Education
Pedagogy & Assessment
Educators and curriculum designers evaluating explanations, grade-level calibration, and the accuracy of instructional content.
K–12Higher EdSTEMCurriculum Design

How an expert enters the network.

We do not onboard quickly. The evaluation of experts is as rigorous as the evaluation of models.

01
Credential verification

We verify degrees, licences, publications, and institutional affiliations directly.

02
Domain calibration test

Each applicant completes a structured evaluation in their domain using tasks that already have known answers. We are measuring how they reason, not just whether they are correct. Borderline cases are reviewed by a senior expert in the same field.

03
Inter-rater reliability check

New experts evaluate the same tasks as established network members. We measure agreement rates and flag systematic divergence. Low agreement is not automatically disqualifying- it opens a conversation about where and why judgment differs.

04
Ongoing quality monitoring

Every expert is re-evaluated periodically through blind calibration tasks embedded in live work. Performance that drifts from baseline is flagged and reviewed.

Active · Medical
Dr. Ananya Krishnan
Cardiologist, 14 years clinical practice · Chennai & remote

"I evaluate whether a model's recommendation would lead a real patient to harm. That requires knowing not just what the guidelines say, but when the guidelines don't apply."

  • SpecialisationInterventional Cardiology, Heart Failure
  • Credential verifiedMBBS, MD-AIIMS Delhi · MCI Registration confirmed
  • Evaluation domainsDiagnostic reasoning · Treatment planning · Patient communication · Drug interaction
  • Inter-rater agreementWithin network baseline (κ = 0.81)
  • Network sinceMarch 2024
For companies
You need evaluators who know the field.

Tell us the domain, the task type, and the output format you need evaluated. We match you with verified experts and manage quality throughout.

Tell us what you need
For experts
Your judgment shapes how models think.

If you have deep domain expertise and want to apply it to one of the more consequential problems in technology right now, we want to hear from you.

Apply to the network

For companies

Tell us what you
need evaluated.

We match you with verified domain experts for your specific task-evaluation, annotation, red-teaming, or preference ranking. We manage quality throughout, not just at the point of delivery.

For experts

Your judgment is
worth something.

We are looking for people with deep, verifiable expertise who want to apply it to one of the more consequential problems in technology. As a serious contribution to how AI understands the world.

Training Solutions

Better signal.
At every stage
of training.

Training a model is only as good as the feedback it receives. We provide the human signal through SFT, RLHF, active learning, and preference data using verified domain experts who understand what correct actually looks like in each field.

Talk to us
For teams building domain-specific AI
AI labs fine-tuning for specific domains
Companies using GPT or open-source models
Teams auditing existing annotation pipelines
SFT
Supervised Fine-Tuning Data

High quality demonstration data written or reviewed by domain experts. A practitioner writing a solution thinks differently from a generalist annotator. That difference compounds in the model.

RLHF
Reward Modeling

Expert annotators compare model outputs and provide preference rankings with written justifications. We track inter-rater agreement and surface disagreement because disagreement often contains the most signal.

Active Learning
Targeted Data Collection

Active learning identifies the highest leverage examples reducing annotation volume without sacrificing coverage.

Preference Data
Preference Generation

Structured preference datasets built for your domain and use case. We work with you to define evaluation criteria, then produce preference data that reflects how real practitioners make judgments.

Red-teaming
Failure Testing

Domain experts attempt to elicit failure modes in your model - factual errors, reasoning gaps, dangerous outputs. Failures found by experts in a field are qualitatively different from failures found by generalist red-teamers.

Audit
Pipeline Review

We review your existing annotation guidelines, calibration process, and quality controls.

What most annotation pipelines get wrong.

These aren't edge cases. They're the standard state of annotation pipelines not designed with domain expertise in mind.

WithoutGeneralist annotators judging whether an output is "good" based on surface plausibility not whether it is actually correct in the domain.
With humanEVALSDomain experts evaluate whether the output is substantively correct whether someone who knows the field would trust it, act on it, or stake their name on it.
WithoutNo distinction between a wrong answer and a correct answer framed badly. Both fail, for different reasons that require different fixes.
With humanEVALSFailures are categorised: factual error, reasoning gap, overconfident output, register mismatch. Each type requires a different training intervention.
WithoutAudience calibration is inconsistent; the same model frames outputs the same way regardless of who is reading them or how the output will be used.
With humanEVALSExperts calibrate responses by context. Preference data explicitly encodes what appropriate looks like for a given audience, use case, and domain.
WithoutAnnotator quality drifts undetected. Fatigue, inconsistency, and changing standards affect data quality invisibly.
With humanEVALSBlind calibration tasks run throughout. Quality drift is caught before it corrupts training data.
01
Understand your model

We look at what your model gets wrong today - which task types, which edge cases. We scope the intervention from there, not from a generic template.

02
Match the experts

We identify and onboard the right specialists from our network for your domain calibrating them against your specific task before any production annotation begins.

03
Run the pipeline

Annotation, preference collection, or red-teaming whichever your training stage requires. You receive data with provenance, quality scores, and inter-rater statistics included.

04
Iterate with you

We review model outputs post-training, identify where signal needs to be refined, and adjust. This is not a one-time delivery - it's a working relationship tied to your model's trajectory.

Tell us what your
model gets wrong
today.

We work with teams at all stages from labs designing their first RLHF pipeline to established teams auditing annotation processes that have been running for years. The conversation starts with what your model gets wrong today.

Get in touch