Building 40+ AI Agents with LangChain for an Enterprise Recruitment Platform
Abhishek Sharma
Software Developer
Building 40+ AI Agents with LangChain for an Enterprise Recruitment Platform
When the spec for NCHRecruitPro landed on my desk, the AI requirements section was two pages long. Resume matching. Culture-fit scoring. Predictive retention analysis. Job description generation. Skill gap analysis. Org chart recommendations. Compensation benchmarking. Interview question generation. Bias detection in screening. The list went on. After counting, I had 40+ distinct AI capabilities to build for an enterprise recruitment platform.
The naive approach—one monolithic prompt that tries to do everything—fails catastrophically at this scale. The correct approach, which I'll walk through in detail, is an agent-based architecture where each capability is an independent, testable, composable unit. Here is how I built it with LangChain, how I kept costs near zero using Groq's free tier, and what I learned about orchestrating dozens of AI agents in production.
Why Specialized Agents Beat a Monolithic Prompt
Before I show any code, let me justify the architectural decision. I built a proof-of-concept with a single massive prompt that handled resume matching and job description generation. The problems surfaced immediately:
- Context window pollution. Instructions for resume matching interfered with JD generation. The model would start scoring resumes when asked to generate a job description because both instruction sets were in context simultaneously.
- Testing impossibility. How do you write unit tests for a 4,000-token system prompt that does 10 different things? You cannot isolate failures.
- Prompt versioning chaos. Improving the culture-fit scoring prompt accidentally degraded skill gap analysis because they shared preamble instructions.
- Token waste. Every call sent the full 4,000-token system prompt even if you only needed one capability. At scale, this is a real cost multiplier.
Specialized agents solve all of these. Each agent has a focused system prompt (200-500 tokens), its own tools, its own output parser, and its own test suite. They compose through a coordinator pattern rather than prompt concatenation.
System Architecture
+------------------------------------------------------------------+
| Next.js 16 Frontend |
+------------------------------------------------------------------+
|
REST API / Server Actions
|
+------------------------------------------------------------------+
| Agent Coordinator |
| - Routes requests to appropriate agents |
| - Manages multi-agent workflows |
| - Handles fallback and retry logic |
+------------------------------------------------------------------+
| | | | |
+--------+ +--------+ +--------+ +--------+ +--------+
| Resume | | JD Gen | |Culture | |Predict.| | Skill |
| Match | | Agent | | Fit | |Retent. | | Gap |
| Agent | | | | Agent | | Agent | | Agent |
+--------+ +--------+ +--------+ +--------+ +--------+
| | | | |
+------------------------------------------------------------------+
| AI Model Factory |
| Primary: Groq (llama-3.3-70b) | Fallback: Gemini | OpenAI |
+------------------------------------------------------------------+
|
+------------------------------------------------------------------+
| PostgreSQL | Vector Store (pgvector) | LangSmith Monitoring |
+------------------------------------------------------------------+
The AI Model Factory Pattern
The first infrastructure problem to solve was LLM provider abstraction. NCHRecruitPro uses Groq as the primary provider because Llama 3.3 70B on Groq's free tier is absurdly capable for structured tasks. But free tiers have rate limits, and production systems need reliability guarantees. The Model Factory handles transparent failover:
// lib/ai/model-factory.ts
import { ChatGroq } from "@langchain/groq";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { ChatOpenAI } from "@langchain/openai";
type ModelProvider = "groq" | "gemini" | "openai";
interface ModelConfig {
temperature?: number;
maxTokens?: number;
forceProvider?: ModelProvider;
}
const PROVIDER_CHAIN: ModelProvider[] = ["groq", "gemini", "openai"];
export class AIModelFactory {
private static failureCounts: Record<ModelProvider, number> = {
groq: 0,
gemini: 0,
openai: 0,
};
private static lastFailure: Record<ModelProvider, number> = {
groq: 0,
gemini: 0,
openai: 0,
};
static getModel(config: ModelConfig = {}) {
const { temperature = 0.1, maxTokens = 4096, forceProvider } = config;
if (forceProvider) {
return this.createProvider(forceProvider, temperature, maxTokens);
}
// Return the first provider that isn't in a cooldown period
for (const provider of PROVIDER_CHAIN) {
if (this.isProviderHealthy(provider)) {
return this.createProvider(provider, temperature, maxTokens);
}
}
// All providers degraded; fall back to OpenAI (paid, most reliable)
return this.createProvider("openai", temperature, maxTokens);
}
private static isProviderHealthy(provider: ModelProvider): boolean {
const COOLDOWN_MS = 60_000; // 1 minute cooldown after 3 failures
const MAX_FAILURES = 3;
if (this.failureCounts[provider] >= MAX_FAILURES) {
const elapsed = Date.now() - this.lastFailure[provider];
if (elapsed < COOLDOWN_MS) return false;
// Reset after cooldown
this.failureCounts[provider] = 0;
}
return true;
}
private static createProvider(
provider: ModelProvider,
temperature: number,
maxTokens: number,
) {
switch (provider) {
case "groq":
return new ChatGroq({
model: "llama-3.3-70b-versatile",
temperature,
maxTokens,
apiKey: process.env.GROQ_API_KEY,
});
case "gemini":
return new ChatGoogleGenerativeAI({
model: "gemini-1.5-flash",
temperature,
maxTokens,
apiKey: process.env.GOOGLE_API_KEY,
});
case "openai":
return new ChatOpenAI({
model: "gpt-4o-mini",
temperature,
maxTokens,
apiKey: process.env.OPENAI_API_KEY,
});
}
}
static recordFailure(provider: ModelProvider) {
this.failureCounts[provider]++;
this.lastFailure[provider] = Date.now();
}
static recordSuccess(provider: ModelProvider) {
this.failureCounts[provider] = Math.max(0, this.failureCounts[provider] - 1);
}
}
The key insight here is the circuit breaker pattern. After three consecutive failures from Groq (usually rate limit errors on the free tier), the factory stops trying Groq for 60 seconds and routes to Gemini instead. This prevents cascading timeouts. In practice, about 90% of our requests are served by Groq's free tier, 8% by Gemini's free tier, and only 2% hit the paid OpenAI endpoint. Monthly AI cost for an application with 40+ agents: under $15.
Agent Structure: Anatomy of the Resume Matching Agent
Every agent in the system follows the same pattern: a system prompt, input/output schemas, optional tools, and an output parser. Here is the complete resume matching agent:
// lib/ai/agents/resume-match.agent.ts
import { z } from "zod";
import { StructuredOutputParser } from "langchain/output_parsers";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { AIModelFactory } from "../model-factory";
// Strict output schema
const ResumeMatchSchema = z.object({
overallScore: z.number().min(0).max(100),
skillMatch: z.object({
score: z.number().min(0).max(100),
matched: z.array(z.string()),
missing: z.array(z.string()),
bonus: z.array(z.string()),
}),
experienceMatch: z.object({
score: z.number().min(0).max(100),
yearsRequired: z.number(),
yearsActual: z.number(),
relevantRoles: z.array(z.string()),
}),
educationMatch: z.object({
score: z.number().min(0).max(100),
meets_requirement: z.boolean(),
details: z.string(),
}),
reasoning: z.string(),
recommendation: z.enum(["STRONG_YES", "YES", "MAYBE", "NO", "STRONG_NO"]),
});
type ResumeMatchResult = z.infer<typeof ResumeMatchSchema>;
const parser = StructuredOutputParser.fromZodSchema(ResumeMatchSchema);
const SYSTEM_PROMPT = `You are an expert technical recruiter analyzing resume-to-job
compatibility. You evaluate candidates objectively based on skills, experience,
and qualifications.
SCORING RULES:
- Skill Match (40% weight): Exact tech stack matches score highest. Adjacent
technologies (e.g., React experience when Vue is required) score 50%.
- Experience (35% weight): Meeting the year requirement = 100%. Each year
under = -15 points. Over-qualification does not add points beyond 100%.
- Education (15% weight): Exact degree match = 100%. Related field = 75%.
No degree with equivalent experience = 60%.
- Bonus (10% weight): Open source contributions, certifications, publications.
BIAS PREVENTION:
- Do NOT factor in candidate name, gender, age, or location.
- Score based solely on demonstrated skills and experience.
- If two candidates have equivalent qualifications, they MUST receive
equivalent scores.
{format_instructions}`;
const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT],
["human", "JOB DESCRIPTION:\n{jobDescription}\n\nRESUME:\n{resume}\n\nAnalyze this match."],
]);
export async function matchResume(
jobDescription: string,
resume: string,
): Promise<ResumeMatchResult> {
const model = AIModelFactory.getModel({ temperature: 0.1 });
const chain = RunnableSequence.from([
prompt.partial({ format_instructions: parser.getFormatInstructions() }),
model,
parser,
]);
const result = await chain.invoke({ jobDescription, resume });
return result;
}
Several design decisions deserve explanation:
Zod schema for output parsing. LangChain's StructuredOutputParser with Zod schemas injects format instructions into the system prompt and validates the response. If the LLM returns malformed JSON (which happens roughly 3% of the time with Llama 3.3), the parser throws a typed error that the coordinator can catch and retry.
Low temperature (0.1). Resume matching is an evaluation task, not a creative one. We want consistent, reproducible scores. At temperature 0.7, the same resume scored against the same JD produced scores ranging from 62 to 84. At 0.1, the variance drops to plus or minus 3 points.
Bias prevention in the prompt. This is not theoretical. In early testing without the bias prevention section, candidates with Indian names scored 4-7 points lower on average than equivalent profiles with Western names. Adding explicit bias instructions eliminated the gap in our testing. We also strip names from resumes before passing them to the agent as an additional safeguard.
The Agent Coordinator Pattern
Many recruitment workflows require multiple agents working in sequence. For example, the "Full Candidate Assessment" workflow runs five agents in a specific order with data flowing between them:
// lib/ai/coordinator.ts
import { matchResume } from "./agents/resume-match.agent";
import { scoreCultureFit } from "./agents/culture-fit.agent";
import { predictRetention } from "./agents/retention.agent";
import { analyzeSkillGaps } from "./agents/skill-gap.agent";
import { generateInterviewQuestions } from "./agents/interview-gen.agent";
import { AIModelFactory } from "./model-factory";
interface CoordinatorInput {
jobDescription: string;
resume: string;
companyValues?: string[];
teamProfile?: object;
}
interface FullAssessment {
resumeMatch: Awaited<ReturnType<typeof matchResume>>;
cultureFit: Awaited<ReturnType<typeof scoreCultureFit>> | null;
retention: Awaited<ReturnType<typeof predictRetention>> | null;
skillGaps: Awaited<ReturnType<typeof analyzeSkillGaps>>;
interviewQuestions: string[];
aggregateScore: number;
processingTime: number;
}
export async function fullCandidateAssessment(
input: CoordinatorInput,
): Promise<FullAssessment> {
const start = Date.now();
const results: Partial<FullAssessment> = {};
// Phase 1: Resume match (must run first - gates everything else)
results.resumeMatch = await withRetry(
() => matchResume(input.jobDescription, input.resume),
"resume-match",
);
// Early exit: if resume match is STRONG_NO, skip expensive downstream agents
if (results.resumeMatch.recommendation === "STRONG_NO") {
return {
...results as FullAssessment,
cultureFit: null,
retention: null,
skillGaps: { gaps: [], recommendations: [] },
interviewQuestions: [],
aggregateScore: results.resumeMatch.overallScore,
processingTime: Date.now() - start,
};
}
// Phase 2: Run independent agents in parallel
const [cultureFit, skillGaps] = await Promise.allSettled([
withRetry(
() => scoreCultureFit(input.resume, input.companyValues || []),
"culture-fit",
),
withRetry(
() => analyzeSkillGaps(input.resume, input.jobDescription),
"skill-gap",
),
]);
results.cultureFit = cultureFit.status === "fulfilled" ? cultureFit.value : null;
results.skillGaps = skillGaps.status === "fulfilled"
? skillGaps.value
: { gaps: [], recommendations: [] };
// Phase 3: Retention prediction (depends on resume match + culture fit)
results.retention = await withRetry(
() => predictRetention(
input.resume,
results.resumeMatch!.overallScore,
results.cultureFit?.score || 50,
),
"retention",
);
// Phase 4: Generate interview questions (depends on skill gaps)
results.interviewQuestions = await withRetry(
() => generateInterviewQuestions(
input.jobDescription,
results.skillGaps!.gaps,
results.resumeMatch!.skillMatch.missing,
),
"interview-gen",
);
// Aggregate score: weighted combination
results.aggregateScore = calculateAggregate(results as FullAssessment);
results.processingTime = Date.now() - start;
return results as FullAssessment;
}
async function withRetry<T>(
fn: () => Promise<T>,
agentName: string,
maxRetries: number = 2,
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const result = await fn();
return result;
} catch (error) {
console.error(`Agent ${agentName} failed (attempt ${attempt + 1}):`, error);
if (attempt === maxRetries) throw error;
// On retry, the Model Factory will naturally rotate providers
// if the current provider is failing
await new Promise((r) => setTimeout(r, 1000 * (attempt + 1)));
}
}
throw new Error("Unreachable");
}
The coordinator implements three critical patterns:
1. Early exit gating. If the resume match returns STRONG_NO, there is no point running culture-fit scoring, retention prediction, or interview question generation. Each downstream agent costs tokens and time. The early exit reduces average pipeline cost by roughly 30% because about a third of candidates are clear non-matches.
2. Parallel execution where possible. Culture-fit and skill-gap analysis are independent—neither needs the other's output. Running them in parallel with Promise.allSettled cuts wall-clock time by 40-50%. Using allSettled instead of all is intentional: if one agent fails, we still get the other's results. A null culture-fit score degrades the assessment gracefully rather than crashing the entire pipeline.
3. Typed retry with provider rotation. The withRetry wrapper catches failures and retries with exponential backoff. Because the Model Factory tracks failure counts, retries naturally shift to different providers. First attempt hits Groq, Groq rate-limits, Model Factory marks it unhealthy, second attempt routes to Gemini.
Making LangChain Work with Groq's API
Groq's API is OpenAI-compatible, but there are subtle differences that cause LangChain integration issues. The two main pain points and their solutions:
// Problem 1: Groq's free tier has aggressive rate limits
// Solution: Custom rate limiter middleware
import { Semaphore } from "async-mutex";
const groqSemaphore = new Semaphore(5); // max 5 concurrent Groq requests
export async function rateLimitedGroqCall<T>(fn: () => Promise<T>): Promise<T> {
const [, release] = await groqSemaphore.acquire();
try {
return await fn();
} finally {
// Groq free tier: 30 req/min. Space requests ~2s apart.
setTimeout(() => release(), 2000);
}
}
// Problem 2: Groq occasionally returns empty content with finish_reason "length"
// when the output exceeds the model's generation limit.
// Solution: Detect and retry with lower maxTokens or switch provider.
export function handleGroqResponse(response: any) {
if (
response?.generations?.[0]?.[0]?.generationInfo?.finish_reason === "length"
) {
throw new GroqTruncationError(
"Response truncated by Groq length limit. Retrying with shorter output."
);
}
return response;
}
Cost optimization results: Over a 30-day production period processing approximately 2,400 candidate assessments (each triggering 3-5 agents), the total LLM cost was $11.40. Groq free tier handled 89% of requests, Gemini free tier handled 9%, and OpenAI (gpt-4o-mini at $0.15/1M input tokens) handled 2%. For context, running the same volume through GPT-4o exclusively would have cost approximately $380.
Monitoring with LangSmith
With 40+ agents, observability is not optional. LangSmith provides trace-level visibility into every chain execution. The integration is environment-variable-based, requiring zero code changes:
# .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__xxxx
LANGCHAIN_PROJECT=nchrecruitpro-production
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
With these set, every LangChain invocation is automatically traced. In the LangSmith dashboard, I can see: the exact prompt sent to each agent, the raw LLM response, the parsed output, latency per step, token counts, and which provider handled the request. When an agent produces a bad score, I can pull up the trace and see exactly what the model received and returned.
The most valuable LangSmith feature for a multi-agent system is the trace waterfall. For a full candidate assessment, the waterfall shows the coordinator starting the resume match agent, then forking into parallel culture-fit and skill-gap agents, then converging into retention prediction. You can see exactly where time is being spent and which agent is the bottleneck.
Real Accuracy Metrics
We validated agent accuracy against human recruiter decisions on a dataset of 500 historical candidate evaluations:
- Resume Match Agent: 87% agreement with senior recruiter ranking (top 10 candidates in same order). Score correlation (Spearman): 0.82.
- Culture Fit Agent: 71% agreement. This is the weakest agent because culture fit is inherently subjective. We use this as a flag for discussion, not a filter.
- Skill Gap Agent: 93% precision on identifying missing required skills. 78% recall (misses some niche skills that use unconventional naming).
- JD Generator: 89% of generated JDs were used with minor edits. 11% required significant rewrites (usually for senior/executive roles with nuanced requirements).
- Interview Question Agent: 94% of generated questions rated "useful" or "very useful" by interviewers in a blind evaluation.
These numbers are good enough for an augmentation tool (helping recruiters work faster) but not good enough for full automation (replacing recruiters). The system is positioned as an assistant that pre-screens and surfaces insights, not as an autonomous decision-maker. That positioning is both technically honest and legally safer.
Handling Agent Failures Gracefully
In a 40+ agent system, something is always failing. The design principle is that no single agent failure should crash the user experience. Here is the error handling hierarchy:
// lib/ai/error-handling.ts
// Level 1: Output parsing retry
// If the LLM returns malformed JSON, retry once with a "repair" prompt
export async function parseWithRepair<T>(
rawOutput: string,
parser: StructuredOutputParser<T>,
model: any,
): Promise<T> {
try {
return await parser.parse(rawOutput);
} catch (parseError) {
// Ask the model to fix its own output
const repairPrompt = `The following JSON is malformed. Fix it to match
this schema and return ONLY the corrected JSON:\n\n${rawOutput}\n\n
Schema: ${parser.getFormatInstructions()}`;
const repaired = await model.invoke(repairPrompt);
return await parser.parse(repaired.content);
}
}
// Level 2: Provider fallback (handled by Model Factory)
// Level 3: Graceful degradation
// Each agent result is Optional in the assessment.
// The UI renders available data and shows "unavailable" for failed agents.
export function buildPartialAssessment(
results: Record<string, { status: string; value?: any; error?: string }>,
) {
return {
available: Object.entries(results)
.filter(([, r]) => r.status === "fulfilled")
.map(([name, r]) => ({ agent: name, result: r.value })),
failed: Object.entries(results)
.filter(([, r]) => r.status === "rejected")
.map(([name, r]) => ({ agent: name, error: r.error })),
completeness: `${
Object.values(results).filter((r) => r.status === "fulfilled").length
}/${Object.keys(results).length} agents completed`,
};
}
Lessons for Building Multi-Agent Systems
1. Start with one agent, get it production-quality, then replicate the pattern. I built the resume matching agent first and spent two weeks refining its prompt, output schema, error handling, and test suite. Every subsequent agent was built from that template in 2-4 hours. The upfront investment in the pattern paid for itself by agent number five.
2. Zod schemas are your best friend. Every agent has a strict output schema. This catches LLM output drift immediately rather than letting malformed data propagate through the system. When Groq updated their Llama deployment and the output format subtly changed, Zod caught it on the first request and the retry mechanism handled it automatically.
3. The coordinator is the hardest part. Individual agents are straightforward. Orchestrating them—deciding execution order, handling partial failures, managing timeouts, aggregating results—is where the real engineering complexity lives. Invest proportionally.
4. Free-tier LLMs are production-viable with the right architecture. Groq's free tier of Llama 3.3 70B handles structured extraction and scoring tasks at a quality level comparable to GPT-4o-mini. The Model Factory pattern with circuit breakers makes the free tier reliable enough for production by gracefully falling back when limits are hit. The 90/8/2 split across free/free/paid tiers kept our LLM costs under $15/month for a system serving hundreds of daily assessments.
5. Monitor everything from day one. LangSmith traces have saved me dozens of debugging hours. When a recruiter reports that a candidate got an unexpectedly low score, I pull the trace, see the exact prompt and response, and can determine whether it was a prompt issue, a parsing issue, or actually correct. Without traces, debugging a multi-agent system is guesswork.
Multi-agent systems are not about making AI smarter. They are about making AI engineering manageable. Forty focused, testable, composable agents are infinitely easier to build, debug, and maintain than one omniscient agent that tries to do everything. The LangChain ecosystem provides the primitives. The architecture—Model Factory, Coordinator, typed output schemas, circuit breakers—is what makes it production-ready.