Building a Plug-and-Play Digital Proctoring Solution: Scaling to 50K Concurrent Users
How we built an independent, plug-and-play proctoring solution that handles 50,000 concurrent users—and the technical decisions that made it possible.
Digital proctoring is essential for maintaining integrity in online assessments, but existing solutions often come with significant limitations: high costs, scalability issues, and lack of customization. This is how we architected a plug-and-play proctoring system that handles 50k concurrent users, costs approximately ₹0.35 per user, and provides full control over the experience.
The solution: A lightweight SDK that integrates into any web application via a simple script tag, with a scalable backend that can handle massive concurrent loads.
Use Cases
This proctoring solution is designed as a plug-and-play system that can be integrated into various educational and assessment platforms:
- Online Examinations: High-stakes competitive exams, certification tests, and academic assessments requiring strict integrity monitoring.
- Live Class Attentiveness: Monitor student engagement during live online classes by tracking presence, attention levels, and participation patterns.
- Remote Interviews: Ensure candidate authenticity during remote hiring processes.
- Training & Certification: Track completion and authenticity for professional development courses.
- Adaptive Learning Assessments: Monitor student behavior during adaptive learning sessions to ensure genuine engagement.
The system is designed to be non-intrusive, with minimal performance impact on the host application, making it suitable for long-duration sessions (3+ hours) without degrading user experience.
The Architecture: Client-Side Capture, Server-Side Orchestration
We designed a system with clear separation of concerns:
┌─────────────────────────────────────────────────────────────────────┐
│ CANDIDATE BROWSER (Host Application) │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ PROCTOR SDK (Script Tag Injection) │ │
│ │ │ │
│ │ • Permission checks (camera, mic, screen) │ │
│ │ • Face captures every 30s (25KB JPEG) │ │
│ │ • Screen screenshots every 30s (25KB JPEG) │ │
│ │ • Audio recording (continuous, 12kbps Opus) │ │
│ │ • Event monitoring (tab switches, keystrokes) │ │
│ │ • IndexedDB buffer for offline resilience │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Media │ │ Event │ │ IndexedDB │ │ │
│ │ │ Capture │─▶│ Collector │─▶│ Buffer │ │ │
│ │ │ (Workers) │ │ │ │ (Offline) │ │ │
│ │ └────────────┘ └────────────┘ └─────┬──────┘ │ │
│ └─────────────────────────────────────────┼──────────────────────┘ │
└─────────────────────────────────────────────┼──────────────────────────┘
│
┌─────────────────────────┼──────────────────────────┐
│ │ │
▼ │ ▼
┌───────────────────────────────┐ │ ┌───────────────────────────┐
│ PROCTOR BACKEND SERVICE (Go) │ │ │ S3 (Direct Upload) │
│ │ │ │ ap-south-1 │
│ • GET /api/v1/config │ │ │ │
│ • POST /api/v1/credentials │◄────────────┘ │ /{date}/{test_id}/ │
│ • POST /api/v1/events │ │ {candidate_id}/ │
│ • GET /api/v1/dashboard/* │ │ camera/ (25KB) │
│ │ │ screen/ (25KB) │
│ │ │ audio/ (14KB/30s) │
│ │ └───────────────────────────┘
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Kafka │ │
│ │ topic: proctor_events │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Kafka Connect │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ ClickHouse │ │
│ │ table: proctor_events │ │
│ └──────────────────────────┘ │
└───────────────────────────────────┘
The key insight: media uploads go directly to S3 (bypassing the backend), while events flow through our backend for validation and publishing to Kafka, which are then streamed to ClickHouse via Kafka Connect.
Critical Design Decision: Non-Blocking by Default
This was our most important architectural principle: permission issues block the test; infrastructure failures never do.
Blocking Scenarios (Candidate Must Act)
- Permission denied (camera, mic, screen)
- Full screen exited
- Screen share stopped
- Unsupported browser
These are compliance issues. The candidate must fix them to continue.
Non-Blocking Scenarios (Graceful Degradation)
- Backend API unavailable → Events buffered in IndexedDB
- S3 upload failure → Retry with exponential backoff
- Kafka/ClickHouse down → Backend buffers internally
- STS credential failure → Retry 3x, then disable S3 uploads
The golden rule: A candidate should NEVER be blocked from taking their test due to a failure in the proctoring infrastructure.
This design choice meant we had to build robust retry logic, offline buffering, and graceful degradation at every layer.
The SDK: Performance-Critical Client-Side Code
The SDK runs in the candidate's browser for 3+ hours. It must have negligible impact on the test-taking experience.
Performance Budgets
| Metric | Budget | How We Achieved It |
|---|---|---|
| Main thread blocking | <16ms per frame | All heavy ops in Web Workers |
| Memory usage | <50MB steady state | Aggressive cleanup, limit buffers |
| CPU usage | <5% average | Throttle captures, use requestIdleCallback |
| Bundle size | <100KB gzipped | Tree-shaking, lazy loading |
| Network per minute | <200KB | Aggressive compression, batching |
Thread Model
We moved all heavy computation off the main thread:
┌─────────────────────────────────────────────────────────┐
│ MAIN THREAD │
│ • Event listeners (minimal) │
│ • Permission prompts │
│ • Blocking overlays │
│ • State coordination │
│ │
│ NO heavy computation here! │
└─────────────────────────────────────────────────────────┘
│
│ Message passing
│
┌─────────────────────────────────────────────────────────┐
│ WEB WORKERS │
│ │
│ ┌──────────────────────┐ │
│ │ ImageWorker.js │ │
│ │ • JPEG compression │ │
│ │ • Canvas resizing │ │
│ │ • Face: 25KB │ │
│ │ • Screen: 25KB │ │
│ └──────────────────────┘ │
│ │
│ ┌──────────────────────┐ │
│ │ AudioWorker.js │ │
│ │ • Opus encoding │ │
│ │ • Chunk processing │ │
│ └──────────────────────┘ │
│ │
│ ┌──────────────────────┐ │
│ │ UploadWorker.js │ │
│ │ • S3 PUT requests │ │
│ │ • Retry logic │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Image Compression Strategy
We compress images aggressively in a Web Worker:
// ImageWorker.js - runs off main thread
self.onmessage = async (e) => {
const { imageData, type, quality, maxWidth } = e.data;
// Create OffscreenCanvas (no DOM access needed)
const canvas = new OffscreenCanvas(maxWidth, maxWidth * 0.75);
const ctx = canvas.getContext("2d");
// Resize maintaining aspect ratio
const bitmap = await createImageBitmap(imageData);
const scale = Math.min(maxWidth / bitmap.width, 1);
canvas.width = bitmap.width * scale;
canvas.height = bitmap.height * scale;
ctx.drawImage(bitmap, 0, 0, canvas.width, canvas.height);
bitmap.close(); // Release memory immediately
// Compress to JPEG blob
const blob = await canvas.convertToBlob({
type: "image/jpeg",
quality: type === "face" ? 0.4 : 0.35, // Face: 25KB, Screen: 25KB
});
self.postMessage({ blob, type }, [blob]);
};
Result: Face captures are ~25KB, screen screenshots are ~25KB. For a 3-hour test with captures every 30 seconds, that's 360 images × 50KB = ~18MB per candidate. Much better than uncompressed images.
Audio Recording: Real-Time Chunks
We upload audio in real-time chunks (every 30 seconds) rather than one large file at the end. This ensures minimal data loss if the candidate closes the tab.
class AudioRecorder {
private readonly CHUNK_INTERVAL_MS = 30000; // 30 seconds per chunk
async start(stream: MediaStream) {
this.mediaRecorder = new MediaRecorder(stream, {
mimeType: "audio/webm;codecs=opus",
audioBitsPerSecond: 12000, // 12kbps Opus = ~3MB for 3 hours total
});
this.mediaRecorder.ondataavailable = async (e) => {
if (e.data.size > 0) {
await this.uploadChunk(e.data);
}
};
// Fire ondataavailable every 30 seconds
this.mediaRecorder.start(this.CHUNK_INTERVAL_MS);
}
}
Each chunk is ~14KB (30 seconds at 12kbps Opus). For a 3-hour test: 360 chunks × 14KB = ~5MB total.
Benefits:
- Only last partial chunk lost on tab close (~30s max)
- No need for IndexedDB recovery of large files
- Smaller retry units if upload fails
Backend: Stateless and Scalable
The backend service is written in Go and designed to be horizontally scalable.
Key Endpoints
POST /api/v1/credentials - Issues STS temporary credentials for S3 uploads
func (s *STSService) GenerateCredentials(ctx context.Context, testID, candidateID string) (*Credentials, error) {
date := time.Now().Format("2006-01-02")
prefix := fmt.Sprintf("%s/%s/%s/", date, testID, candidateID)
// Session policy scopes credentials to candidate's folder only
sessionPolicy := fmt.Sprintf(`{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::proctor-media-bucket/%s*"
}]
}`, prefix)
input := &sts.AssumeRoleInput{
RoleArn: aws.String("arn:aws:iam::ACCOUNT_ID:role/ProctorMediaUploadRole"),
RoleSessionName: aws.String(fmt.Sprintf("proctor-%s-%s", testID, candidateID)),
DurationSeconds: aws.Int64(14400), // 4 hours (covers 3-hour test + buffer)
Policy: aws.String(sessionPolicy),
}
result, err := s.stsClient.AssumeRole(ctx, input)
// ... return credentials
}
POST /api/v1/events - Ingests batched events from SDK
The backend validates events, enriches them with server timestamps, and publishes them to Kafka. Events are then consumed by Kafka Connect, which streams them to ClickHouse for real-time analytics. The response includes shutdown signals if proctoring is disabled:
func HandleEvents(w http.ResponseWriter, r *http.Request) {
// Extract from JWT
candidateID := r.Context().Value("candidate_id").(string)
testID := r.FormValue("test_id")
// Check flags BEFORE processing events
config, err := db.GetTestConfig(testID, candidateID)
if err != nil || !config.ProctoringEnabled || config.CandidateBypass {
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "shutdown",
"reason": determineReason(config),
"action": "graceful_shutdown",
})
return
}
// Process events normally...
processEvents(events)
}
This approach uses the existing events channel (called every 5s) to signal shutdown, avoiding the need for separate config polling or WebSocket infrastructure.
Configuration: AWS AppConfig for Instant Propagation
We store all proctoring configuration in AWS AppConfig rather than a database. This gives us:
- Instant propagation - Config changes reflect immediately
- Built-in rollback - If something breaks, auto-rollback
- Feature flags - Native support for gradual rollouts
- No database dependency - One less system to manage
Configuration Profiles
Global Defaults:
{
"defaults": {
"proctoring_enabled": true,
"capture_intervals": {
"face_ms": 30000,
"screen_ms": 30000
},
"blocking": {
"on_permission_denied": true,
"on_fullscreen_exit": true
}
}
}
Test Overrides:
{
"tests": {
"TEST123": {
"proctoring_enabled": true,
"capture_intervals": {
"face_ms": 15000,
"screen_ms": 60000
}
}
}
}
Candidate Bypass:
{
"bypasses": {
"TEST123": {
"CAND456": {
"bypassed": true,
"reason": "Technical issues with camera",
"bypassed_by": "admin@example.com"
}
}
}
}
The backend resolves config in this order: global defaults → test overrides → candidate bypass. If bypass is enabled, the SDK receives a shutdown signal on its next events call.
Scale: Handling 50K Concurrent Candidates
Let's break down the load:
Scenario: 50,000 candidates, 3-hour test, captures every 30 seconds
| Metric | Calculation | Value |
|---|---|---|
| Face captures per candidate | 3 hours × 2/min | 360 |
| Screen screenshots per candidate | 3 hours × 2/min | 360 |
| Audio chunks per candidate | 3 hours × 2/min (30s chunks) | 360 |
| Total face captures | 50,000 × 360 | 18,000,000 |
| Total screen screenshots | 50,000 × 360 | 18,000,000 |
| Total audio chunks | 50,000 × 360 | 18,000,000 |
| Other events per candidate | ~100 (tab switches, etc.) | 5,000,000 |
| Total events | ~41,000,000 | |
| Storage per test | (36M × 50KB) + (50K × 5MB) | ~1.05 TB |
QPS Analysis
| Phase | Duration | Requests | QPS |
|---|---|---|---|
| Credential burst | 2 min (instruction window) | 50,000 | ~417 |
| Steady state events | 3 hours | 23M batched | ~1,100 |
| S3 uploads (face) | 3 hours | 18M | ~1,667 |
| S3 uploads (screen) | 3 hours | 18M | ~1,667 |
| S3 uploads (audio) | 3 hours | 18M | ~1,667 |
Total S3 uploads: ~54M PUT requests. But since uploads go directly from client to S3, the backend doesn't see this load.
Scaling Strategy
| Component | Strategy |
|---|---|
| Proctor Backend | Horizontal scaling (stateless), K8s HPA |
| S3 Uploads | Direct client-to-S3 (infinitely scalable) |
| Kafka | Kafka cluster with topic partitioning |
| Kafka Connect | Streams events from Kafka to ClickHouse |
| ClickHouse | ClickHouse cluster for analytics storage |
We use Kubernetes Horizontal Pod Autoscaler (HPA) to scale the backend based on CPU and memory:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Scaling Estimates (50k Candidates):
| Phase | Duration | Pods Needed | CPU Total |
|---|---|---|---|
| Test Start (burst) | 5 min | 15-20 | 15-20 vCPU |
| Steady State | 3 hours | 8-10 | 4-5 vCPU |
| Test End (burst) | 10 min | 10-12 | 5-6 vCPU |
| Idle (no test) | - | 3 | 0.15 vCPU |
Storage: S3 with Lifecycle Policies
We store all media in S3 with a date-prefixed structure for easy lifecycle management:
s3://proctor-media-bucket/
└── 2024-12-22/ # Date prefix
└── TEST123/
└── CAND456/
├── camera/ # Face captures (25KB each)
│ ├── base_truth_1703234567890.jpg
│ └── 1703234597890_uuid-1.jpg
├── screen/ # Screen screenshots (25KB each)
│ └── 1703234597890_uuid-2.jpg
└── audio/ # Audio chunks (~14KB each)
└── audio_1703234567890/
├── chunk_00000.webm
├── chunk_00001.webm
└── chunk_00359.webm
Lifecycle Policy:
- 0-10 days: Standard storage ($0.023/GB)
- 10-30 days: Glacier Instant Retrieval ($0.004/GB)
- 30+ days: Deleted
This reduces storage costs by ~83% after 10 days.
Cost Estimation
Per Test (50k users):
| Item | Cost (INR) |
|---|---|
| S3 PUT requests (54M requests) | ₹15,000 |
| S3 Storage (10 days Standard) | ₹665 |
| S3 Storage (20 days Glacier IR) | ₹235 |
| Kafka (marginal) | ₹1,000 |
| Compute (burst handling) | ₹830 |
| Total per test | ~₹17,730 |
Cost per user: Approximately ₹0.35 per test session.
Annual Infrastructure (Monthly Baseline):
| Component | Monthly Cost |
|---|---|
| Proctor Backend | ₹8,000 |
| S3 Storage | ₹6,500 |
| Kafka | ₹3,000 |
| ClickHouse | ₹3,500 |
| AWS AppConfig | ₹500 |
| CDN (SDK) | ₹1,000 |
| Total monthly | ₹22,500 |
Total Annual: ~₹78,000 (per-test costs) + ~₹2.7L (infrastructure) = ~₹3.5L for 5 tests per year.
The system is designed to scale linearly, with costs primarily driven by storage and compute resources that can be optimized based on retention policies and usage patterns.
Key Learnings
1. Non-Blocking Design is Non-Negotiable
The most critical decision was making infrastructure failures non-blocking. Candidates should never be prevented from taking their test because our proctoring service is down. This required:
- Robust retry logic with exponential backoff
- IndexedDB buffering for offline resilience
- Graceful degradation at every layer
- Clear separation between "must block" (permissions) and "never block" (infrastructure)
2. Client-Side Performance Matters
When code runs in a candidate's browser for 3+ hours, every millisecond counts. We achieved <5% CPU overhead by:
- Moving all heavy computation to Web Workers
- Aggressive image compression (25KB per capture)
- Throttling captures to prevent overlap
- Using requestIdleCallback for non-critical operations
3. Direct S3 Uploads Scale Infinitely
By having clients upload directly to S3 (via STS credentials), we bypass the backend entirely for media uploads. This means:
- No backend bottleneck for uploads
- S3 handles the scale (it's designed for this)
- Backend only handles lightweight event batching
4. Configuration as Code (AppConfig)
Using AWS AppConfig instead of a database for configuration gives us:
- Instant propagation (no polling needed)
- Built-in rollback on errors
- Feature flag support out of the box
- One less database to manage
5. Real-Time Chunks > Single Upload
Uploading audio in 30-second chunks instead of one large file at the end means:
- Minimal data loss on tab close (~30s max)
- Smaller retry units if upload fails
- Better progress tracking
- No need for large IndexedDB buffers
What's Next
Phase 1 is complete and handling production load. Future phases will add:
- Phase 2: AI-based face matching and anomaly detection
- Phase 3: Real-time proctoring with live human monitors
- Phase 4: Advanced security (VM detection, remote desktop detection)
But the foundation is solid: a scalable, cost-effective, fully-controlled proctoring solution that handles 50k concurrent candidates without breaking a sweat.
Takeaways
If you're considering building vs. buying for a critical system:
- Do the math - At scale, vendor costs can be astronomical
- Design for failure - Non-blocking architecture is essential for user-facing systems
- Optimize client-side - Performance budgets matter when code runs for hours
- Leverage managed services - S3, AppConfig, and existing infrastructure reduce complexity
- Measure everything - We track CPU, memory, network, and upload success rates
The result: a plug-and-play system that scales to 50k concurrent users, provides full control over the proctoring experience, and integrates seamlessly into any web application with minimal overhead.