Back to all articles
2025-12-24

Building a Plug-and-Play Digital Proctoring Solution: Scaling to 50K Concurrent Users

How we built an independent, plug-and-play proctoring solution that handles 50,000 concurrent users—and the technical decisions that made it possible.

Digital proctoring is essential for maintaining integrity in online assessments, but existing solutions often come with significant limitations: high costs, scalability issues, and lack of customization. This is how we architected a plug-and-play proctoring system that handles 50k concurrent users, costs approximately ₹0.35 per user, and provides full control over the experience.

The solution: A lightweight SDK that integrates into any web application via a simple script tag, with a scalable backend that can handle massive concurrent loads.


Use Cases

This proctoring solution is designed as a plug-and-play system that can be integrated into various educational and assessment platforms:

  1. Online Examinations: High-stakes competitive exams, certification tests, and academic assessments requiring strict integrity monitoring.
  2. Live Class Attentiveness: Monitor student engagement during live online classes by tracking presence, attention levels, and participation patterns.
  3. Remote Interviews: Ensure candidate authenticity during remote hiring processes.
  4. Training & Certification: Track completion and authenticity for professional development courses.
  5. Adaptive Learning Assessments: Monitor student behavior during adaptive learning sessions to ensure genuine engagement.

The system is designed to be non-intrusive, with minimal performance impact on the host application, making it suitable for long-duration sessions (3+ hours) without degrading user experience.


The Architecture: Client-Side Capture, Server-Side Orchestration

We designed a system with clear separation of concerns:

┌─────────────────────────────────────────────────────────────────────┐
│                    CANDIDATE BROWSER (Host Application)               │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │              PROCTOR SDK (Script Tag Injection)               │  │
│  │                                                                │  │
│  │  • Permission checks (camera, mic, screen)                     │  │
│  │  • Face captures every 30s (25KB JPEG)                        │  │
│  │  • Screen screenshots every 30s (25KB JPEG)                    │  │
│  │  • Audio recording (continuous, 12kbps Opus)                   │  │
│  │  • Event monitoring (tab switches, keystrokes)                 │  │
│  │  • IndexedDB buffer for offline resilience                     │  │
│  │                                                                │  │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐               │  │
│  │  │   Media    │  │   Event    │  │  IndexedDB │               │  │
│  │  │  Capture   │─▶│ Collector  │─▶│   Buffer   │               │  │
│  │  │ (Workers)  │  │            │  │ (Offline)  │               │  │
│  │  └────────────┘  └────────────┘  └─────┬──────┘               │  │
│  └─────────────────────────────────────────┼──────────────────────┘  │
└─────────────────────────────────────────────┼──────────────────────────┘
                                              │
                    ┌─────────────────────────┼──────────────────────────┐
                    │                         │                          │
                    ▼                         │                          ▼
┌───────────────────────────────┐             │       ┌───────────────────────────┐
│  PROCTOR BACKEND SERVICE (Go) │             │       │      S3 (Direct Upload)   │
│                               │             │       │        ap-south-1         │
│  • GET /api/v1/config         │             │       │                           │
│  • POST /api/v1/credentials   │◄────────────┘       │  /{date}/{test_id}/       │
│  • POST /api/v1/events        │                     │    {candidate_id}/         │
│  • GET /api/v1/dashboard/*   │                     │      camera/  (25KB)       │
│                               │                     │      screen/  (25KB)       │
│                               │                     │      audio/  (14KB/30s)    │
│                               │                     └───────────────────────────┘
│              │                 │
│              ▼                 │
│  ┌─────────────────────────┐   │
│  │        Kafka            │   │
│  │  topic: proctor_events  │   │
│  └───────────┬─────────────┘   │
│              │                 │
│              ▼                 │
│  ┌─────────────────────────┐   │
│  │   Kafka Connect         │   │
│  └───────────┬─────────────┘   │
│              │                 │
│              ▼                 │
│  ┌─────────────────────────┐   │
│  │      ClickHouse          │   │
│  │  table: proctor_events   │   │
│  └──────────────────────────┘   │
└───────────────────────────────────┘

The key insight: media uploads go directly to S3 (bypassing the backend), while events flow through our backend for validation and publishing to Kafka, which are then streamed to ClickHouse via Kafka Connect.


Critical Design Decision: Non-Blocking by Default

This was our most important architectural principle: permission issues block the test; infrastructure failures never do.

Blocking Scenarios (Candidate Must Act)

  • Permission denied (camera, mic, screen)
  • Full screen exited
  • Screen share stopped
  • Unsupported browser

These are compliance issues. The candidate must fix them to continue.

Non-Blocking Scenarios (Graceful Degradation)

  • Backend API unavailable → Events buffered in IndexedDB
  • S3 upload failure → Retry with exponential backoff
  • Kafka/ClickHouse down → Backend buffers internally
  • STS credential failure → Retry 3x, then disable S3 uploads

The golden rule: A candidate should NEVER be blocked from taking their test due to a failure in the proctoring infrastructure.

This design choice meant we had to build robust retry logic, offline buffering, and graceful degradation at every layer.


The SDK: Performance-Critical Client-Side Code

The SDK runs in the candidate's browser for 3+ hours. It must have negligible impact on the test-taking experience.

Performance Budgets

Metric Budget How We Achieved It
Main thread blocking <16ms per frame All heavy ops in Web Workers
Memory usage <50MB steady state Aggressive cleanup, limit buffers
CPU usage <5% average Throttle captures, use requestIdleCallback
Bundle size <100KB gzipped Tree-shaking, lazy loading
Network per minute <200KB Aggressive compression, batching

Thread Model

We moved all heavy computation off the main thread:

┌─────────────────────────────────────────────────────────┐
│                    MAIN THREAD                          │
│  • Event listeners (minimal)                             │
│  • Permission prompts                                  │
│  • Blocking overlays                                   │
│  • State coordination                                  │
│                                                        │
│  NO heavy computation here!                            │
└─────────────────────────────────────────────────────────┘
                        │
                        │ Message passing
                        │
┌─────────────────────────────────────────────────────────┐
│                    WEB WORKERS                          │
│                                                        │
│  ┌──────────────────────┐                              │
│  │  ImageWorker.js      │                              │
│  │  • JPEG compression  │                              │
│  │  • Canvas resizing   │                              │
│  │  • Face: 25KB       │                              │
│  │  • Screen: 25KB     │                              │
│  └──────────────────────┘                              │
│                                                        │
│  ┌──────────────────────┐                              │
│  │  AudioWorker.js      │                              │
│  │  • Opus encoding     │                              │
│  │  • Chunk processing  │                              │
│  └──────────────────────┘                              │
│                                                        │
│  ┌──────────────────────┐                              │
│  │  UploadWorker.js     │                              │
│  │  • S3 PUT requests   │                              │
│  │  • Retry logic       │                              │
│  └──────────────────────┘                              │
└─────────────────────────────────────────────────────────┘

Image Compression Strategy

We compress images aggressively in a Web Worker:

// ImageWorker.js - runs off main thread
self.onmessage = async (e) => {
  const { imageData, type, quality, maxWidth } = e.data;

  // Create OffscreenCanvas (no DOM access needed)
  const canvas = new OffscreenCanvas(maxWidth, maxWidth * 0.75);
  const ctx = canvas.getContext("2d");

  // Resize maintaining aspect ratio
  const bitmap = await createImageBitmap(imageData);
  const scale = Math.min(maxWidth / bitmap.width, 1);
  canvas.width = bitmap.width * scale;
  canvas.height = bitmap.height * scale;
  ctx.drawImage(bitmap, 0, 0, canvas.width, canvas.height);
  bitmap.close(); // Release memory immediately

  // Compress to JPEG blob
  const blob = await canvas.convertToBlob({
    type: "image/jpeg",
    quality: type === "face" ? 0.4 : 0.35, // Face: 25KB, Screen: 25KB
  });

  self.postMessage({ blob, type }, [blob]);
};

Result: Face captures are ~25KB, screen screenshots are ~25KB. For a 3-hour test with captures every 30 seconds, that's 360 images × 50KB = ~18MB per candidate. Much better than uncompressed images.

Audio Recording: Real-Time Chunks

We upload audio in real-time chunks (every 30 seconds) rather than one large file at the end. This ensures minimal data loss if the candidate closes the tab.

class AudioRecorder {
  private readonly CHUNK_INTERVAL_MS = 30000; // 30 seconds per chunk

  async start(stream: MediaStream) {
    this.mediaRecorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
      audioBitsPerSecond: 12000, // 12kbps Opus = ~3MB for 3 hours total
    });

    this.mediaRecorder.ondataavailable = async (e) => {
      if (e.data.size > 0) {
        await this.uploadChunk(e.data);
      }
    };

    // Fire ondataavailable every 30 seconds
    this.mediaRecorder.start(this.CHUNK_INTERVAL_MS);
  }
}

Each chunk is ~14KB (30 seconds at 12kbps Opus). For a 3-hour test: 360 chunks × 14KB = ~5MB total.

Benefits:

  • Only last partial chunk lost on tab close (~30s max)
  • No need for IndexedDB recovery of large files
  • Smaller retry units if upload fails

Backend: Stateless and Scalable

The backend service is written in Go and designed to be horizontally scalable.

Key Endpoints

POST /api/v1/credentials - Issues STS temporary credentials for S3 uploads

func (s *STSService) GenerateCredentials(ctx context.Context, testID, candidateID string) (*Credentials, error) {
    date := time.Now().Format("2006-01-02")
    prefix := fmt.Sprintf("%s/%s/%s/", date, testID, candidateID)

    // Session policy scopes credentials to candidate's folder only
    sessionPolicy := fmt.Sprintf(`{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::proctor-media-bucket/%s*"
        }]
    }`, prefix)

    input := &sts.AssumeRoleInput{
        RoleArn:         aws.String("arn:aws:iam::ACCOUNT_ID:role/ProctorMediaUploadRole"),
        RoleSessionName: aws.String(fmt.Sprintf("proctor-%s-%s", testID, candidateID)),
        DurationSeconds: aws.Int64(14400), // 4 hours (covers 3-hour test + buffer)
        Policy:          aws.String(sessionPolicy),
    }

    result, err := s.stsClient.AssumeRole(ctx, input)
    // ... return credentials
}

POST /api/v1/events - Ingests batched events from SDK

The backend validates events, enriches them with server timestamps, and publishes them to Kafka. Events are then consumed by Kafka Connect, which streams them to ClickHouse for real-time analytics. The response includes shutdown signals if proctoring is disabled:

func HandleEvents(w http.ResponseWriter, r *http.Request) {
    // Extract from JWT
    candidateID := r.Context().Value("candidate_id").(string)
    testID := r.FormValue("test_id")

    // Check flags BEFORE processing events
    config, err := db.GetTestConfig(testID, candidateID)
    if err != nil || !config.ProctoringEnabled || config.CandidateBypass {
        json.NewEncoder(w).Encode(map[string]interface{}{
            "status": "shutdown",
            "reason": determineReason(config),
            "action": "graceful_shutdown",
        })
        return
    }

    // Process events normally...
    processEvents(events)
}

This approach uses the existing events channel (called every 5s) to signal shutdown, avoiding the need for separate config polling or WebSocket infrastructure.


Configuration: AWS AppConfig for Instant Propagation

We store all proctoring configuration in AWS AppConfig rather than a database. This gives us:

  1. Instant propagation - Config changes reflect immediately
  2. Built-in rollback - If something breaks, auto-rollback
  3. Feature flags - Native support for gradual rollouts
  4. No database dependency - One less system to manage

Configuration Profiles

Global Defaults:

{
  "defaults": {
    "proctoring_enabled": true,
    "capture_intervals": {
      "face_ms": 30000,
      "screen_ms": 30000
    },
    "blocking": {
      "on_permission_denied": true,
      "on_fullscreen_exit": true
    }
  }
}

Test Overrides:

{
  "tests": {
    "TEST123": {
      "proctoring_enabled": true,
      "capture_intervals": {
        "face_ms": 15000,
        "screen_ms": 60000
      }
    }
  }
}

Candidate Bypass:

{
  "bypasses": {
    "TEST123": {
      "CAND456": {
        "bypassed": true,
        "reason": "Technical issues with camera",
        "bypassed_by": "admin@example.com"
      }
    }
  }
}

The backend resolves config in this order: global defaults → test overrides → candidate bypass. If bypass is enabled, the SDK receives a shutdown signal on its next events call.


Scale: Handling 50K Concurrent Candidates

Let's break down the load:

Scenario: 50,000 candidates, 3-hour test, captures every 30 seconds

Metric Calculation Value
Face captures per candidate 3 hours × 2/min 360
Screen screenshots per candidate 3 hours × 2/min 360
Audio chunks per candidate 3 hours × 2/min (30s chunks) 360
Total face captures 50,000 × 360 18,000,000
Total screen screenshots 50,000 × 360 18,000,000
Total audio chunks 50,000 × 360 18,000,000
Other events per candidate ~100 (tab switches, etc.) 5,000,000
Total events ~41,000,000
Storage per test (36M × 50KB) + (50K × 5MB) ~1.05 TB

QPS Analysis

Phase Duration Requests QPS
Credential burst 2 min (instruction window) 50,000 ~417
Steady state events 3 hours 23M batched ~1,100
S3 uploads (face) 3 hours 18M ~1,667
S3 uploads (screen) 3 hours 18M ~1,667
S3 uploads (audio) 3 hours 18M ~1,667

Total S3 uploads: ~54M PUT requests. But since uploads go directly from client to S3, the backend doesn't see this load.

Scaling Strategy

Component Strategy
Proctor Backend Horizontal scaling (stateless), K8s HPA
S3 Uploads Direct client-to-S3 (infinitely scalable)
Kafka Kafka cluster with topic partitioning
Kafka Connect Streams events from Kafka to ClickHouse
ClickHouse ClickHouse cluster for analytics storage

We use Kubernetes Horizontal Pod Autoscaler (HPA) to scale the backend based on CPU and memory:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Scaling Estimates (50k Candidates):

Phase Duration Pods Needed CPU Total
Test Start (burst) 5 min 15-20 15-20 vCPU
Steady State 3 hours 8-10 4-5 vCPU
Test End (burst) 10 min 10-12 5-6 vCPU
Idle (no test) - 3 0.15 vCPU

Storage: S3 with Lifecycle Policies

We store all media in S3 with a date-prefixed structure for easy lifecycle management:

s3://proctor-media-bucket/
└── 2024-12-22/                         # Date prefix
    └── TEST123/
        └── CAND456/
            ├── camera/                 # Face captures (25KB each)
            │   ├── base_truth_1703234567890.jpg
            │   └── 1703234597890_uuid-1.jpg
            ├── screen/                 # Screen screenshots (25KB each)
            │   └── 1703234597890_uuid-2.jpg
            └── audio/                  # Audio chunks (~14KB each)
                └── audio_1703234567890/
                    ├── chunk_00000.webm
                    ├── chunk_00001.webm
                    └── chunk_00359.webm

Lifecycle Policy:

  • 0-10 days: Standard storage ($0.023/GB)
  • 10-30 days: Glacier Instant Retrieval ($0.004/GB)
  • 30+ days: Deleted

This reduces storage costs by ~83% after 10 days.


Cost Estimation

Per Test (50k users):

Item Cost (INR)
S3 PUT requests (54M requests) ₹15,000
S3 Storage (10 days Standard) ₹665
S3 Storage (20 days Glacier IR) ₹235
Kafka (marginal) ₹1,000
Compute (burst handling) ₹830
Total per test ~₹17,730

Cost per user: Approximately ₹0.35 per test session.

Annual Infrastructure (Monthly Baseline):

Component Monthly Cost
Proctor Backend ₹8,000
S3 Storage ₹6,500
Kafka ₹3,000
ClickHouse ₹3,500
AWS AppConfig ₹500
CDN (SDK) ₹1,000
Total monthly ₹22,500

Total Annual: ~₹78,000 (per-test costs) + ~₹2.7L (infrastructure) = ~₹3.5L for 5 tests per year.

The system is designed to scale linearly, with costs primarily driven by storage and compute resources that can be optimized based on retention policies and usage patterns.


Key Learnings

1. Non-Blocking Design is Non-Negotiable

The most critical decision was making infrastructure failures non-blocking. Candidates should never be prevented from taking their test because our proctoring service is down. This required:

  • Robust retry logic with exponential backoff
  • IndexedDB buffering for offline resilience
  • Graceful degradation at every layer
  • Clear separation between "must block" (permissions) and "never block" (infrastructure)

2. Client-Side Performance Matters

When code runs in a candidate's browser for 3+ hours, every millisecond counts. We achieved <5% CPU overhead by:

  • Moving all heavy computation to Web Workers
  • Aggressive image compression (25KB per capture)
  • Throttling captures to prevent overlap
  • Using requestIdleCallback for non-critical operations

3. Direct S3 Uploads Scale Infinitely

By having clients upload directly to S3 (via STS credentials), we bypass the backend entirely for media uploads. This means:

  • No backend bottleneck for uploads
  • S3 handles the scale (it's designed for this)
  • Backend only handles lightweight event batching

4. Configuration as Code (AppConfig)

Using AWS AppConfig instead of a database for configuration gives us:

  • Instant propagation (no polling needed)
  • Built-in rollback on errors
  • Feature flag support out of the box
  • One less database to manage

5. Real-Time Chunks > Single Upload

Uploading audio in 30-second chunks instead of one large file at the end means:

  • Minimal data loss on tab close (~30s max)
  • Smaller retry units if upload fails
  • Better progress tracking
  • No need for large IndexedDB buffers

What's Next

Phase 1 is complete and handling production load. Future phases will add:

  • Phase 2: AI-based face matching and anomaly detection
  • Phase 3: Real-time proctoring with live human monitors
  • Phase 4: Advanced security (VM detection, remote desktop detection)

But the foundation is solid: a scalable, cost-effective, fully-controlled proctoring solution that handles 50k concurrent candidates without breaking a sweat.


Takeaways

If you're considering building vs. buying for a critical system:

  1. Do the math - At scale, vendor costs can be astronomical
  2. Design for failure - Non-blocking architecture is essential for user-facing systems
  3. Optimize client-side - Performance budgets matter when code runs for hours
  4. Leverage managed services - S3, AppConfig, and existing infrastructure reduce complexity
  5. Measure everything - We track CPU, memory, network, and upload success rates

The result: a plug-and-play system that scales to 50k concurrent users, provides full control over the proctoring experience, and integrates seamlessly into any web application with minimal overhead.