Kitchen Surveillance with OpenClaw: Person Detection Under 50MB RAM

How I built a real-time kitchen surveillance agent on a Redmi Note 7 Pro using MobileNet-SSD, OpenCV, and TFLite — detecting people under 50MB RAM and sending Telegram photo alerts.

most computer vision demos assume you have a GPU. this one runs on a 6-year-old phone with 4GB RAM, LineageOS, and a cracked screen. and yes, it actually works.

this post is about the person-detection module i built on top of OpenClaw on my Redmi Note 7 Pro. if you haven't read setup yet, start here: how I installed OpenClaw on this same phone. that post covers Termux, PM2, SSH, and 24/7 runtime basics. this one is the surveillance layer on top.

end result: ~8 FPS real-time detection under 50MB RAM, plus Telegram alerts with annotated photos every time someone entered frame.

The Problem

normal surveillance stacks usually mean Raspberry Pi + camera + cloud subscription (or all three). i wanted cheaper and self-contained: use one phone, flash clean AOSP-based LineageOS, run an always-on agent, detect people, and push alerts to Telegram. hardware cost: under ₹12,000 (~$150).

main constraint: 4GB RAM total on device, shared by OS + agent runtime + Telegram. CV pipeline had to stay under 50MB peak working set.

Why MobileNet-SSD Instead of YOLO

YOLO is the obvious first choice. it's accurate, fast, and everywhere. i evaluated YOLOv5-nano for edge use. problem: ~200MB model footprint and 70-80MB RAM working set in my tests on this device.

MobileNet-SSD V2 (COCO INT8 quantized) was a better fit:

Model size: 6.9MB (quantized)
RAM working set: ~30-35MB including frame buffers
Inference time: ~120ms on Snapdragon 660 CPU
Person precision: ~72% mAP on COCO

for surveillance alerts, mAP isn't the only thing. false negatives matter more. i set confidence threshold to 0.45 after testing across kitchen lighting. at that threshold, misses were about 1 in 15 clearly visible people (usually fast edge movement). false positives (chairs/bags) were reduced by requiring 2 consecutive detections within 500ms before alert.

Architecture

pipeline ran as a background Android service and exposed to OpenClaw through a local Unix socket.

Camera2 API → Frame buffer → OpenCV resize → MobileNet-SSD → Filter → Telegram alert

Camera capture: Camera2 via Python-for-Android (P4A), capture at 640×480, then downsample to 300×300 for model input. this avoids expensive full-res float32 allocation inside inference path.

OpenCV: resize, YUV→BGR conversion, and bounding-box rendering for alert image. building OpenCV-contrib for ARM32 with correct ABI/API level was honestly harder than the detection logic.

Inference: TensorFlow Lite Android .so loaded through ctypes. why not PyTorch Mobile? TFLite delegate path on Snapdragon 660 offered better practical performance for this setup. DSP offload wasn't clean on this device, but INT8 still ran ~40% faster than FP32 due to cache behavior.

The Telegram Integration

on each fired detection:

capture full-res annotated frame (1280×960) from separate capture thread
encode JPEG at 85% (~60-90KB)
send via Telegram Bot API using requests

image includes box, confidence, and timestamp. rate limit: max one alert every 30s to avoid notification spam.

def send_alert(frame: np.ndarray, confidence: float) -> None:
    """
    Send annotated detection frame to Telegram.
    Draws bounding box and confidence before encoding.
 
    Args:
        frame: BGR frame from capture thread
        confidence: Detection confidence score (0-1)
    """
    annotated = draw_detection(frame, confidence)
    _, buf = cv2.imencode(".jpg", annotated, [cv2.IMWRITE_JPEG_QUALITY, 85])
    requests.post(
        f"https://api.telegram.org/bot{BOT_TOKEN}/sendPhoto",
        files={"photo": buf.tobytes()},
        data={"chat_id": CHAT_ID, "caption": f"Person detected ({confidence:.0%})"},
        timeout=10,
    )

RAM Budget

actual numbers during runtime:

| Component | RSS | | -------------------- | --------- | | Python runtime + P4A | ~12MB | | OpenCV (ARM32 .so) | ~8MB | | TFLite model (INT8) | ~7MB | | Frame buffers (2×) | ~3MB | | Telegram + requests | ~4MB | | Misc / overhead | ~5MB | | Total | ~39MB |

two-frame strategy (300×300 INT8 detection frame + full-res alert frame) kept memory predictable. python GC occasionally spiked near ~45MB, but 8-hour overnight test never crossed 50MB.

Results

72-hour real kitchen run:

True positive rate: 94%
False positive rate: ~3% (mostly strong backlight silhouettes)
Alert delivery latency: 1.8s median
System stability: no crashes, memory stable around 38-42MB RSS

this module ran inside full OpenClaw stack (voice I/O, UI automation, agent networking) on same phone. fitting everything inside 4GB with headroom was the real challenge.

if you want broader system context, read my AI agent orchestration system post.