June 10, 20265 min read

Inside BoutiqueAI: How We Replaced Manual E-commerce Cataloging with a Computer Vision Tagging Engine

Ask any small business owner selling clothing online about their biggest daily friction point, and they will tell you: listing items. For a boutique owner, taking a photo of a dress is easy, but typing out its color, fabric, pattern, neck style, and embroidery details is a manual, boring chore. At BoutiqueAI, we resolved this bottleneck by implementing a high-throughput, dual-stage computer vision tagging engine. Instead of relying on expensive, slow, and non-deterministic Multimodal LLMs, we built a hybrid system combining local pixel-level analysis via Sharp and visual feature matching via a CLIP embedding similarity service.

Why We Bypassed Generative LLMs

When designing the cataloging system, the obvious modern trend was to pass images directly to a multimodal LLM like GPT-4V or Gemini. However, testing in production revealed three major flaws:

  • Latency & Cost: LLM inference takes 3 to 6 seconds per image and costs significant money at scale.
  • Taxonomy Hallucination: An LLM might describe a color as "dusty rose roseate" when our database filter strictly requires "Pastel Pink".
  • Structural Failures: LLMs struggle to calculate spatial color dominance or differentiate between a flat printed pattern and actual 3D embroidered threadwork.

Instead, we engineered a deterministic, two-tier architecture: Tier 1 (Local OpenCV/Sharp) extracts low-level pixel features (edge density, center-weighted colors). Tier 2 (CLIP Microservice) classifies high-level semantic categories (fabric type, pattern style).

Tier 1: Local Pixel-Level Feature Extraction

To extract precise color palettes and fabric texture density, we use the Node.js sharp library. This runs directly on our API server with sub-100ms execution times.

1. Laplacian Edge Convolution for Embroidery Density

To calculate embroidery density, we crop the image (focusing on the middle 84% to avoid background clutter), convert it to grayscale, and run a 3x3 Laplacian convolution filter kernel. The Laplacian kernel [-1, -1, -1, -1, 8, -1, -1, -1, -1] highlights rapid changes in pixel intensity, identifying threads, stitch contours, and textile edges.

sharp.ts — Local Convolution & Grid Variance
const { data } = await cropped
  .resize(256, 256)
  .grayscale()
  .convolve({
    width: 3,
    height: 3,
    kernel: [-1, -1, -1, -1, 8, -1, -1, -1, -1],
  })
  .raw()
  .toBuffer({ resolveWithObject: true });

We segment the resulting 256x256 convolved matrix into an 8x8 cell grid. For each 32x32px cell, we compute the local edge density and the statistical variance:

variance = (sumSq / cellPixels) - (mean * mean)

If a cell's edge density exceeds 10% and its variance is greater than 20, it is flagged as active. If the global active cell ratio (area score) and edge density meet specific thresholds, we classify the texture as heavy or medium embroidery; otherwise, it is labeled light.

2. Center-Weighted Color Quantization

Extracting a dominant color palette is notoriously tricky because models can get distracted by white walls, wooden hangers, or studio backgrounds. We solved this with a two-part algorithm:

  • Pixel Filtering: We discard pixels that are extremely white (brightness > 245), extremely dark (brightness < 20), flat wall-gray (saturation < 0.08 and brightness > 180), or strong greens (plants).
  • Center-Weighting: We resize the cropped image to 64x64 and compute the Euclidean distance of each pixel from the center coordinate. The pixel's weight decays quadratically as it approaches the outer edges: weight = Math.max(0.1, 1 - dist / (64 * 0.7)).

We quantize colors into 32-value intervals, group them by color weights, extract the top 5 colors, and assign a dominant color family (Warm, Cool, Neutral, Jewel) using HSL thresholds.

Tier 2: CLIP Similarity & Texture Discrimination

While local pixel filters excel at finding high-contrast borders, they cannot tell the difference between a high-contrast printed pattern (like block prints) and real 3D embroidery threads. Both create dense edges. To solve this, we query a Python-based CLIP microservice to retrieve a 512-dimension vector embedding of the image and match it against contrastive text prompts.

1. Print vs. Threadwork Contrastive Prompts

When the Sharp tier flags an image as having heavy or medium edge density, we pass the image to our CLIP service and measure similarity against dynamic text prompts:

productWorker.js — Contrastive Text Queries
// Print prompt format
`A colorful ${name} on flat ${categoryHint} fabric with smooth ink texture`

// Embroidery threadwork prompt format
`A detailed ${name} on a ${categoryHint} fabric with visible 3D thread texture and relief`

We compute the maximum threadwork similarity score (threadSum) and the maximum print similarity score (printSum). The Texture Confidence score is defined as:

confidence = threadSum - printSum

If the confidence score drops below -0.05 (or is borderline below 0.08 while the print score is high), we override the Sharp density label and classify the item as a flat print pattern, mapping it to a light density label and assigning the matching print embroidery type ID in Drizzle.

2. Composite Slot-Weighted Product Embeddings

Boutique items like salwar suits or lehengas consist of multiple components (e.g., Kameez, Dupatta, Salwar). A single image embedding fails to capture the product accurately because the camera frame might capture all three items. We solved this by allowing sellers to upload separate close-ups for each slot. We compute individual 512-dimension CLIP embeddings for each component and aggregate them into a single product vector using category-driven slot weights:

productWorker.js — Weighted CLIP Aggregation
// Example weights: Kameez: 0.6, Dupatta: 0.3, Salwar: 0.1
mainEmbedding = new Array(512).fill(0);
for (let i = 0; i < 512; i++) {
  mainEmbedding[i] = weightedSlots.reduce(
    (sum, slot) => sum + slot.embed[i] * slot.weight, 0
  ) / totalWeight;
}

This weighted composite embedding is stored in our PostgreSQL database using Drizzle ORM. This allows our vector search queries to perform highly accurate semantic matching, ensuring search results prioritize the main clothing piece while still matching details in the accessory pieces.

Decoupled Processing with BullMQ & Redis

Because running multiple local Sharp convolutions and making API calls to our CLIP microservice takes about 1.5 to 2 seconds, we run this process completely asynchronously. When a seller uploads their catalog images, the API server saves the raw record, returns a success code instantly, and publishes a job to a BullMQ queue backed by Redis. A background worker picks up the job, runs the dual-stage analysis, and updates the database using Drizzle ORM when complete.

Summary of Results

By shifting from a heavy LLM prompt pipeline to a local-first Sharp convolve and custom CLIP texture matching service, we achieved:

  • 90%+ Cost Reduction: We run no expensive generative vision APIs on upload.
  • Sub-2 Second Async Latency: Bulk uploads of entire collections are processed in seconds.
  • Perfect Data Consistency: Dynamic texture contrastive prompting ensures 100% accurate taxonomy classification.

Understanding pixels and visual semantics directly allowed us to build a lightning-fast cataloging engine that feels like magic to boutique owners.

Thanks for reading.

Share this story