770 mentions in ten hours. Three were worth my time.
I pointed a keyword watch at #buildinpublic and got a firehose. Here's the scoring system that finds the 1% of mentions actually worth a founder's reply — intent, ICP fit, author quality, and freshness.

Last week I added #buildinpublic as a keyword watch on X. In the next ten hours, the system ingested 770 matching posts.
Seven hundred and seventy. In ten hours. From one keyword.
Here's the uncomfortable math: if I spent just sixty seconds judging each one — open it, read it, decide — that's nearly thirteen hours of triage for one day of one keyword. The mentions were not the bottleneck. The mentions were never the bottleneck. Every keyword-alert tool on earth will happily bury you in mentions for $15 a month.
The bottleneck is the question that comes after: which of these is worth me?
So I let the scorer run on the backlog, and watched what it did. Of 770 posts, it classified most as noise on sight — people using the hashtag as decoration, retweet chatter, "day 47 of building in public" diary posts with no problem and no question in them. A few dozen scored mid-range: topically close, wrong buyer. Exactly two scored above 85.
Two. I read both. Both were founders describing — in their own words, unprompted — the exact problem my product exists to solve. I replied to both. That was my founder-led GTM for the day: ten minutes, two conversations, zero scrolling.
This post is about the rubric in the middle, because the rubric is the product. Mentions are a commodity. Judgment is the thing you're actually short on.
A keyword match is not a lead
The original sin of social listening tools is the implicit promise that a mention equals an opportunity. It doesn't. A mention is a string match. The distance between "string matched" and "person worth replying to" is four separate judgments, and most tools make zero of them:
1. Intent. What is this person actually doing in this post? There's a hierarchy, and it's steep. Someone asking "what do you all use for X?" is gold — they are requesting the reply you want to write. Someone describing a pain ("this is eating my evenings") without asking is silver — nobody else is answering them, because they didn't flag themselves as a buyer. Someone mentioning your category in passing is bronze on a good day. Someone using your keyword in a meme is nothing. The same keyword surfaces all four.
2. ICP fit. Is this person your buyer, or just standing near your topic? A student asking hypothetically, an enterprise buyer who needs SSO yesterday, and a two-person bootstrapped team all phrase the same question almost identically. Context decides — what they're building, the scale they mention, the words they use for budget. This judgment requires knowing your product deeply, which is precisely why generic tools can't make it.
3. Author quality. A six-year-old account with real post history that replies to its commenters is a person. A three-day-old account that's posted the same question in nine subreddits is a pattern. Account age, karma, declared role, whether they engage back — these signals are sitting in public, and almost nobody reads them before replying.
4. Freshness — with an asterisk. Most replies earn their reads in the first hours of a thread. But threads that rank in search ("best tool for X") convert for years, slowly. Old plus high-search beats new plus ephemeral. Recency is a tiebreaker, not a king.
Score it or scroll forever
Here's the thing I'd tell any founder doing this manually: write your rubric down. Actually write it. Four columns in a note, score ten threads a day against it for two weeks. Two things happen. First, your triage gets fast — you stop reading thread number 40 to the end because the rubric kills it in line one. Second, you produce the most underrated marketing asset a startup can own: a written, falsifiable definition of who is about to buy from you.
The written rubric is also what lets you delegate — to a teammate, or to software. Judgment you can't articulate is judgment you can't scale.
What Thread Otter does is run that rubric automatically: every surfaced mention gets classified for intent, scored against your ICP (built from your actual product pages, not a form you filled out optimistically), weighted by author signals, and ranked. The 770 became two before I woke up. But the rubric works on paper too. The tool just removes the thirteen hours.
The numbers nobody puts on their landing page
The honest ratio from my ten-hour firehose: roughly 0.3% of keyword mentions were worth a founder reply. Not 30%. Not 3%. Zero point three.
That ratio is why "we'll send you every mention!" is a threat dressed as a feature, and why reply-volume tools that blast answers at everything are community poison — when 99.7% of matches aren't worth a reply, a tool that replies to all of them is wrong 99.7% of the time, publicly, in your name.
It's also why the right amount of founder-led GTM is small. Two great replies beat two hundred mediocre ones on every axis that matters: conversion, account health, community reputation, and whether you still have a soul at 11pm.
Watch fewer keywords. Score everything. Reply to almost nothing — but reply to it well.