I’ve been consuming a lot of LLM/agent content lately. Some of it is genuinely useful; most of it is noise.
This page is my attempt to keep a practical signal stack — sources + a mental model that help me build agents that hold up past the demo.
Current staples I already follow:
Xiaohongshu “update” feeds (fast trend radar) Anthropic blog (Newsroom / updates) What I’m doing now is turning that into a structured system.
The mental model (inspired by Anthropic’s “Building Effective Agents”) The main framing I stole from Anthropic is simple:
Start with the simplest workflow that can solve the problem, then progressively “agentify” only when complexity demands it.
Very SDE-friendly: get something observable, testable, and debuggable before you crank up autonomy.
I bucket agent building into three layers:
1) Primitives (components) These are the parts you’ll reuse across systems.
Tools / action space Tool calling is table stakes. The real work is discovery, permissions, schema discipline, and keeping context from exploding when tool count grows.
Environment / state Agents don’t live in chat logs; they live in an environment — browser, file system, UI state, terminal, DB.
2) Patterns (composable architectures) Instead of jumping straight to a “general agent,” I think in patterns that compose cleanly:
Prompt chaining / Routing / Parallelization Evaluator–Optimizer (review → improve → re-evaluate loops) 3) Harness & evals (reliability) This is the difference between a demo and something you can ship:
drift control over long runs If I can’t measure reliability, I don’t trust it.
My source list (ranked by “signal per minute”) Rule of thumb:
Tier 0 : primary sources — subscribe and skim regularlyTier 1 : translators — podcasts/newsletters that turn research into engineering lessonsTier 2 : benchmarks/evals — calibration tools so I don’t fool myselfTier 0 — Primary sources (must-follow) Anthropic (Research / Engineering) Building Effective AI Agents Demystifying evals for AI agents OpenAI Docs (Agents + tool calling) LangChain / LangGraph / LangSmith Manus (context engineering, very practical) Context Engineering for AI Agents: Lessons from Building Manus Hugging Face (open-source ecosystem radar) Google DeepMind Blog (research trends) Tier 1 — Podcasts / newsletters (weekly pick) Tier 2 — Benchmarks / evals (calibration tools) AgentBench (LLM-as-agent benchmark) WebArena (web environment benchmark; great for browser/tool agents) Chinese sources (fast + noisy, still useful) I treat these as trend radar, not truth. The goal is “early signal,” then I verify via Tier 0.
Websites / cross-posted feeds WeChat Official Accounts I actually keep an eye on These are the ones that consistently surface papers, product updates, and industry moves quickly:
新智元 (often described as “智能+中国”; sometimes people shorthand it as “new intelligence era”)机器之心 (公众号:almosthuman2014)Xiaohongshu (my filter keywords) MCP, LangGraph, agent eval, context engineering, tool calling, memory
YouTube (implementation > hype) X (Twitter) — “early signal” accounts Orgs People Hands-on anchors (so I don’t stay theoretical) 1) nanochat — a minimal end-to-end ChatGPT-style stack When I feel like I’m consuming too much and building too little, I go back to one repo and follow the plumbing.
My notes template:
How is data + tokenization handled? What are the critical engineering points in the training loop? What evals exist, and what’s the minimum viable eval? What does the inference/serving loop look like end-to-end? 2) Memory-first agents (MemGPT / Letta) I’m increasingly convinced memory is a real separator for long-running agents — not because it’s fancy, but because it reduces repetition and improves continuity.
What I care about:
memory tiers (working vs long-term vs externalized) write policies (when to commit memory) retrieval policies (what to pull back, and when) compression without breaking correctness I keep this linked to a separate page where I run small experiments (same task over multiple days; measure steps/retries/tool errors).
My low-friction weekly cadence Daily (10 min): skim Tier 0 (Anthropic / OpenAI / LangChain / Manus)Weekly (1 hr): 1–2 podcasts (Latent Space or High Agency)Weekly (30 min): 1 eval/benchmark paper or an eval-focused postBiweekly: write a short “what I learned + how I’ll apply it” notePersonal reminder (so I don’t drift) Don’t chase new buzzwords — chase new failure modes and how people fix them Prefer postmortems and production stories over “top 10 frameworks” lists If I can’t measure reliability, I’m not done building