<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Starred Articles</title>
    <description>Starred Articles</description>
    <atom:link href="https://feedbin.com/starred/e4e3c772e7fb0d23e277e27aefcbe675.xml" rel="self" type="application/rss+xml"/>
    <link>https://feedbin.com/</link>
    <item>
      <title><![CDATA[Rui Ma on X: "100 Hours Inside Kimi" / X]]></title>
      <description><![CDATA[<div class="css-175oi2r r-j5o65s r-qklmqi r-1adg3ll r-1ny4l3l"><p class="css-175oi2r"></p><article class="css-175oi2r r-18u37iz r-1udh08x r-1c4vpko r-1c7gwzm r-1ny4l3l"><div class="css-175oi2r r-eqz5dr r-16y2uox r-1wbh5a2"><div class="css-175oi2r r-16y2uox r-1wbh5a2 r-1ny4l3l"><div class="css-175oi2r r-18u37iz r-136ojw6"><div class="css-175oi2r r-18kxxzh r-1wron08 r-onrtq4 r-1awozwy"><div class="css-175oi2r"><div class="css-175oi2r r-18kxxzh r-1wbh5a2 r-13qz1uu"><div class="css-175oi2r r-1wbh5a2 r-dnmrzs"><div class="css-175oi2r r-bztko3 r-1adg3ll"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1adg3ll r-1pi2tsx r-13qz1uu r-45ll9u r-u8s1d r-1v2oles r-176fswd r-bztko3"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-sdzlij r-1udh08x r-5f1w11 r-u8s1d r-8jfcpp"><a href="https://x.com/ruima" class="css-175oi2r r-1pi2tsx r-13qz1uu r-o7ynqc r-6416eg r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-sdzlij r-1udh08x r-633pao r-45ll9u r-u8s1d r-1v2oles r-176fswd"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="" src="https://pbs.twimg.com/profile_images/1413221549692227589/ME01cG25_x96.jpg" class="css-9pa8cd"></div></div></div></div></a></div></div></div></div></div></div></div></div></div></div><div class="css-175oi2r"><div class="css-175oi2r r-1ny4l3l"><div class="css-175oi2r r-1peese0"><div class="css-175oi2r"><div class="css-175oi2r r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/ruima/article/2039245985520681257/media/2039243976650420224" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/HEzZl6aacAA6uVQ?format=jpg&amp;name=medium" class="css-9pa8cd"></div></div></div></a></div></div></div></div></div><div class="css-175oi2r r-vmopo1"><div class="css-175oi2r r-1knelpx"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This is a translated article from Chinese, originally published by </span></span><span><span>Renwu</span></span><span><span>, a respected Chinese magazine known for reported features and profile writing. It has been adapted for readers who may not know the Chinese context, company culture, or references. I’ve made it more readable, added brief context where needed, and smoothed some phrases that would sound strange if translated directly.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>By Liu Mo
Edited by Jin Zha
</span></span><span><span>
Originally published in Chinese by</span></span><span><span> </span></span><span><span>Renwu</span></span><span><span> </span></span><span><span>(人物)</span></span><span><span> </span></span><span><span>on March 31, 2026</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Spring 2026 has been unusually kind to Kimi.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In just a few months, the company behind Kimi seemed to hit one milestone after another. Its revenue, fundraising, and valuation all kept breaking records. A research paper co-authored by a 17-year-old high school intern received praise from Silicon Valley figures including Elon Musk. And Cursor, the U.S. coding startup valued at around $50 billion, was accused by Chinese observers of essentially “wrapping” or heavily relying on Kimi’s model as part of its own product experience. In other words, Kimi suddenly seemed to be winning on all three fronts at once: capital, technology, and commercial traction.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This startup is only three years old. Its valuation has already surpassed RMB 120 billion, or roughly $16 billion. It is becoming impossible to ignore in the global AI story.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>And yet Moonshot AI, the company behind Kimi, remains deeply mysterious.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I was given permission to spend 100 hours observing the company from the inside. As an independent writer, I was allowed to interview any employee willing to talk, sit in on any meeting that did not involve trade secrets, and write freely afterward. No one would edit my work. I would not be paid. That, it turns out, is very much in character for this company.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Inside the office, it feels like standing in the eye of a storm.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At the center, everything is strangely still. The desks are quiet. Only scattered keyboard sounds break the silence. Occasionally you hear someone laugh. But the noise outside, the rumors, arguments, hype, imitation, and endless commentary, seems to leave no trace here.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>There are just over 300 employees. Their average age is under 30. Each person, if you divide the company valuation by headcount, is effectively carrying close to RMB 400 million in enterprise value on their shoulders.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>About 80% of the staff are what Chinese internet slang calls “I people,” meaning introverts, borrowing from MBTI language. People sit side by side, but they are more comfortable typing than talking. Here, introversion is not treated as a flaw. It is almost an operating protocol.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I thought back to my first visit in 2024, on a night when the storm was only beginning to gather. At the time, I did not come away with a particularly positive first impression.</span></span></div></div><section class=""><div class="css-175oi2r r-1nxhmzv"><div class="css-175oi2r r-13qz1uu"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/ruima/article/2039245985520681257/media/2039245589666578432" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/HEzbDzXbwAANiRa?format=jpg&amp;name=medium" class="css-9pa8cd"></div></div></div></a></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The night of December 24, 2024, was Christmas Eve, though for most people in China it was not a holiday that mattered much. For Julian, it became one of the darkest nights of her life.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>She was 26, had graduated from Peking University only two years earlier, and had no prior industry experience. Yet she was already one of the earliest employees at Kimi. That night, this very young yet already “senior” employee sat at the long table in a conference room called Radiohead, crying in front of more than 30 colleagues.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>She still had not delivered a holiday marketing plan that met the standards of the co-founders.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Chinese New Year was only a month away. The latest plan had already been revised six times, and now it needed to be upgraded again, perhaps even scrapped entirely. The odds of rebuilding it from scratch and then coordinating product and engineering to execute it in time were slim. But the company had high hopes for growth during the 2025 Lunar New Year period.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That mattered because the previous Lunar New Year had been a breakthrough moment for Kimi. It had gone viral in China thanks to its branding around handling “2 million Chinese characters of long-context input,” which was unusually advanced at the time. Consumer users surged, and in the Chinese stock market people even started talking about “Kimi concept stocks,” meaning public companies loosely associated with the trend.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That weekly meeting was long and brutal.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Around 20 young employees, most as inexperienced as Julian, took turns reporting on everything: social media ads, user operations, PR in China, overseas marketing, all the details. The group discussed everything collectively, and the co-founders made the final calls.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Kimi at that point felt like an adolescent: talented, full of potential, but not yet fully in control of itself. Even with a monthly advertising budget of tens of millions of RMB, it still looked clumsy in the face of fast-rising competitors.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The meeting ended around 4 a.m.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>No one knows whether Julian’s final plan would have succeeded. A month later, it no longer mattered.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That was when the world first heard the name DeepSeek.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Hayley, who worked on growth, went home to Wenzhou for the holiday and found that relatives and friends all asked the same question: “Have you heard of DeepSeek?” It was as if Kimi had suddenly become yesterday’s news.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>She says that was the hardest Lunar New Year of her life. The silence inside the company was deafening.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The annual company meeting is usually held in March, after the holiday. Employees are allowed to challenge management directly. That year, almost every question revolved around DeepSeek.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The sharpest question came from the HR team. With complete sincerity, they said the uncomfortable thing out loud:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“How are we supposed to answer candidates when they ask: DeepSeek also gave me an offer. Why should I join Kimi instead?”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>But not everyone reacted the same way.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Alex from the algorithm team says that if he felt any strong emotion during the “DeepSeek moment,” it was not fear. It was excitement.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That feeling was not just personal. It reflected the mood of much of the algorithm team. DeepSeek had shown that there might be another way: lower-cost strategies, open-source approaches, and a truth many people had doubted before. A little-known Chinese startup, if its technology was strong enough and its model was good enough, could still earn global respect.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The product team was not especially anxious either. Kevin, one of the earliest product employees, believed that DeepSeek had broken out because of its model. Once Kimi’s own model capabilities caught up, he believed the product team would have even more room to build useful features on top.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>No outsider knows exactly what discussions the co-founders had. But the company moved quickly. It adjusted strategy, narrowed focus, and reached something close to full internal alignment.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Ask almost anyone inside the company what matters most now, and they will answer without hesitation: the model.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>From then on, you could feel a growing respect for DeepSeek inside Kimi. Part of it was professional admiration. Part of it was something else.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“In a way, DeepSeek saved us.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“Why are you wearing shoes like that?”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>After Ezra asked me that, I was more surprised than she was. On her floor of the office, almost everyone keeps a pair of slippers under the desk. Comfortable clothes and shoes, people believe, make you more relaxed, more focused, and more creative.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This is the dress code of smart people.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I have met many high-achieving students in my life. But the “good students” here are a very different species.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>When Ezra was in elementary school, she tried to hack the family computer because her parents would not tell her the password. In middle school she became interested in Bitcoin, when one coin cost only a few hundred RMB. She asked her mother for spending money to invest; her mother told her it was a scam. In high school, the first time she ever took a taxi, she sketched out a ride-hailing product concept. Had today’s AI tools existed back then, she says, maybe she could have launched it. Once she finally had some money of her own in college, she put it into the Chinese stock market and lost 90%.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That disaster taught her something about the limits of human judgment, and pushed her toward AI.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Her view of AGI, or artificial general intelligence, is simple: create “N Einsteins” and use them to solve humanity’s hardest problems. From that point on, she became determined to find a company that would truly push the limits of AGI. This was despite the fact that she had already made her investment losses back in the stock market.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Because of her strong academic background, she received offers from many companies. She chose Kimi for one reason: during the interview, she was deeply impressed by founder Yang Zhilin’s understanding of technology and his seriousness about details. She felt he genuinely cared about models. He did not have the restlessness often seen in smart people, nor the utilitarian instinct common in businesspeople. In fact, by the end of the interview, she still did not know he was the founder.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Karen’s personality is different but leads to a similar place.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He was rebellious from childhood. He argued with teachers. He never listened to his parents. As a student, he insisted on going abroad. After graduating, he insisted on starting a business. The comfortable and stable life offered by a big Chinese tech company made him despair. He did not want a life whose ending was visible from the beginning.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I asked him: if given the choice between a guaranteed 60 out of 100, and a 1% chance at 100 out of 100, which would you choose?</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He chose the latter without hesitation.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>It was not that he could not tolerate a score of 60. He just hated the certainty of that 100% path.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That founder-like DNA forms part of the company’s underlying texture. By rough internal count, at least 50 people at Moonshot AI have founded or joined startups before.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Kimi, apparently, likes hiring CEOs.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>A more accurate way to put it is this: the company shelters a rotating population of gifted drifters. A genius is not necessarily a top student or model employee. What matters is that in some dimension, they can see through time.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At a company where around 80% of employees come from China’s elite “985” and “211” universities, Yannis’s résumé does not look especially impressive. Yet as early as 2023, he had already predicted in engineering communities that both DeepSeek and Kimi would rise, at a time when model companies barely had products at all. Another employee, himself born after 2000, noticed Yannis’s insight and recommended him internally.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Karen says too many smart people get trapped by systems. First the family, then school, then the workplace. They obey group expectations without realizing it and lose sight of what they actually want. Only a small number try to escape, and even they often go unseen.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>One of Kimi’s missions, he says, is to see them.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Without that instinct, a 17-year-old high school student would never have been brought in as a Kimi intern, collaborated with the team, and published a paper that later drew praise from Elon Musk. The person who put that student’s name first on the paper was Bob, the mentor who first spotted him.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>There is only a thin line between genius and madness. When an “ununderstood madman” arrives at Moonshot AI, he may suddenly become a world-changing genius. Or perhaps some still-hidden genius can only truly bloom in a place like this.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Bob told me that, to some extent, having a big ego is not a problem. It may even be a good sign. If that ego functions as inner drive, if someone believes they must be part of a great mission, that may be exactly the sort of person the company cannot afford to miss.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Inside this team, training a top AI model is jokingly called “alchemy,” a common Chinese tech term for the mysterious, half-scientific, half-artistic process of model training. But in practice, alchemy means constantly fixing bugs.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Once a flagship training run begins, Bob and his teammates fall into the same ritual. The first thing they do every morning is refresh the company’s massive set of internal monitoring dashboards. Hundreds of thousands of metrics. If even one curve spikes abnormally, alarms go off in their heads. Was there a problem in optimization? A flaw in the architecture? A mismatch in numerical precision?</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>They react with almost animal sensitivity.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Some people even inspect training data token by token, printing out those that produced extreme gradients and interrogating them like suspects: why did you jump so violently?</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Everyone who has ever truly participated in “delivering” one of these models has lived through this kind of sleepless tension. It is not really anxiety. It is curiosity driving obsession. That obsessive vigilance is part of what pushed the model toward top-tier performance.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Over the past year, more than 100 of Kimi’s hires came through referrals, friends or friends of friends. Inside the company, this is jokingly called “human-to-human transmission.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Trust, because of these dense networks, becomes a natural organizational asset.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In essence, Kimi shifts the hardest part of management onto recruiting. If people are brought in by trusted peers, they are more likely to share the same instincts. This is why one word comes up over and over inside the company:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>One night in September 2025, several engineers casually launched a small internal project and named it </span></span><span><span>Ensoul</span></span><span><span>. They wanted code sleeping inside files to “come alive” and become a conversational assistant inside the command line.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This sensitivity to naming is not accidental.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>They once had a framework called </span></span><span><span>YAMAHA</span></span><span><span>, short for “Yet Another Moonshot Agent.” Their deepest infrastructure layer was called </span></span><span><span>Kosong</span></span><span><span>, which means “empty” in Malay, inspired by the Buddhist phrase “emptiness is form.” It was meant to suggest a blank sheet of paper with no pre-assigned function, but infinite potential.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Taste, in other words, shapes the product itself.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>While many other companies were shoving chat windows into the command line, Kimi’s engineers thought that was ugly. Real programmers open a terminal to issue commands, not to chat. So Kimi CLI was designed to feel more like a smart shell than a chat interface. It understands commands, but does not force itself into the shape of a conversation box.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This minimalism is visible in the code too. The core logic is only about 400 lines of Python, stripped of all unnecessary ornament. The modules are cleanly decoupled. Users can customize functions themselves, or take Kimi apart and reassemble it into their own applications.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Even Kimi Agent was once internally associated with the phrase </span></span><span><span>OK Computer</span></span><span><span>, a Radiohead reference, though that name was later changed because it was too obscure for wider adoption. The people who chose names like that did not seem especially interested in maximizing internet traffic. They obeyed their own musical taste and linguistic standards instead.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Someone joked that if you measured AI companies by the share of employees who play musical instruments, Kimi might rank first.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Taste has become the highest hiring standard, and also the hardest to define.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>It cannot be quantified, but it is everywhere.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>You may never fully understand what each person at Kimi actually does.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The company likes using the word “team” instead of department. At a high level, the main areas are clear enough: algorithms, product and engineering, growth, strategy, operations. But once you try to zoom in and map actual departments or fixed responsibilities, things start to blur.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That is because this is an organization with no formal departments, no hierarchy, no titles, no OKRs, and no KPIs. Reporting lines are so simple that they feel almost unreal.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>For Brandon, this made no sense at all.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He had studied at Tsinghua, held management roles at Silicon Valley giants and major Chinese tech firms, and helped build a startup worth around $1 billion. He had spent years in the industry and excelled at technical management. He had led teams of nearly 1,000 people. He hoped to enter AI and apply that experience at scale.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Instead, co-founder Zhang Yutong told him that the company did not work that way. The number of people he would likely manage, if he joined, was about two.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Still, something about the future pulled him in, and he wanted one more conversation.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>So in January 2025, during a period of internal doubt and unrest, Brandon met Yang Zhilin, his younger schoolmate from Tsinghua.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At the time, Brandon had no idea that Yang’s name would eventually be mentioned in media stories alongside Elon Musk and Jensen Huang. What he remembers most is the very first sentence Yang said after basic greetings:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“Reinforcement learning is the future.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The rest of the conversation felt almost like Yang thinking out loud. He was so immersed in his own line of thought that Brandon could not understand much of what he was saying, even though it was all in Chinese.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>But one thing was unmistakable: for the first time, Brandon felt the knowledge structure and mental models he had built over the past 20 years starting to collapse. Along with them went his ego.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>When I asked why he eventually joined, he replied in a slightly mysterious tone: Yang Zhilin might become a great prophet, because he is both far-sighted and pure.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Later, when the company hesitated because it did not really know how to define his role in such a title-light system, Brandon replied firmly:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“Even if you make me clean toilets, I’ll come. And I’ll clean them better than anyone.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Not every former big-tech manager or expert thrives in this environment.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Phoebe, born after 2000, moved from the growth team into product and engineering. She describes herself jokingly as “a clueless little girl,” but says something important: in this company, deep experience and strong credentials can actually become a burden.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>AI is too new. The field is changing too fast. A highly experienced expert may not learn and adapt as fast as a younger person with fewer assumptions.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>She has seen at least three mid-level or senior big-tech hires fail to “land” after joining. One eventually chose to leave the industry altogether, saying the people around him were just too young and too smart. After being repeatedly outperformed, he gave up. This, he decided, was no longer his era or his industry.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>After the DeepSeek shock, Phoebe also felt a deep sense of crisis. She decided to abandon ad-buying work and instead try to help the company through product and engineering. She began an intense period of self-study, even streaming herself learning on Bilibili for hundreds of hours.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>What surprised her most was that the company, from the start, gave her the chance to switch roles without much hesitation.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In fact, among the thirty employees I interviewed, more than half had changed responsibilities multiple times. Compared with their previous jobs, perhaps 80% were now doing something completely different.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Kimi likes people with </span></span><span><span>generalization ability</span></span><span><span>.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In AI, generalization means a model can perform well in new scenarios beyond its training data. It has not merely memorized answers; it has learned underlying structures.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The company applies this idea to people too.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Mid-level and senior employees from giant firms may have spent too long optimizing for a particular KPI system, a particular reporting language, a particular internal political game. Their “algorithm” becomes overfit to one local optimum. When the environment changes completely, they may fail to adapt.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>If traditional big-tech workers are like specialized models, then the people Moonshot AI wants are more like base models. First they learn basic rules through supervised fine-tuning. Then, through reinforcement learning and repeated self-play across many tasks, they acquire the ability to transfer across domains.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>James, a returnee from Silicon Valley, is 26 and says his dream is “to give money to young people.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>As a devout believer in AI, he sees his own body as little more than a sensor for an agent to collect information. When playing </span></span><span><span>League of Legends</span></span><span><span> with friends, he records voice and collects physiological data like heart rate and pulse, then analyzes which teammate’s comments affected his emotional state and game performance.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>His views are so sharp they verge on extreme. He says: if a person starts learning a truly new language after age 14, they will never master it at a native level. AI, he argues, works similarly.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Dan, who joined the company right after graduation, says that for the first time in his life he felt true knowledge anxiety.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At school, he had only ever worked on “toy models,” around 7 billion parameters, which could be trained in a few days on 32 GPUs. Now he was handling enormous Mixture-of-Experts models with tens of billions of parameters and training datasets measured in trillions of tokens. It felt like jumping straight from a small pond into the Pacific Ocean.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>To keep up, he threw himself into near self-abusive study. His schedule collapsed. Beijing daytime became Silicon Valley nighttime, then reversed. He stared at training dashboards for hundreds of hours, like a stock trader watching markets with no room to blink.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The real challenge was not just workload. He had to do three jobs at once.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He had to be an algorithm architect, designing the best plan through a maze of model choices. He had to be a systems engineer, debugging distributed computing problems like a mechanic repairing a pipeline stretched across the globe. He had to be a data curator, performing “alchemy” on giant datasets so the model would score well on benchmarks while also feeling natural and soft in actual conversation.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Sometimes that meant emergency surgery mid-training. At one point, key parameters stored in bf16 precision started behaving dangerously. The team made a snap decision to switch to fp32 precision halfway through training, just to stabilize the run. Dan says that if all you can do is write algorithms, or build systems, or clean data, you will never produce a top model. There is no excuse here of “I only handle this part.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The company expects you to integrate algorithm, engineering, and data work across multiple worlds. It is like doing several jobs at once. But that kind of intense cross-training can give you years’ worth of growth in a very short time.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>So anyone trying to join Kimi faces a brutal test.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>There are no OKRs, no KPIs, no office politics, no manipulative managers, not even clock-in attendance. But if you are not AI-native, if you cannot generalize, if you cannot continuously reinforce and adapt, then you may struggle to find meaning for your existence here.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Most brands want a story.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>But nearly every Kimi employee gently warned me: don’t write about Pink Floyd, or the piano near the office entrance.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Their view is that people who get it, get it. People who don’t, don’t need to. The names Moonshot and Kimi have nothing directly to do with AI or technology. But if the company talked too much about its connection to rock music or art, it would start to feel self-conscious and pretentious. Better, they seem to think, to be beautiful without trying to explain the beauty.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Win, another post-2000s employee who had escaped from a giant tech company, told me this place is bizarre because people can actually get work done without endless meetings.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At his former employer, daytime was for meetings and nighttime was for work. He learned a simple lesson: if your energy goes mainly into coordinating relationships around production, there is very little room left to improve actual productivity.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This is part of what an AI-native organization looks like.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>More than ten employees told me explicitly that they increasingly prefer dealing with AI over dealing with humans. AI feels more reliable and simpler. That tendency also fits the company’s broader introverted character. One person used a gentler word: shy.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In group chats, everyone can be lively and expressive. In person, many are quiet. Kimi does not organize many cultural activities. Aside from the annual meeting, the most recent group event had simply been massages in the office.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Introversion does not mean a lack of communication or energy.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Even though no one was required to talk to me, not a single person said no. In group chats, information flies constantly, along with all kinds of abstract emoji. No one’s messages are left hanging in silence.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>And if you need help from someone else to get work done, the process is simple: ask them directly.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>No need to go through a manager. No need for approval. No need for a coordination meeting. No need to break through departmental walls.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Kimi has no departmental walls. In some sense, it does not even have departments.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Yang Zhilin’s status message is just four words:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Still, everyone acknowledges that the company has changed continuously since its founding.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Some changes were proactive, some reactive, and some even seemed like reversals. The company moved from heavy ad spending to model focus, from insisting on closed source to embracing open source, from chatbot products to Kimi Agent, Kimi Code, and Kimi Claw, from consumer to enterprise and back again. Not every shift stands up perfectly to scrutiny.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Yet in Ezra’s mind, one thing has remained constant: respect for facts.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>All those changes, she believes, had only one cause and one purpose: to make the company align better with objective reality.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The company tolerates ego, but it does not like hiring people who place themselves above facts.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>From the co-founders down, people are relatively easy to persuade, as long as the facts are clear enough. That willingness, employees say, comes from an intense commitment to truth, reality, and what is real. Truly smart people are not wounded by honest feedback.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Another condition for this level of honesty is that the company has no horse-race system, no zero-sum competition, no major internal conflicts of interest. People willingly share research findings and technical detail without expecting payment or credit. Early on the company had its own community; today it still promotes a community culture. Shared information and shared knowledge speed up everyone’s learning, which in the end benefits everyone.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Win says toxic culture is contagious. Good culture is contagious too.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Someone used the word “solidarity” to describe the atmosphere, a word that sounds almost old-fashioned when applied to a startup. But the company operates in a harsh environment. Outside are giant competitors. Inside are the pressures of being squeezed by established tech firms. Compute resources are limited. Those constraints, if anything, seem to increase cohesion.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At the root of it all, people are the only truly important asset in an organization.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Recently, Florence was approached by a competing company offering double her salary. She rejected it immediately. Her reason was simple:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“There’s no ‘officialdom smell’ here.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That phrase is hard to translate directly. In Chinese internet slang, it refers to the stale, hierarchical, self-important atmosphere associated with bureaucracy, performative authority, and status games.</span></span></div></div><section class=""><div class="css-175oi2r r-1nxhmzv"><div class="css-175oi2r r-13qz1uu"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/ruima/article/2039245985520681257/media/2039245184211505152" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/HEzasM7aUAAFSFX?format=jpg&amp;name=medium" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-2039245985520681257"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The company's new office.</span></span></div></div></div></div></div></div></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At the beginning of this reporting process, I was extremely nervous. I was about to interview some of the smartest AI people in the world. I am a humanities person. I have never worked in tech. My knowledge of AI is limited.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>But when I actually started talking with young experts from the algorithm and product-engineering teams, I realized they were the ones who seemed nervous. They were afraid I would feel awkward if I did not understand their terminology.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>So first they would translate English into Chinese, and then translate that Chinese into a second, even simpler Chinese I could understand.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That instinct to protect was moving.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Before I started the interviews, the company gave me only one instruction: protect everyone.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>So I tried to avoid questions that were too sensitive or likely to hurt people.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Even so, Ty, during a phone interview, could not fully hide a small emotional tremor. When he first joined the company and was going through the difficult onboarding process, he struggled badly. At one point he felt he could not continue and even thought about resigning.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Then one week, at the company meeting, he watched Annie, a woman who had graduated only two years earlier, finally push a difficult project forward after countless setbacks and internal doubts. Seeing that, he felt he could not give up either. He was older than she was, had more life experience, yet in terms of sheer stamina and willpower, he felt weaker.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“I don’t know how she endured it.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In fact, Ty was not the only one who had thought about leaving.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>For a long time, she was trying to build a business line overseas from zero to one and made no real breakthrough. To make things worse, colleagues from other teams, with good intentions, directly told her to abandon what they viewed as a meaningless effort.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>She says she cried more at Kimi than for any other company, or for any ex-boyfriend she had ever had.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>It was not as though she lacked alternatives. She already had a better-paying offer elsewhere. But she says she simply could not persuade herself to go work for someone else. She wanted one more conversation with Zhang Yutong.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Afterward, she decided to stay.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>She did not tell me what was said in that conversation. She only said: Yutong is the strongest boss I have ever seen, the fastest at iterating, with the highest ceiling. Following her is how I can raise my own ceiling.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Then Annie repeated the same line:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>“I don’t know how she endured it.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Once you gather enough material, you notice certain sentences recurring. And the most repeated phrases often reveal the deepest common qualities of a team.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Bob, who had been pulled back to China by Yang Zhilin and gave up the chance to pursue a PhD in the United States, joined the company on day one. If anyone understands the company deeply, he does.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>When I asked him the same question I asked everyone else, what is the team’s most important quality, he thought for about two minutes and answered with one word:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>For a company only three years old, talking about resilience may sound like a luxury. But he means it sincerely. Smart and brave, he says, are sometimes opposites. The smarter you are, the more clearly you see the risks, and the easier it becomes to walk away. Foolish persistence will not succeed either. So only those who see the truth, calculate the odds of failure, and still continue deserve to be called resilient.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Inside the company, there is a story known as </span></span><span><span>“three trips to the cliff of reflection.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In May 2023, Freddie and his colleagues were given a task that seemed impossible: make AI read and understand 128K context in a single pass, meaning hundreds of book pages, at a time when the industry standard was closer to 4K.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He quickly designed a solution called MoBA v0.5, but it required rewriting the underlying training framework while the main model was already halfway through training. The cost was too high, so the idea was shelved. That was the first trip to the “cliff of reflection.”</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Half a year later he returned with version 1, now designed to continue training from the existing model. It worked on small models, but when tested on the large one it hit a loss spike and kept failing. The project was forced back to the cliff a second time, for another six months. It even missed the company’s 200,000-character product milestone. But the team was not disbanded. Instead, the company launched what it called a “saturation rescue,” gathering technical experts from everywhere to attack the problem together. They rewrote core logic and finally got version 2 to pass the classic long-context “needle in a haystack” test.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Just when launch seemed close, a third blow arrived. During supervised fine-tuning, the model performed poorly on long-summary tasks because the training signals were too sparse. By then huge resources had already been invested. Still, the engineers went back to the cliff again, searched for a solution, and eventually fixed the issue by changing the attention mechanism in the final layers.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Three retreats. Three returns.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At the end of the interview, I asked Freddie the ultimate question: how would you describe this company?</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He answered in two words:</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>He quoted the famous line from John F. Kennedy:</span></span></div></div><blockquote class="longform-blockquote"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard.</span></span></div></blockquote><section class=""><div class="css-175oi2r r-1nxhmzv"><div class="css-175oi2r r-13qz1uu"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/ruima/article/2039245985520681257/media/2039245329724485634" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/HEza0rAaMAI3bAX?format=jpg&amp;name=large" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-2039245985520681257"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>All the company meeting rooms are named after musical acts.</span></span></div></div></div></div></div></div></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In the end, I did not disturb or attempt to probe the co-founders themselves.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Externally, they remain almost invisible. They dislike interviews and have no interest in personal fame. Internally, though, they are everywhere.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In an extremely flat organization, you need superbrains at the center. Otherwise vitality turns into chaos. Because there is little middle management, each co-founder interfaces directly with around 40 to 50 employees and stays close to both the technical and business front lines. That is how the company keeps decision-making and execution aligned.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>All five co-founders came from Tsinghua University. But biological limits still exist. Human attention spans are finite. Management range is finite. Once the company reached a RMB 120 billion valuation and grew past 300 people, even these superbrains began to strain under the load.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>And it is not just the founders.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This is an infinite game driven by self-motivation. If every member is effectively carrying RMB 400 million of valuation, then each person is expected to create an extraordinary amount of value.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The revolutionary variable is the toolset.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Kimi does not actually run on extreme working hours. Employees are allowed to wake naturally. They are not required to stay in the office until dawn every night. Leo from the product team says he commands “an army” now, meaning AI agents.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Leo wakes up at 10 a.m. and walks into the office. His task is to analyze user feedback from five global markets over the past 24 hours and decide this week’s product priorities. In the past, that would have taken three people two days.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Now he launches three agents.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>A strategy agent scans 3,000 feedback items and filters for high-priority requests related to long-context interruption. A translation agent interprets Japanese dialect and Korean honorifics in real time and marks true emotional intensity. A competitor agent monitors updates from Cursor and ChatGPT and produces a technical comparison.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Leo does only three things himself. He rejects one sarcastic comment that the system had misread as sincere. He flags a screenshot containing an unreleased UI. He confirms the top three needs recommended by the agents.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>By 11:30 a.m., the product requirements document is already finished. Meanwhile, a coding agent has generated about 70% of the base implementation, leaving only the more creative design work for afternoon discussion with human engineers.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Humans set the rules. Silicon-based systems execute them. The organization becomes a container for algorithms.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In an AI-native company, using agents skillfully and embedding them deeply into workflows is not optional. It is part of the job.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The model is not only the goal. It is also the tool.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Whether by directly improving productivity or by fundamentally changing management structure, AI’s logic has already entered the bones of this company. Just as the company builds an </span></span><span><span>Agent Swarm</span></span><span><span>, the team itself begins to resemble a </span></span><span><span>Genius Swarm</span></span><span><span>: many independent geniuses working in parallel, coordinating seamlessly.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Still, such a flat structure has built-in fragility.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>When I asked whether this model would remain sustainable if the company grew from 300 people to 3,000, most people answered cautiously. History is not encouraging. Similar experiments in extreme flatness, like holacracy or Haier’s internal contract-cell structures, often hit decision bottlenecks once they pass around 500 people. When there are too many information nodes, “direct communication” starts turning into information overload.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>A more immediate pain point is the personal experience of weightlessness.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Without hierarchy to buffer uncertainty, confusion about direction is felt directly by each individual. One former employee who eventually returned to big tech put it bluntly: without top-down OKRs and KPIs, some mornings you walk into the office not knowing what you should do. No one necessarily tells you whether you are doing well. That lack of feedback creates insecurity. It can make people nostalgic for the clear reporting lines, review points, and measurable outputs of giant tech companies.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Those cumbersome structures, after all, do provide one essential thing: a baseline of certainty.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Where is the goal? What counts as completion? How will performance be judged? In a large firm, all that is visible.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That is not Stockholm syndrome, the person said. It is basic organizational physics.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>If Alibaba is like a finely calibrated promotion conveyor belt, ByteDance like a ruthless battle corps with strong objectives, and Tencent like a more forgiving professional academy, then Moonshot AI is like a primeval forest.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Geniuses may find a hunting path. Ordinary people may just wander in the fog.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>No departments. No titles. No evaluations.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The AI-native organizational model is anti-bureaucratic and intentionally unstructured. Large companies can no longer pivot toward it easily. Small companies often miss the window because they expand into traditional structures too quickly. This is an asymmetric war.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Here the author turns to a famous science-fiction reference from </span></span><span><span>The Three-Body Problem</span></span><span><span>. In that story, an advanced civilization casually uses a weapon called a </span></span><span><span>two-dimensional foil</span></span><span><span>, which collapses the solar system from three dimensions into two. Planets, stars, and humans all become a flat image without thickness.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Moonshot AI, the author argues, is deliberately throwing such a “two-dimensional foil” at itself.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Not to destroy an opponent, but to flatten the organization in pursuit of maximum efficiency.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>No vertical depth of hierarchy. No horizontal walls of departments. No three-dimensional tangles of office politics. Only “model” and “intelligence” facing each other directly in the simplest possible form.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In the age of AI, every startup, the author argues, is being forced to throw such a foil at itself. The rise of one-person companies reflects the same generational explosion of AI-native talent. If technology can compress organizational capability into the individual, then many of the middle layers of management simply evaporate. The organization gets flattened. There is no depth left for detours. Everyone is forced to face the problem itself.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That may be the hard rule governing the evolution of organizations in the business world.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Everyone, eventually, will be folded.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Once people are exposed on the same plane, one person radiating influence over fifty others no longer looks like a managerial miracle. It becomes normal. The distance from center to edge is redefined. People who depend on titles and OKRs as coordinates may suffocate instantly. But geniuses, on this exposed flat surface, can violently dismantle intelligence itself, while the “guardians” clear away noise and entropy, seeing themselves, not without humility, as pioneers widening the boundary of human civilization.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>And yet the transition from three dimensions to two cannot be reversed.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>That means Kimi cannot go backward.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Every strategic adjustment becomes a chaotic iteration with high stakes. Competitors can still turn slowly inside a maze. But if Moonshot AI tries to expand recklessly in size, it may tear itself apart structurally. This act of self-flattening is only acceptable because it is in service of something more radical.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>The endpoint of lowering the organization’s dimension is raising the dimension of intelligence.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Only if model intelligence crosses the critical threshold, rising high enough to escape the gravity well of all carbon-based organizations, can Moonshot AI truly crush the organizational advantages of its competitors and justify this irreversible gamble.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>At that point, debates over management span or org charts no longer matter. It would be like asking what dimension the </span></span><span><span>Three-Body Problem</span></span><span><span> civilization inhabits, when the real point is that its dimensional weapon has already rewritten the rules of war.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Then “Moonshot AI” would stop being a metaphor.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>It would become a higher-dimensional light source, illuminating the dark side of the intelligence universe. All the organizational pain that came before would be no more than the heat shield burning off as the lunar module passed through the atmosphere.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Either they become godlike through ascent.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Or they are sealed away in collapse.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>All of the English names used are pseudonyms.</span></span></div></div></div></div></div></div></div></div><div class="css-175oi2r"><div class="css-175oi2r r-1xfd6ze r-6koalj r-18u37iz r-kzbkwu r-1c4vpko r-1c7gwzm r-ttdzmv r-13qz1uu"><div class="css-175oi2r r-13awgt0 r-18u37iz"><svg class="r-4qtqp9 r-yyyyoo r-1xvli5t r-dnmrzs r-bnwqim r-lrvibr r-m6rgpd r-18jsvk2 r-1wron08"><g><path></path></g></svg><div class="css-175oi2r r-1habvwh r-13awgt0 r-1777fci"><div class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-1b43r93 r-1cwl3u0 r-16dba41 r-13qz1uu"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Want to publish your own Article?</span></div><a href="https://x.com/i/premium_sign_up" class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-1b43r93 r-1cwl3u0 r-b88u0q r-1q142lx r-1s2bzr4 r-iphfwy r-13orzya r-3s2u2q r-1ddef8g r-tjvw6i r-1loqt21"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Upgrade to Premium</span></a></div></div></div></div></div></div><div class="css-175oi2r r-12kyg2d"><div class="css-175oi2r r-k4xj1c r-18u37iz r-1wtj0ep"><div class="css-175oi2r r-1wbh5a2 r-1a11zyx"><div class="css-175oi2r r-1d09ksm r-18u37iz r-1wbh5a2 r-1471scf"><a href="https://x.com/ruima/status/2039245985520681257/analytics" class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-a023e6 r-rjixqe r-16dba41 r-1loqt21"> <span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3 r-1b43r93 r-1cwl3u0"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Views</span></span></a></div></div></div></div></div></div></div></article><p></p></div>]]></description>
      <pubDate>Wed, 01 Apr 2026 22:03:00 +0000</pubDate>
      <link>https://x.com/ruima/status/2039245985520681257?s=12</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5177855032</guid>
    </item>
    <item>
      <title><![CDATA[Vulnerability Research Is Cooked]]></title>
      <description><![CDATA[
    
<p><strong><a href="https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/">Vulnerability Research Is Cooked</a></strong></p>
Thomas Ptacek's take on the sudden and enormous impact the latest frontier models are having on the field of vulnerability research.<p></p>
<blockquote>
<p>Within the next few months, coding agents will drastically alter both the practice and the economics of exploit development. Frontier model improvement won’t be a slow burn, but rather a step function. Substantial amounts of high-impact vulnerability research (maybe even most of it) will happen simply by pointing an agent at a source tree and typing “find me zero days”.</p>
</blockquote>
<p>Why are agents so good at this? A combination of baked-in knowledge, pattern matching ability and brute force:</p>
<blockquote>
<p>You can't design a better problem for an LLM agent than exploitation research.</p>
<p>Before you feed it a single token of context, a frontier LLM already encodes supernatural amounts of correlation across vast bodies of source code. Is the Linux KVM hypervisor connected to the&nbsp;<code>hrtimer</code>&nbsp;subsystem,&nbsp;<code>workqueue</code>, or&nbsp;<code>perf_event</code>? The model knows.</p>
<p>Also baked into those model weights: the complete library of documented "bug classes" on which all exploit development builds: stale pointers, integer mishandling, type confusion, allocator grooming, and all the known ways of promoting a wild write to a controlled 64-bit read/write in Firefox.</p>
<p>Vulnerabilities are found by pattern-matching bug classes and constraint-solving for reachability and exploitability. Precisely the implicit search problems that LLMs are most gifted at solving. Exploit outcomes are straightforwardly testable success/failure trials. An agent never gets bored and will search forever if you tell it to.</p>
</blockquote>
<p>The article was partly inspired by <a href="https://securitycryptographywhatever.com/2026/03/25/ai-bug-finding/">this episode of the Security Cryptography Whatever podcast</a>, where David Adrian, Deirdre Connolly, and Thomas interviewed Anthropic's Nicholas Carlini for 1 hour 16 minutes.</p>
<p>I just started a new tag here for <a href="https://simonwillison.net/tags/ai-security-research/">ai-security-research</a> - it's up to 11 posts already.


    </p><p>Tags: <a href="https://simonwillison.net/tags/security">security</a>, <a href="https://simonwillison.net/tags/thomas-ptacek">thomas-ptacek</a>, <a href="https://simonwillison.net/tags/careers">careers</a>, <a href="https://simonwillison.net/tags/ai">ai</a>, <a href="https://simonwillison.net/tags/generative-ai">generative-ai</a>, <a href="https://simonwillison.net/tags/llms">llms</a>, <a href="https://simonwillison.net/tags/nicholas-carlini">nicholas-carlini</a>, <a href="https://simonwillison.net/tags/ai-ethics">ai-ethics</a>, <a href="https://simonwillison.net/tags/ai-security-research">ai-security-research</a></p>



]]></description>
      <pubDate>Fri, 03 Apr 2026 23:59:08 +0000</pubDate>
      <link>https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/#atom-everything</link>
      <dc:creator>Simon Willison&#39;s Weblog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5180485908</guid>
    </item>
    <item>
      <title><![CDATA[There was a nice time where researchers talked about various ideas quite openly on twitter. (before they disappeared into the gold mines :)).

My guess is that you can get quite far even in the current paradigm by introducing a number of memory ops as "tools" and throwing them into the mix in RL. E.g. current compaction and memory implementations are crappy, first, early examples that were somewhat bolted on, but both can be fairly easily generalized and made part of the optimization as just another tool during RL.

That said neither of these is fully satisfying because clearly people are capable of some weight-based updates (my personal suspicion - mostly during sleep). So there should be even more room for more exotic approaches for long-term memory that do change the weights, but exactly - the details are not obvious. This is a lot more exciting, but also more into the realm of research outside of the established prod stack.]]></title>
      <description><![CDATA[<p>There was a nice time where researchers talked about various ideas quite openly on twitter. (before they disappeared into the gold mines :)).<br>
<br>
My guess is that you can get quite far even in the current paradigm by introducing a number of memory ops as "tools" and throwing them into the mix in RL. E.g. current compaction and memory implementations are crappy, first, early examples that were somewhat bolted on, but both can be fairly easily generalized and made part of the optimization as just another tool during RL.<br>
<br>
That said neither of these is fully satisfying because clearly people are capable of some weight-based updates (my personal suspicion - mostly during sleep). So there should be even more room for more exotic approaches for long-term memory that do change the weights, but exactly - the details are not obvious. This is a lot more exciting, but also more into the realm of research outside of the established prod stack.</p>
<hr>
<blockquote>
<b>Awni Hannun (@awnihannun)</b>
<p>
</p><p>I've been thinking a bit about continual learning recently, especially as it relates to long-running agents (and running a few toy experiments with MLX).<br>
<br>
The status quo of prompt compaction coupled with recursive sub-agents is actually remarkably effective. Seems like we can go pretty far with this. (Prompt compaction = when the context window gets close to full, model generates a shorter summary, then start from scratch using the summary. Recursive sub-agents = decompose tasks into smaller tasks to deal with finite context windows)<br>
<br>
Recursive sub-agents will probably always be useful. But prompt compaction seems like a bit of an inefficient (though highly effective) hack.<br>
<br>
The are two other alternatives I know of 1. online fine-tuning and 2. memory based techniques.<br>
<br>
Online fine-tuning: train some LoRA adapters on data the model encounters during deployment. I'm less bullish on this in general. Aside from the engineering challenges of deploying custom models / adapters for each use case / user there are a some fundamental issues:<br>
- Online fine-tuning is inherently unstable. If you train on data in the target domain you can catastrophically destroy capabilities that you don't target. One way around this is to keep a mixed dataset with the new and the old. But this gets pretty complicated pretty quickly.<br>
- What does the data even look like for online fine tuning? Do you generate Q/A pairs based on the target domain to train the model? You also have the problem prioritizing information in the data mixture given finite capacity.<br>
<br>
Memory based techniques: basically a policy for keeping useful memory around and discarding what is not needed. This feels much more like how humans retain information: "use it or lose it". You only need a few things for this to work:<br>
- An eviction/retention policy. Something like "keep a memory if it has been accessed at least once in the last 10k tokens".<br>
- The policy needs to be efficiently computable<br>
- A place for the model to store and access long-term memory. Maybe a sparsely accessed KV cache would be sufficient. But for efficient access to a large memory a hierarchical data structure might be beter.</p>

<p></p>
<footer>
— <cite><a href="https://rss.xcancel.com/awnihannun/status/2029672507448643706#m">https://rss.xcancel.com/awnihannun/status/2029672507448643706#m</a>
</cite></footer>
</blockquote>]]></description>
      <pubDate>Thu, 05 Mar 2026 23:13:33 +0000</pubDate>
      <link>https://rss.xcancel.com/karpathy/status/2029696850366971921#m</link>
      <dc:creator>Andrej Karpathy / @karpathy</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5143326975</guid>
    </item>
    <item>
      <title><![CDATA[Tejas Kulkarni on X: "Creating a personalized vaccine for my dad's high-grade brain cancer." / X]]></title>
      <description><![CDATA[<div class="css-175oi2r r-1igl3o0 r-qklmqi r-1adg3ll r-1ny4l3l"><p class="css-175oi2r"></p><article class="css-175oi2r r-18u37iz r-1udh08x r-1c4vpko r-1c7gwzm r-1ny4l3l"><div class="css-175oi2r r-eqz5dr r-16y2uox r-1wbh5a2"><div class="css-175oi2r r-16y2uox r-1wbh5a2 r-1ny4l3l"><div class="css-175oi2r r-18u37iz r-136ojw6"><div class="css-175oi2r r-18kxxzh r-1wron08 r-onrtq4 r-1awozwy"><div class="css-175oi2r"><div class="css-175oi2r r-18kxxzh r-1wbh5a2 r-13qz1uu"><div class="css-175oi2r r-1wbh5a2 r-dnmrzs"><div class="css-175oi2r r-bztko3 r-1adg3ll"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1adg3ll r-1pi2tsx r-13qz1uu r-45ll9u r-u8s1d r-1v2oles r-176fswd r-bztko3"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-sdzlij r-1udh08x r-5f1w11 r-u8s1d r-8jfcpp"><a href="https://x.com/tejasdkulkarni" class="css-175oi2r r-1pi2tsx r-13qz1uu r-o7ynqc r-6416eg r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-sdzlij r-1udh08x r-633pao r-45ll9u r-u8s1d r-1v2oles r-176fswd"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="" src="https://pbs.twimg.com/profile_images/2022494213112500228/r56GRKNP_normal.jpg" class="css-9pa8cd"></div></div></div></div></a></div></div></div></div></div></div></div></div></div><div class="css-175oi2r r-1iusvr4 r-16y2uox r-1777fci"><div class="css-175oi2r r-zl2h9q"><div class="css-175oi2r r-k4xj1c r-18u37iz r-1wtj0ep"><div class="css-175oi2r r-1d09ksm r-18u37iz r-1wbh5a2"><div class="css-175oi2r r-1wbh5a2 r-dnmrzs r-1ny4l3l"><div class="css-175oi2r r-1wbh5a2 r-dnmrzs r-1ny4l3l" id="id__vg28v8yzfc"></div></div></div></div></div></div></div><div class="css-175oi2r"><div class="css-175oi2r r-1ny4l3l"><div class="css-175oi2r r-1peese0"><div class="css-175oi2r"><div class="css-175oi2r r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/tejasdkulkarni/article/1801928891252236374/media/1801914515740901376" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/GQGvtIJbQAAXICA?format=jpg&amp;name=small" class="css-9pa8cd"></div></div></div></a></div></div></div></div></div><div class="css-175oi2r r-vmopo1"><div class="css-175oi2r r-13qz1uu"><div class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-uho16t r-1vr29t4 r-1o37s4c r-1s2bzr4"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Creating a personalized vaccine for my dad's high-grade brain cancer.</span></div></div><div class="css-175oi2r r-1knelpx"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>----------------------------------------------------------------------------</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>It has been heartwarming to hear from so many friends, family members, and strangers over the last few days. Many of you have been asking me for the latest updates, so I wanted to share an update. Without all the guidance I received from you all, none of this experimental treatment would have been possible. I will be forever grateful to everyone who reached out and went out of their way.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>After hearing from so many doctors, researchers, and patients, we decided to pursue multiple paths after his biopsy. His recovery since the first hematoma and then the surgery has been a roller coaster, and he is frustrated due to his temporary loss of independence. However, we are working through it, and everyone close to him is doing the best they can to lift him up. So we are currently waiting to proceed with his standard oncology treatment.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In the meantime, we have figured out robust pathways to get him access to a dendritic cell-based vaccine. After hearing from many patients, it either works really well or it doesn't, so it was always worth the shot. I got connected to Prof. Ashkan in the UK, and we are figuring out if it's feasible to do the treatment. A big complication is that he will have to go to the UK multiple times, which is extremely challenging given his current state. I am figuring out the practicality of this path, and there seems to be a way to ship the cryo sample from India to the UK safely and reliably. In parallel, I have been exploring alternatives in India.</span></span></div></div><section class=""><div class="css-175oi2r r-13qz1uu r-1nxhmzv"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/tejasdkulkarni/article/1801928891252236374/media/1805332450110832640" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/GQ3UTF5XgAAgi_N?format=jpg&amp;name=900x900" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-1801928891252236374"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>20 hour train ride from Pune to Delhi, with the cryo sample.</span></span></div></div></div></div></div></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I am sitting in Delhi after a 20-hour train ride from Pune over the weekend. I am glad I went personally. The first clinic I visited was literally fake. I somehow got access to their lab early in the morning, as they thought I was working there because I had the freezer. The lab had no equipment; it was extremely dirty, and the "technicians" were sleeping in another room and had no idea about anything. Later on, when I confronted the doctor, he said that was not the lab and the real lab is at his home. I did not have high hopes for this option, but I wanted to check all supposed options —this one was unbelievable. I wonder how many cancer patients have suffered from this—it was deeply disturbing. This is perhaps the most unjust thing I have ever experienced, but I am letting it go for now.</span></span></div></div><section class=""><div class="css-175oi2r r-13qz1uu r-1nxhmzv"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/tejasdkulkarni/article/1801928891252236374/media/1805332867209183232" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/GQ3UrXtXQAAwrrU?format=jpg&amp;name=900x900" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-1801928891252236374"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Refilling liquid nitrogen from random dudes in Delhi.</span></span></div></div></div></div></div></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I had heard about another clinic which has CDSCO certification and is connected to doctors I know from the US and Pune. We now have a clear plan to synthesize a DC-based vaccine. We need to wait for my dad to recover properly, do a leukapheresis in Pune, and then get it shipped to their lab. It is incredible to know that there are kind doctors and clinicians who have started such labs to help people in need. We need more of this in India.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Now all the treatments we can give him are either lined up or can be lined up. The challenging part now is that he is not listening very much to any of us, but this is a temporary glitch as he has had a lot of stomach issues in the last couple of days. Dealing with such an aggressive condition is extremely challenging, and every day is new.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>There are several people that I want to deeply thank: Dr. Bandana at APAC biotech, Soham from PopVax, Dr Achawal for connecting me to Prof Ashkan, Rohan Khaunte, Dr. Kesari, Nisha, Tara, Guruji, cryo mafia in Pune/Delhi, </span></span><span><span>, colleagues from CSM/DeepMind/MIT, and many others.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>----------------------------------------------------------------------------</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>This is a time-sensitive post to help answer questions about my dad's sudden brain tumor diagnosis and to quickly plan the next steps for his treatment. I am providing the full context for those who can help, but you can skip to the last bit for concrete questions.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>On June 6th, I received some devastating MRI images showing a clear, large lesion and hematoma in the left temporal lobe. Unfortunately, it is deep within the brain, touching the hippocampus and language areas. It is suspected to be a high-grade glioma or perhaps glioblastoma (GBM), which is the deadliest form of brain cancer with a median survival time of only a few months. This was surprising to me, as he did not show any clear symptoms before, and there isn't any documented family history.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>I received an outpouring of support and help from my family, friends, colleagues, and strangers. I quickly learned from Josh Tenenbaum (my PhD advisor at MIT) and Marty Tenenbaum (Josh's dad and the founder of Cancer Commons) that there may not be a clear standardized path forward and that I should get connected to Al Musella (Musella Foundation For Brain Tumor Research &amp; Information, Inc). They all were incredibly generous and went out of their way to help me identify the only promising clinical trial (DCVax-L) for GBM. At this stage, we did not know if the tumor was malignant or benign, and biopsy/surgery was the next step. But I started preparing regardless.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>There have been a plethora of complications, regulatory barriers, and geographical barriers. First of all, my dad is in Pune, India, and does not currently have a foreign visa. DCVax is currently only ongoing in the UK, and I have yet to get in touch with the right folks about it due to scheduling problems on their end. Through a series of networks, Dr. Achawal in the UK introduced me to Dr. Ashkan at King's College, who has been involved in DCVax. Dr. Achawal also recommended Dr. Patkar, a neurosurgeon in Pune, in case we had to opt for a local surgery. Due to the time sensitivity, we had to proceed with biopsy/surgery yesterday, and it went well (dad is now beginning to talk, seems interactive and is out of ICU already) - thanks to the brilliant Dr. Patkar. Unfortunately, the tumor also seems to be highly malignant.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>A day before the surgery, Tanvi (my sister) and I were preparing to store his brain tissue samples, but none of the hospitals had the infrastructure or support for cryopreserving the brain tissue, let alone making the vaccine. Dr. Patel from the Poona Hospital pathology department called numerous IVF and stem cell clinics to see if they could preserve it, but they all declined due to regulatory and legal reasons.</span></span></div></div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>So we called a contact who agreed to sell us a cryo-freezer and liquid nitrogen. Now there was a path forward. The neurosurgeon, pathologist, and histopathologists at Poona Hospital were incredibly supportive and gave me the space to collect the sample, but there were several complications as no one knew the protocol. Fortunately, my dad's sister is a microbiologist, and Sri Krishna from the histopathology department figured out a safe way to cryopreserve the tumor tissue.</span></span></div></div><section class=""><div class="css-175oi2r r-13qz1uu r-1nxhmzv"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/tejasdkulkarni/article/1801928891252236374/media/1801921709639131136" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/GQG2P3ga4AA2MpL?format=jpg&amp;name=small" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-1801928891252236374"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Sri Krishna doing a last minute concoction consisting 10% glycerol and saline as the base material to immerse the tissue.</span></span></div></div></div></div></div></div></div></div></div></div></section><section class=""><div class="css-175oi2r r-13qz1uu r-1nxhmzv"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/tejasdkulkarni/article/1801928891252236374/media/1801922119951081472" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/GQG2nwCaYAAJCgo?format=jpg&amp;name=900x900" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-1801928891252236374"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Brain tumor tissue in the sterilized solution inside a cryo-vial</span></span></div></div></div></div></div></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>In order to safely preserve the tissue, we performed a series of progressive cooling steps (thank you, ChatGPT, for the detailed instructions): 1) 4°C for 1 hour, 2) -80°C inside a Styrofoam container for controlled freezing for a day, and 3) placing it inside the liquid nitrogen freezer at -196°C.</span></span></div></div><section class=""><div class="css-175oi2r r-13qz1uu r-1nxhmzv"><div class="css-175oi2r r-1867qdf r-1udh08x r-o7ynqc r-6416eg r-1ny4l3l"><div class="css-175oi2r"><div class="css-175oi2r r-16y2uox r-1pi2tsx r-13qz1uu"><a href="https://x.com/tejasdkulkarni/article/1801928891252236374/media/1801924035573596160" class="css-175oi2r r-1pi2tsx r-1ny4l3l r-1loqt21"><div class="css-175oi2r r-1adg3ll r-1udh08x"><div class="r-1p0dtai r-1pi2tsx r-1d2f490 r-u8s1d r-ipm5af r-13qz1uu"><div class="css-175oi2r r-1mlwlqe r-1udh08x r-417010 r-aqfbo4 r-n1ft60 r-gf0ln r-agouwx r-1p0dtai r-1d2f490 r-u8s1d r-zchlnj r-ipm5af"><img alt="Image" src="https://pbs.twimg.com/media/GQG4XQSaQAAxI1l?format=jpg&amp;name=medium" class="css-9pa8cd"></div></div></div></a></div></div></div><div class="css-175oi2r r-knv0ih" id="caption-1801928891252236374"><div class="twitter-article-media-caption-id"><div class="css-175oi2r r-37j5jr"><div class="DraftEditor-root"><div class="DraftEditor-editorContainer"><div class="public-DraftEditor-content"><div><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Cryo-vial with the tumor tissue placed inside liquid nitrogen after a series of cooling steps. </span></span></div></div></div></div></div></div></div></div></div></div></section><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Now the cryopreserved sample is with me. The process for making an immunotherapy-based vaccine involves collecting dendritic cells from the blood, extracting tumor antigens from the sample, training the dendritic cells to recognize the antigens, and then injecting this into my dad. I am talking to several people locally since DCVax-L is not available here (although I will be talking to Dr. Ashkan on the 20th). A few questions:</span></span></div></div><ol class="public-DraftStyleDefault-ol"><li class="longform-ordered-list-item public-DraftStyleDefault-orderedListItem public-DraftStyleDefault-reset public-DraftStyleDefault-depth0 public-DraftStyleDefault-listLTR"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Does anyone have any personal experiences with non-standard immunotherapies? A lab in India is promising they can synthesize this if we can ship the tissue, but I don't have enough experience to know. I am talking to them in detail now and they seem very promising.</span></span></div></li><li class="longform-ordered-list-item public-DraftStyleDefault-orderedListItem public-DraftStyleDefault-depth0 public-DraftStyleDefault-listLTR"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Does anyone have experience shipping biological materials internationally? What are the complications?</span></span></div></li><li class="longform-ordered-list-item public-DraftStyleDefault-orderedListItem public-DraftStyleDefault-depth0 public-DraftStyleDefault-listLTR"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>If you are in India, do you have experience with any of this? Unfortunately, all of this is very esoteric and new, especially in India.</span></span></div></li></ol><div class="longform-unstyled"><div class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr"><span><span>Thank you to everyone who has been helping me over the last couple of weeks. I will be forever grateful for it.</span></span></div></div></div></div></div></div></div></div><div class="css-175oi2r"><div class="css-175oi2r r-1xfd6ze r-6koalj r-18u37iz r-kzbkwu r-1c4vpko r-1c7gwzm r-ttdzmv r-13qz1uu"><div class="css-175oi2r r-13awgt0 r-18u37iz"><svg class="r-4qtqp9 r-yyyyoo r-1xvli5t r-dnmrzs r-bnwqim r-lrvibr r-m6rgpd r-1nao33i r-1wron08"><g><path></path></g></svg><div class="css-175oi2r r-1habvwh r-13awgt0 r-1777fci"><div class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-1b43r93 r-1cwl3u0 r-16dba41 r-13qz1uu"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Want to publish your own Article?</span></div><a href="https://x.com/i/premium_sign_up" class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-1b43r93 r-1cwl3u0 r-b88u0q r-1q142lx r-1s2bzr4 r-iphfwy r-13orzya r-3s2u2q r-1ddef8g r-tjvw6i r-1loqt21"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Upgrade to Premium</span></a></div></div></div></div></div></div><div class="css-175oi2r r-12kyg2d"><div class="css-175oi2r r-k4xj1c r-18u37iz r-1wtj0ep"><div class="css-175oi2r r-1wbh5a2 r-1a11zyx"><div class="css-175oi2r r-1d09ksm r-18u37iz r-1wbh5a2 r-1471scf"><a href="https://x.com/tejasdkulkarni/status/1801928891252236374/analytics" class="css-146c3p1 r-bcqeeo r-1ttztb7 r-qvutc0 r-37j5jr r-a023e6 r-rjixqe r-16dba41 r-1loqt21"> <span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3 r-1b43r93 r-1cwl3u0"><span class="css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3">Views</span></span></a></div></div></div></div></div></div></div></article><p></p></div>]]></description>
      <pubDate>Mon, 16 Mar 2026 03:15:06 +0000</pubDate>
      <link>https://x.com/tejasdkulkarni/status/1801928891252236374?s=46</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5155781102</guid>
    </item>
    <item>
      <title><![CDATA[How to win a best paper award]]></title>
      <description><![CDATA[An opinionated perspective on how to do important research that makes a difference (and sometimes win awards).]]></description>
      <pubDate>Mon, 09 Mar 2026 00:00:00 +0000</pubDate>
      <link>https://nicholas.carlini.com/writing/2026/how-to-win-a-best-paper-award.html</link>
      <dc:creator>Nicholas Carlini</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5147741488</guid>
    </item>
    <item>
      <title><![CDATA[RT by @ylecun: Latent world models learn differentiable dynamics in a learned representation space, which should make planning as simple as gradient descent.

But it almost never works.

What I mean is, at test time, you can treat the action sequence as learnable parameters, roll out the frozen world model, measure how far the predicted final state is from the goal, and backprop through the entire unrolled chain to optimize actions directly. Yet many of the systems that work (Dreamer, TD-MPC2, DINO-WM) abandon this and fall back to sampling-based search instead.

That's why I really like this new paper by @yingwww_, @ylecun, and @mengyer, which gives a clean diagnosis of why, and a principled fix.

The reason everyone abandons gradient descent on actions is that the planning objective is highly non-convex in the learned latent space. So instead most systems use CEM (cross-entropy method) or MPPI (model predictive path integral), both derivative-free.

CEM samples batches of action sequences, evaluates them by rolling out the world model, keeps the top-k, and refits the sampling distribution.

MPPI does something similar but weights trajectories by exponentiated negative cost instead of hard elite selection.

These work when gradients are unreliable but the compute cost is substantial — hundreds of candidate rollouts per planning step vs a single forward-backward pass.

This paper asks what exactly makes the latent planning landscape so hostile to gradients and what you can do about it.

The diagnosis. Their baseline is DINO-WM, a JEPA-style world model with a ViT predictor planning in frozen DINOv2 feature space, minimizing terminal MSE between predicted and goal embeddings. The problem is that DINOv2 latent trajectories are highly curved (when you use MSE as the planning cost you're implicitly assuming euclidean distance approximates geodesic distance along feasible transitions).

For curved trajectories this breaks badly, gradient-based planners get trapped and straight-line distances in embedding space misrepresent actual reachability.

The fix draws from the perceptual straightening hypothesis in neuroscience — the idea that biological visual systems transform complex video into internally straighter representations. So they add a curvature regularizer during world model training.

Given consecutive encoded states

z_t, z_{t+1}, z_{t+2},

define velocity vectors as

v_t = z_{t+1} - z_t

measure curvature as the cosine similarity between consecutive velocities, and minimize

L_curv = 1 - cos(v_t, v_{t+1}).

Total loss is then

L_pred + λ * L_curv

with stop-gradient on the target branch to prevent collapse.

The theory backs this up cleanly — they prove that reducing curvature directly bounds how well-conditioned the planning optimization is — straighter latent trajectories guarantee faster convergence of gradient descent over longer horizons.

Worth noting that even without the curvature loss, training the encoder with a prediction objective alone produces some "implicit straightening" — the JEPA loss naturally favors representations whose temporal evolution is predictable. Explicit regularization simply pushes this much further.

Empirical results across four 2D goal-reaching environments are consistently strong. Open-loop success improves by 20-50%, and the GD with straightening matches or beats CEM at a fraction of the compute. 

The most convincing evidence is the distance heatmaps: after straightening, latent Euclidean distance closely matches the shortest distance between states, even though the model was trained only on suboptimal random trajectories.

What I find interesting beyond the specific method is that the planning algorithm didn't change. The dynamics model didn't change. A single regularization term on the embedding geometry turned gradient descent from unreliable to competitive with sampling methods. 

The field has largely treated representation learning and planning as separate concerns — learn good features, then figure out how to plan in them. 

This paper makes a concrete case that the representation geometry is itself the bottleneck.

This connects to a broader pattern in ML. When optimization fails, the instinct is to fix the optimizer (better search, more samples, adaptive schedules). But often the real lever is the shape of the space you're optimizing in.

Same principle shows up in RL post-training where reward landscape shaping matters as much as the algorithm itself. 

Shape the space so simple optimization works, rather than building complex optimization to handle a bad space.

Their paper:
https://arxiv.org/abs/2603.12231]]></title>
      <description><![CDATA[<p>Latent world models learn differentiable dynamics in a learned representation space, which should make planning as simple as gradient descent.<br>
<br>
But it almost never works.<br>
<br>
What I mean is, at test time, you can treat the action sequence as learnable parameters, roll out the frozen world model, measure how far the predicted final state is from the goal, and backprop through the entire unrolled chain to optimize actions directly. Yet many of the systems that work (Dreamer, TD-MPC2, DINO-WM) abandon this and fall back to sampling-based search instead.<br>
<br>
That's why I really like this new paper by <a href="https://rss.xcancel.com/yingwww_" title="Ying Wang">@yingwww_</a>, <a href="https://rss.xcancel.com/ylecun" title="Yann LeCun">@ylecun</a>, and <a href="https://rss.xcancel.com/mengyer" title="Mengye Ren">@mengyer</a>, which gives a clean diagnosis of why, and a principled fix.<br>
<br>
The reason everyone abandons gradient descent on actions is that the planning objective is highly non-convex in the learned latent space. So instead most systems use CEM (cross-entropy method) or MPPI (model predictive path integral), both derivative-free.<br>
<br>
CEM samples batches of action sequences, evaluates them by rolling out the world model, keeps the top-k, and refits the sampling distribution.<br>
<br>
MPPI does something similar but weights trajectories by exponentiated negative cost instead of hard elite selection.<br>
<br>
These work when gradients are unreliable but the compute cost is substantial — hundreds of candidate rollouts per planning step vs a single forward-backward pass.<br>
<br>
This paper asks what exactly makes the latent planning landscape so hostile to gradients and what you can do about it.<br>
<br>
The diagnosis. Their baseline is DINO-WM, a JEPA-style world model with a ViT predictor planning in frozen DINOv2 feature space, minimizing terminal MSE between predicted and goal embeddings. The problem is that DINOv2 latent trajectories are highly curved (when you use MSE as the planning cost you're implicitly assuming euclidean distance approximates geodesic distance along feasible transitions).<br>
<br>
For curved trajectories this breaks badly, gradient-based planners get trapped and straight-line distances in embedding space misrepresent actual reachability.<br>
<br>
The fix draws from the perceptual straightening hypothesis in neuroscience — the idea that biological visual systems transform complex video into internally straighter representations. So they add a curvature regularizer during world model training.<br>
<br>
Given consecutive encoded states<br>
<br>
z_t, z_{t+1}, z_{t+2},<br>
<br>
define velocity vectors as<br>
<br>
v_t = z_{t+1} - z_t<br>
<br>
measure curvature as the cosine similarity between consecutive velocities, and minimize<br>
<br>
L_curv = 1 - cos(v_t, v_{t+1}).<br>
<br>
Total loss is then<br>
<br>
L_pred + λ * L_curv<br>
<br>
with stop-gradient on the target branch to prevent collapse.<br>
<br>
The theory backs this up cleanly — they prove that reducing curvature directly bounds how well-conditioned the planning optimization is — straighter latent trajectories guarantee faster convergence of gradient descent over longer horizons.<br>
<br>
Worth noting that even without the curvature loss, training the encoder with a prediction objective alone produces some "implicit straightening" — the JEPA loss naturally favors representations whose temporal evolution is predictable. Explicit regularization simply pushes this much further.<br>
<br>
Empirical results across four 2D goal-reaching environments are consistently strong. Open-loop success improves by 20-50%, and the GD with straightening matches or beats CEM at a fraction of the compute. <br>
<br>
The most convincing evidence is the distance heatmaps: after straightening, latent Euclidean distance closely matches the shortest distance between states, even though the model was trained only on suboptimal random trajectories.<br>
<br>
What I find interesting beyond the specific method is that the planning algorithm didn't change. The dynamics model didn't change. A single regularization term on the embedding geometry turned gradient descent from unreliable to competitive with sampling methods. <br>
<br>
The field has largely treated representation learning and planning as separate concerns — learn good features, then figure out how to plan in them. <br>
<br>
This paper makes a concrete case that the representation geometry is itself the bottleneck.<br>
<br>
This connects to a broader pattern in ML. When optimization fails, the instinct is to fix the optimizer (better search, more samples, adaptive schedules). But often the real lever is the shape of the space you're optimizing in.<br>
<br>
Same principle shows up in RL post-training where reward landscape shaping matters as much as the algorithm itself. <br>
<br>
Shape the space so simple optimization works, rather than building complex optimization to handle a bad space.<br>
<br>
Their paper:<br>
<a href="https://arxiv.org/abs/2603.12231">arxiv.org/abs/2603.12231</a></p>
<img src="https://pbs.twimg.com/card_img/2031482977276878848/A3K92F3y?format=jpg&amp;name=420x420_2" style="max-width:250px;">]]></description>
      <pubDate>Sun, 15 Mar 2026 06:02:44 +0000</pubDate>
      <link>https://rss.xcancel.com/zhuokaiz/status/2033061315707654255#m</link>
      <dc:creator>Yann LeCun / @ylecun</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5155204232</guid>
    </item>
    <item>
      <title><![CDATA[In 2019 feeling the limits in deep RL, I went on defining AGI metric. Like a good ML agent, I tried collecting diverse examples of GI. Some recommended reading list:
- history books: "gun, steel, germs", "sapiens", "roman history" by shiono nanami, "art of war"
- animals: "other minds🐙", "are we smart enough to know how smart animals are"
- psychology and neuroscience: "man who mistook his wife for hat", "being you"
- papers: Shane Legg, Daniel Polani, Francois Chollet, Ken Stanley, Karl Friston]]></title>
      <description><![CDATA[<p>In 2019 feeling the limits in deep RL, I went on defining AGI metric. Like a good ML agent, I tried collecting diverse examples of GI. Some recommended reading list:<br>
- history books: "gun, steel, germs", "sapiens", "roman history" by shiono nanami, "art of war"<br>
- animals: "other minds🐙", "are we smart enough to know how smart animals are"<br>
- psychology and neuroscience: "man who mistook his wife for hat", "being you"<br>
- papers: Shane Legg, Daniel Polani, Francois Chollet, Ken Stanley, Karl Friston</p>]]></description>
      <pubDate>Wed, 11 Mar 2026 20:49:31 +0000</pubDate>
      <link>https://rss.xcancel.com/shaneguML/status/2031834929605320891#m</link>
      <dc:creator>Shane Gu / @shaneguML</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5150785994</guid>
    </item>
    <item>
      <title><![CDATA[Good post on how to think about honing your skills as an (academic) researcher by Carlini

https://nicholas.carlini.com/writing/2026/how-to-win-a-best-paper-award.html]]></title>
      <description><![CDATA[<p>Good post on how to think about honing your skills as an (academic) researcher by Carlini<br>
<br>
<a href="https://nicholas.carlini.com/writing/2026/how-to-win-a-best-paper-award.html">nicholas.carlini.com/writing…</a></p>]]></description>
      <pubDate>Tue, 10 Mar 2026 19:15:35 +0000</pubDate>
      <link>https://rss.xcancel.com/natolambert/status/2031448901103005824#m</link>
      <dc:creator>Nathan Lambert / @natolambert</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5149408133</guid>
    </item>
    <item>
      <title><![CDATA[Nice write up. I also think we're at the beginning of software engineering completely changing forever. Not just "muh vibecoding yolo", but something more fundamental:

libraries are currently the cornerstone of swe. I think the future will be without libraries. You'll do ~everything in your codebase, tailored specifically to exactly what you do. The agents see the whole code. You can change the behavior of anything exactly the way you need it. No need to be backwards compatible. No need to worry about breaking others.

What i think might become more ubiquitous instead is something akin to "starter packs" which could really just be something like tutorials explaining a concept. Like if you make a game, get the "rpg(.)md from SquareEnix" and the "entity-component(.)md from Lucas" files into context and go.

Not "no libraries at all" but rather "only direct vendor libraries" like of course Vulkan or DirectX are still needed, but Unity or Unreal? Not so much. And i think this will happen in all domains. It will take a while, but i think e are at the very beginning of this future. I'm already starting to live in it.

I used to love coding for the sake of coding, and was a bit worried of it going away. However, it turns out it's not really the coding itself that i love, it's more the combined act of creating something and having "puzzles" to think about while doing so. And it turns out that in this new way of coding, these two parts are just as fun.]]></title>
      <description><![CDATA[<p>Nice write up. I also think we're at the beginning of software engineering completely changing forever. Not just "muh vibecoding yolo", but something more fundamental:<br>
<br>
libraries are currently the cornerstone of swe. I think the future will be without libraries. You'll do ~everything in your codebase, tailored specifically to exactly what you do. The agents see the whole code. You can change the behavior of anything exactly the way you need it. No need to be backwards compatible. No need to worry about breaking others.<br>
<br>
What i think might become more ubiquitous instead is something akin to "starter packs" which could really just be something like tutorials explaining a concept. Like if you make a game, get the "rpg(.)md from SquareEnix" and the "entity-component(.)md from Lucas" files into context and go.<br>
<br>
Not "no libraries at all" but rather "only direct vendor libraries" like of course Vulkan or DirectX are still needed, but Unity or Unreal? Not so much. And i think this will happen in all domains. It will take a while, but i think e are at the very beginning of this future. I'm already starting to live in it.<br>
<br>
I used to love coding for the sake of coding, and was a bit worried of it going away. However, it turns out it's not really the coding itself that i love, it's more the combined act of creating something and having "puzzles" to think about while doing so. And it turns out that in this new way of coding, these two parts are just as fun.</p>
<hr>
<blockquote>
<b>mike64_t (@mike64_t)</b>
<p>
</p><p>I think with Codex 5.3, the need for off-the-shelf deep learning libraries will fade away.<br>
Reasoning models operate best at the boundary of exact verifiabilty, so ever venturing too far into "well this is kinda correct" is no longer the best strategy. Exact verification now scales better than soft verification.<br>
When starting my current project, I deliberately decided against using any DL library because I wanted to take ownership of some things that are hard when a graph or eager model is in the way.<br>
Dispatching operations to multiple streams with fine-grained barrier relations is really stroking against the grain in PyTorch, and you are never really sure "am I really allowed to do this". There was a time for OpenGL, but people eventually did want a VkCmdBarrier for good reason.<br>
Because I also wanted predictable dispatch pacing, using C++ was a natural choice. Previously this meant taking on the burden of writing a lot of boilerplate, the equivalent of "shit I can't do this in unity, now I gotta write my own engine" which never seemed a good idea on the surface.<br>
Now I can say it was among the best decisions I have made. New operations are a prompt away, Codex can introspect and trace into any part of the codebase automatically, single-stepping even into nccl if ever needed, and supporting a new backend is trivial.<br>
At no point would your debugging lead into an opaque compiled native library you do not have the source code for, it will simply go-to-declaration one more time. In the age of reasoning models, a single source tree break is fatal and can be the difference between finding or not finding a bug.<br>
There is no cost to saying "write a test for this" and you've protected yourself against regressions for this case forever onwards.<br>
You can just say "implement muon, here's the repo" and it will do so and loss in wandb will literally look the same compared to the python baseline.<br>
Codex is a good autonomous debugger, so program runtime really starts to become a bottleneck, not thinking time. Hence start-up time is important.<br>
There is no reason your training script should take minutes to launch, when it could have performed the first step in the time it takes a shitty terminal to repaint.<br>
If your iteration loop was slow before, in the age of coding agents it is now fatal.<br>
By not triggering a billion library lazy inits at unpredictable points in time because your ML framework decided to do so, your Nsight traces look as clean as higher level profilers would, just with more introspectability. You finally get to use NVTX the way Nvidia always intended for you to do.<br>
Another thing, kernels are just cuda elf binaries. There is no reason to deal with a flash attention package installation. This is all cpu-side. Tell codex to write packaging logic to compile it AOT, and document the kernel signature how arguments have to be prepared. In the C++ code load that kernel from a resource and then simply pass those arguments.<br>
This approach is modular. Want a cutlass, flash attention, triton or cute dsl backend and reserve the right to write a custom kernel later? No problem. Nobody wants to write backend kernel dispatch logic, but you don't have to anymore.<br>
Does C++ scare you? Maintain a minimal Python reference implementation in PyTorch with the intent of keeping behavior exactly the same, just without all the optimizations. Exact verifiability means you can resume that cpp checkpoint in your Python implementation and get near-exact loss overlap in wandb and vice-versa. No more spook, it's either in the spec, or its not. That is what verifiability means.<br>
While I think there is a large cost to move off of pre-existing infra, eventually taking ownership of more and more pieces of the codebase will become more and more desirable with this change in dynamic.</p>
<img src="https://pbs.twimg.com/media/HAvZTkIaAAEi01m.jpg" style="max-width:250px;">
<img src="https://pbs.twimg.com/media/HAvZlvibwAAOxPk.jpg" style="max-width:250px;">
<img src="https://pbs.twimg.com/media/HAvaPfOaAAoPKVA.png" style="max-width:250px;">
<img src="https://pbs.twimg.com/media/HAvbkM4bQAAkQRW.jpg" style="max-width:250px;">

<p></p>
<footer>
— <cite><a href="https://rss.xcancel.com/mike64_t/status/2020953317535514648#m">https://rss.xcancel.com/mike64_t/status/2020953317535514648#m</a>
</cite></footer>
</blockquote>]]></description>
      <pubDate>Wed, 04 Mar 2026 19:50:22 +0000</pubDate>
      <link>https://rss.xcancel.com/giffmana/status/2029283330513084527#m</link>
      <dc:creator>Lucas Beyer (bl16) / @giffmana</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5140980160</guid>
    </item>
    <item>
      <title><![CDATA[Use Agents or Be Left Behind? A Personal Guide to Automating Your Own Work]]></title>
      <description><![CDATA[
<p>If you are reading this, you probably feel the FOMO. Maybe you have seen the Twitter threads about coding agents completing entire features in minutes. Maybe a colleague mentioned they are “10x more productive” now — or “Influencers” saying AGI is here and you need to learn their particular thing now. Maybe you tried Claude Code and felt confused about why the magic everyone talks about is not working for you. This blog post is for those who want to cut through the hype and understand what actually works, what does not, and how to think about using agents to automate your own job further and further to be more productive.</p>



<p>I have been using agents — primarily Claude Code — for eight months to automate my own work. What you will read here is not speculation or theory. It is the product of hundreds of hours of experimentation, many failures, and some surprising successes. As a professor who does not write much code anymore, my perspective is different from the software engineering discourse that dominates Twitter. Most of my agent use is actually for writing — blog posts, grant proposals, meta reviews. While these problems might be non-traditional, they provide the exact view of how to use coding agents for all kinds of tasks even beyond coding itself. This helps you understand how far you can go in all different directions of agent-use.</p>



<span id="more-1238"></span>



<p>Just to give you a hint of how powerful agents have been for me: usually, the first year as a professor is very stressful and involves a lot of work. For me, it felt easy. I had some luck here and there, but I believe the use of agents is a significant reason why things have been manageable for me when they are hard for others.</p>



<p>This blog post is my attempt to share what I have learned so that it might help you grow in the long term. I will detail a number of different things I have built with agents — which succeeded and which failed. There are plenty of blog posts out there about software engineering with agents; this one tries to give a broader, more balanced view.</p>



<p>Before I worked in AI, I spent three years in the automation industry in Germany, developing SCADA systems. That experience taught me how to think about automation systematically — when it makes sense, when it does not, and how to build skills over time. I bring that perspective here, along with concepts like process optimization that are standard in manufacturing but rarely discussed in the context of AI agents.</p>





<h2>What Is Hype and What Is Real</h2>



<p>This blog post is mostly about my own experience with concrete examples of successes and failure cases. This is <em>real</em>. And I want to contrast this slightly with the Twitter discourse, which can create unnecessary FOMO.</p>



<h3>Hype: Vast parallelization, large productivity increases, and autonomy</h3>



<p>What you see on Twitter is mostly about software engineering. And while agents in software engineering are real and powerful, software engineering is very unlike most other problems.</p>



<p>Firstly, in software engineering, you often have many parallel problems that you work on independently: bugs, new features, quality control and refactoring, GitHub discussions, and reviews. All of these tasks are independent and can be parallelized. The larger the codebase, the more independent work can be done, and the more parallel sessions become useful.</p>



<p>But here is the thing: the concept of parallel sessions is an agentic workflow pattern that is useful for software engineering and some other tasks, but most tasks cannot benefit from parallelization.</p>



<p>Secondly, while productivity gains in software engineering are real, they do not automatically translate to all tasks. Coding is a very general capability. Theoretically, you can do anything digital – which spans a lot of tasks. But in practice, automation of many other non-software-engineering tasks is very difficult or has small payoffs. Later in this blog post, I will give you a framework for how to think about this more broadly.</p>



<p>Thirdly, while a fully autonomous system can be impressive, real work that is useful often creates design decisions. Iteratively designing a system with shorter bursts of autonomy with feedback loops where they shape the final solution that you want can be much more effective than just rolling out agents until a solution is achieved. It works, but it is not very helpful for most work because the quality is too low.</p>



<h3>Real: Agents Should Be Used Everywhere</h3>



<p>This blog post is about using coding agents for all kinds of tasks and how I learned from that experience. After 8 months of Claude Code use and trying to automate countless tasks, here is my honest assessment: more than 90% of code and text should be written by agents. You need to do so, or you will be left behind.</p>



<p>This statement might seem controversial and as FOMO-inducing as the software engineering story I just critiqued. But I believe it is reality, and understanding how to adjust to that reality will be a big part of everyone’s jobs going forward. This blog post is an attempt to share my knowledge to help you on this path.</p>



<p>When I talk with people about this, a lot of people push back vehemently. Generate all your work with AI? They find it ridiculous. How can generic boilerplate generation replace the intricate style of a well-designed software system? It feels absurd for them to replace the immediately noticeable and distinct style of a writer with an AI-generated slop wall of text.</p>



<h2>Why AI-Generated Content Is Personal, Not Generic</h2>



<p>AI-generated content is personal content. When I explore a concept with Claude, the output is not generic. It is shaped entirely by my thinking, my style of conversation.</p>



<p>&nbsp;I really like connections between fields, and the topics I explore are highly personal and unique traces of my thinking — and with that, my taste.</p>



<p>Let me give you a vivid example. I once started a conversation about jihad — the concept in Islamic theology that is often misunderstood, about the inner struggle to do the right thing when it is difficult, but it really matters. From there, I ended up connecting it to Krishna’s advice to Arjuna about doing the right thing and to not worry about the outcome by surrendering the fruits of actions to him (karma yoga), to Lutheran grace that is purest when it emerges at the height of struggle, to Taoist Wu Wei where struggle disappears through letting go and letting your nature take over, to Beowulfian naked will against overwhelming odds — the struggle with Grendel as a symbol to surrender to your fate.</p>



<p>None of this exists in any textbook or on the internet. It is a fingerprint. A very personal fingerprint. If you were to read the details of these conversations, you would know parts of me intimately — who I am and why I am that way. You would know me to a degree that is usually only reserved for close friends and your partner.</p>



<p>Someone thinking that AI-generated content is impersonal and generic is deeply mistaken. The concept of soulless AI generation is an artifact of less powerful AI, or the mistake of seeing your own generations as what AI can do, rather than recognizing it as a skill issue that has to be overcome.</p>



<h2>Useful Background: How to Think About Automation</h2>



<h3>The Basic Calculus of Automation</h3>



<p>Before I worked in AI, I worked in the automation industry in Germany. I was developing SCADA systems — integrating data from machines and databases to enable the control of workflows via data and monitoring. The knowledge gained in these three years in the automation industry applies directly to automating your own work with agents.</p>



<p>The first important question is: when should you automate versus when should you not? While people always think about automation in terms of full automation, this is almost never the case in practice. You always have a degree of automation.</p>



<p>Here is how to think about useful automation: if you take your current degree of automation and increase it by a new technology, then you improve by a certain percentage. If a task takes 10 hours and you improve the degree of automation by 10%, then it takes 9 hours.</p>



<p>With this view, there is a simple calculus: how often do you do the task, and how long do you need to automate this task to improve the degree of automation by 10%? If this calculus leads to a result where the cost of automating something is higher than the gain, then the problem is not fit for automation. The task should be done manually. There are many tasks that should not be automated because it is not effective. I will give a length example about email automation, where I tried hard, but it is a problem where automation fails.</p>



<p>Additionally, as your workflow changes, it adds overhead. For example, you might save 30 seconds, but if your agent needs 30 seconds to generate that automation, then the effectiveness is 0%. The degree of automation improves by 0%.</p>



<p>If you invest so much time into improving your work with agents, you want to make sure that it actually helps. This simple calculus – while not perfect – is a simple tool to help you decide where to start with automating your work.</p>



<h3>The Method: Process Optimization</h3>



<p>A very basic method in factory automation is process optimization. You are on the factory floor with a stopwatch. You look at and study what workers are doing. You time each step: how they hand off work to another person, when they wait for previous work to complete, how many resources they need, and if they wait for resources.&nbsp;</p>



<p>If you have all these components, you can construct a workflow — a process that reflects the current state of how work is done. Based on this, you can think about how to optimize. This thinking is extremely important for automating your own work. You have to think about how you work on a particular problem and how you can change your work with agents. Using this framework of process optimization can be extremely helpful to get a quick sense of how much productivity gains you can achieve with a particular solution. Sometimes you find that the process cannot be optimized much with agents – that saves you a lot of time.</p>



<p>Let me give you a concrete example. If I take one minute to read an email and 30 seconds to reply, then it takes 1 minute 30 seconds to complete an email. Now, if I use an agent to help me with my emails, then I need to guide it to process my emails. Then I need to read that content to decide how it should create drafts or answer emails so that I can then edit those drafts. But once you do this exercise, you realize that by using agents, you just shift the process and do some automatic generations — but you still need to read content, make sure it is aligned with your intent, and you need to edit the draft if it does not match exactly with what you wanted.</p>



<p>There are certain emails that are easy to automate. There are others that are not. It depends on your process and your underlying inputs to see if using agents and changing your process can actually lead to productivity.</p>



<p>Reading an email or reading AI-generated content has a cost. You need to include that cost in your process optimization thinking to understand if your process can benefit from automation. This insight — which seems obvious but is often ignored — is fundamental to achieving higher and higher degrees of automation.</p>



<h3>The Long-Term Perspective: Building Automation Muscles</h3>



<p>The process perspective I just gave is a short-term view. You look at the underlying processes and the degree of automation, then think about how long you will need to automate that work and how much you increase the degree of automation. It is a simple calculus. This is classic automation, how it is done in Germany and other countries. Very cost-effective and optimal in the short term.</p>



<p>However, it is very short-sighted because it does not consider long-term consequences.</p>



<p>The long-term view is a Shenzhen-style perspective. It is not about making any automation useful in the short term. It is about making automation useful in the long run by gathering knowledge that improves automation over time.</p>



<p>It is essentially short-term calculus with a meta-automation step added: even if the degree of automation is not worth it in the short term, will the skills I build and the tools I develop make future automation effective that was previously ineffective? Does the additional knowledge help me with future automation effectiveness?&nbsp;</p>



<p>This is exactly what led from Shenzhen-style scrappy factories to highly structured dark factories that are fully automated. Chinese automation is far superior to Western automation, not because of scale, but because the long-term view of automation led to a higher quality and degree of automation.</p>



<p>This is an important concept. You need to optimize both short-term and long-term perspectives to effectively automate your own job. Europe is struggling because of its short-term view of automation. The US is struggling in many segments because it did not build the long-term skillset that is required to build the automation muscles to tackle more challenging problems.</p>



<p>In other words, using agents and failing at automating a task effectively is important. You need to gain skills to improve future automation, and that means sometimes trying to automate things that you know you will not be able to automate.</p>



<p>A key part of making sure you learn over time is important. Often, you learn more from failure than successes and with agents, it is no different.</p>



<h3>Why Automating Your Job Is Good for You</h3>



<p>Software engineers are not replaceable. They just level up. The current hiring challenges are driven by COVID and financial dynamics much more than by AI. Software engineers are now much more effective at building software more rapidly, and the value of software has not decreased significantly. An engineer who uses agents like a hot knife slicing through butter is actually more valuable because they can produce more software that still has significant value.</p>



<p>A common view, particularly from the Bay Area,&nbsp; is that this is the current state, but software engineering will be fully automated very soon. I have many friends at frontier labs who had this view about nine months ago, but it has broadly changed. They see that it is difficult to automate their own work and that, as they use these tools, new problems open up.</p>



<p>Even if an agent can do everything, it cannot do everything at the same time. If you have a limited amount of GPUs, you want to direct agents to tasks that are useful for you so they can generate value where you need it. While even that can be partially automated, once your existence is at stake, you probably want to direct what agents do yourself — at least specify the problem and solution you want.</p>



<p>I think it will be a long time until you use an agent to manage your retirement savings by analyzing the stock market and optimizing it fully autonomously. But what is more reasonable is that you build a system where you tell an agent what risk you are happy to accept and how to optimize this risk through hedging, so that you might manage your retirement fund with a trade-off between potential upside and risk over time. It would be unwise to fully trust an agent if you do not know the parameters that are important for you and how the agent chooses those parameters.</p>



<p>If resources are limited, you want to decide how those resources are used rather than fully trusting an agent. And if this is true, then directing agents will remain a problem, even if agents can do everything, because agents cannot do everything at once, because resources are finite.</p>



<p>Long story short, because of this resource problem, there will always be work where your personal preferences, decisions, and taste will be needed — even if 90% of the work happens through AI. From software engineering, we already see that these changes work, but they will not eliminate many jobs that we thought would be automated away quickly.&nbsp;</p>



<p>I think the other direction is actually more pressing: if you do not know how to use agents, you will not have a good job or be able to find a job. Agent use is now an essential skill that you need to develop and master.</p>



<h2>My Personal Experience with Automating My Own Work</h2>



<h3>Personal tools and pipelines</h3>



<p>What is most common on Twitter are examples of successful agent use, where people create a tool that is useful for them. Small extensions that help your everyday life — just vibe coding something that you always wanted and that is simple, but nobody provided.</p>



<p>While this is a simplistic way of using agents, it has its importance. This is a problem where agents work really well, and they require very little skill to be used correctly.</p>



<p>For example, I built tools that help me write this blog post. I built tools that help me work with agents. One of the most important tools is a voice tool, which helps me quickly interact with agents, particularly for parallel sessions. A voice tool also helps me because I have carpal tunnel in both my hands. Typing can be painful. I have a very custom keyboard layout and way of working with the keyboard that reduces pain to almost zero, but still, it is much more comfortable to just use my voice. And it is not only comfortable, it is also faster.</p>



<p>A main advantage is that with voice, you can inspect outputs and use your keyboard and mouse while narrating. This is extremely powerful. A key tool that everybody should develop is their own voice tool to use AI in this way, where they can do work while narrating.</p>



<h3>Tools for Students</h3>



<h4>Finding related papers: Replication of Connected Paper</h4>



<p>Another tool I built was to solve the problem of finding related work. The most useful tool I have ever used for this was <a href="https://www.connectedpapers.com/">Connected Papers</a>. It was free at some time, but then it became commercial. I need something like this at the beginning of a project and when writing the related work section of a paper. I did not want to pay for the subscription, and I wanted my students to be able to access it. So I just replicated the entire software system.</p>



<p>This was probably not effective for automation in the short term — I could have just paid for Connected Papers subscription. But it gave me an overview of what I can do in the long term: what tools can I build, what is too ambitious, what is less ambitious, and how can I be more effective when creating complex tools.</p>



<p>My connected papers replication uses the Semantic Scholar API to retrieve data. Then it builds statistics on the citation graph of papers to find papers that are very similar to what Connected Papers finds. The key insight I had is how Connected Papers works: it finds papers that are in conversation. They are often indirectly related through a third paper that cites both of them. They create a chain, a loop of three papers, and this loop is distributed. If you have this loop across three-paper chains that create circles, you have a very good way of finding related papers.</p>



<p>The tool that I created was very useful, but here is where it failed: the user interface. The algorithm works well, making the software easy to use for others turned out to be the hard part. My students needed to execute a Python command and use a password to extract an API key – it is a mess to get started. Instead of local deployment, I should have built a deployment that is just a regular website that you can access anywhere with just your browser.</p>



<p>If you want to create a tool that improves the degree of automation of a task, a useful tool is often not enough. You need to figure out how you and other people can use these tools intuitively and effectively.</p>



<p>You see that even creating simple tools like this connected paper replication, can have its own complexity. But such failure cases give you perspective: While not highly successful in the short term, it will give you the skills needed to tackle problems more effectively in the long-term future.&nbsp;</p>



<p>I would encourage you to spend some time on projects that do not offer a high gain in terms of degree of automation, just for the sake of having more diverse points of failure that you encounter, which will inform your future automations.</p>



<h4>Exploiting coding agents as an API</h4>



<p>Other tools I build are mostly for my students. It was recently revealed that quite a bit of Claude Code use is actually by exploiting it as an API for regular LLM calls. This means you use a Claude Code endpoint just as an API to use it in other work. While I do not use Anthropic for this, there are other providers where you can get frontier capabilities at the cost of 1% of the usual API costs. So you get regular API calls, but at 1% of the price.</p>



<p>I built this pipeline for my students a couple of months ago, and when I asked my students if they needed any GPUs for their projects, they said no — they just generate evaluations and research directions with the API that I created. It has been a very useful tool. For a research group, having easy access to frontier model capabilities without the cost that is typical for APIs is liberating for research. And I built this tool in about 2 hours. This is where good tooling really paid off.</p>



<h4>Other Tooling for Students</h4>



<p>Other tooling that is much more straightforward includes infrastructure for Slurm and a common infrastructure to analyze experiments. I believe Weights and Biases actually harms research by biasing the interpretation of results and how experiments are run, and so the custom tools that I have in the pipeline will help my students to avoid this bias.</p>



<p>A tool I have not developed yet, but which my colleagues have mentioned is a review system where students can get feedback on ideas or papers by querying an agent or an agentic pipeline that mimics how they, their academic advisor, would give feedback to the students. Imagine a student being able to get a first-pass critique of their paper at any time they like without being embarrassed about it or worrying about perceptions. This would not replace advising, but it would make our collaborations more productive by handling the basic structural and clarity feedback automatically.</p>



<p>While not all of these tools might be useful, and some are more like distractions, it is clear that with the right pipelines, workflow, and tools, productivity for students can be increased dramatically — and this can be driven by an advisor who invests in building these systems.</p>



<p>Similarly, a technical manager can develop tools and guide a team in this way. Even if agents cannot do all the work, you need to figure out what work you actually want to do and how you want to build on each other’s work as a team. Agents can work independently, but it might not be useful if your team is pulling on different ends of a problem. If coordination is missing, and everyone is using agents in their own way, it can lead to disaster. The tools an advisor or manager builds can provide that coordination layer.</p>



<p>All of these examples highlight where tools fail, where tools can be useful, and where tools might not be useful in the short term but give you the skills to improve tools in the future.</p>



<h2>Writing Tasks</h2>



<h3>Blog Posts</h3>



<p>You might have guessed it already: this blog post is AI-generated. My previous post about <a href="https://timdettmers.com/2025/12/10/why-agi-will-not-happen/">Why AGI Will Never Happen</a> was AI-generate too. More than 95% of the text from both blog posts comes from an AI model. I did not even type prompts. Most of it was just me rambling into a microphone while doing other things, then transcribing that voice into text, shaping it into a blog post, then doing a style transfer to shape it into my voice, and then adding some small snippets that have character.</p>



<p>The editing and adding the small snippets, this last 5%, is a cherry on top that is very important. But they key point stands, 95% is AI-generated by I bet you still find this useful and enjoy the read. It has my style and my voice of writing and presenting information. Processing information in this way, to really make writing personal, is not that difficult if you use AI agents well.</p>



<p>While I am still experimenting with blog posts, this pipeline allows me to write blog posts much more quickly — and blog posts that are much more current. A blog post like this would have taken me days in the past. Now it takes about 3 hours: one hour to speak content into a microphone, 10 minutes for my agentic workflow, and then ~2 hours of reviewing and editing. It is very fast, and when using the agents, you notice that the quality is pretty good.</p>



<p>Would you agree that this blog post has soul? Or is it AI slop now that you know it is AI-generated?</p>



<h3>Writing Tasks: Grant Proposals</h3>



<p>Grant proposals are a major time sink as an academic. A CMU student costs $150,000, and I need to find that money by writing grant proposals. A lot of proposals are rejected, so you have to write lots of them.</p>



<p>It is interesting because while you might think the blog post approach should work, it actually does not work that well. Grant proposals need to have a particular structure, and even small deviations read poorly. Good design is familiar design, and good proposals are familiar proposals.</p>



<p>This is just like a good abstract — for example, an abstract in Nature has almost always the same structure, sentence by sentence, the same for every paper. That makes it easy to read abstracts because you know where to find information.&nbsp;</p>



<p>I am dyslexic, and reading is very slow for me, but I learned to read papers at a relatively okay pace because I understand that they have a common structure that repeats again, and again, and again. I can skip sections, skip to particular phrases, and I know where an interesting part begins. If the introduction says “In this paper” or “Here,” then I know now the list of contributions starts.</p>



<p>Grant proposals are highly structured. A free-flowing, talkative approach that I use for blog posts does not work out of the box, but it can be made to work by introducing an abstraction pattern.</p>



<p>This abstraction pattern works as follows: you create sentence-by-sentence extractions of what the grant proposal content should be. For example, for an abstract:</p>



<ul><li>The first sentence is about the general field and the subfield</li><li>The second sentence mentions the problem, why it is important, and why it has not been solved</li><li>The third sentence states your contributions and your main results</li><li>The fourth sentence explains your main method</li><li>Then, depending on taste, you expand on this method or keep it brief</li><li>Finally, you state the impact and broad implications</li></ul>



<p>If you have an AI model, you can apply this process very easily. Just take a couple of grant proposals of your own or others that you really liked. Then use an agent to do this, sentence by sentence, then merge multiple abstracted structures by commonality.</p>



<p>Then I use this structure together with an agent to create an interactive flow: The agent gives me particular questions, and I respond with a voice message about that content that I want – this is often casual “rambling” about my research that I want to do. After each response, the agent stores the content and analyzes the abstract template if key information is missing. The agent asks follow-up questions, and I answer them with my voice tool.</p>



<p>I then have the agent generate the draft and then smooth it over by doing style transfer using particular proposals that I have written and like.</p>



<p>With this, I can create a four-page grant proposal in about an hour and a half — even faster than a blog post.</p>



<h3>Meta Reviews</h3>



<p>Machine learning conferences are notorious for bad reviewing. The reviewing system is broken. There have been studies on ICLR and NeurIPS with clear results: reviewing does not work. Reviewing can identify the really worst papers and the best papers, but in-between it is a coin flip.</p>



<p>The finding from these studies is that reviewing quality is not related to knowledge but related to effort. Undergrad students have much higher quality reviews than PhD students or professors because they have more time and take it more seriously. For PhD students and professors, it is a chore.</p>



<p>Looking at that reality, using agents becomes very straightforward, and I would argue an imperative to improve review quality by reducing the time needed for reviewing.</p>



<p>In this case, we look at meta reviewing, reviewing the reviews, which is the task of an area chair. There are two philosophies about being an area chair. One is that you bring your own opinions and overrule decisions. The other is to follow what the reviewers said. I believe the second is more intellectually honest. While I have expertise and sometimes will overwrite reviewers, I have not had the depth to read every paper thoroughly, and certain concerns might be valid. A good paper is not a paper that I like, but a paper that is useful for the research community, and usefulness is difficult to judge if you do not read a paper in depth.</p>



<p>What I built to help with meta reviewing is a system to analyze the discussions, the points where reviewers disagree, give summaries of papers, summarize which papers are borderline, and identify which are clear rejects or accepts. The clear accepts or rejects have high score variability. These reviews can be processed quickly — you can understand why people have certain views.</p>



<p>The workflow is as follows: An agent uses my OpenReview login details to log in, navigate to the papers, get all the reviews and rebuttals, and store them to disk. Then the interactive part with the agent starts that helps me to understand where the issues are.</p>



<p>What is more subtle is tracking changes in the discussion. With the rebuttal, even if the score is not increased (which is very common because people do not have time), the rebuttal might contain information that could change the outcome.</p>



<p>From all this discussion about borderline cases, you can easily draft the first meta review. If it looks strange, you can ask the agent to explain, provide more detail, or provide evidence. It is a very interactive way of reviewing and actually mirrors what I would do without AI agents: separate straightforward and difficult cases; analyze difficult cases for disagreement; figure out which arguments have merit and if author rebuttals change the picture; draft a review; edit by looking at details; submit.</p>



<p>All these things can be done by an agent, and they can be done faster and probably more precisely. Understanding a subtle argument of a paper I have not seen before, between reviewers with different perspectives — this is hard if it is 5 PM and I have already had eight meetings, and I am just tired. But if I do it with my voice tool and my meta review agent system, this allows me to write high-quality meta reviews and make decisions that consider all information and arguments of reviewers and authors carefully.</p>



<p>The use of agents for meta reviews might be highly controversial, but again, AI-generated content is highly personal if you do it right. This also goes for reviews. I think we do a disservice to the research community if we do not use agents for reviewing, since they can improve quality dramatically.</p>



<h2>Where Agents Fail: A Study of Email Automation</h2>



<p>I alluded previously that I tried to automate emails for a long time. Over two months, I worked quite a bit on automating emails — for one, because I do not like emails, and also because it is now a major part of my work.</p>



<p>I wanted to build a system that helps me manage, prioritize, and draft emails. Probably for most people, the process of “doing your emails” is similar: Categorize emails into urgency, map out information that is needed to make a reply, and prioritize replies with the time that you have now until your next meeting or other event.&nbsp;&nbsp;</p>



<p>Doing this manually is very simple and fast. I can often look at the title and immediately sort it into a category. I can skim an email for 10 seconds and know if I need to reply now or if it has time. I can organize emails in bulk to review later.</p>



<p>My initial attempt at email automation was very focused on features. Can I do this categorization, prioritization, bulk sorting, and get the gist of an email with agents?</p>



<p>But here is the issue: even if you automate all of this, you still have a similar workflow. If you categorize an email automatically, you still need to look at the categorization to see if there are new emails. If you have an AI summary of an email, you still need to read it. If you create agent-generated drafts, you need to look at the drafts and see if it has the right details, the right tone, and actually say what you wanted to say more broadly.&nbsp;</p>



<p>Furthermore, Gmail is a familiar interface. You know where everything i,s and all these things of prioritization, categorization, etcetera, can be done easily and quickly. If an AI does that, many things are automated, but you still need to use a user interface. This interface may be unfamiliar, not optimized for all workflows, or might miss crucial information or features. And navigating and using an agent-driven email system costs time, just like how it costs time do it manually.</p>



<p>Here, the process optimization view kicks in. If I can categorize an email within five seconds, that is pretty fast. An AI agent needs to beat that in five seconds and be more precise than I am for it to actually be useful. While the reading and categorization can happen in the background, with an AI-generated draft, I still need to navigate to that draft and read it. That might take 10 to 30 seconds just for navigation and reading, but an additional 1 minute for editing the draft. In many cases, the manual approach is about equally fast. But if you add the development time for this system (it was more than 100 hours), it becomes clearly net negative in terms of productivity to use the agentic system.</p>



<p>Despite all these edge cases, I did not want to give up. For one, I really do not like emails. But the second part is that, for me, it was a challenge: can I automate this task? And if I cannot, it would serve as a hard-won lesson for future automation challenges.</p>



<p>So I made a second attempt. I knew about the process. I knew about the importance of interfaces and how to structure information. Since I am an avid Vim user, I wanted to build a vim-optimized interface. This was a long process — co-designing functionality, agents, and the user interface. My productivity using the agentic email system improved day by day, but at some point I saw the improvement plateauing, and I asked: Is Gmail, if I use it the right way, faster?&nbsp;</p>



<p>So I compared time spent on emails between the tool I created and just using Gmail – which is very much the process optimization view of having a stopwatch on the factory floor. What I found is that just using Gmail is faster. I could not get any degree of automation improvement by using agents for emails.</p>



<p>That was a very important lesson. Sometimes you fail, and that failure teaches you something valuable for the next challenge.</p>



<h2>Conclusion</h2>



<p>If you take away one thing from this blog post, let it be this: agent use is a skill, and like any skill, it requires deliberate practice, understanding of when it applies, and acceptance that you will fail often before you succeed.</p>



<p>The hype is real in some domains and misleading in others. Software engineering parallelization is real but not generalizable. The personal nature of AI-generated content is real and profound. The need for process thinking before automation is real and often ignored.</p>



<p>I hope these perspectives have been useful to help you think about how you can use agents, where agents work well, and what hype and what is not. The key is to think carefully, experiment often, and build skills for the long term. I hope this blog post will help you to make agents your own and see more and more benefits from agent-use.</p>
<p>The post <a rel="nofollow" href="https://timdettmers.com/2026/01/13/use-agents-or-be-left-behind/">Use Agents or Be Left Behind? A Personal Guide to Automating Your Own Work</a> appeared first on <a rel="nofollow" href="https://timdettmers.com">Tim Dettmers</a>.</p>
]]></description>
      <pubDate>Tue, 13 Jan 2026 12:56:37 +0000</pubDate>
      <link>https://timdettmers.com/2026/01/13/use-agents-or-be-left-behind/</link>
      <dc:creator>Tim Dettmers</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5081283761</guid>
    </item>
    <item>
      <title><![CDATA[How will OpenAI compete?]]></title>
      <description><![CDATA[<p class=""><em>“Jakub and Mark set the research direction for the long run. Then after months of work, something incredible emerges and I get a researcher pinging me saying: “I have something pretty cool. How are you going to use it in chat? How are you going to use it for our enterprise products?” </em></p><p class=""><em>- Fidji Simo, head of Product at OpenAI, </em><a href="https://podcasts.apple.com/us/podcast/openais-fidji-simo-on-why-ads-are-coming-to-chatgpt/id1840154537?i=1000749173634" target="_blank"><em>2026</em></a></p><p class=""><em>“You've got to start with the customer experience and work backwards to the technology. You can't start with the technology and try to figure out where you're going to try to sell it”</em></p><p class=""><em>- Steve Jobs, </em><a href="https://youtu.be/EZll3dJ2AjY?si=0o1olcWkA-L1y3rM&amp;t=106" target="_blank"><em>1997</em></a></p>





















  
  



<hr>


  <p class="">It seems to me that OpenAI has four fundamental strategic questions. </p><p class="">First, the business as we see it today doesn’t have a strong, clear competitive lead. It doesn’t have a unique technology or product. The models have a very large user base, but very narrow engagement and stickiness, and no network effect or any other winner-takes-all effect so far that provides a clear path to turning that user base into something broader and durable. Nor does OpenAI have consumer products on top of the models themselves that have product-market fit. </p><p class="">Second, the experience, product, value capture and strategic leverage in AI will all change an enormous amount in the next couple of years as the market develops. Big aggressive incumbents and thousands of entrepreneurs are trying to create new features, experiences and business models, and in the process try to turn foundation models themselves into commodity infrastructure sold at marginal cost. Having kicked off the LLM boom, OpenAI now has to invent a whole other set of new things as well, or at least fend off, co-opt and absorb the thousands of other people who are trying to do that. </p><p class="">Third, while much of this applies to everyone else in the field as well, OpenAI, like Anthropic, has to ‘cross the chasm’ across the ‘messy middle’ (insert your favourite startup book title here) without existing products that can act as distribution and make all of this a feature, and to compete in one of the most capital-intensive industries in history without cashflows from existing businesses to lean on. Of course, companies that do have all of that need to be able to disrupt themselves, but we’re well past the point that people said Google couldn’t do AI.  </p><p class="">The fourth problem is expressed in the quotes I used above. Mike Krieger and Kevin Weil made similar points last year: when you’re head of product at an AI lab, you don’t control your roadmap. You have very limited ability to set product strategy. You open your email in the morning and discover that the labs have worked something out, and your job is to turn that into a button. The strategy happens somewhere else. But where? </p><p class="">OpenAI does still at least arguably set the agenda for new models, and it has a lot of great technology and a lot of clever and ambitious people. But unlike Google in the 2000s or Apple in the 2010s, those people don’t have a thing that really really works already that no-one else can do. I think that one way you could see OpenAI’s activity in the last 12 months is that Sam Altman is deeply aware of this, and is trying above all to trade his paper for more durable strategic positions before the music stops. </p><p class="">Let’s start with the models.  </p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp" data-image-dimensions="2500x1039" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=1000w" width="2500" height="1039" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/5a2d4a08-2d6f-4654-9027-33ac9c15f6ad/Frontier%2BLanguage%2BModel%2BIntelligence%2C%2BOver%2BTime%2B%2824%2BJan%2B%2726%29%2B-2.png.webp?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
          
          <figcaption class="image-caption-wrapper">
            <p class="">There are many benchmarks, all of them ‘wrong’, but they all show the same picture </p>
          </figcaption>
        
      
        </figure>
      

    
  


  





  <p class="">There are something like half a dozen organisations that are currently shipping competitive frontier models, all with pretty-much equivalent capabilities. Every few weeks they leapfrog each other. There is variation within those capabilities, it’s possible to drop off the curve (Meta, for now) or fail to get onto it (Apple, Amazon, Microsoft, for now), or remain six months behind the frontier (China), or rely heavily on other people’s work (China, again) and all of this needs a lot of money (of which more below), but today there is no mechanic we know of for one company to get a lead that others in the field could never match. There is no equivalent of the network effects seen at everything from Windows to Google Search to iOS to Instagram, where market share was self-reinforcing and no amount of money and effort was enough for someone else to to break in or catch up. </p><p class="">This could change if there was a breakthrough that enabled a network effect, most obviously continuous learning, but we can’t plan for that happening. It could also change if there are practical scale effects around access to proprietary data, which today is the dark matter of AI: this could be user data (but it’s not clear what scale here would look like, and the existing platform companies have a lot), or vertical data (foundation models don’t know what happens inside SAP or Salesforce and don’t have millions of spreadsheets from inside investment banks, though you can sometimes buy this or pay people to make it, but so can all the other big players). There might be something else here as well, but at the moment we don’t know and you can’t plan for it, so we have to presume that the models will remain very close for the time being. </p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp" data-image-dimensions="1920x1080" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=1000w" width="1920" height="1080" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/fa12e9b9-d43d-4bd0-9c38-cc367c8a3cd1/2025%2BAutumn%2BAI.037.png.webp?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
      
        </figure>
      

    
  


  





  <p class="">The one place where OpenAI does have a clear lead today is in the user base: it has 8-900m users. The trouble is, there’re only ‘weekly active’ users: the vast majority even of people who already know what this is and know how to use it have not made it a daily habit. <a href="https://www.theinformation.com/articles/openai-projected-least-220-million-people-will-pay-chatgpt-2030?rc=qvvfkx">Only</a> 5% of ChatGPT users are paying, and even US teens are much more likely to use this a <a href="https://www.pewresearch.org/internet/2025/12/09/teens-social-media-and-ai-chatbots-2025/">few times a week or less</a> than they are to use it multiple time a day. The data that OpenAI released in its ‘2025 wrapped’ promotion tells us that 80% of users sent less than 1,000 ‘messages’ in 2025. We don’t know how that changed in the year (it probably grew) but at face value that’s an average of less than three prompts per day, and many fewer individual chats. Usage is a mile wide but an inch deep.</p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp" data-image-dimensions="1920x1080" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=1000w" width="1920" height="1080" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/2a1ef97d-b272-4826-ac97-d93e86e503d6/2025%2BAutumn%2BAI.083.png.webp?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
      
        </figure>
      

    
  


  





  <p class="">This means that most people don’t see the differences between model personality and emphasis that you might see, and most people aren’t benefiting from ‘memory’ or the other features that the product teams at each company copy from each other in the hope of building stickiness (and memory is stickiness, not a network effect). Meanwhile, usage data from a larger (for now) user base itself might be an advantage, but how big an advantage, if 80% of users are only using this a couple of times a week at most? </p><p class="">This engagement is a clearly a ‘glass half full or half empty?’ question, but this is supposed to be a transformation in how you use computers. If people are only using this a couple of times a week at most, and can’t think of anything to do with it on the average day, it hasn’t changed their life. OpenAI itself admits the problem, talking about a ‘capability gap’ between what the models can do and what people do with them, which seems to me like a way to avoid saying that you don’t have clear product-market fit. </p><p class="">Hence, OpenAI’s ad project is partly just about covering the cost of serving the 90% or more of users who don’t pay (and capturing an early lead with advertisers and early learning in how this might work), but more strategically, it’s also about making it possible to give those users the latest and most powerful (i.e. expensive) models, in the hope that this will deepen their engagement. Fidji Simo says here that “diffusion and scale is the most important thing.” That might work (though it also might drive them to pay, or drive them to Gemini). But it’s not self-evident that if someone can’t think of anything to do with ChatGPT today or this week, that will change if you give them a better model. It might, but it’s at least equally likely that they’re stuck on the blank screen problem, or that the chatbot itself just isn’t the right product and experience for their use-cases no matter how good the model is.  </p><p class="">In the meantime, when you have an undifferentiated product, early leads in adoption tend not to be durable, and competition tends to shift to brand and distribution. We can see this today in the rapid market share gains for Gemini and Meta AI: the products look much the same to the typical user (though people in tech wrote off Llama 4 as a fiasco, Meta’s numbers seem to be good), and Google and Meta have distribution to leverage. Conversely, Anthropic’s Claude models are regularly at the top of the benchmarks but it has no consumer strategy or product (Claude Cowork asks you to install Git!) and close to zero consumer awareness.</p><p class="">This has led a lot of people to compare ChatGPT to Netscape, where Microsoft used distribution to crowbar its way into a market where the products themselves were hard to tell apart, much as Google and Meta do today. </p><p class="">Taking this a step further, it’s interesting to compare how hard it was to differentiate a web browser with how hard it is today to differentiate a chatbot: this looks like the same problem. You can make the rendering engine better in a browser and you can make the LLM better in a chatbot, but the browser itself, and the chatbot itself, are just an input box and an output box. The ChatGPT app, like all the chatbot apps, is just a ‘thin wrapper’ - how <em>could</em> you make yours different? In browsers, the answer was that you couldn’t - the last successful product innovations were tabs and merging search with the URL bar. Chatbots today have the same problem - how many more little buttons can you add, and how can you tell them apart? </p>





















  
  






  

  



  
    
      

        

        

        
          
            
              
                
                <a data-title="" data-description="" data-lightbox-theme="dark" href="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577572027-2DAGBXHOLMFAYUWT0OG3/IMG_0969.jpeg" role="button" aria-label="" class="
                    image-slide-anchor
                    
                      js-gallery-lightbox-opener
                    
                    content-fit
                  ">
                  
                    <span class="v6-visually-hidden">View fullsize</span>
                  
                  <img class="thumb-image" elementtiming="system-gallery-block-grid" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577572027-2DAGBXHOLMFAYUWT0OG3/IMG_0969.jpeg" data-image-dimensions="1260x2736" data-image-focal-point="0.5,0.5" alt="IMG_0969.jpeg" data-load="false" data-image-id="6993b59b8517240d22352415" data-type="image" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577572027-2DAGBXHOLMFAYUWT0OG3/IMG_0969.jpeg?format=1000w"><br>
                </a>
                
              
            
          

          
        

      

        

        

        
          
            
              
                
                <a data-title="" data-description="" data-lightbox-theme="dark" href="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577571997-RFIMLLVMRCUDA6HLQB44/IMG_0968.jpeg" role="button" aria-label="" class="
                    image-slide-anchor
                    
                      js-gallery-lightbox-opener
                    
                    content-fit
                  ">
                  
                    <span class="v6-visually-hidden">View fullsize</span>
                  
                  <img class="thumb-image" elementtiming="system-gallery-block-grid" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577571997-RFIMLLVMRCUDA6HLQB44/IMG_0968.jpeg" data-image-dimensions="1260x2736" data-image-focal-point="0.5,0.5" alt="IMG_0968.jpeg" data-load="false" data-image-id="6993b59b8517240d22352419" data-type="image" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577571997-RFIMLLVMRCUDA6HLQB44/IMG_0968.jpeg?format=1000w"><br>
                </a>
                
              
            
          

          
        

      

        

        

        
          
            
              
                
                <a data-title="" data-description="" data-lightbox-theme="dark" href="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577573146-9YW3T8QQV53TYPSINGDZ/IMG_0970.jpeg" role="button" aria-label="" class="
                    image-slide-anchor
                    
                      js-gallery-lightbox-opener
                    
                    content-fit
                  ">
                  
                    <span class="v6-visually-hidden">View fullsize</span>
                  
                  <img class="thumb-image" elementtiming="system-gallery-block-grid" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577573146-9YW3T8QQV53TYPSINGDZ/IMG_0970.jpeg" data-image-dimensions="1260x2736" data-image-focal-point="0.5,0.5" alt="IMG_0970.jpeg" data-load="false" data-image-id="6993b59b8517240d2235241b" data-type="image" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577573146-9YW3T8QQV53TYPSINGDZ/IMG_0970.jpeg?format=1000w"><br>
                </a>
                
              
            
          

          
        

      

        

        

        
          
            
              
                
                <a data-title="" data-description="" data-lightbox-theme="dark" href="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577573206-W4WIDT6QSNQS8AMQEQG2/IMG_0971.jpeg" role="button" aria-label="" class="
                    image-slide-anchor
                    
                      js-gallery-lightbox-opener
                    
                    content-fit
                  ">
                  
                    <span class="v6-visually-hidden">View fullsize</span>
                  
                  <img class="thumb-image" elementtiming="system-gallery-block-grid" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577573206-W4WIDT6QSNQS8AMQEQG2/IMG_0971.jpeg" data-image-dimensions="1260x2736" data-image-focal-point="0.5,0.5" alt="IMG_0971.jpeg" data-load="false" data-image-id="6993b59b8517240d22352417" data-type="image" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/1770577573206-W4WIDT6QSNQS8AMQEQG2/IMG_0971.jpeg?format=1000w"><br>
                </a>
                
              
            
          

          
        

      
    
  

  













  <p class="">In the end, though, Microsoft won browsers for the first generation of the consumer internet but that turned out not to matter, because the experiences that did matter, and the value capture, were created elsewhere. Equally, the real question for the ‘engagement gap’ is whether this is solved through generalised model improvements that we must presume will apply across the entire industry, or innovation in the UI that, again, everyone will copy, or whether it is followed by the invention and creation of new experiments and use cases built on top of the ‘raw’ models. </p><p class="">If the next step is those new experiences, who does that, and why would it be OpenAI? The entire tech industry is trying to invent the second step of generative AI experiences - how can you plan for it to be you? How do you compete with this chart - with every entrepreneur in Silicon Valley? </p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp" data-image-dimensions="1920x1080" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=1000w" width="1920" height="1080" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/50f54b52-64d3-4e31-9812-6f64f199cb1b/2025%2BAutumn%2BAI.047.png.webp?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
      
        </figure>
      

    
  


  





  <p class="">So: you don’t know how you can make your core technology better than anyone else’s. You have a big user base but one that has limited engagement and seems really fragile. The key incumbents have more or less matched your technology and are leveraging their product and distribution advantages to come after the market. And, it looks like a lot of the value and leverage will come from new experiences that haven’t been invented yet, and you can’t invent all of those yourself. What do you do? </p><p class="">For a lot of last year, it felt like OpenAI's answer was “everything, all at once, yesterday”. An app platform! No, another app platform! A browser! A social video app! Jony Ive! Medical research! Advertising! More stuff I've forgotten!&nbsp; And, of course, trillions of dollars of capex announcements, or at least capex aspirations. </p><p class="">Some of this looked like ‘flooding the zone’, or just the result of hiring a lot of aggressive, ambitious people really quickly. There was also sometimes the sense of people copying the forms of previously successful platforms without quite understanding their purpose or dynamics: “platforms have app stores, so we need an app store!”</p><p class="">But late last year, Sam Altman tried to put it all together, showing this diagram, and using the famous quote from Bill Gates, that the definition of a platform is that it creates more value for its partners than for itself. </p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp" data-image-dimensions="2500x1366" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=1000w" width="2500" height="1366" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/da576b9e-12f3-4230-aef6-84c2d2b93bbe/o4BfDtcAaig2WNi.png.webp?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
      
        </figure>
      

    
  


  





  <p class="">At much the same time, the CFO published this diagram. </p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp" data-image-dimensions="2133x1200" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=1000w" width="2133" height="1200" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/436427e0-fa96-4146-8253-706ed39b1a67/G8ZEru0WsAAmvN2.jpeg.webp?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
          
          <figcaption class="image-caption-wrapper">
            <p data-rte-preserve-empty="true">PSA: a 1:1 relationship between capex and revenue is not a flywheel</p>
          </figcaption>
        
      
        </figure>
      

    
  


  





  <p class="">This is a nice, neat, coherent strategy. Your capex is a virtuous circle in its own right, and it’s the foundation of a plan to build a full-stake platform company on top. You build every layer of the stack, from chips and infrastructure up, and the further you go up the stack, the more you help other people create their own things with your tooling. Everyone gets to use your cloud, your chips, your models, and then further up, the layers of the stack are mutually reinforcing. You have a network effect, you have an ecosystem, you have leverage up and down the stack, everyone builds on your stuff and everyone is locked in. Other people might built parts of that stack too, but you have a defensible, differentiated, value-capturing platform.  </p><p class="">That is indeed how Windows or iOS worked. The trouble is, I really don't think that's the right analogy. I don't think OpenAI has any of this. It doesn’t have the kind of platform and ecosystem dynamics that Microsoft or Apple had, and that flywheel diagram doesn’t actually show a flywheel. </p><p class="">Let's start by talking about capex. This is the stuff that gets everybody's attention, because the tech industry has never contemplated such large numbers before. The big four cloud companies spent about $400 billion on infrastructure last year and have announced plans for at least $650 billion this year. </p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
              
              
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png" data-image-dimensions="1920x1080" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=1000w" width="1920" height="1080" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/867efab5-e3b3-458a-9c9d-7007bf8f550e/2026+01+Autumn+AI.016.png?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
            
          
        

        
      
        </figure>
      

    
  


  





  <p class="">As we all know, OpenAI has been running around trying to join the club, claiming a few months ago to have $1.4tr and 30 gigawatts of compute commitment for the future (with no timeline), while it reported 1.9 gigawatts in use at the end of 2025. Since it doesn’t have the scale of cashflows from existing businesses that the hyperscalers can use, it has so far managed to do this, or at least announce this, with a combination of capital-raising (not all of which has necessarily <a href="https://www.wsj.com/tech/ai/the-100-billion-megadeal-between-openai-and-nvidia-is-on-ice-aa3025e3?gaa_at=eafs&amp;gaa_n=AWEtsqc4QyzEzHpN9m9m3XzhBITeQycD3E_mgvkmOznlrN8CU2wbZTD47jkmTvgv5Yg%3D&amp;gaa_ts=698fad69&amp;gaa_sig=Cl0PsNc1mBFOdnvqmL4dFkoPHdMItTHguXLOzCuQG7HryVo-VX2kj_vAbyIpTVq5IcEhCgbRzuJvvAkPMnFbZA%3D%3D"><span>closed</span></a>) and <a href="https://www.bloomberg.com/news/articles/2025-12-02/oracle-credit-fear-gauge-hits-highest-since-2009-on-ai-bubble-fears"><span>other peoples balance sheets</span></a> (some of which is also the famous ‘circular revenue’). </p><p class="">You can watch plenty of three-hour podcasts discussing all of this, and plenty of people have opinions about TPUs, Nvidia’s product lead, and Oracle’s strategy of borrowing against a declining but cash-generative legacy business to burn its way into the new thing, but how much should the rest of us care? Is this a path to a competitive advantage, or just a seat at the table? </p><p class="">We don’t really know what AI infrastructure costs will look like in the long term, but it’s quite possible that this turns out like the manufacture of airliners or semiconductors: there are no network effects, but with each generation the process gets more difficult and more expensive, and so those industries have gone from dozens of companies at the cutting edge to just Boeing and Airbus on one hand and TSMC on the other. Semiconductor manufacturing had both Moore’s Law, which everyone has heard of, and Rock’s Law, which most people haven’t: Moore’s Law said that the number of transistors on a chip was doubling every two years, but Rock’s Law said that the cost of a state-of-the-art semiconductor fab was doubling every four years. Maybe generative AI will work the same, with unit costs falling but fixed costs rising to the point that only a handful of companies are able to sustain the investment needed to build competitive models and everyone else is squeezed out.* This oligopoly would presumably have a price equilibrium, though it might be at high or low margins - this might all just be commodity infrastructure sold at marginal cost, especially given some of those at the table will be using their models to power other, much more differentiated businesses. Ask your favourite economist. **</p><p class="">So, when Sam Altman says he’s raised $100bn or $200bn, and when he says he’d like OpenAI to be building a gigawatt of compute every week (implying something in the order of a trillion dollars of annual capex), it would be easy to laugh at this as ‘braggawatts’, and apparently people at TSMC once dismissed him as ‘podcast bro’, but he’s trying to create a self-fulfilling prophecy. He’s trying to get OpenAI, a company with no revenue three years ago, a seat at a table where you’ll probably need to spend couple of hundred billion dollars a year on infrastructure, through force of will. His force of will has turned out to be pretty powerful so far. </p><p class="">But, again, does that get you anything more than a seat at that table? TSMC isn’t just an oligopolist - it has a <em>de facto</em> monopoly on cutting edge chips - but that gives it little to no leverage or value-capture further up the stack. People built Windows apps, web services and iPhone apps - they don’t build TSMC apps or Intel apps. </p><p class="">Developers had to build for Windows because it had almost all the users, and users had to buy Windows PCs because it had almost all the developers (a network effect!). But if you invent a brilliant new app or product or service using generative AI, or add it as a feature to an existing product, you use the APIs to call a foundation model running in the cloud and the users don’t know or care what model you used. No-one using Snap cares if it runs on AWS or GCP. When you buy an enterprise SaaS product you don’t care if it uses AWS or Azure.&nbsp;And if I do a Google Search and the first match is a product that’s running on Google Cloud, I would never know. </p><p class="">That doesn’t mean these APIs are interchangeable -&nbsp;there are good reasons why AWS, GCP and Azure have very different market shares, and why developers choose each. But the customer doesn’t know or care. Running a cloud doesn’t give you leverage over third part products and services that are further up the stack. </p><p class="">The difference now, perhaps, is that all of those services were separate silos: there was a common search and discovery layer at the top in Google and Facebook, and common infrastructures at the bottom in the cloud, but all those apps were never connected to each other. Now we have an emerging alphabet soup of standards and protocols for models and websites to talk to each other across ads, e-commerce and some kind of intent and automation (the brief enthusiasm around OpenClaw captured some of this). A website can surface its capabilities so that a subset can just show up in ChatGPT, be it a real estate search or a shopping cart. You’ll tell your agent to look at a recipe on Instagram and order the ingredients on Instacart. Everything can get piped to everything else, and everything can talk to each other! </p><p class="">Meanwhile, (saying the quiet part out loud), if you could set and control those APIs and manage the flows, that gives you power. Standards have been a basic competitive weapon in every generation of technology - remember Microsoft’s slogan ‘embrace and extend’. In particular, OpenAI suggests you’ll use your ChatGPT account as the glue linking all of these together. That’s a network effect! </p><p class="">I'm not sure about this: I’m not sure that this vision will really work, and if it does, I’m not sure it gives one company dominance. </p><p class="">First, there’s a recurring fallacy in tech that you can abstract many different complex products into a simple standard interface - you could call this the ‘widget fallacy’. A decade ago people said ‘APIs are the new BD’, which was really the same concept, and it mostly failed. This is partly because there’s a huge gap between what looks cool in demos and all of the work and thought in the interaction models and the workflows in the actual product: very quickly you’ll run into an exception case and you’ll need the actual product UI and a human decision. It’s also because the incentives are misaligned: no-one wants to be someone else’s dumb API call, so there’s an inherent tension or trade-off between the distribution that an abstraction layer might give you (Google Shopping, Facebook shopping, and now ChatGPT shopping) and your desire to control the experience and the customer relationship. Remember, after all, that all of Instacart’s profits come from showing ads. </p><p class="">Of course, this is just speculation - maybe it will all work this time! But the second problem is that if these are all separate systems plugged together by abstracted and automated APIs, is the user or developer locked into any one of them? If apps in the chatbot feed work, and OpenAI uses one standard and Gemini uses another, what stops a developer from doing both? This is much less code than making both an iOS and Android app, and anyway, can’t you get the AI to write the code for you? What does <em>that </em>do to developer lock-ins? Meanwhile, yes, maybe I’ll log into all of these services with my OpenAI or Gemini account, but does it necessarily make sense for me to log into Tinder, Zillow and Workday with the same account?  And, again, do they want that?</p><p class="">Hmm.  </p><p class="">As I've written this essay, I’ve returned again and again to terms like platform, ecosystem, leverage and network effect. These terms get used a lot in tech, but they have pretty vague meanings. Google Cloud, Apple's App Store, Amazon Marketplace, and even TikTok are all ‘platforms’ but they're all very different. </p><p class="">Maybe the word I'm really looking for is power. When I was at university, a long time ago now, my medieval history professor, Roger Lovatt, told me that power is the ability to make people do something that they don't want to do, and that's really the question here. Does OpenAI have the ability to get consumers, developers and enterprises to use its systems more than anybody else, regardless of what the system itself actually does? Microsoft, Apple and Facebook had that. So does Amazon - this is a real flywheel. ***</p>





















  
  














































  

    
  
    

      

      
        <figure class="
              sqs-block-image-figure
              intrinsic
            ">
          
        
        

        
          
            
          
            
                
                
                
                
                
                
                
                <img data-stretch="false" data-image="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png" data-image-dimensions="640x357" data-image-focal-point="0.5,0.5" alt="" data-load="false" elementtiming="system-image-block" src="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=1000w" width="640" height="357" sizes="(max-width: 640px) 100vw, (max-width: 767px) 100vw, 100vw" onload="this.classList.add(&quot;loaded&quot;)" srcset="https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=100w 100w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=300w 300w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=500w 500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=750w 750w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=1000w 1000w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=1500w 1500w, https://images.squarespace-cdn.com/content/v1/50363cf324ac8e905e7df861/34627da9-69ef-4909-bed7-6bf2cb004d10/amazonflywheel_updated.png?format=2500w 2500w" loading="lazy" decoding="async" data-loader="sqs">

            
          
        
          
        

        
      
        </figure>
      

    
  


  





  <p class="">To this point, I think a good way to read that Bill Gates quote is that what a platform really achieves is to harness the creative energy of the entire tech industry, so that you don’t have to invent everything yourself and massively more stuff gets built at massive scale, but, it’s all done on your system with you holding the reins. Do LLMs have that? </p><p class="">Foundation models are certainly multipliers: massive amounts of new stuff will be built with them. But do you have a reason why everyone has to use your thing, even though your competitors have built the same thing? And are there reasons why your thing will always be better than the competition no matter how much money and effort they throw at it? That's how the entire consumer tech industry has worked for all of our lives. If not, then the only thing you have is execution, every single day. Executing better than everyone else is certainly an aspiration, and some companies have managed it over extended periods and even persuaded themselves that they’ve institutionalised this, but it’s not a strategy. </p>





















  
  



<hr>


  <p class=""><em>* Another comparison is the ‘capital as a weapon’ stage in ZIRP, when companies used cheap capital to buy their way to scale in the hope that would give them market dominance with a network effect that would be defensible. This worked at least arguably for Uber, but didn’t for WeWork which, like OpenAI, had no network effect. </em></p><p class=""><em>** It’s also possible that this turns out completely differently and (say) that we end up with a huge number of models of many different shapes and sizes, some of which run on the edge for free, and talking about an oligopoly of AI infrastructure makes as much sense as talking about an oligopoly of SQL infrastructure. We just don’t know. </em></p><p class=""><em>*** The counter to this is that Microsoft, Google, Apple and Meta may have lived in a world of ‘winner-takes-all’ effects, but they never thought they’d won. They were alway looking over their shoulder. The iconic expression of this mentality, of course, is Andy Grove’s quote that ‘only the paranoid survive’. Intel had a network effect, and then it didn’t, and it had a technology lead, and then it didn’t.</em> </p>]]></description>
      <pubDate>Thu, 19 Feb 2026 20:51:04 +0000</pubDate>
      <link>https://www.ben-evans.com/benedictevans/2026/2/19/how-will-openai-compete-nkg2x</link>
      <dc:creator>Essays - Benedict Evans</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5124294089</guid>
    </item>
    <item>
      <title><![CDATA[Learning to Execute]]></title>
      <description><![CDATA[<div class="essay-content">
<p>Ten years ago this month, I was a 20-year-old PhD student sketching out the plan for what would become my first first-author paper.</p> <p>Ten years later, I've published over 160 papers and I'm an associate professor at Harvard running a lab. I've mentored hundreds of students on their very first paper. With every one of them, I get to think about what actually matters for doing great research—not the credentials or the pedigree, but the capabilities you can actually develop.</p> <p>The same three skills matter for paper number one as they do for paper number one hundred: execution, problem selection, and framing. This is my attempt to write them down.</p> <p>The first capability: knowing how to build. Can you actually make the thing? Do you know what's feasible versus what's fantasy?</p> <blockquote class="pull-quote">Execution isn't about brilliance. It's about reps.</blockquote> <p>In AI, this means the fundamentals—machine learning, systems, algorithms. You learn these through courses and practice. I learned a lot from coursework early in my PhD. But the real learning happens differently.</p> <p>I had a peer in grad school—an exceptionally skilled programmer—who read code every morning. Not wrote. Read. Popular repositories. Reference implementations. Line by line, every day.</p> <blockquote class="pull-quote">Reading code teaches you what reading papers can't. Papers describe ideas. Code reveals whether those ideas actually work.</blockquote> <p>Over time, they developed a sense for whether a method description was implementable or hand-wavy. This intuition is irreplaceable.</p> <p>Another habit I picked up from a collaborator: write code every single day. No exceptions. Even if just refactoring. Even if just adding comments. Something got committed daily.</p> <p>I adopted this practice. Looking at <a href="https://github.com/rajpurkar">my GitHub</a>, I've tried to follow this for a lot of the year. The continuity matters. The compounding works like this: daily code → continuous progress → faster completion → more practice → better at coding → even faster next time.</p> <p>Beyond code, documentation compounds too. Another colleague wrote everything down after every meeting and work session. Everything. Not just "experiment X got accuracy Y"—the hypothesis before running it, observations during, interpretations after, what to try next and why.</p> <p>When things worked, they knew exactly why. When things failed, they knew what to try next. Writing the paper took 2 weeks because all the content was already there.</p> <blockquote class="pull-quote">Documentation isn't overhead. It's pre-writing your paper.</blockquote> <p>These daily habits matter, but so does knowing when to cut your losses. Pay attention to early signals. When something starts working in the first couple weeks—even imperfectly—that positive momentum usually carries through. Projects that show early promise tend to work out. Projects that don't often stay stuck.</p> <p>I've spent months on projects that never showed early promise—time I could have redirected sooner. Get something working first, then improve it. A simple baseline that runs end-to-end beats a sophisticated method that's half-built. You can iterate from working code. Starting from nothing is brutal.</p> <p>The same principle applies to how you structure your work. Think modularly. Break problems into components. Solve each piece independently. Compose them. Debugging becomes easier. Code becomes reusable.</p> <p>Infrastructure matters more than you think. Early projects: hacky scripts, results scattered in random folders. This doesn't compound. Later: reusable components, systematic organization. Every time I write code, I ask: will I need this again? If yes, make it reusable now.</p> <p>Over time, you build intuition for what will work. Not certainty—intuition. This is taste for execution developing.</p> <p>That first paper I started sketching out ten years ago? The Stanford Question Answering Dataset. Over 10,000 citations.</p> <p>But execution alone isn't enough. You can build perfectly and still build the wrong thing.</p> </div>]]></description>
      <pubDate>Sat, 21 Feb 2026 01:54:10 +0000</pubDate>
      <link>https://pranavrajpurkar.com/essays/learning-to-execute.html</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5125808771</guid>
    </item>
    <item>
      <title><![CDATA[Recognizing Good Problems]]></title>
      <description><![CDATA[<div class="essay-content">
<p>I recently hit an i10 index of 100—meaning 100 of my papers have been cited at least 10 times each. What that represents isn't just output. It's reps. A hundred times watching an idea go from concept to paper to seeing whether it lands.</p> <p>After enough reps, something shifts. I've developed intuitions about which directions feel promising—and which ones probably won't pan out. That's taste.</p> <p>But here's the thing: you don't have to write a hundred papers to develop it. If you're mindful about the process—if you pay attention to what works and why—you can get there much faster.</p> <p>You can execute perfectly on the wrong problem. Execution gets the work done, but problem selection determines if it matters.</p> <blockquote class="pull-quote">Execution is necessary. Problem selection is decisive.</blockquote> <p>So how do you learn which problems matter? You can learn to code from courses. Learning which problems matter requires reading and experience. Lots of both.</p> <p>Some colleagues spend the first two hours of every morning reading before touching code. Every morning, for years. The math: 2 hours a day, 5 days a week, about 5-7 papers per week. That's 250-350 papers per year. After 2-3 years, they'd read 500-1000 papers. They see connections others miss.</p> <p>One of my mentors told me something specific: read 10-20 papers deeply on a topic, and ideas start generating automatically. Below that threshold, you're guessing. Above it, patterns emerge.</p> <p>This advice compressed years into a heuristic. It only worked because my mentor had earned it through experience. They had taste for what mattered. I benefited from their taste before developing my own.</p> <blockquote class="pull-quote">Find mentors with taste. It's the highest-leverage thing you can do early in your career.</blockquote> <p>How do you find them? Look for people whose work compounds—whose papers generate follow-up work, whose students succeed, whose ideas others adopt. Then find ways to contribute—good mentorship relationships are mutual.</p> <p>Mentorship works both ways. I run the <a href="https://www.rajpurkarlab.hms.harvard.edu/medical-ai-bootcamp-harvard-stanford">Medical AI Bootcamp</a>—a program that brings together students from Harvard, Stanford, and doctors worldwide. The bootcamp creates a structure for people to work on meaningful projects with complementary expertise and close mentorship. Watching students go from their first paper to building research programs has reinforced these lessons about how taste develops. Creating vehicles that enable collaboration matters.</p> <p>Beyond mentors, you need systems that force you to keep reading. For a while, I ran a weekly newsletter on AI and health. Every week, I had to find 5-7 papers to highlight. The newsletter itself didn't matter—readership was tiny. What mattered: it created a forcing function. I had to read even when I didn't feel like it.</p> <p>Writing summaries reinforced understanding. Months later, I'd remember "there was a paper about X" and find it in my notes. This compounds. Create external commitments that force the behavior you want. Newsletter, reading group, blog—whatever makes reading mandatory.</p> <p>The progression is visible. Early: you're trying to understand what they did. Slow and necessary. Middle stage: you start critiquing. Later: synthesis emerges. Connections appear automatically. Reading gets faster with practice.</p> <p>When students ask "what should I work on?", I usually ask back: "what have you read?" Often that's a sign there's more to explore first. Good ideas require inputs. Reading provides them.</p> <p>For each paper: What problem did they solve? What's new? What's missing? Could I build on this? These questions help me know if I've really engaged with a paper.</p> <p>Some of my best ideas came from reading outside my immediate area. Reading purely in your domain gives you depth. Cross-domain reading reveals opportunities others miss.</p> <blockquote class="pull-quote">The "why now" question is everything. Can this be solved today but couldn't two years ago? What threshold just crossed?</blockquote> <p>That question is the ultimate filter for good problems. And recognizing which problems pass through—developing taste for what matters—comes from reading, mentors, and time.</p> <p>Good problems need good framing. Otherwise, no one notices. That's the third capability—and where we go in Part 3.</p> </div>]]></description>
      <pubDate>Sat, 21 Feb 2026 01:54:13 +0000</pubDate>
      <link>https://pranavrajpurkar.com/essays/recognizing-good-problems.html</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5125808797</guid>
    </item>
    <item>
      <title><![CDATA[Making People Care]]></title>
      <description><![CDATA[<div class="essay-content">
<p>My first paper at the intersection of AI and medicine started with a bold opening: "We develop a model which can diagnose irregular heart rhythms from single-lead ECG signals better than a cardiologist." My mentor's reasoning: people don't have time to wait for the point. Give them the point early.</p> <p>That paper got over a thousand citations. Execution gets the work done. Problem selection determines if it matters. Framing determines if anyone notices.</p> <blockquote class="pull-quote">Writing is thinking, not reporting. You don't figure out what you did, then write it down. You write to figure out what you actually did and why it matters.</blockquote> <p>The papers I've written personally, I rewrite 10-15 times. Students and collaborators I've worked with follow the same principle. Clarity emerges through rewriting. First drafts reveal what you're trying to say. Rewrites reveal what you should say.</p> <p>Over time, I've converged on a structure that works across papers in different venues:</p> <p>Paragraph 1: The problem. Why it matters. Paragraph 2: What's been tried, why it's insufficient. Paragraph 3: What we did. Why now. Paragraph 4: What we found. What this unlocks.</p> <p>One page. Every sentence earns its place.</p> <p>The introduction matters disproportionately. Reviewers decide based on intro and first figure. I spend about 40% of writing time on the introduction. I rewrite it last, after results are done.</p> <p>Here's an example of the difference framing makes:</p> <p>Bad first sentence: "Machine learning has revolutionized healthcare." Good first sentence: "200,000 emergency department visits result in preventable deaths annually."</p> <p>The difference: specificity and urgency. Many papers are hard to understand, even at top venues. This creates an opportunity. Clear papers get accepted more easily, cited more often, implemented more widely.</p> <h2>The Power Law</h2> <p>Research impact follows a power law. A small fraction of papers generate most of the follow-up work—through territories they open, infrastructure they create, collaborators they attract.</p> <blockquote class="pull-quote">Papers that reframe problems compound more than papers that improve numbers.</blockquote> <p>The question: "Will this change how people think, or just what numbers they report?"</p> <p>Problem selection matters more over time. I spend more time now deciding what NOT to work on. Taste—the ability to recognize which problems will matter—becomes increasingly valuable.</p> <h2>What Compounds</h2> <p>Consistent work creates opportunities. Publishing regularly leads to collaborations. Working with others helps.</p> <p>What hasn't changed: still read regularly, still code regularly, still rewrite extensively.</p> <p>What has changed: reading is faster, projects start faster, drafting is faster. The skills build on each other over time.</p> <p>The work stops feeling like a slog at some point. Skills become automatic. Confidence builds.</p> <h2>The Timing</h2> <p>This is a good time to be doing research. AI tools make reading faster, experimentation more productive, writing clearer.</p> <p>We're in an interesting window. AI is powerful enough to accelerate research significantly, but not yet capable enough to do science autonomously. This period—where AI amplifies researchers rather than replacing them—creates unusual opportunities for productivity.</p> <p>The joy compounds too. Reading is easier when tools help you find relevant work. Experiments run faster when infrastructure improves. Writing clarifies faster with assistance.</p> <p>The three capabilities—execution, problems, framing—build on each other over years.</p> <h2>Looking Forward</h2> <p>I sometimes think about what advice I'd give 20-year-old me, sitting in that Stanford office in December 2015. Anxious. Uncertain. Hungry.</p> <p>The uncertainty doesn't go away. But it becomes familiar. The curiosity is an asset, not a liability. And the work itself—when you find the right problems—is deeply satisfying.</p> <p>The first decade took me from SQuAD to medical AI, from student to professor, from individual contributor to running a lab. The path wasn't linear. It never is.</p> <blockquote class="pull-quote">We'll see what it looks like ten years from now.</blockquote> </div>]]></description>
      <pubDate>Sat, 21 Feb 2026 01:54:16 +0000</pubDate>
      <link>https://pranavrajpurkar.com/essays/making-people-care.html</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5125808846</guid>
    </item>
    <item>
      <title><![CDATA[ShiQ: Bringing back Bellman to LLMs]]></title>
      <description><![CDATA[ <p class="flex-wrap-footer"> </p><header> <a href="https://arxiv.org/abs/2505.11081#content" class="is-sr-only">Skip to main content</a> <!-- start desktop header --> <!-- /end desktop header --> <!-- /end mobile-header --> </header> <main> <div id="content">
<!--
rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"> <rdf:Description rdf:about="/abs/2505.11081" dc:identifier="/abs/2505.11081" dc:title="ShiQ: Bringing back Bellman to LLMs" trackback:ping="/trackback/2505.11081" /> </rdf:RDF>
--><div id="abs-outer"> <div class="leftcolumn"> <div id="content-inner"> <div id="abs"> <a class="mobile-submission-download" href="https://arxiv.org/pdf/2505.11081">View PDF</a> <a class="mobile-submission-download" href="https://arxiv.org/html/2505.11081v1">HTML (experimental)</a> <blockquote class="abstract mathjax"> <span class="descriptor">Abstract:</span>The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM, seen as an initial policy. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness comes from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLMs. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs. Our core contribution is to derive theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs. To do so, we carefully adapt insights from the RL literature to account for LLM-specific characteristics, ensuring that the logits become reliable Q-value estimates. We then use this loss to build a practical algorithm, ShiQ for Shifted-Q, that supports off-policy, token-wise learning while remaining simple to implement. Finally, we evaluate ShiQ on both synthetic data and real-world benchmarks, e.g., UltraFeedback and BFCL-V3, demonstrating its effectiveness in both single-turn and multi-turn LLM settings </blockquote> <!--CONTEXT--> </div>
</div> <div class="submission-history"> From: Pierre Clavier [<a href="https://arxiv.org/show-email/264eedf7/2505.11081">view email</a>] <br> <strong>[v1]</strong> Fri, 16 May 2025 10:12:11 UTC (1,672 KB)<br>
</div> <a class="dm_extension_mh_paper dm_extension_mh_arxiv">Browse in Mission Hub</a></div> <!--end leftcolumn--> <!--end extra-services-->
<!-- LABS AREA --> <!-- END LABS AREA --> </div> </div> </main> <footer> </footer> <p></p> ]]></description>
      <pubDate>Fri, 16 May 2025 00:00:00 +0000</pubDate>
      <link>https://arxiv.org/abs/2505.11081</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5085682171</guid>
    </item>
    <item>
      <title><![CDATA[Demystifying evals for AI agents]]></title>
      <description><![CDATA[The capabilities that make agents useful also make them difficult to evaluate. The strategies that work across deployments combine techniques to match the complexity of the systems they measure. \n]]></description>
      <pubDate>Fri, 09 Jan 2026 00:00:00 +0000</pubDate>
      <link>https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</link>
      <dc:creator>Anthropic Engineering Blog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5077652889</guid>
    </item>
    <item>
      <title><![CDATA[RT by @srush_nlp: I have some thoughts on this, and I mostly agree with you. But the picture is nuanced... Longer reply below: 👇

I agree that when RL is done right -- with the right prompt mixture, possibly a curriculum, and a sufficiently high-capacity base model, we should generally expect pass@k to increase with training steps, especially when we are not repeatedly running too many epochs over the same prompt mixture. In that sense, I agree that a drop in pass@k is not a universal phenomenon to fix by itself, but rather a symptom of something deeper with the cross product of the base model x prompt set x training epochs. For example, with typical prompt mixtures in open-source land and 4B or larger models, we haven’t seen this issue happen if we do not train for too long.

That said, we do see this issue when the prompt mixture contains very different types of problems, and for certain classes of base models (we do not yet know how to precisely predict that this will happen for a base model, though I will give a mental model below for why this happens), pass@k does decrease in practice with multi-epoch training on the same prompt set. We have observed this consistently…

A concrete example: suppose you train on a mixture of very easy problems (where the base model is already strong) and very hard problems (where the base model is weak, pass@k near 0), and the base model is not very large. RL will extract a strong learning signal from the easy subset and sharpen the policy there, but you will see pass@k dropping on the hard set. Importantly, this drop is not due to entropy or response length reduction on the easy subset, since collapsing onto the correct solution there is actually desirable, but instead due to negative transfer, i.e.. some kind of interference that transfers these phenomena to the hard prompts as well where entropy / response length collapse is undesirable.

This phenomenon has essentially been referred to as ray interference in the multi-task RL literature (https://arxiv.org/abs/1904.11455). We discuss this in our blog post on training on hard problems where exploration is a challenge (https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/, see the “interference” section). Intuitively, RL optimizes one subset much faster than the other, which can stall or even harm learning on the hard problems (see the cartoon below).

The concern I have with fixing this via reward shaping or novel algorithmic tweaks is that these approaches often address the symptom of the pass@k drop, rather than the cause, which is negative transfer (across problems) during RL. If instead we modify the prompt set algorithmically, for example by adding guidance to hard prompts (as we explore in the blog, https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/), using curricula, or filtering data, this issue may largely disappear since we make progress on all problems simultaneously (roughly). Changing the reward metric or directly optimizing pass@k, as many papers do (including an older one of mine 🙂), is unlikely to resolve the core challenge on its own and pass@k might still decrease for higher k.

Given differences in base models and data compositions, it seems that different RL recipes may be better suited to different settings. This means, perhaps our goal should be to identify, as simply as possible, whether this issue is likely to arise in a given RL setting and then address it with the minimal approach. Of course, we do not know all the ways to solve this issue but I think we have a reasonable handle on what is happening and why. It would be very interesting to see if there are even simpler ways to improve interference in RL in general…]]></title>
      <description><![CDATA[<p>I have some thoughts on this, and I mostly agree with you. But the picture is nuanced... Longer reply below: 👇<br>
<br>
I agree that when RL is done right -- with the right prompt mixture, possibly a curriculum, and a sufficiently high-capacity base model, we should generally expect pass@k to increase with training steps, especially when we are not repeatedly running too many epochs over the same prompt mixture. In that sense, I agree that a drop in pass@k is not a universal phenomenon to fix by itself, but rather a symptom of something deeper with the cross product of the base model x prompt set x training epochs. For example, with typical prompt mixtures in open-source land and 4B or larger models, we haven’t seen this issue happen if we do not train for too long.<br>
<br>
That said, we do see this issue when the prompt mixture contains very different types of problems, and for certain classes of base models (we do not yet know how to precisely predict that this will happen for a base model, though I will give a mental model below for why this happens), pass@k does decrease in practice with multi-epoch training on the same prompt set. We have observed this consistently…<br>
<br>
A concrete example: suppose you train on a mixture of very easy problems (where the base model is already strong) and very hard problems (where the base model is weak, pass@k near 0), and the base model is not very large. RL will extract a strong learning signal from the easy subset and sharpen the policy there, but you will see pass@k dropping on the hard set. Importantly, this drop is not due to entropy or response length reduction on the easy subset, since collapsing onto the correct solution there is actually desirable, but instead due to negative transfer, i.e.. some kind of interference that transfers these phenomena to the hard prompts as well where entropy / response length collapse is undesirable.<br>
<br>
This phenomenon has essentially been referred to as ray interference in the multi-task RL literature (<a href="https://arxiv.org/abs/1904.11455">arxiv.org/abs/1904.11455</a>). We discuss this in our blog post on training on hard problems where exploration is a challenge (<a href="https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/">blog.ml.cmu.edu/2025/11/26/h…</a>, see the “interference” section). Intuitively, RL optimizes one subset much faster than the other, which can stall or even harm learning on the hard problems (see the cartoon below).<br>
<br>
The concern I have with fixing this via reward shaping or novel algorithmic tweaks is that these approaches often address the symptom of the pass@k drop, rather than the cause, which is negative transfer (across problems) during RL. If instead we modify the prompt set algorithmically, for example by adding guidance to hard prompts (as we explore in the blog, <a href="https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/">blog.ml.cmu.edu/2025/11/26/h…</a>), using curricula, or filtering data, this issue may largely disappear since we make progress on all problems simultaneously (roughly). Changing the reward metric or directly optimizing pass@k, as many papers do (including an older one of mine 🙂), is unlikely to resolve the core challenge on its own and pass@k might still decrease for higher k.<br>
<br>
Given differences in base models and data compositions, it seems that different RL recipes may be better suited to different settings. This means, perhaps our goal should be to identify, as simply as possible, whether this issue is likely to arise in a given RL setting and then address it with the minimal approach. Of course, we do not know all the ways to solve this issue but I think we have a reasonable handle on what is happening and why. It would be very interesting to see if there are even simpler ways to improve interference in RL in general…</p>
<img src="https://pbs.twimg.com/card_img/2018214058080542720/INCrnAIT?format=jpg&amp;name=420x420_2" style="max-width:250px;">]]></description>
      <pubDate>Fri, 19 Dec 2025 03:22:54 +0000</pubDate>
      <link>https://rss.xcancel.com/aviral_kumar2/status/2001855734485582239#m</link>
      <dc:creator>Sasha Rush / @srush_nlp</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5054225681</guid>
    </item>
    <item>
      <title><![CDATA[Why AGI Will Not Happen]]></title>
      <description><![CDATA[
<p>If you are reading this, you probably have strong opinions about AGI, superintelligence, and the future of AI. Maybe you believe we are on the cusp of a transformative breakthrough. Maybe you are skeptical. This blog post is for those who want to think more carefully about these claims and examine them from a perspective that is often missing in the current discourse: the physical reality of computation.</p>



<span id="more-1233"></span>



<p>I have been thinking about this topic for a while now, and what prompted me to finally write this down was a combination of things: a Twitter thread, conversations with friends, and a growing awareness that the thinking around AGI and superintelligence is not just optimistic, but fundamentally flawed. The purpose of this blog post is to address what I see as very sloppy thinking, thinking that is created in an echo chamber, particularly in the Bay Area, where the same ideas amplify themselves without critical awareness. This amplification of bad ideas and thinking exhuded by the rationalist and EA movements, is a big problem in shaping a beneficial future for everyone. Realistic thought can be used to ground where we are and where we have to go to shape a future that is good for everyone.</p>



<p>I want to talk about hardware improvements, AGI, superintelligence, scaling laws, the AI bubble, and related topics. But before we dive into these specific areas, I need to establish a foundation that is often overlooked in these discussions. Let me start with the most fundamental principle.</p>





<h2>Computation is Physical</h2>



<p>A key problem with ideas, particularly those coming from the Bay Area, is that they often live entirely in the idea space. Most people who think about AGI, superintelligence, scaling laws, and hardware improvements treat these concepts as abstract ideas that can be discussed like philosophical thought experiments. In fact, a lot of the thinking about superintelligence and AGI comes from Oxford-style philosophy. Oxford, the birthplace of effective altruism, mixed with the rationality culture from the Bay Area, gave rise to a strong distortion of how to clearly think about certain ideas. All of this sits on one fundamental misunderstanding of AI and scaling: computation is physical.</p>



<p>For effective computation, you need to balance two things. You need to move global information to a local neighborhood, and you need to pool multiple pieces of local information to transform old information into new. While the complexity of local computation is virtually constant — much accelerated by smaller transistors — movement scales quadratically with distance to local computation units. While memory movement also benefits from smaller transistors, improvements become quickly sublinear due to the squared nature of memory access patterns.</p>



<p>This is most easily seen by looking at cache hierarchies. L1, L2 and L3 cache are physically the same technology, but computationally they are very different. L2 and L3 are much larger than L1, but they are also much slower. This is because L2 and L3 are further away, physically, from the computational core, and memory lookups need to traverse a longer distance due to the physical size.&nbsp;</p>



<p>Two ideas to remember: First, larger caches are slower. Second, as we get smaller and smaller transistors, computation gets cheaper, but memory becomes more expensive, relatively speaking. The fraction of silicon area dedicated to memory on a chip has increased over time to the point where now computational elements on a chip are trivial in proportion. Almost all area is allocated to memory. In other words, if you want to produce 10 exaflops on a chip, you can do that easily — but you will not be able to service it with memory, making it useless FLOPS (the NVIDIA marketing department is good at ignoring this fact). All of this makes AI architectures like the transformer fundamentally physical. Our architectures are not abstract ideas that can be developed and thrown around carelessly. They are physical optimizations of information processing units.</p>



<p>To process information usefully, you need to do two things: compute local associations (MLP) and pool more distant associations to the local neighborhood (attention). This is because local information alone only helps you to distinguish closely related information, while pooling distant information helps you to form more complex associations that contrast or augment local details. The transformer is one of the most physically efficient architectures because it combines the simplest ways of doing this local computation and global pooling of information. The global pooling of information might be made more effective through research, and there is still active investigation going on that I think might be promising, but it has diminishing returns — the transformer architecture is close to physically optimal.</p>



<p>Computation is physical. This is also true for biological systems. The computational capacity of all animals is limited by the possible caloric intake in their ecological niche. If you have the average calorie intake of a primate, you can calculate within 99% accuracy how many neurons that primate has. Humans invented cooking, which increased the physically possible caloric intake substantially through predigestion. But we reached the physical limits of intelligence. When women are pregnant, they need to feed two brains, which is so expensive that physically, the gut cannot mobilize enough macronutrients to keep both alive if our brains were bigger. With bigger brains, we would not be able to have children — not because of the birth canal being too small, but because we would not be able to provide enough energy — making our current intelligence a physical boundary that we cannot cross due to energy limitations.</p>



<p>We are close to reaching the same limits for digital computation.</p>



<h2>Linear Progress Needs Exponential Resources</h2>



<p>There have been studies about progress in all kinds of fields that come to the same conclusion: linear progress needs exponential resources. What does that mean? If you want to improve a system further and further, make it more precise, or improve its efficiency, you need exponentially more resources with any improvement that you make. This is true for all kinds of fields and problems being investigated, and it is pretty clear why.</p>



<p>There are two realities at play here: one physical and one in the idea space. In the physical reality, if you need to accumulate resources in time and space to produce an outcome, then for logistical reasons, the overall effect that is locally produced needs linear resources to produce a linear outcome. But because of physicality and because matter takes up space, those resources can only be pooled at an increasingly slowing rate due to contention in space or time.</p>



<p>In the idea space, there is a similar phenomenon, which is less obvious. If two ideas are completely independent, they can have an effect that is ten times larger than any single idea. But if ideas are related, then the overall impact is limited due to diminishing returns — the ideas are just too correlated. If an idea builds on another, it can only be so much better. Often, if there is a dependency between ideas, one is a refinement of the other. Refinements, even if they are extremely creative, will yield incremental improvements. If a field is large enough, even if one tries to work on very different ideas, they are still heavily related to previous ideas. For example, while state-based models and Transformers seem like very different approaches to attention, they concentrate on the same problem. Very minimal gains can be achieved through any idea that modifies attention in these ways.</p>



<p>These relationships are most striking in physics. There was a time when progress could be made by individuals – not so much anymore.</p>



<p>I talked to a top theoretical physicist at a top research university, and he told me that all theoretical work in physics is, in some sense, either incremental refinement or made-up problems. The core problem of the idea space is this: if the idea is in the same sub-area, no meaningful innovation is possible because most things have already been thought. A first urge is to look for wildy creative ideas, but the problem is that are still bound by the rules of that subspace that often exist for a very good reason (see graduate-student-theory-of-everything-phenomenon). So the theoretical physicist faces only two meaningful choices: refine other ideas incrementally, which leads to insignificant impact; or work on rule-breaking unconventional ideas that are interesting but which will have no clear impact on physical theory.</p>



<p>The experimental physics demonstrates the physical limitations. The experiments that test more and more fundamental laws of physics and constituent particles — in other words, the standard model — become increasingly expensive. The standard model is incomplete, and we do not know how to fix it. Higher energies at the Large Hadron Collider have only led to more inconclusive results and the ruling out of more theories. We have no understanding of what dark energy or dark matter is, even though we build increasingly complex experiments that cost billions of dollars. The reality might be that certain aspects of physics are unknowable, hidden by complexity that cannot be attained with the resources that we can muster.&nbsp;</p>



<p>If you want to get linear improvements, you need exponential resources.</p>



<h2>GPUs No Longer Improve</h2>



<p>One of the most common misconceptions I see is that people assume hardware keeps improving and improving. This is an important misconception that explains a lot of the poor thinking around AI progress. The efficiency of GPUs has driven almost all innovation in AI. AlexNet was only possible by developing one of the first CUDA implementations that could compute convolutions over networked GPUs. Further innovation was mostly possible through improved GPUs and using more GPUs. Almost everybody sees this pattern — GPUs improve, AI performance improves — and it is easy to think that GPUs will improve further and will continue to improve AI outcomes. Every generation of GPUs has been better, and it would seem foolish to think that it will stop. But actually, it is foolish to think that GPUs will continue to improve. In fact, GPUs will no longer improve meaningfully. We have essentially seen the last generation of significant GPU improvements. GPUs maxed out in performance per cost around 2018 — after that, we added one-off features that exhaust quickly.</p>



<p>The first of these one-off features was 16-bit precision, then Tensor Cores, or the equivalent, then high-bandwidth memory (HBM),then the TMA or equivalent,&nbsp; then 8-bit precision, then 4-bit precision. And now we are at the end, both in the physical and the idea space. I have shown in my paper about k-bit inference scaling laws what data types with particular block sizes and computational arrangements are optimal. This has already been adopted by hardware manufacturers. Any further improvement will lead not to straightforward improvements but to trade-offs: either better memory footprint at lower computational efficiency or higher computational throughput at higher memory footprint. Even if you can innovate – linear improvements, need exponential resources – further improvements will be trivial and will not add any meaningful advancement.</p>



<p>While GPUs can no longer improve meaningfully, rack-level optimizations are still critically important. Efficient shuttling of key-value caches is one of the most important problems in AI infrastructure. The current solution to this problem, however, is also relatively straightforward. Companies like OpenAI boast about their AI infrastructure, but it is relatively simple to design because there is essentially only one optimal way to design it. And while it is complex to implement, it just needs clear thinking and mostly hard, time-intensive engineering. But the overall system design is not particularly novel. OpenAI – or other frontier labs – have no fundamental advantage in their inference and infrastructure stacks. The only way to gain an advantage is by having slightly better rack-level hardware optimizations or data-center-level hardware optimizations. But these will also run out quickly – maybe 2026, maybe 2027.</p>



<h2>Why Scaling Is Not Enough</h2>



<p>In my Twitter thread, I talked about how Gemini might signal a plateau in AI progress in the sense that we might not see meaningful improvements anymore. A lot of people responded with something along the lines of, “You are being too pessimistic. Can you not see that scaling works?” The point here is a bit more subtle, so I want to elaborate.&nbsp;</p>



<p>I believe in scaling laws and I believe scaling will improve performance, and models like Gemini are clearly good models. The problem with scaling is this: for linear improvements, we previously had exponential growth as GPUs which canceled out the exponential resource requirements of scaling. This is no longer true. In other words, previously we invested roughly linear costs to get linear payoff, but now it has turned to exponential costs. That would not be a problem on its own, but it sets a clear physical limit on scaling that is rapidly approaching. We have maybe one, maybe two more years of scaling left because further improvements become physically infeasible. The scaling improvements in 2025 were not impressive. Scaling in 2026 and 2027 better work out better.</p>



<p>Despite these exponential costs, the current infrastructure build-out is reasonable, particularly with the growth of inference use, but it still creates a very precarious balance. The biggest problem is this: if scaling does not provide much larger improvements than research/software innovations, then hardware becomes a liability and not an asset.&nbsp;</p>



<p>Small players like MoonshotAI and Z.ai show that they do not need many resources to reach frontier performance (I personally prefer Kimi K2-thinking over Sonnet 4.5 for coding). If these companies innovate beyond scale, they might just create the best model. While they might still use existing infrastructure, they could just switch to Huawei Ascend chips for inference, which are more than fine for providing good inference performance.</p>



<p>Another big threat to scale-up-infrastructure is that, currently, large-model inference efficiency is strongly related to a large user base due to network scaling. The problem is that efficient deployments of a large model needs a certain amount of GPUs to be efficient enough to overlap computation with networking and KV-cache length partitioning. Such deployments are ultra-efficient but demand a large user base to unlock full utilization and with that, cost-effectiveness. That is why open-weight models currently have not had the expected impact, because the infrastructure cost of large deployments need a large user-base. However, this problem can be solved with software.</p>



<p>While vLLM and SGLang currently try to optimize frontier-type deployments, they do not provide this efficiency at smaller scales. With the right inference stack beyond vLLM/SGLang, people would be able to deploy a ~300-billion-parameter model with the same efficiency as OpenAI or Anthropic deploys their frontier models. If smaller models become more capable — we see this with GLM 4.6 — or if AI applications become more specialized, the infrastructure advantage of frontier labs might vanish overnight. The software complexity evaporates, and open-source, open-weight deployments might be close to physically optimal, both in terms of computational efficiency and information processing efficiency. This is a large risk for frontier players.</p>



<p>Under slowing scaling, any of these three factors might degrade the value of AI infrastructure significantly and rapidly: (1) research/software innovations, (2) strong open-weight inference stacks, (3) shift to other hardware.</p>



<p>The current trends do not look good for frontier labs.&nbsp;</p>



<h2>Frontier AI Versus Economic Diffusion</h2>



<p>The US and China follow two different approaches to AI. The US follows the idea that there will be one winner who takes it all – the one that builds superintelligence wins. Even coming short of superintelligence of AGI, if you have the best model, almost all people will use your model and not the competition’s model. The idea is: develop the biggest, badest model and people will come.&nbsp;</p>



<p>China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI. The key indicator of progress is how much AI is integrated into everything and how useful it is. If one model is better than another, it does not automatically mean it will be used more widely. What is important is that the model is useful and yields productivity gains at a reasonable cost. If the current approach is more productive than the previous one, it will be adopted. But hyper-optimization for slightly better quality is not very effective. In most cases, settling on “good enough” yields the highest productivity gain.</p>



<p>I think it is easy to see that the US philosophy is short-sighted and very problematic — particularly if model capability slows. The Chinese philosophy is more long-term focused and pragmatic.&nbsp;</p>



<p>The key value of AI is that it is useful and increases productivity. That makes it beneficial. It is clear that, similarly to computers or the internet, AI will be used everywhere. The problem is that if AI were just used for coding and engineering, it would have a very limited impact. While a lot of economic activity is supported by digital programs, these also have diminishing returns, and producing more software will not improve outcomes significantly if existing software is already good enough (just look at the SAAS failure in China). This makes wide-spread economic integration absolutely vital for AI effectiveness.</p>



<p>So in order to provide real value, AI needs to be used in ways that provide new benefits, not just improvements to what already exists. This is a difficult problem, but the right answer is to integrate AI into everything to squeeze out non-linear improvements, see what works and what does not, then keep what is working. China is taking this approach by subsidizing applications that use AI to encourage adoption. The Chinese population is very receptive to innovation, which facilitates this process. It is nothing unusual in China to see an 80-year-old grandma use AI to help her with their daily life. The US, on the other hand, bets on ideas like AGI and superintelligence, which I believe are fundamentally flawed concepts that have little relevance to future AI progress. This becomes clear when you think carefully about what these terms actually mean in physical reality.</p>



<h2>AGI Will Never Happen, and Superintelligence Is a Fantasy</h2>



<p>There is this pattern I have noticed: when you ask people in the Bay Area when AGI will happen, they always say it is a few years in the future, and it will have a massive impact. Then, if you ask them what AGI actually is, they do not include any physical tasks in their definition, and they do not consider resource inputs.&nbsp;</p>



<p>True AGI, that can do all things human, would need to be able to physical tasks – which comprises the largest economic sector. In short, AGI should include physical robots or machines that are able to do economically meaningful work in the physical world. While physical robots might be convenient for unloading your dishwasher, you will not see them replacing specialized systems in factories. Specialized robots in factories are too efficient, too precise. China demonstrates that dark factories — fully automated facilities — are already possible. Most robotics problems are solved problems in controlled environments. Most existing robotics problems that remain unsolved are also economically unviable. Stitching sleeves to a t-shirt is an unsolved robotics problem, but it is also not particularly economically meaningful in most contexts. Household robots will be interesting, but if it takes me two minutes to unload my dishwasher, I am not sure I need a robot for that. And while in a couple of years a robot might be able to fold laundry, I would rather spend a few minutes folding it myself with no creases than have a robot do a mediocre job.</p>



<p>The main problem with robotics is that learning follows scaling laws that are very similar to the scaling laws of language models. The problem is that data in the physical world is just too expensive to collect, and the physical world is too complex in its details. Robotics will have limited impacts. Factories are already automated and other tasks are not economically meaningful.</p>



<p>The concept of superintelligence is built on a flawed premise. The idea is that once you have an intelligence that is as good or better than humans — in other words, AGI — then that intelligence can improve itself, leading to a runaway effect. This idea comes from Oxford-based philosophers who brought these concepts to the Bay Area. It is a deeply flawed idea that is harmful for the field. The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns. So, superintelligence can be thought of as filling gaps in capability, not extending the frontier. Filling gaps can be useful, but it does not lead to runaway effects — it leads to incremental improvements.</p>



<p>Furthermore, the same people who think that GPUs will infinitely improve are often the people who think superintelligence will make those improvements faster and better. But they do not realize that GPUs can no longer be meaningfully improved. We can wait for better HBM memory technology for speed, and for chiplets and advanced packaging to improve yield/cost, but that is it. Rack-level optimization will likely hit the physical wall in 2026 or 2027. A superintelligence will not accelerate the progress made in HBM development, manufacturing, testing, and integration. The transformer architecture is close to physically optimal. Superintelligence will not be able to meaningfully improve neural network architectures. Efficient large-scale deployments for inference are largely a solved engineering problem. It just needs some careful engineering and time, but very little creativity is required to solve this problem close to physical optimality. Superintelligence will not be able to improve our inference stack by much.</p>



<p>A superintelligence might help with economic diffusion of AI technology, but in the end, the limiting factor of economic diffusion is implementation and adoption, not capability. It is clear to me that any organization that strives primarily for superintelligence as a goal will encounter significant challenges and will ultimately falter and be displaced by players that provide general economic diffusion.&nbsp;</p>



<p>In summary, AGI, as commonly conceived, will not happen because it ignores the physical constraints of computation, the exponential costs of linear progress, and the fundamental limits we are already encountering. Superintelligence is a fantasy because it assumes that intelligence can recursively self-improve without bound, ignoring the physical and economic realities that constrain all systems. These ideas persist not because they are well-founded, but because they serve as compelling narratives in an echo chamber that rewards belief over rigor.&nbsp;</p>



<p>The future of AI will be shaped by economic diffusion, practical applications, and incremental improvements within physical constraints — not by mythical superintelligence or the sudden emergence of AGI. The sooner we accept this reality, the better we can focus on building AI systems that actually improve human productivity and well-being.</p>
<p>The post <a rel="nofollow" href="https://timdettmers.com/2025/12/10/why-agi-will-not-happen/">Why AGI Will Not Happen</a> appeared first on <a rel="nofollow" href="https://timdettmers.com">Tim Dettmers</a>.</p>
]]></description>
      <pubDate>Wed, 10 Dec 2025 15:05:30 +0000</pubDate>
      <link>https://timdettmers.com/2025/12/10/why-agi-will-not-happen/</link>
      <dc:creator>Tim Dettmers</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5044059276</guid>
    </item>
    <item>
      <title><![CDATA[Find Space]]></title>
      <description><![CDATA[People tend to all pursue the same opportunity but the greater impact is often had by pursuing the opportunity nobody else sees]]></description>
      <pubDate>Tue, 11 Nov 2025 00:00:00 +0000</pubDate>
      <link>https://boz.com/articles/find-space</link>
      <dc:creator>boz.</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5012361903</guid>
    </item>
    <item>
      <title><![CDATA[Why speed matters]]></title>
      <description><![CDATA[<img width="150" height="150" src="https://lemire.me/blog/wp-content/uploads/2025/12/Capture-decran-le-2025-12-06-a-14.08.31-150x150.png" class="webfeedsFeaturedVisual wp-post-image" alt="" style="display: block; margin-bottom: 5px; clear:both;max-width: 100%;" link_thumbnail="" decoding="async" loading="lazy"><p>The one constant that I have observed in my professional life is that people underestimate the need to move fast.</p>
<p>Of course, doing good work takes time. I once spent six months writing a URL parser. But the fact that it took so long is not a feature, it is not a positive, it is a negative.</p>
<p>If everything is slow-moving around you, it is likely not going to be good. To fully make use of your brain, you need to move as close as possible to the speed of your thought.</p>
<p>If I give you two PhD students, one who completed their thesis in two years and one who took eight years… you can be almost certain that the two-year thesis will be much better.</p>
<p>Moving fast does not mean that you complete your projects quickly. Projects have many parts, and getting everything right may take a long time.</p>
<p>Nevertheless, you should move as fast as you can.</p>
<p>For multiple reasons:</p>
<p>1. A common mistake is to spend a lot of time—too much time—on a component of your project that does not matter. I once spent a lot of time building a podcast-like version of a course… only to find out later that students had no interest in the podcast format.</p>
<p>2. You learn by making mistakes. The faster you make mistakes, the faster you learn.</p>
<p>3. Your work degrades, becomes less relevant with time. And if you work slowly, you will be more likely to stick with your slightly obsolete work. You know that professor who spent seven years preparing lecture notes twenty years ago? He is not going to throw them away and start again, as that would be a new seven-year project. So he will keep teaching using aging lecture notes until he retires and someone finally updates the course.</p>
<p>What if you are doing open-heart surgery? Don’t you want someone who spends days preparing and who works slowly? No. You almost surely want the surgeon who does many, many open-heart surgeries. They are very likely to be the best one.</p>
<p>Now stop being so slow. Move!</p>
]]></description>
      <pubDate>Fri, 05 Dec 2025 19:24:50 +0000</pubDate>
      <link>https://lemire.me/blog/2025/12/05/why-speed-matters/</link>
      <dc:creator>Daniel Lemire&#39;s blog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5039356409</guid>
    </item>
    <item>
      <title><![CDATA[Welcome, Mistral!]]></title>
      <description><![CDATA[<p>Welcome, Mistral!</p>
<p><a href="https://xcancel.com/Laz4rz/status/1996601221319193002#m">xcancel.com/Laz4rz/status/1996601221319193002#m</a></p>]]></description>
      <pubDate>Thu, 04 Dec 2025 23:42:28 +0000</pubDate>
      <link>https://xcancel.com/giffmana/status/1996726831924261262#m</link>
      <dc:creator>Lucas Beyer (bl16) / @giffmana</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5037948294</guid>
    </item>
    <item>
      <title><![CDATA[Galaxy brain resistance]]></title>
      <description><![CDATA[]]></description>
      <pubDate>Mon, 17 Nov 2025 21:29:54 +0000</pubDate>
      <link>https://vitalik.eth.limo/general/2025/11/07/galaxybrain.html</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5019011309</guid>
    </item>
    <item>
      <title><![CDATA[Kimi K2 Thinking]]></title>
      <description><![CDATA[]]></description>
      <pubDate>Fri, 07 Nov 2025 02:22:29 +0000</pubDate>
      <link>https://moonshotai.github.io/Kimi-K2/thinking.html</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5007077428</guid>
    </item>
    <item>
      <title><![CDATA[Epoch AI (@EpochAIResearch)]]></title>
      <description><![CDATA[<div class="tweet-body"> <div></div> <p class="tweet-content">We looked at OSWorld, a popular evaluation of AI computer use capabilities. Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time. See thread for details!</p> <div class="attachments"><div class="gallery-row"><div class="attachment image"><a class="still-image" href="https://pbs.twimg.com/media/G420HxtbMAAek6M.jpg?name=orig"><img src="https://pbs.twimg.com/media/G420HxtbMAAek6M.jpg?name=small&amp;format=webp" alt=""></a></div></div></div> <p class="tweet-published">Nov 3, 2025 · 8:16 PM UTC</p> </div>]]></description>
      <pubDate>Tue, 04 Nov 2025 02:06:26 +0000</pubDate>
      <link>https://xcancel.com/EpochAIResearch/status/1985441059032478172</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/5003162949</guid>
    </item>
    <item>
      <title><![CDATA[Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here:
https://github.com/karpathy/nanochat/discussions/164

This is done via a new synthetic task `SpellingBee`  that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview:

- You have to ensure diversity in user prompts/queries
- For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc.
- I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with.

If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)]]></title>
      <description><![CDATA[<p>Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here:<br>
<a href="https://github.com/karpathy/nanochat/discussions/164">github.com/karpathy/nanochat…</a><br>
<br>
This is done via a new synthetic task `SpellingBee`  that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview:<br>
<br>
- You have to ensure diversity in user prompts/queries<br>
- For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc.<br>
- I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with.<br>
<br>
If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)</p>
<img src="https://pbs.twimg.com/media/G4CQV2qWMAAwA4V.jpg" style="max-width:250px;">]]></description>
      <pubDate>Fri, 24 Oct 2025 15:35:18 +0000</pubDate>
      <link>https://rss.xcancel.com/karpathy/status/1981746327995465816#m</link>
      <dc:creator>Andrej Karpathy / @karpathy</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4991754435</guid>
    </item>
    <item>
      <title><![CDATA[RT by @zdhnarsil: RainbowGRPO]]></title>
      <description><![CDATA[<p>RainbowGRPO</p>
<p><a href="https://xcancel.com/Devvrit_Khatri/status/1978864275658871099#m">xcancel.com/Devvrit_Khatri/status/1978864275658871099#m</a></p>]]></description>
      <pubDate>Fri, 17 Oct 2025 02:06:52 +0000</pubDate>
      <link>https://xcancel.com/_AndrewZhao/status/1979006166047891758#m</link>
      <dc:creator>Dinghuai Zhang 张鼎怀 / @zdhnarsil</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4983670814</guid>
    </item>
    <item>
      <title><![CDATA[Writing apps is so much easier than writing libraries.

✓ No multiple versions of Python to deal with. 

✓ Less chance that a release will break somebody else's code.

✓ Your code only has to work for your own use case.]]></title>
      <description><![CDATA[<p>Writing apps is so much easier than writing libraries.<br>
<br>
✓ No multiple versions of Python to deal with. <br>
<br>
✓ Less chance that a release will break somebody else's code.<br>
<br>
✓ Your code only has to work for your own use case.</p>]]></description>
      <pubDate>Thu, 16 Oct 2025 13:21:50 +0000</pubDate>
      <link>https://rss.xcancel.com/willmcgugan/status/1978813636408705283#m</link>
      <dc:creator>Will McGugan / @willmcgugan</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4982391694</guid>
    </item>
    <item>
      <title><![CDATA[Claude Code sub-agents]]></title>
      <description><![CDATA[
    <p>Claude Code includes the ability to run sub-agents, where a separate agent loop with a fresh token context is dispatched to achieve a goal and report back when it's done. I wrote a bit about how these work <a href="https://simonwillison.net/2025/Jun/2/claude-trace/">in June</a> when I traced Claude Code's activity by intercepting its API calls.</p>
<p>I recently learned that you can <em>deliberately</em> trigger Claude Code to use sub-agents just by telling it to do so!</p>
<p>Here's an example. I ran this using Claude Code against a fresh checkout of <a href="https://github.com/simonw/datasette">simonw/datasette</a> (commit <code>ec38ad37689f3c14d307770977a17aed5efc5cb9</code>), with no additional custom instructions:</p>
<blockquote>
<p><code>Write markdown documentation for the context passed to each of the templates in this project. Use sub-agents.</code></p>
</blockquote>
<p>Here's a snippet of the output:</p>
<blockquote>
<p>⏺ I'll help you write markdown documentation for the context passed to each template in this project. Let me start by exploring the codebase to understand the template structure and then use sub-agents to document them comprehensively.</p>
<pre><code>⏺ Search(pattern: "**/*.html")
  ⎿ &nbsp;Found 47 files (ctrl+o to expand)

⏺ Search(pattern: "render.*template|render_template")
  ⎿ &nbsp;Found 16 files (ctrl+o to expand)
</code></pre>
<p>[...]</p>
<p>⏺ Now let me launch sub-agents to analyze different template contexts in parallel:</p>
<pre><code>⏺ Task(Document index template context)
  ⎿ &nbsp;Done (12 tool uses · 86.4k tokens · 1m 13s)

⏺ Task(Document database template context)
  ⎿ &nbsp;Done (15 tool uses · 55.4k tokens · 1m 41s)

⏺ Task(Document table template context)
  ⎿ &nbsp;Done (14 tool uses · 85.2k tokens · 1m 44s)

⏺ Task(Document query/row template contexts)
  ⎿ &nbsp;Done (17 tool uses · 78.7k tokens · 1m 55s)

⏺ Task(Document special template contexts)
  ⎿ &nbsp;Done (26 tool uses · 116.1k tokens · 1m 50s)
</code></pre>
</blockquote>
<p>It churned away for a couple of minutes (those sub-agents execute in parallel, another benefit of this pattern) and made me this <a href="https://gist.github.com/simonw/5b7d87fe5ffa36e7a4c104fcb18fc097">
template_contexts.md
</a> markdown file. It's very comprehensive.</p>

    <p>Tags: <a href="https://simonwillison.net/tags/sub-agents">sub-agents</a>, <a href="https://simonwillison.net/tags/coding-agents">coding-agents</a>, <a href="https://simonwillison.net/tags/ai-assisted-programming">ai-assisted-programming</a>, <a href="https://simonwillison.net/tags/claude-code">claude-code</a>, <a href="https://simonwillison.net/tags/generative-ai">generative-ai</a>, <a href="https://simonwillison.net/tags/ai">ai</a>, <a href="https://simonwillison.net/tags/llms">llms</a></p>



]]></description>
      <pubDate>Sat, 11 Oct 2025 21:29:39 +0000</pubDate>
      <link>https://simonwillison.net/2025/Oct/11/sub-agents/#atom-everything</link>
      <dc:creator>Simon Willison&#39;s Weblog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4977006500</guid>
    </item>
    <item>
      <title><![CDATA[Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges]]></title>
      <description><![CDATA[<div id="blog-post__body" class="o-prose"> <p><strong>We rebuilt Devin for Claude Sonnet 4.5.</strong></p><p>The new version is 2x faster, 12% better on our Junior Developer Evals, and it's available now in Agent Preview. For users who prefer the old Devin, that remains available.</p><p>Why rebuild instead of just dropping the new Sonnet in place and calling it a day? Because this model works <em>differently</em>—in ways that broke our assumptions about how agents should be architected. Here's what we learned:</p><a class="image" href="https://cdn.sanity.io/images/2mc9cv2v/production/307b567295a50c82cb3fea942dd578f576beb5f0-2926x2074.png"><img src="https://cdn.sanity.io/images/2mc9cv2v/production/307b567295a50c82cb3fea942dd578f576beb5f0-2926x2074.png" alt="" class="is-loaded"></a><p>Because Devin is an agent that plans, executes, and iterates rather than just autocompleting code (or acting as a copilot), we get an unusual window into model capabilities. Each improvement compounds across our feedback loops, giving us a perspective on what's genuinely changed. With Sonnet 4.5, we're seeing the biggest leap since Sonnet 3.6 (the model that was used with Devin's GA): planning performance is up 18%, end-to-end eval scores up 12%, and multi-hour sessions are dramatically faster and more reliable.</p><p>In order to get these improvements, we had to rework Devin not just around some of the model’s new capabilities, but also a few new behaviors we never noticed in previous generations of models. We are sharing some of our observations below:</p><h2><a href="https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges#the-model-is-aware-of-its-context-window">The model is aware of its context window</a></h2><p>Sonnet 4.5 is the first model we've seen that is aware of its own context window, and this shapes how it behaves. As it approaches context limits, we've observed it proactively summarizing its progress and becoming more decisive about implementing fixes to close out tasks.</p><p>This "context anxiety" can actually hurt performance: we found the model taking shortcuts or leaving tasks incomplete when it believed it was near the end of its window, even when it had plenty of room left.</p><p>We ended up prompting pretty aggressively to override this behavior. Even then, we found that prompts at the start of the conversation weren't enough—we had to add reminders both at the beginning and the end of the prompt to keep it from prematurely wrapping up.</p><p>When researching ways to address this issue, we discovered one unexpected trick that worked well: <strong>enabling the 1M token beta but cap usage at 200k</strong>. This gave us a model that thinks it has plenty of runway and behaves normally, without the anxiety-driven shortcuts or degraded performance.</p><p>This behavior has has real implications for how we architect around context management. When planning token budgets, we now need to factor in the model's own awareness: knowing when it will naturally want to summarize versus when we need to intervene with context compaction.</p><p>Interestingly, the model consistently underestimates how many tokens it has left—and it's very precise about these wrong estimates.</p><h2><a href="https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges#the-model-takes-a-lot-of-notes">The model takes a lot of notes</a></h2><p>One of the most striking shifts in Sonnet 4.5 is that it actively tries to build knowledge about the problem space through both documentation and experimentation.</p><p><strong>Writing notes for itself</strong></p><p>The model treats the file system as its memory without prompting. It frequently writes (or wants to write) summaries and notes (e.g. CHANGELOG.md, SUMMARY.md, but not CLAUDE.md nor AGENTS.md), both for the user and its own future reference. This suggests the model has been trained to externalize state rather than rely purely on context. This behavior is more pronounced when the model is closer to the end of its context window.</p><p>When we saw this, we were interested in the possibility to potentially remove some of our own memory management and let the model handle it. But in practice, we found the summaries weren't comprehensive enough. For example, it would sometimes paraphrase the task, leaving out important details. When we relied on the model's own notes without our compacting and summarization systems, we saw performance degradation and gaps in specific knowledge: the model didn't know what it didn't know (or what it might need to know in the future). It's very likely that these notes can be improved with prompting. You just shouldn't think you get a perfect system for free.</p><p>In some cases, somewhat humorously, we've seen the agent spend more tokens writing summaries than actually solving the problem. We've also noticed that the model's level of effort is uneven: the model tends to generate more summary tokens the shorter the context window.</p><p>In our testing, we found this behavior useful in certain cases, but less effective than our existing memory systems when we explicitly directed the agent to use its previously generated state.</p><p>This is an interesting paradigm and a new axis for model development, especially for simpler agent architectures or systems built around subagent delegation. It's clearly a new direction from Anthropic: likely pointing toward a future where models are more context-aware and where this becomes the way multiple agents communicate with each other. The RL training hasn't fully progressed to the point where this is reliable yet, but we'll be tracking how it evolves.</p><p><strong>Testing to create feedback loops</strong></p><p></p><p>Sonnet 4.5 is notably more proactive about writing and executing short scripts and tests to create feedback loops, and shows good judgment about when to use this capability. This generally improves reliability on long-running tasks, though we've occasionally seen it attempt overly creative workarounds when debugging. For example, when editing a React app, we’ve noticed the models getting the HTML of the page in order to check their work along the way in order to ensure that the behavior was correct. In another case, when trying to fix a seemingly innocent error related to two local servers trying to run on the same port, the model ended up using this behavior to create an overly complicated custom script instead of addressing the root cause issue (terminating the process).</p><h2><a href="https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges#the-model-works-in-parallel">The model works in parallel</a></h2><p>Sonnet 4.5 is efficient at maximizing actions per context window through parallel tool execution -running multiple bash commands at once, reading several files simultaneously, that sort of thing. Rather than working strictly sequentially (finish A, then B, then C), the model will overlap work where it can. It also shows decent judgment about self-verification: checking its work as it goes.</p><p>This is very noticeable in Windsurf, and was an improvement upon Devin's existing parallel capabilities. That being said, there are tradeoffs. Parallelism burns through context faster, which leads to the context anxiety we mentioned earlier. But when the model is running with an empty context window, this more concurrent approach makes sessions feel faster and more productive. It's a subtle shift, but one that influenced how we thought about the architecture.</p><p>The model also seems to be trained to burn through parallel tool calls faster when it is earlier on in its context window, but takes more cautious as it nears the limit. This suggests to us that its been trained to be aware of how many output tokens its tool calls will produce.</p><h2><a href="https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges#what-were-exploring-next">What we're exploring next</a></h2><p>These behaviors open up many interesting avenues, and we haven't been able to explore them all yet. Here are some we're eager to continue testing:</p><ul><li><strong>Subagents and context-aware tool calls.</strong> The model's improved judgment about when to externalize state and create feedback loops suggests it might handle subagent delegation more effectively. However, <a href="https://cognition.ai/blog/dont-build-multi-agents">as we've learned</a> you have to be very careful about when to use subagents because the context and state management gets complex quickly. Sonnet 4.5 seems more aware of the right types of tasks to delegate, which could make this more practical.</li><li><strong>Meta-agent prompting.</strong> We're particularly interested in how this model handles meta-level reasoning about agent workflows. Early experiments suggest it works well with verification systems — letting the model reason about its own development process rather than just executing tasks.</li><li><strong>Context-management models.</strong> Sonnet 4.5 seems to have some initial intuition around how to mange its own context. It might be possible that custom-trained models for intelligent context management could both result in faster and better performance.</li></ul><p>We'll be sharing more as we learn what works (and what doesn't). In the meantime, we’re excited for you to try both the new Devin with Sonnet 4.5 and Windsurf.</p> </div>]]></description>
      <pubDate>Sun, 05 Oct 2025 18:54:43 +0000</pubDate>
      <link>https://cognition.ai/blog/devin-sonnet-4-5-lessons-and-challenges</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4969431558</guid>
    </item>
    <item>
      <title><![CDATA[simonw/claude-skills]]></title>
      <description><![CDATA[
    
<p><strong><a href="https://github.com/simonw/claude-skills">simonw/claude-skills</a></strong></p>
One of the tips I picked up from <a href="https://blog.fsck.com/2025/10/09/superpowers/">Jesse Vincent's Claude Code Superpowers post</a> (<a href="https://simonwillison.net/2025/Oct/10/superpowers/">previously</a>) was this:<p></p>
<blockquote>
<p>Skills are what give your agents Superpowers.</p>
<p>The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit -- I asked Claude to tell me all about its new skills. And it was <a href="https://claude.ai/share/0fe5a9c0-4e5a-42a1-9df7-c5b7636dad92">only too happy to dish</a>.</p>
</blockquote>
<p>In that transcript Jesse asks how the skill system works and Claude replies:</p>
<blockquote>
<p>Looking at the skills system, here's how it works:</p>
<p><strong>Current Skills</strong> (in <code>/mnt/skills/public/</code>):</p>
<ul>
<li><strong>docx</strong> - Word document handling</li>
<li><strong>pdf</strong> - PDF processing and form filling</li>
<li><strong>pptx</strong> - PowerPoint presentations</li>
<li><strong>xlsx</strong> - Excel spreadsheets</li>
</ul>
</blockquote>
<p>This looks like it's part of Claude's brand new Code Interpreter feature! I <a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/">wrote about that extensively</a> last month, but I missed that there was a <code>/mnt/skills/public/</code> folder full of fascinating implementation details.</p>
<p>So I fired up a fresh Claude instance (fun fact: Code Interpreter also works in the Claude iOS app now, which it didn't when they first launched) and prompted:</p>
<blockquote>
<p><code>Create a zip file of everything in your /mnt/skills folder</code></p>
</blockquote>
<p>This worked, and gave me a <code>.zip</code> to download. You can <a href="https://claude.ai/new?q=Create%20a%20zip%20file%20of%20everything%20in%20your%20%2Fmnt%2Fskills%20folder">run the prompt yourself here</a>, though you'll need to <a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/#switching-it-on-in-settings-features">enable the new feature first</a>.</p>
<p>I've pushed the contents of that zip to my <a href="https://github.com/simonw/claude-skills">new simonw/claude-skills GitHub repo</a>.</p>
<p>So now you can see the prompts Anthropic wrote to enable the creation and manipulation of the following files in their Claude consumer applications:</p>
<ul>
<li><a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pdf/SKILL.md">pdf</a> - PDF files</li>
<li><a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/docx/SKILL.md">docx</a> - Microsoft Word</li>
<li><a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pptx/SKILL.md">pptx</a> - Microsoft PowerPoint decks</li>
<li><a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/xlsx/SKILL.md">xlsx</a> - Microsoft Excel</li>
</ul>
<p>In each case the prompts spell out detailed instructions for manipulating those file types using Python, using libraries that come pre-installed on Claude's containers.</p>
<p>Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.</p>
<p><a href="https://github.com/simonw/claude-skills/blob/initial/mnt/skills/public/pdf/scripts/fill_fillable_fields.py">pdf/scripts/fill_fillable_fields.py</a> for example is a custom CLI tool that uses <a href="https://pypi.org/project/pypdf/">pypdf</a> to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.</p>
<p>This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible - presumably deliberately - to users of Claude who know how to ask for them.


    </p><p>Tags: <a href="https://simonwillison.net/tags/pdf">pdf</a>, <a href="https://simonwillison.net/tags/python">python</a>, <a href="https://simonwillison.net/tags/ai">ai</a>, <a href="https://simonwillison.net/tags/prompt-engineering">prompt-engineering</a>, <a href="https://simonwillison.net/tags/generative-ai">generative-ai</a>, <a href="https://simonwillison.net/tags/llms">llms</a>, <a href="https://simonwillison.net/tags/anthropic">anthropic</a>, <a href="https://simonwillison.net/tags/claude">claude</a>, <a href="https://simonwillison.net/tags/code-interpreter">code-interpreter</a>, <a href="https://simonwillison.net/tags/jesse-vincent">jesse-vincent</a>, <a href="https://simonwillison.net/tags/skills">skills</a></p>



]]></description>
      <pubDate>Fri, 10 Oct 2025 23:57:19 +0000</pubDate>
      <link>https://simonwillison.net/2025/Oct/10/claude-skills/#atom-everything</link>
      <dc:creator>Simon Willison&#39;s Weblog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4976171367</guid>
    </item>
    <item>
      <title><![CDATA[LoRA Without Regret]]></title>
      <description><![CDATA[]]></description>
      <pubDate>Mon, 29 Sep 2025 21:43:20 +0000</pubDate>
      <link>https://thinkingmachines.ai/blog/lora/</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4962275951</guid>
    </item>
    <item>
      <title><![CDATA[Policy Gradient, Policy Iteration, and in between]]></title>
      <description><![CDATA[<div class="post-content"> <p>Policy gradient (PG) and actor critic are the most trendy RL methods today. As a theorist, it often gives me headache when people bring up all sorts of variants of PG and ask me for opinions, and all I can say is …</p> <p><img src="https://nanjiang.cs.illinois.edu/files/pic/a7i88k.jpg"></p> <p>Part of the reason is that many algorithms adopt changes to the original PG that look fairly heuristic, and I cannot find a conceptually simple and satisfactory framework for understanding and deriving them from first principles. This situation changed a little bit when <a href="https://lihongli.github.io/">Lihong</a> recently asked me about PPO and shared some interesting empirical phenomenon, which got me thinking and organizing together some ideas that have been in the back of my mind for a long time. So this post will share some perspectives on how I view these algorithms. <a id="sec1"></a></p> <h4 id="policy-gradient-implemented-incorrectly"><strong>Policy Gradient, Implemented (In)Correctly</strong></h4> <p>Let’s start by reviewing the good-old PG, which is
<span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>∇</mi><mi>J</mi><mo>(</mo><mi>π</mi><mo>)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>γ</mi></mrow></mfrac><msub><mi>E</mi><mrow><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo><mo>∼</mo><msup><mi>d</mi><mi>π</mi></msup></mrow></msub><mo>[</mo><msup><mi>Q</mi><mi>π</mi></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo><mi>∇</mi><mi>log</mi><mo>⁡</mo><mi>π</mi><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo><mo>]</mo><mo>,</mo></mrow><annotation>\nabla J(\pi) = \frac{1}{1-\gamma}\mathbb{E}_{(s,a)\sim d^\pi}[Q^\pi(s,a) \nabla \log \pi(s,a)],</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord">∇</span><span class="mord mathnormal">J</span><span class="mopen">(</span><span class="mord mathnormal">π</span><span class="mclose">)</span><span class="mspace"></span><span class="mrel">=</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span><span class="mbin mtight">−</span><span class="mord mathnormal mtight">γ</span></span></span></span><span><span class="pstrut"></span><span class="frac-line"></span></span><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mopen mtight">(</span><span class="mord mathnormal mtight">s</span><span class="mpunct mtight">,</span><span class="mord mathnormal mtight">a</span><span class="mclose mtight">)</span><span class="mrel mtight">∼</span><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span><span class="mopen">[</span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)</span><span class="mord">∇</span><span class="mspace"></span><span class="mop">lo<span>g</span></span><span class="mspace"></span><span class="mord mathnormal">π</span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)]</span><span class="mpunct">,</span></span></span></span>
and <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>Q</mi><mi>π</mi></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo></mrow><annotation>Q^\pi(s,a)</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)</span></span></span></span> can be replaced by the advantage function <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>A</mi><mi>π</mi></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo><mo>=</mo><msup><mi>Q</mi><mi>π</mi></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo><mo>−</mo><msup><mi>V</mi><mi>π</mi></msup><mo>(</mo><mi>s</mi><mo>)</mo></mrow><annotation>A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)</span><span class="mspace"></span><span class="mrel">=</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)</span><span class="mspace"></span><span class="mbin">−</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">V</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mclose">)</span></span></span></span>, or the Monte-Carlo return starting from <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo></mrow><annotation>(s,a)</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)</span></span></span></span> under policy <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>π</mi></mrow><annotation>\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">π</span></span></span></span>.</p> <p>In reality, however, a number of approximations is often deployed that departs from the equation above, with my comments at the end in parentheses:</p> <ol> <li>The true <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>A</mi><mi>π</mi></msup></mrow><annotation>A^\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span> is often unknown and we may use instead an approximate version <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mover><mi>A</mi><mo>^</mo></mover></mrow><annotation>\hat{A}</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord accent"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="mord mathnormal">A</span></span><span><span class="pstrut"></span><span class="accent-body"><span class="mord">^</span></span></span></span></span></span></span></span></span></span> often learned via TD, often referred to as the critic. (Totally reasonable.)</li> <li>In principle, after every update to the actor <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>π</mi></mrow><annotation>\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">π</span></span></span></span>, we should train a new critic for the new policy, ideally to convergence. But that’s obviously computationally too intensive, so we often just constantly update the critic, knowing that it likely has not converged to (the best approximation of) the advantage for the current policy. Theoretically, an updating scheme that reproduces the same issue (and is easier to study) is to freeze the critic and update the actor for a while, and then update the critic.</li> <li>The equation requires on-policy sampling <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mo>(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>)</mo><mo>∼</mo><msup><mi>d</mi><mi>π</mi></msup></mrow><annotation>(s,a)\sim d^\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">a</span><span class="mclose">)</span><span class="mspace"></span><span class="mrel">∼</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span>. While most (but not all) practical variants still respect the action on-policyness, to my knowledge nobody cares about <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>s</mi><mo>∼</mo><msup><mi>d</mi><mi>π</mi></msup></mrow><annotation>s \sim d^\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">s</span><span class="mspace"></span><span class="mrel">∼</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span>.<sup id="fnref:phil"><a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#fn:phil" class="footnote">1</a></sup> Instead of <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>E</mi><msup><mi>d</mi><mi>π</mi></msup></msub></mrow><annotation>\mathbb{E}_{d^\pi}</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span> people often just write <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>E</mi><mi>t</mi></msub></mrow><annotation>\mathbb{E}_t</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">t</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span> to refer to a recent batch (RAGE METER BUILDING UP…)</li> <li>Things get worse in off-policy setting, where an importance-weighting correction is needed. In practice, the policy on the denominator of importance weight does not always match the data used for updating the actor, which breaks importance sampling. (At this point man, what can I say.)</li> <li>… and so on.</li>
</ol> <h4 id="policy-iteration-"><strong>Policy Iteration</strong> <a id="eq1"></a></h4> <p>Despite my complaints about all these “wrong” practices, there is an angle that makes a lot of them justifiable. Consider the following: let’s push the alternating update in #2 to its extreme: in each round, we freeze <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>A</mi><mi>π</mi></msup></mrow><annotation>A^\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span> and keep updating the actor <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>π</mi></mrow><annotation>\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">π</span></span></span></span> until convergence. If the actor parameterization is sufficiently expressive, we would get</p> <span class="katex-display"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>π</mi><mtext>new</mtext></msub><mo>→</mo><mtext>greedy</mtext><mo>(</mo><msup><mi>Q</mi><msub><mi>π</mi><mtext>old</mtext></msub></msup><mo>)</mo><mo>,</mo><mspace width="2em"></mspace><mo>(</mo><mn>1</mn><mo>)</mo></mrow><annotation>\pi_{\textrm{new}} \to \textrm{greedy}(Q^{\pi_{\textrm{old}}}), \qquad (1)</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">new</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span><span class="mspace"></span><span class="mrel">→</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord text"><span class="mord textrm">greedy</span></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace"></span><span class="mspace"></span><span class="mopen">(</span><span class="mord">1</span><span class="mclose">)</span></span></span></span></span> <p>which is policy iteration (PI).<sup id="fnref:pi"><a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#fn:pi" class="footnote">2</a></sup> Now, the textbook material on PI, viewed as a planning algorithm, just lifts so many restrictions of PG, including</p> <ul> <li>You can optimize the policy very far (all the way to greedy w.r.t. current critic) before updating critic.</li> <li>There is no need to be on-policy. In fact, the policy improvement step does not even talk about any notion of data distribution.</li>
</ul> <p>All these relaxations are nice. But they of course come at a price, otherwise we can simply use <a href="https://link.springer.com/content/pdf/10.1007/s10994-007-5038-2.pdf">fitted PI</a> and do not need PG at all. As I will explain below, my view is that <strong>PI relaxes the requirement on data distributions at the cost of strong (and sometimes unrealistic) assumptions on (1) policy structure, and (2) critic quality.</strong> We need strict PG if none of these assumptions hold. However, when some of them are reasonable, we may relax PG towards PI and enjoy the resulting flexibility.</p> <h4 id="on-policy-structure"><strong>On Policy Structure</strong></h4> <p>In its original form, PI does not come with a separate actor and the policy is implicitly induced as the greedy policy of the value functions. For PI to be properly written in the form of Eq.<a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#eq1">(1)</a>, a key assumption is that</p> <span class="katex-display"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mtext>greedy</mtext><mo>(</mo><msup><mi>A</mi><mi>π</mi></msup><mo>)</mo><mo>∈</mo><mi>Π</mi><mo>:</mo><mo>=</mo><mo>{</mo><msub><mi>π</mi><mi>θ</mi></msub><mo>:</mo><mi>θ</mi><mo>∈</mo><mi>Θ</mi><mo>}</mo><mo>,</mo></mrow><annotation>\textrm{greedy}(A^{\pi}) \in \Pi := \{\pi_\theta: \theta \in \Theta\},</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord text"><span class="mord textrm">greedy</span></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace"></span><span class="mrel">∈</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord">Π</span><span class="mspace"></span><span class="mrel">:=</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mopen">{</span><span class="mord"><span class="mord mathnormal">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">θ</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span><span class="mspace"></span><span class="mrel">:</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord mathnormal">θ</span><span class="mspace"></span><span class="mrel">∈</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord">Θ</span><span class="mclose">}</span><span class="mpunct">,</span></span></span></span></span> <p>that is, the greedy policy w.r.t. the advantage function (or its approximation <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mover><mi>A</mi><mo>^</mo></mover></mrow><annotation>\hat{A}</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord accent"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="mord mathnormal">A</span></span><span><span class="pstrut"></span><span class="accent-body"><span class="mord">^</span></span></span></span></span></span></span></span></span></span> in the learning setting) can always be well-approximated by our policy parameterization <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>Π</mi></mrow><annotation>\Pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord">Π</span></span></span></span>. This is because the PD lemma tells us that</p> <span class="katex-display"><span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>J</mi><mo>(</mo><mi>π</mi><mo>)</mo><mo>−</mo><mi>J</mi><mo>(</mo><msub><mi>π</mi><mtext>old</mtext></msub><mo>)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>γ</mi></mrow></mfrac><msub><mi>E</mi><msup><mi>d</mi><mi>π</mi></msup></msub><mo>[</mo><msup><mi>A</mi><msub><mi>π</mi><mtext>old</mtext></msub></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>π</mi><mo>)</mo><mo>]</mo><mo>,</mo></mrow><annotation>J(\pi) - J(\pi_{\textrm{old}}) = \frac{1}{1-\gamma}\mathbb{E}_{d^\pi}[A^{\pi_{\textrm{old}}}(s,\pi)],</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">J</span><span class="mopen">(</span><span class="mord mathnormal">π</span><span class="mclose">)</span><span class="mspace"></span><span class="mbin">−</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord mathnormal">J</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span><span class="mclose">)</span><span class="mspace"></span><span class="mrel">=</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="mord"><span class="mord">1</span><span class="mspace"></span><span class="mbin">−</span><span class="mspace"></span><span class="mord mathnormal">γ</span></span></span><span><span class="pstrut"></span><span class="frac-line"></span></span><span><span class="pstrut"></span><span class="mord"><span class="mord">1</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mord"><span class="mord mathbb">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">d</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span><span class="mopen">[</span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">π</span><span class="mclose">)]</span><span class="mpunct">,</span></span></span></span></span> <p>where <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>A</mi><msub><mi>π</mi><mtext>old</mtext></msub></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>π</mi><mo>)</mo><mo>=</mo><msup><mi>Q</mi><msub><mi>π</mi><mtext>old</mtext></msub></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>π</mi><mo>)</mo><mo>−</mo><msup><mi>Q</mi><msub><mi>π</mi><mtext>old</mtext></msub></msup><mo>(</mo><mi>s</mi><mo>,</mo><msub><mi>π</mi><mtext>old</mtext></msub><mo>)</mo></mrow><annotation>A^{\pi_{\textrm{old}}}(s,\pi) = Q^{\pi_{\textrm{old}}}(s,\pi) - Q^{\pi_{\textrm{old}}}(s,\pi_{\textrm{old}})</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">π</span><span class="mclose">)</span><span class="mspace"></span><span class="mrel">=</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">π</span><span class="mclose">)</span><span class="mspace"></span><span class="mbin">−</span><span class="mspace"></span></span><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight"><span class="mord mathnormal mtight">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size3 size1 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord"><span class="mord mathnormal">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span><span class="mclose">)</span></span></span></span>. PI finds a new policy that is guaranteed to improve because the new policy has non-negative advantage over the old one in all states simultaneously, which is the key property that makes PI insensitive to data distributions. When we mimic PI using “critic-frozen-PG”, we can guarantee this property by choosing the tabular softmax policy, which has a <strong>factorized structure</strong> in the sense that the policy can choose the action distributions on each state independently.</p> <p>In contrast, when we use a policy parameterization that entangles the action distributions across different states, we may run into the problem that <strong>the gradients of <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>A</mi><mi>π</mi></msup><mo>(</mo><mi>s</mi><mo>,</mo><mi>π</mi><mo>)</mo></mrow><annotation>A^\pi(s, \pi)</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">s</span><span class="mpunct">,</span><span class="mspace"></span><span class="mord mathnormal">π</span><span class="mclose">)</span></span></span></span> for different <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>s</mi></mrow><annotation>s</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">s</span></span></span></span> may not be aligned with each other, and making improvements in some states may cause performance degradation in other states</strong>. In this case, PG theorem is the final safe net that guarantees overall improvement in <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>J</mi><mo>(</mo><mi>π</mi><mo>)</mo></mrow><annotation>J(\pi)</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">J</span><span class="mopen">(</span><span class="mord mathnormal">π</span><span class="mclose">)</span></span></span></span>: it tells us that the correct trade-off between the conflicting objectives at different states is <strong>precisely given by the distribution <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>d</mi><mi>π</mi></msup></mrow><annotation>d^\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">d</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span></strong>.</p> <h4 id="on-critic-quality"><strong>On Critic Quality</strong></h4> <p>Another major assumption behind PI is that <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msup><mi>A</mi><mi>π</mi></msup></mrow><annotation>A^\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">π</span></span></span></span></span></span></span></span></span></span></span> is either given or its approximation <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mover><mi>A</mi><mo>^</mo></mover></mrow><annotation>\hat{A}</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord accent"><span class="vlist-t"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="mord mathnormal">A</span></span><span><span class="pstrut"></span><span class="accent-body"><span class="mord">^</span></span></span></span></span></span></span></span></span></span> is universally good across the state-action space, which is rarely the case in practical learning settings. A much more reasonable assumption is that critic is only accurate under the distribution of states and actions induced by the current policy (see the seminal <a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf">CPI</a> paper, though there is more nuance to this assumption<sup id="fnref:pspi"><a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#fn:pspi" class="footnote">3</a></sup>), or the historical data, since this is where the critic is trained. Under this assumption, we cannot freeze the critic and optimize the actor indefinitely, as it will run into the regime where critic is poor due to the lack of data. Instead, we need to stop after <span class="katex"><span class="katex-mathml"><math><semantics><mrow><mi>π</mi></mrow><annotation>\pi</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord mathnormal">π</span></span></span></span> is a bit far from <span class="katex"><span class="katex-mathml"><math><semantics><mrow><msub><mi>π</mi><mtext>old</mtext></msub></mrow><annotation>\pi_{\textrm{old}}</annotation></semantics></math></span><span class="katex-html"><span class="base"><span class="strut"></span><span class="mord"><span class="mord mathnormal">π</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist"><span><span class="pstrut"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord text mtight"><span class="mord textrm mtight">old</span></span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist"><span></span></span></span></span></span></span></span></span></span>, collect data, and re-train the critic to make sure we never get into the completely uncharted regime and stay within the <a href="http://proceedings.mlr.press/v37/schulman15.pdf">trust region</a>.</p> <h4 id="understanding-the-practical-relaxations"><strong>Understanding the Practical Relaxations</strong></h4> <p>The takeaway message of the above discussion is that PI makes a number of implicit assumptions, which allow it to have much less restriction than PG in several corresponding aspects. Judging from how people tweak PG in practice (as mentioned at the <a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#sec1">beginning</a>), we can “reverse-engineer” and have some reasonable guess of why those tweaks are successful: given that people care less about on-policyness but enforce trust region pretty carefully, we can conjecture that <strong>(1) modern neural-net architectures are rich enough that conflicting objectives across states are less of an issue</strong> (especially when actor and critic share layers),<sup id="fnref:update"><a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#fn:update" class="footnote">4</a></sup> and <strong>(2) critic is still only good under existing data</strong>, which necessitates trust region. <em>(Update 09/30)</em> After I show this post to Wei Xiong, he told me that there are <a href="https://arxiv.org/pdf/2507.18071">recent</a> <a href="https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda">works</a> that start to take distribution issues in PG more seriously.<sup id="fnref:conclude"><a href="https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#fn:conclude" class="footnote">5</a></sup> I guess we will just have to wait and see how things play out.</p> </div>]]></description>
      <pubDate>Mon, 29 Sep 2025 00:00:00 +0000</pubDate>
      <link>https://nanjiang.cs.illinois.edu/2025/09/29/pg.html#fnref:pi</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4963736315</guid>
    </item>
    <item>
      <title><![CDATA[Designing agentic loops]]></title>
      <description><![CDATA[
    <p>Coding agents like Anthropic's <a href="https://claude.com/product/claude-code">Claude Code</a> and OpenAI's <a href="https://github.com/openai/codex">Codex CLI</a> represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly exercise the code they are writing, correct errors, dig through existing implementation details, and even run experiments to find effective code solutions to problems.</p>
<p>As is so often the case with modern AI, there is a great deal of depth involved in unlocking the full potential of these new tools.</p>
<p>A critical new skill to develop is <strong>designing agentic loops</strong>.</p>
<p>One way to think about coding agents is that they are brute force tools for finding solutions to coding problems. If you can reduce your problem to a clear goal and a set of tools that can iterate towards that goal a coding agent can often brute force its way to an effective solution.</p>
<p>My preferred definition of an LLM agent is something that <a href="https://simonwillison.net/2025/Sep/18/agents/">runs tools in a loop to achieve a goal</a>. The art of using them well is to carefully design the tools and loop for them to use.</p>
<ul>
  <li><a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/#the-joy-of-yolo-mode">The joy of YOLO mode</a></li>
  <li><a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/#picking-the-right-tools-for-the-loop">Picking the right tools for the loop</a></li>
  <li><a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/#issuing-tightly-scoped-credentials">Issuing tightly scoped credentials</a></li>
  <li><a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/#when-to-design-an-agentic-loop">When to design an agentic loop</a></li>
  <li><a href="https://simonwillison.net/2025/Sep/30/designing-agentic-loops/#this-is-still-a-very-fresh-area">This is still a very fresh area</a></li>
</ul>
<h4 id="the-joy-of-yolo-mode">The joy of YOLO mode</h4>
<p>Agents are inherently dangerous - they can make poor decisions or fall victim to malicious <a href="https://simonwillison.net/tags/prompt-injection/">prompt injection attacks</a>, either of which can result in harmful results from tool calls. Since the most powerful coding agent tool is "run this command in the shell" a rogue agent can do anything that you could do by running a command yourself.</p>
<p>To <a href="https://simonwillison.net/2025/Jun/5/wrecking-its-environment-in-a-loop/">quote Solomon Hykes</a>:</p>
<blockquote>
<p><strong>An AI agent is an LLM wrecking its environment in a loop.</strong></p>
</blockquote>
<p>Coding agents like Claude Code counter this by defaulting to asking you for approval of almost every command that they run.</p>
<p>This is kind of tedious, but more importantly, it dramatically reduces their effectiveness at solving problems through brute force.</p>
<p>Each of these tools provides its own version of what I like to call YOLO mode, where everything gets approved by default.</p>
<p>This is <em>so dangerous</em>, but it's also key to getting the most productive results!</p>
<p>Here are three key risks to consider from unattended YOLO mode.</p>
<ol>
<li>Bad shell commands deleting or mangling things you care about.</li>
<li>Exfiltration attacks where something steals files or data visible to the agent - source code or secrets held in environment variables are particularly vulnerable here.</li>
<li>Attacks that use your machine as a proxy to attack another target - for DDoS or to disguise the source of other hacking attacks.</li>
</ol>
<p>If you want to run YOLO mode anyway, you have a few options:</p>
<ol>
<li>Run your agent in a secure sandbox that restricts the files and secrets it can access and the network connections it can make.</li>
<li>Use someone else's computer. That way if your agent goes rogue, there's only so much damage they can do, including wasting someone else's CPU cycles.</li>
<li>Take a risk! Try to avoid exposing it to potential sources of malicious instructions and hope you catch any mistakes before they cause any damage.</li>
</ol>
<p>Most people choose option 3.</p>
<p>Despite the existence of <a href="https://attack.mitre.org/techniques/T1611/">container escapes</a> I think option 1 using Docker or the new Apple <a href="https://github.com/apple/container">container tool</a> is a reasonable risk to accept for most people.</p>
<p>Option 2 is my favorite. I like to use <a href="https://github.com/features/codespaces">GitHub Codespaces</a> for this - it provides a full container environment on-demand that's accessible through your browser and has a generous free tier too. If anything goes wrong it's a Microsoft Azure machine somewhere that's burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository.</p>
<p>There are plenty of other agent-like tools that run code on other people's computers. <a href="https://simonwillison.net/tags/code-interpreter/">Code Interpreter</a> mode in both ChatGPT and <a href="https://simonwillison.net/2025/Sep/9/claude-code-interpreter/">Claude</a> can go a surprisingly long way here. I've also had a lot of success (ab)using OpenAI's <a href="https://chatgpt.com/features/codex">Codex Cloud</a>.</p>
<p>Coding agents themselves implement various levels of sandboxing, but so far I've not seen convincing enough documentation of these to trust them.</p>
<p><strong>Update</strong>: It turns out Anthropic have their own documentation on <a href="https://www.anthropic.com/engineering/claude-code-best-practices#d-safe-yolo-mode">Safe YOLO mode</a> for Claude Code which says:</p>
<blockquote>
<p>Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use <code>--dangerously-skip-permissions</code> in a container without internet access. You can follow this <a href="https://github.com/anthropics/claude-code/tree/main/.devcontainer">reference implementation</a> using Docker Dev Containers.</p>
</blockquote>
<p>Locking internet access down to a <a href="https://github.com/anthropics/claude-code/blob/5062ed93fc67f9322f807ecbf391ae4376cf8e83/.devcontainer/init-firewall.sh#L66-L75">list of trusted hosts</a> is a great way to prevent exfiltration attacks from stealing your private source code.</p>
<h4 id="picking-the-right-tools-for-the-loop">Picking the right tools for the loop</h4>
<p>Now that we've found a safe (enough) way to run in YOLO mode, the next step is to decide which tools we need to make available to the coding agent.</p>
<p>You can bring <a href="https://modelcontextprotocol.io/">MCP</a> into the mix at this point, but I find it's usually more productive to think in terms of shell commands instead. Coding agents are <em>really good</em> at running shell commands!</p>
<p>If your environment allows them the necessary network access, they can also pull down additional packages from NPM and PyPI and similar. Ensuring your agent runs in an environment where random package installs don't break things on your main computer is an important consideration as well!</p>
<p>Rather than leaning on MCP, I like to create an <a href="https://agents.md/">AGENTS.md</a> (or equivalent) file with details of packages I think they may need to use.</p>
<p>For a project that involved taking screenshots of various websites I installed my own <a href="https://shot-scraper.datasette.io/">shot-scraper</a> CLI tool and dropped the following in <code>AGENTS.md</code>:</p>
<pre><code>To take a screenshot, run:

shot-scraper http://www.example.com/ -w 800 -o example.jpg
</code></pre>
<p>Just that one example is enough for the agent to guess how to swap out the URL and filename for other screenshots.</p>
<p>Good LLMs already know how to use a bewildering array of existing tools. If you say "use <a href="https://playwright.dev/python/">playwright python</a>" or "use ffmpeg" most models will use those effectively - and since they're running in a loop they can usually recover from mistakes they make at first and figure out the right incantations without extra guidance.</p>
<h4 id="issuing-tightly-scoped-credentials">Issuing tightly scoped credentials</h4>
<p>In addition to exposing the right commands, we also need to consider what credentials we should expose to those commands.</p>
<p>Ideally we wouldn't need any credentials at all - plenty of work can be done without signing into anything or providing an API key - but certain problems will require authenticated access.</p>
<p>This is a deep topic in itself, but I have two key recommendations here:</p>
<ol>
<li>Try to provide credentials to test or staging environments where any damage can be well contained.</li>
<li>If a credential can spend money, set a tight budget limit.</li>
</ol>
<p>I'll use an example to illustrate. A while ago I was investigating slow cold start times for a scale-to-zero application I was running on <a href="https://fly.io/">Fly.io</a>.</p>
<p>I realized I could work a lot faster if I gave Claude Code the ability to directly edit Dockerfiles, deploy them to a Fly account and measure how long they took to launch.</p>
<p>Fly allows you to create organizations, and you can set a budget limit for those organizations and issue a Fly API key that can only create or modify apps within that organization...</p>
<p>So I created a dedicated organization for just this one investigation, set a $5 budget, issued an API key and set Claude Code loose on it!</p>
<p>In that particular case the results weren't useful enough to describe in more detail, but this was the project where I first realized that "designing an agentic loop" was an important skill to develop.</p>
<h4 id="when-to-design-an-agentic-loop">When to design an agentic loop</h4>
<p>Not every problem responds well to this pattern of working. The thing to look out for here are problems with <strong>clear success criteria</strong> where finding a good solution is likely to involve (potentially slightly tedious) <strong>trial and error</strong>.</p>
<p>Any time you find yourself thinking "ugh, I'm going to have to try a lot of variations here" is a strong signal that an agentic loop might be worth trying!</p>
<p>A few examples:</p>
<ul>
<li>
<strong>Debugging</strong>: a test is failing and you need to investigate the root cause. Coding agents that can already run your tests can likely do this without any extra setup.</li>
<li>
<strong>Performance optimization</strong>: this SQL query is too slow, would adding an index help? Have your agent benchmark the query and then add and drop indexes (in an isolated development environment!) to measure their impact.</li>
<li>
<strong>Upgrading dependencies</strong>: you've fallen behind on a bunch of dependency upgrades? If your test suite is solid an agentic loop can upgrade them all for you and make any minor updates needed to reflect breaking changes. Make sure a copy of the relevant  release notes is available, or that the agent knows where to find them itself.</li>
<li>
<strong>Optimizing container sizes</strong>: Docker container feeling uncomfortably large? Have your agent try different base images and iterate on the Dockerfile to try to shrink it, while keeping the tests passing.</li>
</ul>
<p>A common theme in all of these is <strong>automated tests</strong>. The value you can get from coding agents and other LLM coding tools is massively amplified by a good, cleanly passing test suite. Thankfully LLMs are great for accelerating the process of putting one of those together, if you don't have one yet.</p>
<h4 id="this-is-still-a-very-fresh-area">This is still a very fresh area</h4>
<p><strong>Designing agentic loops</strong> is a very new skill - Claude Code was <a href="https://www.anthropic.com/news/claude-3-7-sonnet">first released</a> in just February 2025!</p>
<p>I'm hoping that giving it a clear name can help us have productive conversations about it. There's <em>so much more</em> to figure out about how to use these tools as effectively as possible.</p>
    
        <p>Tags: <a href="https://simonwillison.net/tags/definitions">definitions</a>, <a href="https://simonwillison.net/tags/ai">ai</a>, <a href="https://simonwillison.net/tags/generative-ai">generative-ai</a>, <a href="https://simonwillison.net/tags/llms">llms</a>, <a href="https://simonwillison.net/tags/ai-assisted-programming">ai-assisted-programming</a>, <a href="https://simonwillison.net/tags/ai-agents">ai-agents</a>, <a href="https://simonwillison.net/tags/coding-agents">coding-agents</a></p>
    

]]></description>
      <pubDate>Tue, 30 Sep 2025 15:20:46 +0000</pubDate>
      <link>https://simonwillison.net/2025/Sep/30/designing-agentic-loops/#atom-everything</link>
      <dc:creator>Simon Willison&#39;s Weblog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4963219722</guid>
    </item>
    <item>
      <title><![CDATA[Building agents with the Claude Agent SDK]]></title>
      <description><![CDATA[The Claude Agent SDK is a collection of tools that helps developers build powerful agents on top of Claude Code. In this article, we walk through how to get started  and share our best practices.]]></description>
      <pubDate>Mon, 29 Sep 2025 17:04:11 +0000</pubDate>
      <link>https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk</link>
      <dc:creator>Anthropic Engineering Blog</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4962019488</guid>
    </item>
    <item>
      <title><![CDATA[The second wave of MCP: Building for LLMs, not developers]]></title>
      <description><![CDATA[<div class="stack-module__UbbKhW__stack stack article-module__9eLiuW__content"><div class="stack-module__UbbKhW__stack stack w-full flex-wrap gap-4 pt-2"><time class="text-module__eElGjq__wrapper order-2 md:order-3">Sep 9, 2025</time></div><div class="stack-module__UbbKhW__stack stack w-full flex-wrap gap-6"><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">When the MCP standard first launched, many teams rushed to ship something. Many servers ended up as thin wrappers around existing APIs with minimal changes. A quick way to say "we support MCP".</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">At the time, this made sense. MCP was new, teams wanted to get something out quickly, and the obvious approach was mirroring existing API structures. Why reinvent when you could repackage?</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">But the problem with this approach is LLMs don’t work like developers. They don’t reuse past code or keep long term state. Each conversation starts fresh. LLMs have to rediscover which tools exist, how to use them, and in what order. With low level API wrappers, this leads to repeated orchestration, inconsistent behavior, and wasted effort as LLMs repeatedly solve the same puzzles.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">MCP works best when tools handle complete user intentions rather than exposing individual API operations. One tool that deploys a project end-to-end works better than four tools that each handle a piece of the deployment pipeline.</p></div><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">The key difference is context and state management. When you write code, you keep track of information between API calls. You store the project ID from the create call, check deployment status before adding the domain, and wrap error handling around each step so failures don't break the entire process.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">LLMs work differently. Each conversation starts fresh with no memory of previous conversations. While they can see tool results within the current conversation, they have to figure out the right sequence of tools to use based on what's available. When those tools are low-level API wrappers, the LLM has to orchestrate multiple calls and manage the complexity of chaining them together each time.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">For example, take deploying a project with the Vercel API. A developer might write code like this:</p><div class="stack-module__UbbKhW__stack stack blog-code-block-module__zaYDqW__wrapper"><div class="relative code-block-module__NOThwW__wrapper blog-code-block-module__zaYDqW__codeBlock"><pre class="prism-code language-tsx code-block-module__NOThwW__pre blog-code-block-module__zaYDqW__maxHeight"><code class="code-block-module__NOThwW__code"><div class="line" id="C14049af6-L2"><p class="token-line"><span class="token plain"></span><span class="token keyword">const</span><span class="token plain"> project </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword">await</span><span class="token plain"> client</span><span class="token punctuation">.</span><span class="token plain">projects</span><span class="token punctuation">.</span><span class="token function">create</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L3"><p class="token-line"><span class="token plain"> name</span><span class="token operator">:</span><span class="token plain"> domain</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span><span class="token regex regex-delimiter">/</span><span class="token regex regex-source language-regex">\.</span><span class="token regex regex-delimiter">/</span><span class="token regex regex-flags">g</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string">'-'</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L4"><p class="token-line"><span class="token plain"> gitRepository</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"> repo</span><span class="token operator">:</span><span class="token plain"> repoUrl </span><span class="token punctuation">}</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L7"><p class="token-line"><span class="token plain"></span><span class="token keyword">await</span><span class="token plain"> client</span><span class="token punctuation">.</span><span class="token plain">projects</span><span class="token punctuation">.</span><span class="token function">createProjectEnv</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L10"><p class="token-line"><span class="token plain"> requestBody</span><span class="token operator">:</span><span class="token plain"> </span><span class="token known-class-name class-name">Object</span><span class="token punctuation">.</span><span class="token function">entries</span><span class="token punctuation">(</span><span class="token plain">env</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">map</span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token parameter punctuation">[</span><span class="token parameter">key</span><span class="token parameter punctuation">,</span><span class="token parameter"> value</span><span class="token parameter punctuation">]</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator">=&gt;</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L12"><p class="token-line"><span class="token plain"> target</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation">[</span><span class="token string">'production'</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string">'preview'</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string">'development'</span><span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L17"><p class="token-line"><span class="token plain"></span><span class="token keyword">const</span><span class="token plain"> deployment </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword">await</span><span class="token plain"> client</span><span class="token punctuation">.</span><span class="token plain">deployments</span><span class="token punctuation">.</span><span class="token function">createDeployment</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L21"><p class="token-line"><span class="token plain"> gitSource</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"> </span><span class="token keyword">type</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string">'github'</span><span class="token punctuation">,</span><span class="token plain"> repo</span><span class="token operator">:</span><span class="token plain"> repo</span><span class="token punctuation">.</span><span class="token function">replace</span><span class="token punctuation">(</span><span class="token string">'.git'</span><span class="token punctuation">,</span><span class="token plain"> </span><span class="token string">''</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"> ref</span><span class="token operator">:</span><span class="token plain"> </span><span class="token string">'main'</span><span class="token plain"> </span><span class="token punctuation">}</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L25"><p class="token-line"><span class="token plain"></span><span class="token keyword">await</span><span class="token plain"> client</span><span class="token punctuation">.</span><span class="token plain">projects</span><span class="token punctuation">.</span><span class="token function">addProjectDomain</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token plain"></span></p></div><div class="line" id="C14049af6-L27"><p class="token-line"><span class="token plain"> requestBody</span><span class="token operator">:</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"> domain</span><span class="token operator">:</span><span class="token plain"> domain </span><span class="token punctuation">}</span><span class="token plain"></span></p></div></code></pre></div></div><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">This approach involves nested configurations, ID management across calls, and parsing repository URLs. A developer will solve for this once and re-use. But an LLM faces this puzzle fresh each time, often getting the nesting wrong or forgetting required fields.</p><h2 class="text-module__eElGjq__wrapper pt-[calc(var(--grid-divider-gap)/1.5)] [#changelog-description_&amp;]:pt-6 [#changelog-description_&amp;:first-child]:pt-0 w-full group relative text-module__fl7lUq__subheading"><a class="flex items-center justify-center flex-col absolute left-0 bottom-0 mt-auto mb-auto w-8 h-8 rounded-full opacity-0 group-hover:opacity-100 -translate-x-3/4 group-hover:-translate-x-full delay-75 duration-300 transition-all text-gray-900 hover:text-gray-1000 hover:scale-110 focus-visible:-translate-x-full focus-visible:opacity-100 top-[calc(var(--grid-divider-gap)/1.5)] [#changelog-description_&amp;]:hidden" href="https://vercel.com/blog/the-second-wave-of-mcp-building-for-llms-not-developers?utm_source=substack&amp;utm_medium=email#single-workflow-tools-vs-multiple-endpoints" id="single-workflow-tools-vs-multiple-endpoints"><svg height="16" width="16"><path></path></svg><span class="sr-only">Link to heading</span></a>Single workflow tools vs multiple endpoints</h2><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">The solution is building tools around complete user goals rather than API capabilities. Instead of four separate tools, create one <code>deploy_project</code> tool that handles the entire workflow internally.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">This changes everything about tool design:</p><div class="scroller-module__6hSglq__overlayContainer text-module__fl7lUq__tableScroller scroller-module__6hSglq__isHorizontal"><div class="scroller-module__6hSglq__scroller text-module__fl7lUq__scrollerContainer"><div class=""><table class="text-module__fl7lUq__table"><tbody><tr><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph"><b>API-shaped tools</b></p></td><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph"><b>Intention-based tools</b></p></td></tr><tr><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph"><code>create_project</code>, <code>add_env</code>, <code>deploy</code>, <code>add_domain</code></p></td><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph"><code>deploy_project</code></p></td></tr><tr><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Multiple calls with state management</p></td><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Single atomic operation</p></td></tr><tr><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Returns technical status codes</p></td><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Returns conversational updates</p></td></tr><tr><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">LLM assembles the workflow</p></td><td><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Tool owns the complete process</p></td></tr></tbody></table></div></div></div><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Consider the difference in practice. An API-shaped MCP server might expose these tools:</p><div class="stack-module__UbbKhW__stack stack blog-code-block-module__zaYDqW__wrapper"><div class="relative code-block-module__NOThwW__wrapper blog-code-block-module__zaYDqW__codeBlock"><pre class="prism-code language-bash code-block-module__NOThwW__pre blog-code-block-module__zaYDqW__maxHeight"><code class="code-block-module__NOThwW__code"><div class="line" id="C4ad49e7a-L1"><p class="token-line"><span class="token plain">create_project</span><span class="token punctuation">(</span><span class="token plain">name, repo</span><span class="token punctuation">)</span><span class="token plain"></span></p></div><div class="line" id="C4ad49e7a-L2"><p class="token-line"><span class="token plain">add_environment_variables</span><span class="token punctuation">(</span><span class="token plain">project_id, variables</span><span class="token punctuation">)</span><span class="token plain"></span></p></div><div class="line" id="C4ad49e7a-L3"><p class="token-line"><span class="token plain">create_deployment</span><span class="token punctuation">(</span><span class="token plain">project_id, branch</span><span class="token punctuation">)</span><span class="token plain"></span></p></div><div class="line" id="C4ad49e7a-L4"><p class="token-line"><span class="token plain">add_domain</span><span class="token punctuation">(</span><span class="token plain">project_id, domain</span><span class="token punctuation">)</span></p></div></code></pre></div></div><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">The LLM has to call each tool in sequence, pass IDs between calls, and handle potential failures at each step.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">An intention-based tool looks different:</p><div class="stack-module__UbbKhW__stack stack blog-code-block-module__zaYDqW__wrapper"><div class="relative code-block-module__NOThwW__wrapper blog-code-block-module__zaYDqW__codeBlock"><pre class="prism-code language-bash code-block-module__NOThwW__pre blog-code-block-module__zaYDqW__maxHeight"><code class="code-block-module__NOThwW__code"><div class="line" id="C7a8d66f1-L1"><p class="token-line"><span class="token plain">deploy_project</span><span class="token punctuation">(</span><span class="token plain">repo_url, domain, environment_variables, </span><span class="token assign-left variable">branch</span><span class="token operator">=</span><span class="token string">"main"</span><span class="token punctuation">)</span></p></div></code></pre></div></div><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">This single tool handles the complete workflow internally and returns a conversational response. Instead of <code>{ status: 200, data: { id: "proj_123" } }</code>, the LLM can respond with "Project deployed at <code>example.com</code>. Build completed in 45s. All systems running normally."</p><h2 class="text-module__eElGjq__wrapper pt-[calc(var(--grid-divider-gap)/1.5)] [#changelog-description_&amp;]:pt-6 [#changelog-description_&amp;:first-child]:pt-0 w-full group relative text-module__fl7lUq__subheading"><a class="flex items-center justify-center flex-col absolute left-0 bottom-0 mt-auto mb-auto w-8 h-8 rounded-full opacity-0 group-hover:opacity-100 -translate-x-3/4 group-hover:-translate-x-full delay-75 duration-300 transition-all text-gray-900 hover:text-gray-1000 hover:scale-110 focus-visible:-translate-x-full focus-visible:opacity-100 top-[calc(var(--grid-divider-gap)/1.5)] [#changelog-description_&amp;]:hidden" href="https://vercel.com/blog/the-second-wave-of-mcp-building-for-llms-not-developers?utm_source=substack&amp;utm_medium=email#designing-workflow-based-mcp-tools" id="designing-workflow-based-mcp-tools"><svg height="16" width="16"><path></path></svg><span class="sr-only">Link to heading</span></a>Designing workflow based MCP tools</h2><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Start by testing the workflow manually before writing any code. Take a real user request like "set up my project with authentication and a database" and walk through it step by step using your existing APIs. The parts that feel tedious or repetitive are good candidates for a single MCP tool.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Think of MCP tools as tailored toolkits that help an AI achieve a particular task, not as API mirrors. There may be multiple APIs and business logic behind a single MCP tool. If users think of something as one workflow, design it as one tool.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Here's how to structure a complete workflow tool:</p><div class="stack-module__UbbKhW__stack stack blog-code-block-module__zaYDqW__wrapper"><div class="relative code-block-module__NOThwW__wrapper blog-code-block-module__zaYDqW__codeBlock"><pre class="prism-code language-tsx code-block-module__NOThwW__pre blog-code-block-module__zaYDqW__maxHeight"><code class="code-block-module__NOThwW__code"><div class="line" id="C377f222d-L3"><p class="token-line"><span class="token plain"> </span><span class="token string">"Deploy a project with environment variables and custom domain"</span><span class="token punctuation">,</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L7"><p class="token-line"><span class="token plain"> environment_variables</span><span class="token operator">:</span><span class="token plain"> z</span><span class="token punctuation">.</span><span class="token function">record</span><span class="token punctuation">(</span><span class="token plain">z</span><span class="token punctuation">.</span><span class="token function">string</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L8"><p class="token-line"><span class="token plain"> branch</span><span class="token operator">:</span><span class="token plain"> z</span><span class="token punctuation">.</span><span class="token function">string</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token keyword">default</span><span class="token punctuation">(</span><span class="token string">"main"</span><span class="token punctuation">)</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L10"><p class="token-line"><span class="token plain"> </span><span class="token keyword">async</span><span class="token plain"> </span><span class="token punctuation">(</span><span class="token parameter punctuation">{</span><span class="token parameter"> repo_url</span><span class="token parameter punctuation">,</span><span class="token parameter"> domain</span><span class="token parameter punctuation">,</span><span class="token parameter"> environment_variables</span><span class="token parameter punctuation">,</span><span class="token parameter"> branch </span><span class="token parameter punctuation">}</span><span class="token punctuation">)</span><span class="token plain"> </span><span class="token operator">=&gt;</span><span class="token plain"> </span><span class="token punctuation">{</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L12"><p class="token-line"><span class="token plain"> </span><span class="token keyword">const</span><span class="token plain"> project </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword">await</span><span class="token plain"> </span><span class="token function">createProject</span><span class="token punctuation">(</span><span class="token plain">repo_url</span><span class="token punctuation">,</span><span class="token plain"> branch</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L13"><p class="token-line"><span class="token plain"> </span><span class="token keyword">await</span><span class="token plain"> </span><span class="token function">addEnvironmentVariables</span><span class="token punctuation">(</span><span class="token plain">project</span><span class="token punctuation">.</span><span class="token plain">id</span><span class="token punctuation">,</span><span class="token plain"> environment_variables</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L14"><p class="token-line"><span class="token plain"> </span><span class="token keyword">const</span><span class="token plain"> deployment </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword">await</span><span class="token plain"> </span><span class="token function">deployProject</span><span class="token punctuation">(</span><span class="token plain">project</span><span class="token punctuation">.</span><span class="token plain">id</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L15"><p class="token-line"><span class="token plain"> </span><span class="token keyword">await</span><span class="token plain"> </span><span class="token function">addCustomDomain</span><span class="token punctuation">(</span><span class="token plain">project</span><span class="token punctuation">.</span><span class="token plain">id</span><span class="token punctuation">,</span><span class="token plain"> domain</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token plain"></span></p></div><div class="line" id="C377f222d-L20"><p class="token-line"><span class="token plain"> text</span><span class="token operator">:</span><span class="token plain"> </span><span class="token template-string template-punctuation string">`</span><span class="token template-string string">Project deployed successfully at </span><span class="token template-string interpolation interpolation-punctuation punctuation">${</span><span class="token template-string interpolation">domain</span><span class="token template-string interpolation interpolation-punctuation punctuation">}</span><span class="token template-string string">. Build completed in </span><span class="token template-string interpolation interpolation-punctuation punctuation">${</span><span class="token template-string interpolation">deployment</span><span class="token template-string interpolation punctuation">.</span><span class="token template-string interpolation">duration</span><span class="token template-string interpolation interpolation-punctuation punctuation">}</span><span class="token template-string string">s.</span><span class="token template-string template-punctuation string">`</span><span class="token plain"></span></p></div></code></pre></div></div><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Use plain code for the deterministic parts. Things like API sequencing, error recovery, and state management are better suited for regular programming. Only involve the LLM for parts that truly need reasoning or natural language processing.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Test with real scenarios. Run actual user workflows through your tools. When you see the LLM making multiple attempts or asking for clarification, that's feedback about your tool design. The goal is for complex workflows to succeed on the first try.</p><h2 class="text-module__eElGjq__wrapper pt-[calc(var(--grid-divider-gap)/1.5)] [#changelog-description_&amp;]:pt-6 [#changelog-description_&amp;:first-child]:pt-0 w-full group relative text-module__fl7lUq__subheading"><a class="flex items-center justify-center flex-col absolute left-0 bottom-0 mt-auto mb-auto w-8 h-8 rounded-full opacity-0 group-hover:opacity-100 -translate-x-3/4 group-hover:-translate-x-full delay-75 duration-300 transition-all text-gray-900 hover:text-gray-1000 hover:scale-110 focus-visible:-translate-x-full focus-visible:opacity-100 top-[calc(var(--grid-divider-gap)/1.5)] [#changelog-description_&amp;]:hidden" href="https://vercel.com/blog/the-second-wave-of-mcp-building-for-llms-not-developers?utm_source=substack&amp;utm_medium=email#performance-improvements-with-workflow-tools" id="performance-improvements-with-workflow-tools"><svg height="16" width="16"><path></path></svg><span class="sr-only">Link to heading</span></a>Performance improvements with workflow tools</h2><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Teams that have shifted from API shaped tools to workflow shaped tools have seen meaningful improvements in reliability and efficiency.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">The common thread is how these tools are designed:</p><ul class="text-module__fl7lUq__unorderedList"><li class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph text-module__fl7lUq__listItem"><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">They focus on user intentions rather than API coverage</p></li><li class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph text-module__fl7lUq__listItem"><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">They handle complete workflows rather than exposing single operations</p></li><li class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph text-module__fl7lUq__listItem"><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">They respond in a conversational way rather than returning technical codes</p></li></ul><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">MCP works best when tools reflect complete user goals. LLMs do not manage state the way developers do, so building tools around workflows produces better results.</p><p class="text-module__eElGjq__wrapper text-module__fl7lUq__paragraph">Give this approach a try. The <a href="https://github.com/vercel/mcp-handler" class="link-module__Q1NRQq__link link-module__Q1NRQq__highlight">MCP handler</a> makes it straightforward to expose your application logic as workflow based MCP tools. Get started with the <a href="https://vercel.com/templates/ai/model-context-protocol-mcp-with-vercel-functions" class="link-module__Q1NRQq__link link-module__Q1NRQq__highlight">Next.js MCP template</a> or <a href="https://vercel.com/docs/mcp" class="link-module__Q1NRQq__link link-module__Q1NRQq__highlight">explore the documentation</a>.</p></div>]]></description>
      <pubDate>Sun, 28 Sep 2025 18:13:42 +0000</pubDate>
      <link>https://vercel.com/blog/the-second-wave-of-mcp-building-for-llms-not-developers?utm_source=substack&amp;utm_medium=email</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4960938770</guid>
    </item>
    <item>
      <title><![CDATA[Introducing the MCP Registry]]></title>
      <description><![CDATA[<div class="post-content"><p>Today, we’re launching the Model Context Protocol (MCP) Registry—an open catalog and API for publicly available MCP servers to improve discoverability and implementation. By standardizing how servers are distributed and discovered, we’re expanding their reach while making it easier for clients to get connected.</p><p>The MCP Registry is now available in preview. To get started:</p><h2 id="single-source-of-truth-for-mcp-servers">Single source of truth for MCP servers<a class="anchor" href="https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/?utm_source=substack&amp;utm_medium=email#single-source-of-truth-for-mcp-servers">#</a></h2><p>In March 2025, we shared that we wanted to build a central registry for the MCP ecosystem. Today we are announcing that we’ve launched <a href="https://registry.modelcontextprotocol.io">https://registry.modelcontextprotocol.io</a> as the official MCP Registry. As part of the MCP project, the MCP Registry, as well as a parent <a href="https://github.com/modelcontextprotocol/registry/blob/main/docs/reference/api/official-registry-api.md">OpenAPI specification</a>, are open source—allowing everyone to build a compatible sub-registry.</p><p>Our goal is to standardize how servers are distributed and discovered, providing a primary source of truth that sub-registries can build upon. In turn, this will expand server reach and help clients find servers more easily across the MCP ecosystem.</p><p>In building a central registry, it was important to us not to take away from existing registries that the community and companies have built. The MCP Registry serves as a primary source of truth for publicly available MCP servers, and organizations can choose to <a href="https://github.com/modelcontextprotocol/registry/blob/main/docs/guides/consuming/use-rest-api.md#building-a-subregistry">create sub-registries</a> based on custom criteria. For example:</p><p><strong>Public subregistries</strong> like opinionated “MCP marketplaces” associated with each MCP client are free to augment and enhance data they ingest from the upstream MCP Registry. Every MCP end-user persona will have different needs, and it is up to the MCP client marketplaces to properly serve their end-users in opinionated ways.</p><p><strong>Private subregistries</strong> will exist within enterprises that have strict privacy and security requirements, but the MCP Registry gives these enterprises a single upstream data source they can build upon. At a minimum, we aim to share API schemas with these private implementations so that associated SDKs and tooling can be shared across the ecosystem.</p><p>In both cases, the MCP Registry is the starting point – it’s the centralized location where MCP server maintainers publish and maintain their self-reported information for these downstream consumers to massage and deliver to their end-users.</p><h2 id="community-driven-mechanism-for-moderation">Community-driven mechanism for moderation<a class="anchor" href="https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/?utm_source=substack&amp;utm_medium=email#community-driven-mechanism-for-moderation">#</a></h2><p>The MCP Registry is an official MCP project maintained by the registry working group and permissively licensed. Community members can submit issues to flag servers that violate the MCP <a href="https://github.com/modelcontextprotocol/registry/blob/main/docs/guides/administration/moderation-guidelines.md">moderation guidelines</a>—such as those containing spam, malicious code, or impersonating legitimate services. Registry maintainers can then denylist these entries and retroactively remove them from public access.</p><h2 id="getting-started">Getting started<a class="anchor" href="https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/?utm_source=substack&amp;utm_medium=email#getting-started">#</a></h2><p>To get started:</p><p>This preview of the MCP Registry is meant to help us improve the user experience before general availability and does not provide data durability guarantees or other warranties. We advise MCP adopters to watch development closely as breaking changes may occur before the registry is made generally available.</p><p>As we continue to develop the registry, we encourage feedback and contributions on the <a href="https://github.com/modelcontextprotocol/registry">modelcontextprotocol/registry GitHub repository</a>: Discussion, Issues, and Pull Requests are all welcome.</p><h2 id="thanks-to-the-mcp-community">Thanks to the MCP community<a class="anchor" href="https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/?utm_source=substack&amp;utm_medium=email#thanks-to-the-mcp-community">#</a></h2><p>The MCP Registry has been a collaborative effort from the beginning and we are incredibly grateful for the enthusiasm and support from the broader developer community.</p><p>In February 2025, it began as a grassroots project when MCP creators <a href="https://github.com/dsp-ant">David Soria Parra</a> and <a href="https://github.com/jspahrsummers">Justin Spahr-Summers</a> asked the <a href="https://www.pulsemcp.com/">PulseMCP</a> and <a href="https://block.github.io/goose/">Goose</a> teams to help build a centralized community registry. Registry Maintainer <a href="https://github.com/tadasant">Tadas Antanavicius</a> from <a href="https://www.pulsemcp.com/">PulseMCP</a> spearheaded the initial effort in collaboration with <a href="https://github.com/alexhancock">Alex Hancock</a> from <a href="https://block.xyz/">Block</a>. They were soon joined by Registry Maintainer <a href="https://github.com/toby">Toby Padilla</a>, Head of MCP at <a href="https://github.com/">GitHub</a>, and more recently, <a href="https://github.com/domdomegg">Adam Jones</a> from <a href="https://www.anthropic.com/">Anthropic</a> joined as Registry Maintainer to drive the project towards the launch today. The <a href="https://github.com/modelcontextprotocol/registry/discussions/11">initial announcement</a> of the MCP Registry’s development lists 16 contributing individuals from at least 9 different companies.</p><p>Many others made crucial contributions to bring this project to life: <a href="https://github.com/rdimitrov">Radoslav Dimitrov</a> from <a href="https://stacklok.com/">Stacklok</a>, <a href="https://github.com/sridharavinash">Avinash Sridhar</a> from <a href="https://github.com/">GitHub</a>, <a href="https://github.com/connor4312">Connor Peet</a> from <a href="https://code.visualstudio.com/">VS Code</a>, <a href="https://github.com/joelverhagen">Joel Verhagen</a> from <a href="https://www.nuget.org/">NuGet</a>, <a href="https://github.com/pree-dew">Preeti Dewani</a> from <a href="https://last9.io/">Last9</a>, <a href="https://github.com/Avish34">Avish Porwal</a> from <a href="https://www.microsoft.com/">Microsoft</a>, <a href="https://github.com/jonathanhefner">Jonathan Hefner</a>, and many Anthropic and GitHub employees that provided code reviews and development support. We are also grateful to everyone on the <a href="https://github.com/modelcontextprotocol/registry/graphs/contributors">Registry’s contributors log</a> and those who participated in <a href="https://github.com/modelcontextprotocol/registry">discussions and issues</a>.</p><p>We deeply appreciate everyone investing in this foundational open source infrastructure. Together, we’re helping developers and organizations worldwide to build more reliable, context-aware AI applications. On behalf of the MCP community, thank you.</p></div>]]></description>
      <pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate>
      <link>https://blog.modelcontextprotocol.io/posts/2025-09-08-mcp-registry-preview/?utm_source=substack&amp;utm_medium=email</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4960938716</guid>
    </item>
    <item>
      <title><![CDATA[Open Social — overreacted]]></title>
      <description><![CDATA[]]></description>
      <pubDate>Sun, 28 Sep 2025 18:07:35 +0000</pubDate>
      <link>https://overreacted.io/open-social/</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4960933716</guid>
    </item>
    <item>
      <title><![CDATA[Alan Turing on Embodied Intelligence]]></title>
      <description><![CDATA[<p>The term “Embodied AI” is having its moment in the sun right now. For me, that is what I have spent my career working on, and I have repeatedly published articles using the term “embodied intelligence”.</p>
<p>I recently stumbled across a piece that I wrote in May 2011 intended for the proceedings of a conference held in June 2012 in Manchester, England, to celebrate the 100<sup>th</sup> anniversary of Alan Turing’s birth. There was a series of invited talks and a number of other components. I attended the conference but I cannot find any record of a proceedings having been published for the 17 invited talks. &nbsp;Here is what I had written for it, unchanged except for fixing many typos.</p>
<hr>
<hr>
<p style="text-align: center;"><strong>The Case For Embodied Intelligence</strong></p>
<p style="text-align: center;"><em>Rodney A. Brooks, May 2011</em></p>
<p style="padding-left: 30px; text-align: left;"><strong>Abstract:&nbsp;</strong>In 1948 in <em>Intelligent Machinery</em> Turing made the distinction between embodied and disembodied intelligence. While arguing that building an embodied intelligence would be a “sure” route to produce a thinking machine he rejected it in favor of disembodied intelligence on the grounds of technical practicalities of the era. Modern researchers are now seriously investigating the embodied approach to intelligence and have rediscovered the importance of what Turing called “cultural search” in that same paper.</p>
<p>For me Alan Turing’s 1948 paper <a href="https://people.csail.mit.edu/brooks/idocs/IntelligentMachinery.pdf" target="_blank" rel="noopener"><em>Intelligent Machinery</em></a> was more important than his 1950 paper <a href="https://people.csail.mit.edu/brooks/idocs/compmach.pdf"><em>Computing Machinery and Intelligence</em></a>.</p>
<p>At the beginning of <em>Intelligent Machinery</em> Turing provided counter arguments to a number of possible objections to the idea that machines could be intelligent. And right at the end he introduced a precursor to the “Imitation Game”, now commonly referred to as the <em>Turing Test</em>, of his 1950 paper. In this earlier version, one human not very good chess player would try to guess whether he was playing against another human not very good chess player, or against an algorithm. <span style="color: #ff6600;">[[At the time the opponent person had to be not very good so that it didn’t outshine the then current abilities of mechanical chess playing. Today the opponent person would have to be a world champion to have any chance at not being outshone by the mechanical system!]]</span> Expansion of these bookends became the body of&nbsp;<em>Computing Machinery and Intelligence</em>.</p>
<p><em>Intelligent Machinery</em> itself was not published until 1970, so many early computer science researchers were unaware of it. I was fortunate to come in contact with it right as I was starting my academic career.</p>
<p>The bulk of the paper gives examples of how simple computational mechanisms could be adaptable, could be taught, and could learn for themselves. The examples and mechanisms Turing used in this exposition were networks of active computational elements. Although he connected them back to the universal machines of his <a href="https://people.csail.mit.edu/brooks/idocs/Turing_Paper_1936.pdf" target="_blank" rel="noopener">1936 paper</a>, it is remarkable, in hindsight, how different this abstraction was than the one he had previously introduced, of the central processing element with a tape memory–still the essential model for all modern digital computers. Here, instead, he used a model inspired by brains. One can only wonder how different our technological world might be if Turing had lived to fully develop this set of ideas himself. Others carried on this second tradition, but one must think that perhaps Turing’s intellectual influence might have been stronger as he would have been arguing against the approach that was adopted from his earlier work.</p>
<p>For me, the critical, and new, insights in <em>Intelligent Machinery</em>&nbsp;were two fold.</p>
<p>First, Turing made the distinction between embodied and disembodied intelligence. While arguing that building an embodied intelligence would be a “sure” route to produce a thinking machine he rejected it in favor of disembodied intelligence on the grounds of technical practicalities of the era. Second, he introduced the notion of “cultural search”: that people’s learning largely comes from the culture of other people in which they are immersed.</p>
<p>Modern researchers are now seriously investigating the embodied approach to intelligence and have rediscovered the importance of interaction with people as the basis for intelligence. My own work for the last twenty five years has been based on these two ideas.</p>
<p>Turing justifies the possibility making a thinking machine by “the fact that it is possible to make machinery to imitate any small part of a man”. He uses the implicit idea of his universal computing machines to dismiss the idea that it is necessary to emulate a person at the neural signal level in order to have intelligence, and instead suggests a digital computer, “if produced by present techniques, would be of immense size”, which would control a robot from a distance. That robot would be built by “tak[ing] a man as a whole and to try to replace all parts of him by machine”. In particular he suggests the parts would include “television cameras, microphones, loudspeakers, wheels and `handling servo-mechanisms’ …”. Turing’s description from over sixty years ago, fairly precisely describes what is done today in dozens of research labs around the world with our PR2 robots, or Mekabots, with their brains off board in racks of Linux boxes, or even off in the computing cloud.</p>
<p>Turing further rightfully notes that even in building such a robot “the creature would still have no contact with food, sex, sport, and many other things of interest to the human being”. Nevertheless he suggests that such an approach “is probably the `sure’ way of producing a thinking machine”, before dismissing it as too slow and impractical. He suggests instead that it is more practical, certainly at that time, to “see what can be done with a `brain’ which is more or less without a body”. He suggests the following fields as ripe for exploration by disembodied intelligence:</p>
<p style="padding-left: 30px;"><span style="color: #0000ff;">(i) Various games, e.g., chess, noughts and crosses, bridge, poker</span><br>
<span style="color: #0000ff;"> (ii) The learning of languages</span><br>
<span style="color: #0000ff;"> (iii) Translations of languages</span><br>
<span style="color: #0000ff;"> (iv) Cryptography</span><br>
<span style="color: #0000ff;"> (v) Mathematics.</span></p>
<p>With these suggestions much of the early directions for the field of <em>Artificial Intelligence</em> were set, and certainly the odd numbered of Turing’s suggestions formed a large part of the work in AI during its first decade.</p>
<p>In one paper Turing both distinguished embodied versus disembodied approaches to building intelligent machines, praised the former as more likely to succeed and either set or predicted the disembodied directions that were actually followed for many years.</p>
<p>But later, towards the very end of <em>Intelligent Machinery</em> he comes back to the place of bodies in the world. He distinguishes three kinds of <em>search</em> as ways to build intelligent systems: <em>intellectual</em> search, <em>genetic</em> search, and <em>cultural</em> search. The first is the direction that classical AI went, where programs try to learn and improve their performance. Although he did not suggest that it be mechanized, genetic search has become a thoroughly practical approach to design and optimization. And lastly by cultural search, Turing means the way in which interactions with others contributes to the development of intelligence. This developmental approach, using social robots, has only now become practical in the last fifteen years, and is a rich source of both theoretical and practical learning systems for robots.</p>
<p>It is humbling to read Alan Turing’s papers. He thought of it all. First.</p>
]]></description>
      <pubDate>Sat, 20 Sep 2025 23:52:14 +0000</pubDate>
      <link>http://rodneybrooks.com/alan-turing-on-embodied-intelligence/</link>
      <dc:creator>Rodney Brooks</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4952237127</guid>
    </item>
    <item>
      <title><![CDATA[How I've run major projects]]></title>
      <description><![CDATA[
<p>My few most productive individual weeks at Anthropic have all been “crisis project management:” coordinating major, time-sensitive implementation or debugging efforts.</p>
<p>In a company like Anthropic, excellent project management is an extremely high-leverage skill, and not just during crises: our work has tons of moving parts with complex, non-obvious interdependencies and hard schedule constraints, which means organizing them is a huge job, and can save weeks of delays if done right. Although a lot of the examples here come from crisis projects, most of the principles here are also the way I try to run any project, just more-so.</p>
<p>I think excellent project management is also <em>rarer than it needs to be</em>. During the crisis projects I didn’t feel like I was doing anything particularly impressive; mostly it felt like I was putting in a lot of work but doing things that felt relatively straightforward. On the other hand, I often see other people miss chances to do those things, maybe for lack of having seen a good playbook.</p>
<p>So here’s an attempt to describe my playbook for when I’m being intense about project management.</p>
<p>(I’ve described what I did as “coordinating” above, but that’s underselling it a bit; it mattered a lot for this playbook that I had enough technical context, and organizational trust, to autonomously make most prioritization decisions about the project. Sometimes we instead try to have the trusted decisionmakers not be highly involved in managing execution, and instead farm that out to a lower-context or less-trusted project manager to save the trusted decisionmaker time, but IMO this is usually a false economy for projects where it’s critical that they be executed well.)</p>
<h2 id="focus">Focus</h2>
<p>For each of the crisis management projects I completely cleared my schedule to focus on them, and ended up spending 6+ hours a day organizing them.</p>
<p>This is a bit unintuitive because I’m used to thinking of information processing as basically a free action. After all, you’re “just” moving info from place to place, not doing real work like coding, right? But if you add it all up—running meetings, pinging for updates, digesting Slack threads, pinging for updates again, thinking about what’s next, pinging for updates a third time, etc.—it’s surprisingly time-intensive.</p>
<p>Even more importantly than freeing up time, clearing my schedule made sure the project was the <a href="https://paulgraham.com/top.html">top idea in my mind</a>. If I don’t do that, it’s easy for me to let projects “go on autopilot,” where I keep them running but don’t proactively make time to think through things like whether we should change goals, add or drop priorities, or do other “non-obvious” things.</p>
<p>For non-crisis projects, it’s often not tenable (or the right prioritization) to spend 6+ hours a day project-managing; but it’s still the case that you can improve execution a lot if you focus and make them a top priority, e.g. by carving out dedicated time every day to check statuses, contemplate priorities, broadcast updates, and so on.</p>
<h2 id="maintain-a-detailed-plan-for-victory">Maintain a detailed plan for victory</h2>
<p>A specific tool that I’ve found critical for staying oriented and updating quickly is a <em>detailed plan for victory</em>, i.e., a list of steps, as concrete as possible, that end with the goal being achieved.</p>
<p>The plan is important because whether or not we’re achieving the plan is the best way to figure out how well or badly things are going. Knowing how well or badly things are going is important because it tells me when to start asking for more support, cutting scope, escalating problems, and otherwise sounding more alarms. One of the most common megaproject failure modes is to <em>not freak out soon enough</em>, and having a concrete plan is the best antidote.</p>
<p>As a both positive and negative example of this, during a recent sprint to release a new implementation of a model, we took a detailed accounting of all the work we thought we had to do to launch.</p>
<ul>
<li>On the plus side, this made it clear three months before launch that things were going to be <em>very</em> tight, and this enabled us to ask for help from another team, who loaned us someone who sped up the project a fair amount.</li>
<li>On the minus side, we also massively underestimated a few components of the project, and because of this, we still ended up very crunched at the end.</li>
</ul>
<p>As the above example shows, having a plan can’t completely save you if you underestimate how long all the steps in the plan will take. But it certainly helps! My sense is that the main things that would have helped even more in the above case were:</p>
<ul>
<li>We were inexperienced at estimating tasks, especially tasks related to new model implementations (which most people on the team were too new to have done before), and we were too cowardly to add the requisite amount of “slop” to our plan.</li>
<li>We didn’t check in frequently enough against the plan once we made it, or sound the alarm early enough when we went off-plan.</li>
</ul>
<h2 id="run-a-fast-ooda-loop">Run a fast OODA loop</h2>
<p>OODA stands for “observe, orient, decide, act”—in other words, the process by which you <em>update your plans and behavior based on new information</em>.</p>
<p>Most of the large projects I’ve worked on have been characterized by incomplete information:</p>
<ul>
<li>Our cluster’s networking is bad, but we don’t understand why.</li>
<li>We have a correctness bug but we don’t know where it is.</li>
<li>We need to rewrite the system but we’re not totally sure what the rewrite should look like.</li>
</ul>
<p>In fact, I’d make a stronger claim: usually getting complete information <em>was the hard part of the project</em>, and took up a substantial fraction of the overall critical-path timeline.</p>
<p>For example, let’s take a recent project to kick off a training run. The critical path probably looked something like:</p>
<ol>
<li>Chips for the training run are delivered</li>
<li>We run some tests</li>
<li>We discover one aspect of performance is unexpectedly poor</li>
<li>We escalate the problem with our compute partner</li>
<li>Compute partner staffs a large debugging effort</li>
<li>We realize we had given our compute partner an outdated benchmark that is causing them to target the wrong improvements</li>
<li>Compute partner switches benchmark and prioritizes different improvements</li>
<li>We share our benchmarks with compute partner so they can run the exact same code as us</li>
<li>Compute partner rolls out improvements</li>
<li>We test the improvements</li>
<li>Performance is still poor and we tell them that</li>
<li>Repeat steps 8-10 until eventually it’s good enough</li>
</ol>
<p>Practically all of these steps are about information-processing, not writing code! Even the step where the compute partner debugged the problems on their side was itself constrained by information processing speed, since there were tens of people working on the debugging effort and coordinating / sharing info between them was difficult. Overall, the project timeline was strongly constrained by how quickly information could round-trip from our compute partner’s large-scale debugging effort, through their tech lead, me, and Anthropic’s large-scale debugging effort.</p>
<p>This pattern generalizes to most projects I’ve been a part of, and as a result, one of my most productive project management habits is to try to run the fastest OODA loop that I can.</p>
<p>A few specific things that I’ve found help:</p>
<ul>
<li><strong>Spend time on it:</strong> running OODA loops takes time, and is one of the primary reasons that, as mentioned above, I usually spend 6+ hours a day on running a megaproject if it’s in crisis mode.</li>
<li><strong>Communicate uncomfortably much:</strong> For the training run debugging, to reduce the round-trip time between orgs as much as possible, I had multiple daily calls with my counterpart at our compute partner (9am and 6pm). For the model implementation effort, I was basically constantly bouncing between different groups of debuggers, asking for updates and processing them.</li>
<li><strong>Track and prioritize the biggest open questions:</strong> For most big projects I’ve maintained a living doc with a ranked list of all my <em>biggest open questions</em> about the project. Resolving or de-risking these uncertainties basically turns into the project’s priority list.<br>
Ideally, there are enough people working on the project that we can work on resolving multiple of the uncertainties in parallel, since that’s one of the best ways to speed things up. (And for a project in “crisis mode,” if we have more top priorities than we can parallel-path with the current set of people working on the problem, that’s also a good test for whether it’s time to pull in more folks.)</li>
<li><strong>Step back and reorient frequently:</strong> Other than asking for updates, the main thing I spend time on was <em>reorienting</em>—looking at our list of priorities, asking myself whether they should still be the top priorities, then looking at what people were working on, and making sure those things were attacking the top priorities. I probably reviewed the project’s priorities multiple times a day as well, although I often didn’t make changes as a result.
<ul>
<li>(Note that it is possible to change what people are working on too often, since switching tasks is costly. Parallelizing work on the top few priorities, as mentioned above, helps with this, since if you decide that priority #3 is now #1, but there are 2 people working on each, then nobody has to switch tasks. The thing that kills you is when <em>no one</em> is working on the new priority #1.)</li>
</ul>
</li>
</ul>
<h2 id="overcommunicate">Overcommunicate</h2>
<p>It’s not just enough for me personally to be running a fast OODA loop—in a large group, <em>everyone</em> needs to be autonomously making frequent, high-quality, <em>local</em> prioritization decisions, without needing a round-trip through me. To get there, they need to be ambiently aware of:</p>
<ol>
<li>what else is going on around them, so they can coordinate and update on new info quickly (“oh, we’re planning to kick off the next derisking run in three days, so I have to have my new RL environment ready and tested by then”)</li>
<li>how their goal fits into the overall project, so they can make correct decisions about the details of their approach (“we’re trying to scale up as much as possible right now, so this direction isn’t valuable to pursue since it could never provide the scale of data we need”)</li>
</ol>
<p>I’ve usually found that to create the right level of ambient awareness, I have to repeat the same things way more often than I intuitively expect. This is roughly the same “communicate uncomfortably much” principle above, but applied to broadcasts and not just 1:1 conversations with people.</p>
<p>For example, although the first team I managed at Anthropic started with a daily asynchronous standup, we found that synchronous meetings were much more effective for creating common knowledge and reorienting, so we moved to a twice-weekly synchronous standup, which probably qualified as “uncomfortably much” synchronous communication for Anthropic at the time.</p>
<h2 id="break-off-subprojects">Break off subprojects</h2>
<p>Once a project gets over maybe 10 people, I can’t track everything myself in enough detail to project-manage the entire thing myself. At this point, it becomes critical to delegate.</p>
<p>Here I mean delegating the <em>project management</em>, not just the execution (that’s what I’d be delegating to the first 10 people). This is the point where I need other people to help split up the work, monitor and communicate progress, escalate blockers, etc.</p>
<p>A few things I try to keep in mind when delegating project management:</p>
<ul>
<li>The ideal unit of delegation is a crisp, <em>simple</em>, high-level goal, with limited overlap with other workstreams. (This is as opposed to, e.g., a list of tasks like “see if X helps.“) Good examples: “get X training technique working over Y networking protocol at Z throughput,” “get identical evals between model implementations A and B.” Bad examples: “follow this 10-step checklist that we hope results in training working,” “try these 3 techniques for debugging the loss eval.”</li>
<li>The best project-managers are often <em>not</em> the strongest technical ICs. Instead the most important traits are that they’re highly organized and great at staying laser focused on end goals, perhaps to the point of being annoying about it. IC depth helps and I’ll never say no to it, but it’s not what I’d optimize for.</li>
<li>People running subprojects are probably also doing a lot of the same stuff I do, in particular e.g. spending a lot of time on it. That means they’ll take a substantial hit to their IC productivity. This is expected, and is often worth it. “Direction is more important than magnitude”—it’s usually better to have a lower-velocity project that works on the right things, than a higher-velocity one that’s pointed at the wrong goal.</li>
</ul>
<p>One of my favorite things to make delegation easier is to <em>keep goals simple</em>—if they can fit in a Slack message while still crisply describing a path to the desired end state, then the people working on the goal will be much more able to prioritize autonomously, and point their work at the real end goal rather than doing something that turns out to be useless for some reason they didn’t think about.</p>
<p>“Keep goals simple” doesn’t have to mean “do less”—the best way to keep goals simple is to <em>find the latent structure that enables a clean recursive decomposition into subgoals</em>. This often requires a deceptive amount of work—both cognitive and hands-on-keyboard—to identify the right intermediate goals, but I’ve found that it pays off immensely by clarifying what’s important to work on.</p>
<h2 id="have-fun">Have fun</h2>
<p>Some of my favorite memories of Anthropic are of helping out with these big projects. While they can be intense, it’s also really inspiring to see how our team comes together, and the feeling of being part of a big team of truly excellent people cooking something ambitious together can be really magical! So I try to enjoy the chaos :)</p>
<hr>
<h2 id="appendix-my-project-dri-starter-kit">Appendix: my project DRI starter kit</h2>
<p><em>Here’s the internal doc I share with folks on my team who are getting into being responsible for large projects.</em></p>
<p>So you’re the DRI of a project (or part of one). Concretely, what do you do to “be DRI”?</p>
<p>This doc is my suggested “starter kit” answer to that question. The habits and rituals described here aren’t perfect for every situation, but they’re lightweight and broadly helpful. I suggest you use them as a starting point for iteration: try them out, then adjust as necessary. This is an SL init; the RL is your job :)</p>
<h3 id="goals-of-this-playbook">Goals of this playbook</h3>
<p>The goal is to help you do your job as DRI—</p>
<ul>
<li>Make your project go quickly:
<ul>
<li>Participants deeply understand the root goal and can autonomously choose the most important next things to work on</li>
<li>People have “situational awareness” of what other people are working on, learn about relevant updates quickly, and coordinate quickly when needed</li>
<li>People get quick feedback on their work</li>
<li>If things aren’t going fast enough, you (the DRI) can notice and course-correct quickly</li>
</ul>
</li>
<li>“Play well with others:”
<ul>
<li>Observers can figure out where to go to follow along</li>
<li>Adjacent or intersecting people/projects don’t miss important updates or get caught by surprise</li>
<li>People notice quickly if the project is behind or off-track, and can identify opportunities to help</li>
</ul>
</li>
</ul>
<p>—without adding too much overhead:</p>
<ul>
<li>&lt;1 hour of setup to make a working doc, schedule a weekly meeting, etc.</li>
<li>30 min/week of meetings</li>
<li>15-30 min/week to write an update</li>
</ul>
<p>(Note: <em>being DRI</em> will still unavoidably add some overhead—e.g. you’ll have to track what other people are doing, delegate work, unblock people, set and communicate goals, etc. The goal is specifically for the <em>process/paperwork</em> to be minimal.)</p>
<h3 id="weekly-meeting">Weekly meeting</h3>
<p>You should schedule at least one 30-minute weekly meeting with everyone working on the project.</p>
<p>The goal of this meeting is to (1) be a backstop for any coordination that needs to happen and didn’t happen asynchronously; (2) be an efficient way to create <a href="https://en.wikipedia.org/wiki/Common_knowledge_%28logic%29">common knowledge</a> of goals, updates, etc.; (3) help you track whether things are going well.</p>
<ul>
<li>Starter-kit agenda:
<ul>
<li>[5m] DRI reviews major updates from last week and sets goals for next week</li>
<li>[10m] Silent write and comment on discussion topics</li>
<li>[10m] Synchronous discussion of most important things not addressed during silent write</li>
</ul>
</li>
<li>Signs that more meetings might help (e.g. a second weekly standup):
<ul>
<li>you have a very tight deadline and can’t afford to lose time</li>
<li>people aren’t working on the most important thing</li>
<li>people need feedback frequently</li>
<li>people step on each others’ toes or miss opportunities to help each other out</li>
<li>if you just like hanging out with each other :)</li>
</ul>
</li>
</ul>
<h3 id="landing-page--working-doc">Landing page / working doc</h3>
<p>It’s really helpful for discoverability and wayfinding to have a single “master doc” with all the most important info about a project. As you loop more people in, they can read the doc to get up to speed. And anyone who thinks “I wonder how X is going” can stop by there to find out.</p>
<p>Create a doc for your workstream with:</p>
<ul>
<li>A <a href="https://www.golinks.io/">go/ link</a> in the name (if a subproject, maybe use go/project/subproject)
<ul>
<li>→ This makes it easier to find quickly (search is kinda rough)</li>
</ul>
</li>
<li>A clear description of a <strong>concrete top level goal</strong> and how it fits into broader goals
<ul>
<li>→ This is critical info for participants, so they can autonomously prioritize the most important things; and for observers, so that they know what outcome to expect.</li>
</ul>
</li>
<li><strong>Staffing:</strong> A list of people working on the project, your name as the DRI, and a link to the slack channel that’s being used for discussion</li>
<li><strong>Links:</strong> A short list of relevant links at the top (work trackers, the project’s Slack channel, major design docs, etc.). If needed, a longer “docs / see also” section later links to relevant docs.
<ul>
<li>→ It’s really easy to lose track of relevant docs otherwise!</li>
</ul>
</li>
<li>A <strong>roadmap</strong> section with intermediate goals and target dates
<ul>
<li>→ See the <a href="https://www.benkuhn.net/">section on plans</a>; these will help people understand what the overall shape of the project is expected to be.</li>
</ul>
</li>
<li>A section for “running notes” containing meeting notes from your weekly meetings (and any other ad-hoc meetings) and <a href="https://www.benkuhn.net/">broadcast updates</a>
<ul>
<li>→ This really helps observers and new-joiners get up to speed!</li>
</ul>
</li>
<li>I like maintaining a list of <strong>important open questions / uncertainties/ risks</strong> and updating it over time. This helps me stay focused on removing risk from the project as quickly as possible.</li>
</ul>
<p>If it’s part of a larger project, your doc should be nested within the larger project’s working doc.</p>
<p>If this ends up being too much for one doc, you can fork these out into sub-docs (esp. running notes and updates).</p>
<h3 id="plan--roadmap--milestones">Plan / roadmap / milestones</h3>
<ul>
<li>In your working doc, include a section with some intermediate goals and dates by which you hope to accomplish them.
<ul>
<li>→ This is helpful mostly for noticing you’re off track or behind without getting frog-boiled.</li>
<li>→ Or noticing when you need to make a direction change because the intermediate goals don’t seem good anymore.</li>
</ul>
</li>
<li>You might feel some pressure to add false certainty or precision, but avoid this and be honest about your uncertainty instead. For a lot of research projects it’s hard to plan more than a couple weeks ahead. You can make the milestones fuzzier / more aspirational beyond that, or just drop them.
<ul>
<li>I often find it helpful to phrase milestones in probabilitis and distributions (e.g. “my 90% confidence interval for this date is X-Y” or “I think there’s a 75% chance this technique works”)</li>
</ul>
</li>
</ul>
<h3 id="whos-working-on-what">Who’s working on what</h3>
<ul>
<li>You should have something somewhere that describes what people are working on.</li>
<li>The minimum viable version of this is a list of what people are working on in your working doc.
<ul>
<li>If you end up with a large set of tasks and a big backlog, maybe use a checklist and/or move to a subdoc.</li>
</ul>
</li>
<li><strong>Stack rank your work list.</strong> It’s really important for people to understand priorities!</li>
<li>If there’s more different people/TODOs, I suggest using some app to make a kanban board with “backlog” / “up next” / “in progress” / “done” columns.
<ul>
<li>This is probably most helpful for more deterministic/plannable projects where there’s a clear backlog + set of future tasks, and a lot of things you need to remember to do.</li>
</ul>
</li>
<li>If you have an external task tracker, link it in the wiki section of the working doc.</li>
</ul>
<h3 id="slack-norms">Slack norms</h3>
<ul>
<li>Have conversations about the project in a Slack channel (not DMs).
<ul>
<li>Reference the channel in your working doc.</li>
<li>Link the working doc in the Slack channel bookmarks.</li>
</ul>
</li>
<li>Cross-post notebook posts and experiment write-ups into the channel so observers don’t have to follow tons of notebook channels.</li>
<li><strong>Do not use DMs.</strong> These make it hard to make info discoverable or share it further.
<ul>
<li>If people send you important stuff in a DM, ask them to put it in the project channel.</li>
<li>If you need confidentiality, make a private channel.</li>
</ul>
</li>
<li><strong>Avoid centithreads.</strong> Most ≥10-message Slack threads would be better as a ~5-minute Tuple.
<ul>
<li>(This is hard to do with people who are in tons and tons of meetings like execs. But you should try to do it for others.)</li>
<li>If you end up with a centithread, assume nobody will read it; post a summary back to the channel afterwards.</li>
</ul>
</li>
<li>Bias towards fewer, larger, noisier channels. The right time to create a channel is when discussion is either not happening, or getting lost.
<ul>
<li>→ Too many slack channels makes it harder to manage membership, decide where to put things, or find where discussion is happening.</li>
</ul>
</li>
<li>Channel organization and membership matters. Invest in routing conversations to the right place and curating the channel “architecture.”</li>
</ul>
<h3 id="weekly-broadcast-updates">Weekly broadcast updates</h3>
<ul>
<li>Once a week, probably either just before or just after your weekly meeting, write up a <strong>brief</strong> update for a broader audience with:
<ul>
<li>The overall vibe</li>
<li>What’s changed since last update</li>
<li>What’s coming up next</li>
</ul>
</li>
<li>When writing these updates, optimize for <strong>signal to noise ratio</strong>.
<ul>
<li>Err towards concision</li>
<li>No “we worked on X”—tell me “we accomplished Y” or “we learned Z”</li>
<li>Remember your audience (= people not familiar with the project)</li>
<li>State things crisply and concretely (“X improves eval Y by Z points,” not “we got X working”)</li>
<li>Leave out anything that’s not actionable—you don’t need to be exhaustive</li>
</ul>
</li>
<li>Post the update in your project Slack channel, and cross-post it to other relevant channels (e.g. a larger “megaproject” channel) if necessary.
<ul>
<li>If your project is part of a larger megaproject, these updates might feed into something broader like a weekly meeting of DRIs or an aggregated status update.</li>
</ul>
</li>
</ul>
<h3 id="retrospectives">Retrospectives</h3>
<ul>
<li>Every so often, step back and ask “how could the last X weeks have gone better?”
<ul>
<li>Frequency depends on how much there is going on—every 2 weeks is good if there’s a lot, maybe every 4-8 weeks for smaller projects</li>
</ul>
</li>
<li>Suggested meeting format
<ul>
<li>Friday afternoon</li>
<li>[13 min] Async brainstorm 2 lists of items: “what went well” / “what we could improve”</li>
<li>[2 min] Dedupe topics and emoji vote by putting :heavy_plus_sign: next to ones you agree with</li>
<li>Sort “what we could improve” by highest votes</li>
<li>[10 min] Synchronous discussion of top points (either highest voted or flagged by DRI); figure out action items</li>
</ul>
</li>
</ul>
<p><em>Thanks to Kelley Rivoire for many thoughtful comments on a draft!</em></p>]]></description>
      <pubDate>Sun, 16 Mar 2025 00:00:00 +0000</pubDate>
      <link>https://www.benkuhn.net/pjm/</link>
      <dc:creator>benkuhn.net</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4758040043</guid>
    </item>
    <item>
      <title><![CDATA[RT by @_akhaliq: Finally they looking at HF.]]></title>
      <description><![CDATA[<p>Finally they looking at HF.</p>
<p><a href="https://nitter.uni-sonia.com/_akhaliq/status/1745476689625329737#m">nitter.uni-sonia.com/_akhaliq/status/1745476689625329737#m</a></p>]]></description>
      <pubDate>Thu, 11 Jan 2024 16:58:30 +0000</pubDate>
      <link>https://nitter.uni-sonia.com/artificialguybr/status/1745490380244566053#m</link>
      <dc:creator>AK / @_akhaliq</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4332803437</guid>
    </item>
    <item>
      <title><![CDATA[How to Do Great Work]]></title>
      <description><![CDATA[]]></description>
      <pubDate>Sun, 02 Jul 2023 10:18:12 +0000</pubDate>
      <link>http://www.paulgraham.com/greatwork.html</link>
      <dc:creator>Paul Graham: Essays</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4147704684</guid>
    </item>
    <item>
      <title><![CDATA[Research Taste Exercises [rough note]]]></title>
      <description><![CDATA[<div class="col-md-8"> <br> <div> This article is a rough note. Writing rough notes allows me share more content, since polishing takes lots of time. While I hope it's useful, it's likely lower quality and less carefully considered than my usual articles. It's very possible I wouldn't stand by this content if I thought about it more. <p> See also <a href="https://twitter.com/ch402/status/1347970296234479617">twitter thread version</a>.
</p></div> <p>One of the most important aspects of growing as a researcher is developing research taste -- roughly, the ability to chose good problems to work on. But it can be hard to explicitly work on developing taste, so I wanted to share some concrete, actionable exercises.</p> <p>Before we start with exercises, it's worth think about why it's hard to develop research taste. I think the fundamental issue is that actually testing whether a research idea you come up with is good is very expensive. Often it takes months, so you only really get a few pieces of feedback on your taste every year.
Many of the following exercises are really strategies for getting (proxy) feedback on more research ideas faster. The feedback you get is generally lower quality than actually executing the idea yourself, but these aim to get you orders of magnitude more of this noisy feedback.</p> <p>Take all these exercises and suggestions with a grain of salt. I think I have better taste than I used to and doing some things like this helped. I also think these exercises have also helped the occasional person I mentored or managed. But I'm hardly an expert. If you have better ways to build resarch taste, please share them in the comments or <a href="https://twitter.com/ch402/status/1347970296234479617">on twiter!</a></p> <p>One other point: I don't recommend forcing yourself to work on something you don't find exciting, even if you intellectually think it's more important. You'll be a lot less effective. Ideally, you want to get to a point where the problems you think are most important also feel most exciting.</p> <h2>Exercises</h2> <p><i><b>Exercise 1:</b> Write down a list of research ideas. Have a mentor you respect rate each idea 1-10. Discuss ideas where you disagree with them after reflection.</i> </p><p>It often takes several months of work to truly test an idea. Asking a mentor is a cheap proxy.</p> <p><i><b>Exercise 2:</b> Pay attention when other people try ideas you’ve had. How did the results compare with your expectations?</i></p> <p>If you’re investing in brainstorming, you’ll have tens or hundreds of research ideas for each one you try. It can be emotionally uncomfortable when someone publishes an idea you had, but it’s an opportunity to get precious feedback for free.</p> <p><i><b>Exercise 3:</b> Interview researchers around you on their taste. Why do they work on the problems they do? How do they pick problems? What’s their “big picture” of research?</i></p> <p>Bonus: If you have a great interview, you might consider writing it up with their permission.</p> <p>(PhD students often have short meetings with researchers visiting their lab. One pattern is pitching their project to the visitor, which may not be very productive if they don't have similar interests. I suspect they’d get more out of talking about meta-research.)</p> <p><i><b>Exercise 4:</b> Read books about the history of science. Reflect on why some researchers focused on important directions their contemporaries ignored.</i></p> <p>(Kuhn’s “The Structure of Scientific Revolutions” is one of my favorite books, and you can get an audio book!)</p> <p><i><b>Exercise 5:</b> Critically consider your research taste, and the community taste around you. Your taste is likely very influenced by your research cluster (your collaborators, advisor, etc).</i> </p><p>In what ways has your own research taste or your community's taste been wrong over the last few years? </p><p>Are there adjacent research “schools” with significantly different research taste? If so, try to articulate the strongest version of their view, and why you agree or disagree.
</p><p>In what ways to you disagree with the research taste of your own community? What are ways in which you think your community might be wrong even if you aren't confident in that view?
</p><p>What are the underlying goals of your research taste? This could simply be intrinsically wanting to understand, or just finding research fun. But there might also be ways you want to make the world a better place. Are the problems you are working on (or think are intersting) algined with those goals?</p> <h2>Failure Modes</h2> <p>I also wanted to go through some research taste failure modes:</p> <p><i><b>Failure Mode 1:</b> Getting overly attached to one research direction / falling into sunk costs.</i></p> <p>A lot of researchers start working on one problem (often more due to circumstance than contemplated decision) and then find it hard to move on.</p> <p><b>Potential Antidote:</b> Set aside a week or two to step back and create a list of the other research directions you think would be most promising to work on (look at new research, talk to others, etc). Then ask what you’d want to work on if you were starting from scratch. Repeat every 1-2 years as needed.</p> <p><i><b>Failure mode 2:</b> Lack of research knowledge / intimacy.</i></p> <p>Theoretical knowledge is table stakes for research taste. You can’t have research taste in a vacuum.</p> <p>Sometimes people fall into a trap of trying to pick the perfect problem or have a brilliant insight before getting their hands dirty. You can’t have research taste in a vacuum. You need theoretical knowledge and <a href="https://twitter.com/ch402/status/1295063684545314818">research intimacy</a>.</p> <p><b>Potential Antidote:</b> Help someone else with their project in a space you’re interested in. You’ll learn a lot and get your hands dirty without becoming overly committed to working on something long term. Ideally, by not leading the project, this should also be lower stress. Alternatively, do your own short-term projects to get your hands dirty.</p> <p><i><b>Failure mode 3:</b> Environment not aligned with your interests.</i></p> <p>I sometimes talk to PhD students who have different interests than their advisor, and are trying to contort their research interests into something their advisor will find palatable. This generally leads to ideas optimized for compromise rather than quality. This leads to suboptimal resarch, and starves you of an opportunity to build taste.</p> <p>This can happen even when those around you nominally support you working on whatever you’re interested in. It can be challenging to pursue a direction when people around you are unenthusiastic.</p> <p><b>Potential Antidote:</b> Either (1) move to an environment which is aligned with your interests, or (2) temporarily buy into the interests of your group and try to exercise good taste within those interests.
I’m generally pro moving to another environment once you’ve reflected a lot on what you’re interested in and have something you feel very actively excited about.</p> <h2>Suggestions from other people</h2> <p>In the <a href="https://twitter.com/ch402/status/1347970296234479617">twitter discussion</a> (and especially, a <a href="https://twitter.com/andy_matuschak/status/1348001750221946882">thread by Andy Matuschak</a>), a number of people suggested other ideas. Note all of these are quite "exercises for building taste" (many are more "strategies for exercising good taste"), but I think they're very interesting:</p> <ul> <li><a href="https://twitter.com/RachelPrudden/status/1347973143990829056"><b>Rachel Prudden suggests</b></a> <br> <i>The heuristic of imagining that another group has published the paper you have in mind. Are you excited to read it?</i><br> Chris comment: I think this is a pretty good heuristic.</li> <li><a href="https://twitter.com/andy_matuschak/status/1348001750221946882"><b>Andy Matuschak suggests</b></a> <br> <i>ask the anti-Hamming question about your ideas. What are the most interesting (not important!) problems in your field? Why aren't you working on them?</i> <br> Chris comment: This is a great question. See also Hamming's famous <a href="https://www.youtube.com/watch?v=a1zDuOPkMSw">"You and Your Research"</a> talk.</li> <li><a href="https://twitter.com/andy_matuschak/status/1348002769085419520"><b>Andy also shares a failure mode:</b></a><br> <i>running to make an idea happen as soon as it seems tractable. It's exciting to have a viable approach—makes me wanna go do it! But if I sit with an idea longer I can often evolve a deeper variant and run with that instead.</i></li> <li><b>Michael Nielsen's <a href="https://michaelnielsen.org/blog/archive/000114.html">Principles of Effective Resarch</a> (<a href="https://twitter.com/shreydjain13/status/1348023976275161089">via Shrey Jain</a>)</b> notes the value of paying attention to messes:<br> <i>When you identify such a mess, the natural inclination of many people is to shy away, to find something that is easier to understand. But a field that is a mess is really an opportunity. Chances are good that there are deep unifying and simplifying concepts still waiting to be understood and developed by someone - perhaps you.</i> </li> <li><a href="https://twitter.com/arthurallshire/status/1348024259214516231"><b>Arthur Allshire</b></a><br> <i>What are the bottlenecking problems in your field? Eg. in a lot of reinforcement learning research, a lot focus on slight tweaks to algorithms, but the biggest improvements have come from improving simulation for RL.</i> </li> <li><a href="https://twitter.com/PreetumNakkiran/status/1348049533436129281"><b>Preetum Nakkiran</b></a><br> <i>I've found that writing "project proposals" in the short term, and "research statements/plans" in the long term, are good exercises to think about motivations &amp; goals (at diff timescales). By "project proposal", I mean: Once you decide to commit the next &gt;1 month of time to something, write up a few pages explaining: why the question is interesting, why the answer will be valuable, and what you expect to find (w/ initial evidence). And then get feedback.<br> ... Also seconding the advice to ask everyone you meet about their "big picture" research goals. This is almost never written in papers, but is much more interesting/important.</i></li> <li><a href="https://twitter.com/ChrSzegedy/status/1348055991565262849"><b>Christian Szegedy</b></a><br> <i>In hindsight though, most of the errors are made in my research was when I listened to the criticisms of my respected mentors and most of the cool stuff I did were the ones those everybody called initially useless, uninteresting or "good luck!"... Generally, ideas in research need to be either ahead of the others or should defy conventional wisdom. Being in the latter camp requires less skill, but more courage.</i><br> Chris comment: I think there's a lot of truth to this, but beware survivorship bias and remember that different people may have different levels of scientific maturity / taste at different points in their careers. I suspect that many PhD students would find charting their own research agenda from the start very challenging, while others would thrive.</li> <li><a href="https://twitter.com/sergey_nog/status/1348049950630928387"><b>Sergey Gorbunov</b></a><br> <i>My only advice on research taste to grad students or junior researchers is *not* to listen to your advisors/mentors/established researchers... Grad school is the only place when you get to try and fail as many times as needed. Grad students should come up with outside the box problems/solutions. (Established researches have their own agendas/way of thinking😂). The advice can similarly be used to overturn itself. 🧐</i></li> <li><a href="https://twitter.com/boazbaraktcs/status/1348097776551407618"><b>Boaz Barak</b></a><br> <i>I’m fond of “20/80 projects” where you do 80% of work to get 20% of goals:<br> Rather than asking “what’s coolest result I can get with smallest effort?” ask “can I find problem that distills the main obstacle even if end result doesn’t sound cool?</i></li>
</ul> <h2>Acknowledgments</h2> <p>I'm grateful to the countless people who've talked to me about research philosophy over the years.</p> <p>Thanks to <a href="http://nickcammarata.com/">Nick Cammarata</a> for discussion of these exercises.</p> <p>Thanks to Ryan Saxe and Abinav for correcting typos.</p> </div>]]></description>
      <pubDate>Fri, 30 Jun 2023 11:09:51 +0000</pubDate>
      <link>https://colah.github.io/notes/taste/</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4146097570</guid>
    </item>
    <item>
      <title><![CDATA[An AI Miracle Malcontent]]></title>
      <description><![CDATA[<p>The stark success of OpenAI’s <a href="https://en.wikipedia.org/wiki/GPT-4">GPT4 model</a> surprised me shifting my view from “really good autocomplete” (roughly inline with intuitions <a href="https://arxiv.org/abs/2301.06627">here</a>) to a dialog agent exhibiting a significant scope of reasoning and intelligence.  Some of the MSR folks did a <a href="https://arxiv.org/abs/2303.12712">fairly thorough study of capabilities</a> which seems like a good reference.  I think of GPT4 as an artificial savant: super-John capable in some language-centric tasks like style and summarization with impressive yet more limited abilities in other domains like spatial and reasoning intelligence.</p>
<p>And yet, I’m unhappy with mere acceptance because there is a feeling that a miracle happened.  How is this not a miracle, at least with hindsight?  And given this, it’s not surprising to see folks thinking about more miracles.  The difficulty with miracle thinking is that it has no structure upon which to reason for anticipation of the future, prepare for it, and act rationally.  Given that, I wanted to lay out my view in some detail and attempt to understand enough to de-miracle what’s happening and what may come next.</p>
<p><b>Deconstructing The Autocomplete to Dialog Miracle</b><br>
One of the ironies of the current situation is that an organization called “OpenAI” created AI and isn’t really open about how they did it.  That’s an interesting statement about economic incentives and focus.  Nevertheless, back when they were publishing, the <a href="https://arxiv.org/abs/2203.02155">Instruct GPT</a> paper suggested something interesting: that reinforcement learning on a generative model substrate was remarkably effective—good for 2 to 3 orders of magnitude improvement in the quality of response with a tiny (in comparison to language sources for next word prediction) amount of reinforcement learning. My best guess is that this was the first combination of 3 vital ingredients.</p>
<ol>
<li>Learning to predict the next word based on vast amounts of language data from the internet.  I have no idea how much, but wouldn’t be surprised if it’s a million lifetimes of reading generated by a billion people.  That’s a vast amount of information there with deeply intermixed details about the world and language.
<ol>
<li>Why not other objectives?  Well, they wanted something simple so they could maximize scaling.  There may indeed be room for improvement in choice of objective.</li>
<li>Why language? Language is fairy unique amongst information in that it’s the best expression of conscious thought.  There is thought without language (yes, I believe animals think in various ways), but you can’t really do language without thought.</li>
</ol>
</li>
<li>The use of a large deep transformer model (<a href="https://arxiv.org/pdf/2207.09238.pdf">pseudocode here</a>) to absorb all of this information.  Large here presumably implies training on many GPUs with both data and model parallelism.  I’m sure there are many fine engineering tricks here.  I’m unclear on the scale, but expect the answer is more than thousands and less than millions.
<ol>
<li>Why transformer models?  At a functional level, they embed ‘soft attention’ (=ability to look up a value with a key in a gradient friendly way).  At an optimization level, they are GPU-friendly.</li>
<li>Why deep? The drive to minimize word prediction error in the context of differentiable depth creates a pressure to develop useful internal abstractions.</li>
</ol>
</li>
<li>Reinforcement learning on a small amount of data which ‘awakens’ a dialog agent.  With the right prompt (=prefix language) engineering a vanilla large language model can address many tasks as the information is there, but it’s awkward and clearly not a general purpose dialog agent.  At the same time, the learned substrate is an excellent representation upon which to apply RL creating a more active agent while curbing an inherited tendency to mimic internet flamebait.
<ol>
<li>Why reinforcement learning?  One of the oddities of language is that there is more than one way of saying things.  Hence, the supervised learning view that there is a right answer and everything else is wrong sets up inherent conflicts in the optimization. Hence, “reinforcement learning from human feedback” pairs inverse reinforcement learning to discover a reward function and basic reinforcement learning to achieve better performance.  What’s remarkable about this is that the two-step approach is counter to the <a href="https://en.wikipedia.org/wiki/Data_processing_inequality">information processing inequality</a>.</li>
</ol>
</li>
</ol>
<p>The overall impression that I’m left with is something like the “ghost of the internet”.  If you ask the internet for the answer to a question on the best forum available and get an answer, it might be in the ballpark of as useful and as correct as that which GPT4 provides (notably, in seconds).  <a href="https://www.amazon.com/AI-Revolution-Medicine-GPT-4-Beyond/dp/0138200130/ref=sr_1_1?crid=2JMRKC2V7HQW">Peter Lee’s book</a> on the application to medicine is pretty convincing.  There are pluses and minuses here—GPT4’s abstraction of language tasks like summarization and style appear super-human, or at least better than I can manage.  For commonly discussed content (e.g. medicine) it’s fairly solid, but for less commonly discussed content (say, <a href="https://bg.battletech.com/forums/fan-designs-rules/">Battletech fan designs</a>) it becomes sketchy as the internet gives out. There are obviously times when it errs (often egregiously in a fully confident way), but that’s also true in internet forums.  I specifically don’t trust GPT4 with math and often find it’s reasoning and abstraction abilities shaky, although it’s deeply impressive that they exist at all.  And driving a car is out because it’s a task that you can’t really describe.</p>
<p><b>What about the future?</b><br>
There’s been a great deal about the danger of AI discussed recently, and quite a mess of misexpectations about where we are.</p>
<ol>
<li>Is GPT4 and future variants the answer to [insert intelligence-requiring problem here]?  GPT4 seems most interesting as a language intelligence.  It’s clearly useful as an advisor or a brainstormer.  The meaning of “GPT5” isn’t clear, but I would expect substantial shifts in core algorithms/representations are necessary for mastering other forms of intelligence like memory, skill formation, information gathering, and optimized decision making.</li>
<li>Are generative models the end of consensual reality?  Human societies seem to have a systematic weakness in that people often prefer a consistent viewpoint even at the expense of fairly extreme rationalization.  That behavior in large language models is just looking at our collective behavior through a mirror.  Generative model development (both language and video) do have a real potential to worsen this. I  believe we should be making real efforts as a society to harden and defend objective reality in a multiple ways.  This is not specifically about AI, but it would address a class of AI-related concerns and improve society generally.</li>
<li>Is AI about to kill everyone? <a href="https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/">Yudkowski’s editorial</a> gives the impression that a <a href="https://en.wikipedia.org/wiki/Terminator_(franchise)">Terminator style apocalypse</a> is just around the corner.  I’m skeptical about the short term (the next several years), but the longer term requires thought.
<ol>
<li>In the short term there are so many limitations of even GPT4 (even though it’s a giant advance) that I both lack the imagination to see a path to “everyone dies” and I expect it would be suicidal for an AI as well.  GPT4, as an AI, is using the borrowed intelligence of the internet.  Without that source it’s just an amalgamation of parameters of no interesting capabilities.</li>
<li>For the medium term, I think there’s a credible possibility that drone warfare becomes ultralethal inline with <a href="https://www.youtube.com/watch?v=M7mIX_0VK4g">this imagined future</a>.  You can already see drone warfare in the Ukraine-Russia war significantly increasing the lethality of a battlefield.  This requires some significant advances, but nothing seems outlandish.  Counterdrone technology development and limits on usage inline with other war machines seems prudent.</li>
<li>For the longer term, Vinge’s classical <a href="https://edoras.sdsu.edu/~vinge/misc/singularity.html">singularity essay</a> is telling here as he lays out the inevitability of developing intelligence for competitive reasons.  Economists are often fond of pointing out how job creation has accompanied previous mechanization induced job losses and yet my daughter points out how we keep increasing the amount of schooling children must absorb to be capable members of society.  It’s not hard to imagine a desolation of jobs in a decade or two where AIs can simply handle almost all present-day jobs and most humans can’t skill-up to be economically meaningful.  Our society is not prepared for this situation—it seems like a quite serious and possibly inevitable possibility.  Positive models for a nearly-fully-automated society are provided by <a href="https://en.wikipedia.org/wiki/Star_Trek">Star Trek</a> and <a href="https://en.wikipedia.org/wiki/Iain_Banks">Iain Banks</a> although science fiction is very far from a working proposal for a working society.</li>
<li>I’m skeptical about a <a href="https://en.wikipedia.org/wiki/The_Lawnmower_Man_(film)">Lawnmower Man</a> like scenario where a superintelligence suddenly takes over the world.  In essence, cryptographic barriers are plausibly real, even to a superintelligence.  As long as that’s so, the thing to watch out for is excessive concentrations of power without oversight.  We already have a functioning notion of super-human intelligence in <a href="https://en.wikipedia.org/wiki/Organizational_intelligence">organizational intelligence</a> and are familiar with techniques for restraining organizational intelligence into useful-for-society channels.  Starting with this and improving seems reasonable.</li>
</ol>
</li>
</ol>
]]></description>
      <pubDate>Wed, 05 Apr 2023 21:44:38 +0000</pubDate>
      <link>https://hunch.net/?p=13763005</link>
      <dc:creator>Machine Learning (Theory)</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4064459914</guid>
    </item>
    <item>
      <title><![CDATA[Dhruv Batra @DhruvBatraDB]]></title>
      <description><![CDATA[  <div class="feedbin--wrap">
    <p class="feedbin--content-text">A lot of my arguments about the foundations of intelligence being sensorimotor control (and not language or reasoning) are shaped by discussions with Jitendra over the years. 

This is a good summary of his arguments.</p>
      <blockquote class="feedbin--quoted">
        <p>
          <a href="https://twitter.com/JitendraMalikCV">
            <strong class="feedbin--name">Jitendra MALIK</strong>
            <span class="feedbin--screen-name">@JitendraMalikCV</span>
          </a>
        </p>
        <p class="feedbin--content-text">I delivered the 110th Annual Martin Meyerson UC Berkeley Faculty Research Lecture on March 20, 2023. <a href="https://t.co/xQKmXSNZf0" rel="nofollow" title="https://www.youtube.com/watch?v=f6fDpKDxpL0"><span class="tco-ellipsis"><span>&nbsp;</span></span><span>https://www.</span><span class="js-display-url">youtube.com/watch?v=f6fDpK</span><span>DxpL0</span><span class="tco-ellipsis"><span>&nbsp;</span>…</span></a></p>
      </blockquote>
      
          <div class="feedbin--embed-wrap"><iframe src="https://www.youtube.com/embed/f6fDpKDxpL0" height="720" width="1280" frameborder="0" allowfullscreen="allowfullscreen"></iframe></div>

  </div>
]]></description>
      <pubDate>Fri, 24 Mar 2023 01:21:51 +0000</pubDate>
      <link>https://twitter.com/DhruvBatraDB/status/1639075019258228738</link>
      <dc:creator>@DhruvBatraDB</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4049965666</guid>
    </item>
    <item>
      <title><![CDATA[Film Study for Research]]></title>
      <description><![CDATA[<!--kg-card-begin: markdown--><p>Research ability, like most tasks, is a trainable skill. However, while PhD students and other researchers spend a lot of time <em>doing</em>  research, we often don't spend enough time training our research abilities in order to improve. For many researchers, aside from taking  classes and reading papers, most of our training is implicit, through doing research and interacting with mentors (usually a single  mentor--our PhD advisor or research manager). By analogy, we are like basketball players who somehow made it to the NBA, and are now hoping that simply playing basketball games will be enough to keep improving.</p>
<p>Drawing on this analogy, I want to talk about two habits that are ubiquitous among elite athletes, that have analogs in research that I  feel are underutilized. Those who do pursue these habits as PhD students often improve quickly as researchers.</p>
<!--more-->
<p>The first habit is <em>film study</em>. Almost every high-level athlete watches films of other players of the same sport, including historical  greats, contemporary rivals, and themselves. This allows them to incorporate good ideas from other players' games as well as to catch and  eliminate flaws in their own game. Even the very best players benefit from watching film of themselves and others.</p>
<p>The second habit, which I call <em>act-reflect-ask</em>, occurs in the course of a game or scrimmage. I'll describe this from my own experience  (although I'm by no means an elite athlete, I've learned this from people who are). After a point ends, I generally think about what  happened during the point--Was there anything I wanted to do better? Did anything unexpected happen? Then I'll re-run those parts in my  head, simulating what I would have done differently until I feel like I know how to consistently make the right decision. In some cases,  I can't figure it out--perhaps I was playing defense, someone beat me, and I can't figure out what they did or can't figure out the  counter. In that case I'll ask a teammate about it (or the person who beat me, if it's a friendly scrimmage) and talk it over until I  see the right strategy for the future.</p>
<p>Both of these strategies are invaluable for improving. They leverage the fact that as humans, we tend to learn socially: we are very  good at adopting strategies from others, so film study and asking are efficient ways to learn. Both strategies also lead to deliberate  practice focused on real-world contexts. Below, I'll show that these strategies have analogs in research, and argue that good researchers  should adopt both into their own habits.</p>
<h3 id="filmstudy">Film Study</h3>
<p>As mentioned above, good athletes watch lots of film of other athletes. This extends to other skills as well--most chess players,  including grandmasters, study games by both contemporary and historical greats. They do this to understand how other very strong players  play, in order to adopt ideas and, in the case of rivals, to counter those ideas (this part is less relevant to research). Even the very  best players do this.</p>
<p>What is the equivalent to this in research? Ideally, we would watch world experts as they work, observing how they think, perform  experiments, and so on. Unfortunately, this is difficult--much research work is internal rather than external, and we don't routinely  film great researchers in the same way as we do with athletes. The closest obvious analog is working closely with a mentor, as many PhD  students do with their PhD advisor. Then, it is often possible to see first-hand how a more experienced researcher approaches a problem.  However, this isn't scalable, and most people only get to do this with one person--their advisor. (As an aside, it is very useful for  students to develop a good model of their advisor's thinking style--I think this tends to be underrated.)</p>
<p>A more scalable approach would be reading papers, but this doesn't achieve the full goal of film study--you only see the finished product,  rather than the thought process, and it tends to only show the part of a writer's thoughts that are widely defensible. What we want is a  public record of someone's thoughts, including off-the-cuff thoughts that wouldn't make it into a paper.</p>
<p>In fact, we do have this, in the form of blogs. The right type of blogs can provide a valuable form of "film study". I personally learned  a lot about statistics from <a href="http://andrewgelman.com/">Andrew Gelman</a>'s blog. Often, someone sends him a paper and he just gives his  off-the-cuff reactions to it: what he liked and didn't, what was convincing, what parts seem sketchy. I probably learned more from  reading his blog than from statistics classes (of which I've taken embarrassingly few, yet somehow managed to get hired by a Statistics  department; I'll credit Gelman for this). <a href="https://www.scottaaronson.com/blog/">Scott Aaronson</a>'s blog is good in the same way for  theoretical computer science. Many posts on the <a href="https://blog.givewell.org/">GiveWell</a> and  <a href="https://www.openphilanthropy.org/blog">Open Philanthropy</a> blogs are good in this way, too. In all cases, I'd look at  the earlier rather than later posts (though not the very earliest); the reason is that once blogs have too large an audience, writers  start to feel constrained to write more "professionally" and you get less of the valuable off-the-cuff thinking.</p>
<p>In addition to blogs, debates are another good source of off-the-cuff, in-the-moment thinking, as long as the participants don't  overprepare and as long as they are trying to make good arguments rather than score rhetorical points. Actually, the best debates I've  seen also take part via blogs, such as the <a href="https://blogs.worldbank.org/impactevaluations/worm-wars-anthology">debate over de-worming</a>  in global health. Seminars can be good film study, but are primarily film study for giving presentations rather than doing research  (and for this, also watch recordings of great talks online). Seminar Q&amp;A can be good film study for research thinking, as long as  participants are opinionated and express those opinions in a clear way that exposes their underlying mental model. For programming, you  can watch people code on <a href="https://www.twitch.tv/directory/all/tags/a59f1e4e-257b-4bd0-90c7-189c3efbf917">Twitch</a>, or pair program with other students in your research group.</p>
<p>The above are all useful sources of in-the-moment thinking. For research, we also make decisions--such as what directions to pursue--that  have consequences on the scale of years. To film study these, I read histories of important scientific developments. Good histories will  follow individuals around in detail for an extended period of time, ideally with primary sources. For instance,  <em>The Making of the Atomic Bomb</em> covers developments in physics up to and through the Manhattan project, and discusses many of the  decisions, discoveries, and dead ends faced by Fermi, Szilard, Oppenheimer, and others. (The dead ends are especially important, so that  you can see the whole process and not just what is useful today.) Another great example is <em>The Eighth Day of Creation</em>, which does the  same for the development of modern biotechnology. Such histories have helped me gain a better understanding of how science develops on  the scale of years or decades, which I would otherwise have to learn the hard way, over my own years and decades of research.</p>
<p>Some other miscellaneous advice: transcripts of talks can sometimes be good in the same way as blogs. Richard Hamming's  "<a href="https://www.cs.virginia.edu/~robins/YouAndYourResearch.html">You and Your Research</a>" is excellent on this front. For talks, recording  yourself and watching the recording may be the fastest route to improvement. Finally, in addition to histories, case studies (often  taught in law or business courses) also provide information that would be expensive to gather otherwise.</p>
<p>In summary, film study blogs for off-the-cuff research thinking; watch great presentations and record yourself to learn how to speak;  pair program and watch programming streams; and read histories of science for long-term research decisions.</p>
<h3 id="actreflectask">Act-Reflect-Ask</h3>
<p>In the act-reflect-ask loop, we reflect on whether something could have gone better after we do it, and ask someone else if we can't figure it out. There are many ways to do this in research:</p>
<ul>
<li>When seeing a proof, if you don't see how you would have come up with the proof yourself, discuss with others how to do so (this is usually what people mean when they ask “what's the motivation for that step?”). The same goes whenever you see a cool experiment or idea that you're not sure you would have come up with yourself. First try to think about whether there's a way to modify your thought process to reliably come up with such ideas in the future. If not, discuss with the presenter so that you can learn.</li>
<li>After you give a talk, pull aside one of the audience members and get feedback on what worked/didn't work in the talk.</li>
<li>After attending a seminar, discuss what was or wasn't convincing, what was most interesting, etc. Paper reading groups are valuable as they often focus on this. (This isn't quite act-reflect-ask since the seminar was given by someone else; but you can think of it as a way of checking your own thoughts during the seminar against others'.)</li>
<li>Every week, reflect on what things felt less efficient than they needed to be. Think for yourself how to improve these, then talk to friends, colleagues, or mentors to get additional ideas.</li>
</ul>
<p>In addition to helping yourself improve, these habits help others as well--asking someone for advice engages their own thinking in a growth-oriented direction, so by helping you they are likely improving themselves, too. This also helps at the level of teams, as it builds chemistry and creates a shared culture of excellence and growth. Indeed, in sports, the best teams do this regularly, and veteran players are proactive in finding ways to help younger players. Some professional players even stay in a league, making millions of dollars a year, solely by being excellent sources of advice and mentorship.</p>
<h3 id="summary">Summary</h3>
<p>Find ways to routinely study research decision-making, through blogs, seminars, video streams, and histories. Actively consume these to  adopt and build up effective mental heuristics. Whenever you do something, reflect on how it could be better, and ask others for advice.  As you learn more yourself, find ways to give back to others. Consistently doing these will help you to become a better researcher over  time, and contribute to a culture of excellence among those around you.</p>
<!--kg-card-end: markdown-->]]></description>
      <pubDate>Mon, 28 Jun 2021 07:00:00 +0000</pubDate>
      <link>https://bounded-regret.ghost.io/film-study/</link>
      <dc:creator>Bounded Regret</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/3374076470</guid>
    </item>
    <item>
      <title><![CDATA[Principles for Productive Group Meetings]]></title>
      <description><![CDATA[<!--kg-card-begin: markdown--><p><em><strong>Note</strong>: This post is based on a Google document I created for my research group. It speaks in the first person, but I think the lessons could be helpful for many research groups, so I decided to share it more broadly. Thanks to Louise Verkin for converting from Google doc to Markdown format.</em></p>
<p>This document talks about principles for having productive group meetings and seminars, and to some extent a good group culture in general. It’s meant to be a living document--I’ve started it based on my own experiences, but ultimately our seminars and group culture come from all of us together. So if you have ideas you want to add, please do so!</p>
<p>I’ll start by talking about an important concept called <strong>psychological safety</strong>, then discuss what I see as the goals of our research group and how that fits into presentations and discussions in seminars and meetings. I’ll also provide tips for asking excellent questions and some general philosophy on how to hold yourself to a high standard of understanding.</p>
<h1 id="psychological-safety">Psychological Safety</h1>
<p>Psychological safety is an important concept for fostering creative and high-functioning teams. I would highly recommend reading the following two documents to learn about it in detail:</p>
<ul>
<li><a href="https://medium.com/@Harri_Kaloudis/psychological-safety-at-work-what-do-psychologically-safe-work-teams-look-like-5585ab0f2df4?ref=bounded-regret.ghost.io">What Do Psychologically Safe Work Teams Look Like?</a></li>
<li><a href="https://docs.google.com/document/d/1PsnDMS2emcPLgMLFAQCXZjO7C4j2hJ7znOq_g2Zkjgk/export?format=pdf&amp;ref=bounded-regret.ghost.io">Manager Actions for Psychological Safety</a></li>
</ul>
<p>To summarize, a psychologically safe team is one where members feel like:</p>
<ul>
<li>They can make mistakes without it affecting their status in the group</li>
<li>It is easy to give and receive feedback, including critical feedback, without feeling attacked or like one is causing trouble</li>
<li>One is allowed to and encouraged to question prevailing opinions</li>
</ul>
<p>These are especially important in research environments, because questioning and risk-taking are needed to generate creative ideas, and making mistakes and receiving feedback are necessary for learning.<br>
In general, I would encourage everyone in our group to take risks and make mistakes. I know everyone holds themselves to a high standard and so doesn’t like to make mistakes, but this is the main way to learn. In general, if you never do anything that causes you to look silly, you probably aren’t taking enough risks. And in another direction, if you never annoy anyone you probably aren’t taking enough risks. (Of course, you don’t want to do these all the time, but if it never happens then you can probably safely push your boundaries a bit.)</p>
<p><strong>Fostering psychological safety</strong>. As a group, here are some general principles for fostering psychological safety among our teammates:</p>
<ul>
<li>Assume your teammates have something to teach you, and try to learn from them.</li>
<li>In discussions and debates, aim to explain/understand, not to persuade. Adopt a frame of collaborative truth-seeking, rather than trying to “win” an argument.</li>
<li>Acknowledge and thank people for good points/questions/presentations/etc.</li>
<li>Invite push-back</li>
<li>Welcome and encourage newcomers</li>
</ul>
<p>In addition, there are a couple <strong>things to avoid</strong>:</p>
<ul>
<li>Try not to talk over people. Sometimes this happens due to being very excited and engaged in a conversation, and don’t sweat it if you do this occasionally, but try not to do it habitually, and if you do do it make sure to invite the person you interrupted to finish their point.</li>
<li>Avoid making broadly negative or dismissive statements. Even if you personally don’t intend such a statement to apply to anyone in the group, it’s inevitable that someone will take it personally. It also works against the principle of “questioning prevailing opinions”, because it implies that there’s an entire area of work or claims that is “off-limits”.<br><br>As an example, when I was a PhD student, a senior person often made claims to the effect that “research was pointless unless industry people cared about it”. This made it feel discouraging for me to do my (at the time) more theoretically-oriented work, and I abandoned at least one valuable project because of this. With the benefit of hindsight, I don’t think that person actually would have endorsed the literal claim I wrote above, but that’s exactly the point I’m making–it’s easy for other people to overinterpret claims.</li>
</ul>
<h1 id="group-goals-and-group-meetings">Group Goals and Group Meetings</h1>
<p>In my view, our group has three major goals:</p>
<ul>
<li>Do excellent research</li>
<li>Help each other to learn and grow</li>
<li>Help the world</li>
</ul>
<p>In the context of group meetings/seminars, we can promote these goals in the following ways:</p>
<ul>
<li>Hold yourself to a high standard of understanding (see below for more on this). In other words, don’t just follow the individual steps–try to understand why things had to be <em>this</em> way and not any other way. Asking questions about this not only helps your own understanding, but also pushes the speaker to clarify their own thinking–thus promoting the goals of excellent research and of learning.</li>
<li>It’s okay and encouraged to tie things back to the bigger picture. Excellent research is not only technically sound but also well-motivated. Understanding the bigger picture is also especially important for helping the world.</li>
<li>Try to ask questions in a way that succinctly models your own thinking process. One of the most valuable aspects of group meetings is that you can see how other people think, which helps learning. As a concrete example, sometimes in applied talks we ask questions that are very specific and only make sense to people immersed in that area. This is okay, but it’s better to ask the same question in a way that lets people not in that area see why the question is important.</li>
<li>As a speaker, don’t aim for the standard of “defensibility”. Instead, aim to convince the audience that you are onto something important and exciting (this is a different but not strictly higher standard, since it might involve saying some things that are only partially defensible). Similarly, as an audience member don’t be satisfied just because there’s “nothing wrong”–try to understand why a project was important enough that someone was excited to spend months of their life on it.</li>
</ul>
<p>In addition, here are some meta-level principles around question-asking:</p>
<ul>
<li>Basic understanding questions, even at the level of clarifying notation, are highly valuable and usually under-utilized because they don’t feel “smart”. I encourage everyone to ask these questions when they have them–if you’re confused, probably someone else is too, and it’s valuable feedback for the speaker.</li>
<li>I try to pay attention to how many other questions are being asked. If no one is asking questions, I’ll try to ask one to break the ice. If lots of questions are being asked, I’ll try to filter my own questions for the ones that are highest-value or most different from what’s already being discussed.</li>
<li>I also try to pay attention to how many questions I personally have already asked. If I haven’t asked a question yet I feel very free to ask one. If I’ve asked many already, I again try to filter for the highest-value ones.</li>
<li>As an audience member, you have much more cognitive bandwidth than the speaker. It’s therefore helpful to take the extra time to formulate your question to be easy to understand and engage with. It’s also good to state it succinctly when possible. Time spent formulating a question is time spent only by you, but time spend asking/answering it is spent by <em>everyone in the audience</em>.</li>
</ul>
<h1 id="seminar-norms">Seminar Norms</h1>
<p>The culture of a good seminar is different from the culture of everyday conversations, in a way that might not be obvious if you haven’t been immersed in it for a long time. I’ve already gone over that to some extent above, but below I’ll elaborate on some specific points in more detail, and lay out some helpful rules and norms that are usually unstated.</p>
<h2 id="audience-culture">Audience Culture</h2>
<p>There are many everyday social norms that hinder us from seeking a high level of understanding in a talk. Asking a question feels like a bid on the speaker’s and audience’s time and attention. We might worry that it’s a “dumb” question, or feel intimidated by a complicated statement that we don’t understand. Or conversely we might worry that it’s impolite or aggressive to ask for such a high (and, if we’re being honest, demanding) level of understanding. We might worry that we’re putting the speaker on the spot and that perhaps they won’t be able to answer and that we’ll make the <em>speaker</em> look “dumb”.</p>
<p>These are all natural and common thoughts to have from the perspective of everyday culture. But in my opinion, they come from a misconceptualization of seminar culture. Here is a conceptualization that can help dissolve these thoughts.</p>
<p><strong>You have a right to understand</strong>. If something is said in a seminar, you have a right to understand it. Science progresses not by ineffable truths that cannot be explained, but by clearly articulated common knowledge. It helps to also remember that:</p>
<ul>
<li>If you don’t understand something, it is likely that many other people do not as well.</li>
<li>Articulating a confusion is often itself a useful intellectual act. Sometimes we may not even realize that we are missing something until it is pointed out.</li>
</ul>
<p><strong>Asking questions shows respect</strong>. When I ask a question, it shows that I am interested enough in the topic to engage with it, and that I trust the speaker to give an informative answer. Not asking questions implies that the topic is either not worth engaging with, or that you don’t think the speaker is equipped to answer. Questions show respect.</p>
<h2 id="speaker-culture">Speaker Culture</h2>
<p><strong>You have a right to direct the conversation</strong>. A vigorous seminar audience will likely have more questions than you have time to answer, and might sometimes focus on early aspects of a talk that are not the main point. Therefore, as the speaker, you always have a right to direct the conversation to the aspects that will be most interesting or fruitful. You can simply politely cut off a current line of questioning by explaining that there are other topics you want to get to, and promising to engage later if necessary.</p>
<p><strong>Honest answers show courage</strong>. As the speaker, perceptive questions will often stretch the limits of your own understanding. It can be tempting to reflexively deflect or bluster to hide this. But it is much better to be honest about those limits (while feeling free to engage in speculation). Learning the limits of your own knowledge is also a great opportunity for growth.</p>
<h2 id="being-an-excellent-participant">Being an Excellent Participant</h2>
<p>The above norms for speakers and listeners set the ground rules for a productive seminar. But there is more you can do to help actively stimulate learning. Here are a few principles:</p>
<ul>
<li><em>As a listener, be mindful of cognitive load</em>. The speaker has to manage an entire audience of dozens of people, while you as a listener really only have to worry about yourself. So if there’s a question that’s bugging you, that the speaker doesn’t initially give a good answer to, try to do as much work as you can to productively reformulate your question, rather than making the speaker figure it out for you. (Of course, sometimes this isn’t possible, and the speaker does have the advantage of being the expert on the topic. But it’s good to try to offload cognitive load from the speaker whenever possible.)</li>
<li><em>As a listener, be mindful of tone</em>. This is in some sense a corollary of cognitive load. Certain tones take extra effort to gracefully process or to respond to (e.g. dismissiveness, condescension, extreme assertiveness, etc.). We should mostly want tone to be fairly neutral (neither timid nor overbearing, but curious and assertive).</li>
<li><em>As a speaker, be mindful of tone</em>. Treating questions dismissively will ensure that other people don’t ask questions. We generally don’t do this intentionally, but e.g. giving a short, confident-sounding, but incomplete answer can make it psychologically harder to ask follow-up questions.</li>
<li><em>As a speaker, avoid rambling</em>. Sometimes when we aren’t completely satisfied with our own answer, we end up rambling or repeating the same answer in several different ways. This can end up taking up several minutes of time if you don’t catch yourself. Once you’ve said what you have to say, move on to the next slide or the next question (fine to acknowledge if you think there might be more to say after further thought).</li>
</ul>
<p>None of these are things we will remember all the time, and it's not a big deal if you forget, but these are all habits to aspire to that will improve the experience for both you and others.</p>
<h2 id="tips-for-high-trust-environments">Tips for High-Trust Environments</h2>
<p>For high-trust environments (like our own group meeting), we can do even better. Here we can keep in mind that everyone is on the same team, and our goal is to help each other excel. In particular:</p>
<ul>
<li><em>Don’t be afraid to ask tough questions</em>. Our meeting is a safe space, and asking tough questions now helps the speaker think through them before they present externally.</li>
<li>Hold others to the standard you would hold yourself. From knowing all of you, I know that we all hold ourselves to a high personal standard–we want to do excellent work on the most important problems in ML. Let’s call this the <em>standard of excellence</em>. In seminars, I think we sometimes make the mistake of holding the speaker to the <em>standard of defensibility</em>: can they give a reasonable-seeming answer to questions of why/how they did something? Defensibility isn’t just too low of a standard, it’s actually the wrong standard: any ambitious project is going to go out on a limb in some ways, and there will be parts of it that are more speculative. Optimizing for defensibility leads us to avoid ambition. So get the speaker to convince you that this is excellent, rather than defensible, work.</li>
</ul>
<p>For a completed project, my aspirational goal as a speaker is usually to convince the audience that my work addresses a key issue on one of the most important problems in the field (or ideally the world), and that they should be working on this question if they have the right skillset. I almost never meet this goal, but the point is that striving for it leads me to meet higher levels of excellence over time. I think we should all at least periodically strive for this goal in our talks, realizing that we won’t meet it but that the gap can reveal important lessons or important directions of future work. Similarly, as an audience we should consider holding the speaker to this standard. At the same time, we should recognize that anyone who is even inviting this standard in the first place is already performing an act of virtue, and that even being able to talk about where it falls short means that it’s in a comparison class with outstanding work.</p>
<p>On the other hand, many of the presentations in our group are (and should be) on preliminary work or half-baked ideas. Here the above standard is not particularly helpful, and the honest answer to some questions will be “I dunno, I just have some vague intuition that this is a good idea”. Asking those questions is still valuable as long as they are well-targeted (in the sense that we could reasonably expect a more interesting answer than “I have some vague intuition”, or if they point to a place where it would be particularly useful to refine the intuition). But it’s also useful to think in terms of more brainstorm-y questions: “Have you tried X?”, “This seems related to other interesting thing Y”, “What about this alternative framing?”, “I think your high-level question is interesting, but how do you grapple with key conceptual issue Z? Maybe you could try this technique”. Actually, these are great questions even for a fully-baked talk. But for half-baked ideas we should conspicuously increase the number of these types of questions, because the goal is to help give the speaker useful ideas rather than to construct a thorough collective understanding of the topic.</p>
<p>If you’re a speaker who feels nervous giving talks, remember that you’re among friends whose ultimate goal is to help you do great research. This is the time to take risks, get feedback, and grow. Similarly, if you’re an audience member who feels hesitant to ask questions, think of this as the place to expand your comfort zone and try things you wouldn’t normally try. And of course, if you have any thoughts or questions about any of this, feel free to leave a comment here or ask me one-on-one.</p>
<h2 id="levels-of-understanding">Levels of Understanding</h2>
<p>Finally, I want to talk about different levels of <em>understanding</em> (which is, after all, the point of a seminar).</p>
<p><em>(<strong>Note</strong>: The first example below is a bit dense because it’s about a mathematical definition. Feel free to skip to the second example, on robustness, if it’s too much effort to decipher.)</em></p>
<p>Let’s suppose that in some talk you see the following definition:</p>
<blockquote>
<p>A function f on [0,1] is Holder continuous with parameter α if, for k = floor(α) it satisfies |f<sup>(k)</sup>(x)-f<sup>(k)</sup>(y)|≤C|x-y|<sup>α-k</sup> for some constant C&gt;0, for all x,y.</p>
</blockquote>
<p>This definition is probably mysterious to you (it was to me). Let’s suppose you ask the speaker for some intuition on what this definition is doing. There’s at least three levels of explanation they could give:</p>
<p><strong>Level 1</strong>: For α=1 this is the same as being Lipschitz, so think of this as a generalization of Lipschitz.</p>
<p><strong>Level 2</strong>: Morally, this is asking that the function be “α times differentiable”, where we want α to not necessarily be a whole number. For integer α the condition exactly says that f should have α derivatives, while for α&lt;1 it asks the function locally to grow as |x-y|<sup>α</sup>, which is weaker than differentiability but approaches differentiability as α-&gt;1.</p>
<p><strong>Level 3</strong>: A level 2 explanation, plus a description of in what sense this is really a generalization of differentiability (i.e. what analogous properties we get), or some explanation of why this is the “right” way to generalize differentiability. [I don’t actually know the answer to this…]</p>
<p>Of course, the level 3 or level 2 explanation might take too long to get across in a talk. But it’s useful to realize that level 3 is always out there, and to notice as a listener when you’re only at level 1 or level 2. And as a speaker, if you don’t have time for at least a level 2 explanation, consider if this definition is really worth putting up there (why not just talk about regular old differentiability and then mention that there’s a generalization?).</p>
<p>These levels apply to all aspects of a talk, not just mathematical definitions. For instance, imagine a talk about robustness, where the speaker is describing the motivation for their work.</p>
<p><strong>Level 1</strong>: Robustness is important.</p>
<p><strong>Level 2</strong>: The problem we’re considering gets at the following aspect of robustness, which is important.</p>
<p><strong>Level 3</strong>: In the field of robustness, one of the core difficulties is X (as evidenced by {conceptual issue, consultation with practitioners, etc.}). We will tackle problem P which offers a way forward on addressing X.</p>
<p>And for motivation in particular, there’s also a final level:<br>
<strong>Level 4</strong>: In the world at large, M is one of the most important problems, as evidenced by {effect on GDP, important historical analogues, effect on important institutions, etc.}. Machine learning robustness offers a uniquely compelling angle on M for reasons R. &lt;Followed by level 3 explanation&gt;</p>
<p>In practice, it is rare for a seminar to ever touch on Level 4. This is probably partly due to time constraints, partly because many academics consider it “out of scope”, and partly because of the possibly impolite implication that other fields of study are less important. The main exception is job talks, where something on level 4 is expected. I think it’s probably correct for Level 4 to be rare in seminars, but I’d personally also like to see slightly more of it at the current margin. For instance, if you’re at the point of presenting a body of work rather than a single paper, I think it’s worthwhile to at least argue for why this is a compelling direction <em>within the field of ML</em> (we could call that level 3.5).</p>
<p>Finally, while addressing the higher levels requires a deep understanding on the part of the speaker, there are similar levels that apply even to something that isn’t well-understood. For instance, suppose in an applied ML talk, there is a mysterious heuristic H that improves the results. One could say:</p>
<p><strong>Level 1</strong>: 	H works.</p>
<p><strong>Level 2</strong>: 	H works, and we have no idea why.<br>
OR 	H works, for intuitive reason R.</p>
<p><strong>Level 3</strong>:	H works, and we have no idea why. We haven’t really looked into it [possibly followed by reason why this isn’t a core issue for the present work].<br>
OR	H works, and we have no idea why. We tried looking into X,Y,Z to understand it but none of them turned up much insight.<br>
OR	H works, for what we speculate is intuitive reason R, but we haven’t really looked into it.<br>
OR	H works, for what we think is intuitive reason R, and here’s some additional follow-up evidence that seems to support R.</p>
<p>Note that at each level, there are multiple possible explanations depending on the speaker’s actual level of knowledge. Level 1 simply asserts the empirical observation. Level 2 couples it with the speaker’s opinion about the observation, while Level 3 presents what I’d call the <em>full epistemic status</em> surrounding the observation (i.e. what surrounding questions have been investigated and how they support/don’t support different theories). Of course, the bottom example in Level 3 is preferable to the top example, but only one of those is an honest portrayal of the work, and the speaker doesn’t have the power to change that during a talk. What they do have power over is whether they give a Level 1, 2, or 3 explanation. Therefore, as the speaker, have the courage to give a Level 3 explanation even if it acknowledges uncertainty, and as a listener have the wisdom to accept such a Level 3 explanation and to respect the speaker’s courage and integrity.</p>
<p><strong>Conclusion.</strong> Now that we have these levels in mind, we can better understand the seminar norms discussed above. The purpose of these norms is to reach the highest level of understanding possible about the most important aspects of a topic, and to socially reward speakers and listeners who move us towards that understanding.</p>
<!--kg-card-end: markdown--><h2></h2><h3></h3><p></p><h3></h3>]]></description>
      <pubDate>Wed, 22 Mar 2023 00:47:54 +0000</pubDate>
      <link>https://bounded-regret.ghost.io/principles-for-productive-group-meetings/</link>
      <dc:creator>Bounded Regret</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4046879833</guid>
    </item>
    <item>
      <title><![CDATA[Ari Morcos @arimorcos]]></title>
      <description><![CDATA[  <div class="feedbin--wrap">
    <p class="feedbin--content-text">Web-scale data has driven the incredible progress in AI but do we really need all that data? 

We introduce SemDeDup, an exceedingly simple method to remove semantic duplicates in web data which can reduce the LAION dataset (&amp; train time) by 2x w/ minimal performance loss.

🧵👇</p>
      <div class="feedbin--meta-wrap">
        <p class="feedbin--meta">
          <a href="https://twitter.com/jefrankle/status/1637815681797242882">
            <small>Retweeted by Jonathan Frankle</small>
          </a>
        </p>
      </div>
      <p class="feedbin--media-wrap">
          <a href="https://pbs.twimg.com/media/FrpZoC3X0AIdTCC.jpg:4096x4096">
            <img src="https://pbs.twimg.com/media/FrpZoC3X0AIdTCC.jpg:large">
</a>      </p>
  </div>
]]></description>
      <pubDate>Mon, 20 Mar 2023 13:57:42 +0000</pubDate>
      <link>https://twitter.com/arimorcos/status/1637814198460612609</link>
      <dc:creator>@jefrankle</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4044588306</guid>
    </item>
    <item>
      <title><![CDATA[A rant on LLaMA: please stop training giant language models]]></title>
      <description><![CDATA[
<p>Meta AI recently released <a href="https://arxiv.org/abs/2302.13971">a new language model called LLaMA</a>. And by “released a model”, I mean “didn’t really release a model”. They released a really really nice form instead which you can fill out and then Meta will get back to you after snooping on you just to make sure you haven’t been naughty recently (did I mention the form is really nice and it’s public: EVERYBODY can fill out the form). Presumably, no weights for you (or just random weights for you) if they find out you have been a bit too naughty for their liking.</p>



<p>Anyway. So, these LLaMAs come in four different sizes: from 6.7B parameters (smol) to 65.2B parameters (chonky). The largest two models are trained for 1.4T tokens, whereas the smaller ones are trained for 1T tokens (not really sure why). This is roughly ~1 epoch (effectively) over the training data. The largest model roughly follows the Chinchilla compute-optimal recipe. There’s nothing the least bit remarkable about the models or the training setup. It’s just the standard GPT model trained in the standard way. The training data is said to be all public, although I didn’t check this carefully for myself (one hopes that it’s not public in the Meta sense of public. Just kidding, but not really).</p>



<p>The money figure in <a href="https://arxiv.org/abs/2302.13971">the LLaMA paper</a> (for me) is the following figure that shows the training loss curves for the four models (Figure 1):</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png"><img data-attachment-id="3910" data-permalink="https://severelytheoretical.wordpress.com/2023/03/05/a-rant-on-llama-please-stop-training-giant-language-models/screenshot-from-2023-03-03-19-35-47/" data-orig-file="https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png" data-orig-size="592,415" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="screenshot-from-2023-03-03-19-35-47" data-image-description="" data-image-caption="" data-medium-file="https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png?w=300" data-large-file="https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png?w=580" src="https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png?w=592" alt="" class="wp-image-3910" srcset="https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png 592w, https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png?w=150 150w, https://severelytheoretical.files.wordpress.com/2023/03/screenshot-from-2023-03-03-19-35-47.png?w=300 300w" sizes="(max-width: 592px) 100vw, 592px"></a><figcaption class="wp-element-caption">Tell me again: why are we still training <em>O(100B)</em> parameter models?</figcaption></figure></div>


<p>As you can see, no apparent saturation for the 7B and 13B parameter models. In fact, the training loss seems to be decreasing at roughly the same rate for all four models after around 300B tokens. Seeing this figure, one gets immediately overcome by a sense of dejavu: this is the GPT-3 paper all over again with its <a href="https://severelytheoretical.wordpress.com/2022/07/18/thoughts-on-the-new-scaling-laws-for-large-language-models/">severely (criminally!) undertrained small models</a>. </p>



<p>From the above figure, it looks distinctly possible (and indeed I would say quite likely) that were the smallest two models given the same amount of compute as the 65B parameter model, they would have probably matched or even surpassed that model. Giving them the same amount of compute would mean training the 7B parameter model ~12.5x longer and the 13B parameter model ~7.6x longer (I calculated these numbers from the corresponding GPU-hours reported in Table 15 of the paper). Here’s what the training loss curves might have looked like in that scenario (you can click on the image for an enlarged view):</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png"><img data-attachment-id="3907" data-permalink="https://severelytheoretical.wordpress.com/2023/03/05/a-rant-on-llama-please-stop-training-giant-language-models/llama-extrapolation/" data-orig-file="https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png" data-orig-size="19146,1932" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="llama-extrapolation" data-image-description="" data-image-caption="" data-medium-file="https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=300" data-large-file="https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=580" src="https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=1024" alt="" class="wp-image-3907" srcset="https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=1024 1024w, https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=2041 2041w, https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=150 150w, https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=300 300w, https://severelytheoretical.files.wordpress.com/2023/03/llama-extrapolation.png?w=768 768w" sizes="(max-width: 1024px) 100vw, 1024px"></a><figcaption class="wp-element-caption">plz train the smol one for this loooooooooooooooooooooooooooooooooooooooong</figcaption></figure></div>


<p>See just how much longer you would have to train the small models to match the compute given to the largest model? Now, you may laugh at my dumbass hand-drawn training loss curves, but I would submit to you that these dumbass hand-drawn curves are in fact much more rigorous than the dumbass “scaling laws” some really smart people came up with. My dumbass hand-drawn curves are also completely harmless, unlike the dumbass “scaling laws”, which had the overall pernicious effects of wasting a huge amount of resources and making these models much less accessible than they could have been.</p>



<p>Anyway. So, I’m trying to find a non-cynical explanation for this almost bizarre, persistent unwillingness to train small models for longer, but I can’t really find a very convincing one. Training a humongous model for a total of 1 epoch only over your training data is a phenomenon that does not really exist anywhere else in machine learning, to my knowledge. Take <a href="https://arxiv.org/abs/2205.01917">this CoCa paper</a> for comparison, for instance (which is <a href="https://paperswithcode.com/sota/image-classification-on-imagenet">~sota on ImageNet</a> as of this writing): it trains a ~2.1B parameter model on a billion scale image-text dataset (~5B examples in total) for ~7 epochs (effectively). </p>



<p>Of course, I don’t believe for a second that people training these giant language models are actually dumb or ignorant, although from my experiences in academia, I could probably justifiably claim that they might be a bit too credulous: you can make a surprisingly large number of people in these circles believe some really dumb shit if it’s said or done by a sufficiently high prestige individual or individuals (just look at the insane “superintelligence” stuff, to give one example).</p>



<p>Anyway. So, my cynical interpretation? As I argued <a href="https://severelytheoretical.wordpress.com/2020/09/01/thoughts-on-gpt-3/">here</a> before, trying to make these models less easily accessible, less easily controllable by others might be a feature, not a bug. I don’t believe, for instance, that OpenAI is really using a 175B parameter model for ChatGPT or for their other language products (<a href="https://orenleung.super.site/is-chatgpt-175-billion-parameters-technical-analysis">here</a> is an interesting analysis I saw recently that makes the same point, with some caveats), but they have an incentive for making people believe that they’re using a 175B parameter model and that it’s actually critical to use a giant model like that.</p>



<p>Last but not least, one final life lesson from all this, folks, is that whenever a theoretical physicist starts talking about power laws, just completely ignore them (and I really mean <em>completely</em>), immediately run away in the opposite direction. It is my contention that nothing good has ever come out of a physicist blabbering about power laws.</p>
]]></description>
      <pubDate>Sun, 05 Mar 2023 19:44:41 +0000</pubDate>
      <link>https://severelytheoretical.wordpress.com/2023/03/05/a-rant-on-llama-please-stop-training-giant-language-models/</link>
      <dc:creator>Severely Theoretical</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4023985843</guid>
    </item>
    <item>
      <title><![CDATA[Coordination Headwind - How Organizations Are Like Slime Molds]]></title>
      <description><![CDATA[ <!-- Global site tag (gtag.js) - Google Analytics --> ]]></description>
      <pubDate>Tue, 21 Feb 2023 17:02:25 +0000</pubDate>
      <link>https://komoroske.com/slime-mold/</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4006945182</guid>
    </item>
    <item>
      <title><![CDATA[V.I. Arnold, On teaching mathematics]]></title>
      <description><![CDATA[]]></description>
      <pubDate>Tue, 21 Feb 2023 17:00:07 +0000</pubDate>
      <link>https://www.uni-muenster.de/Physik.TP/~munsteg/arnold.html</link>
      <dc:creator>Pages</dc:creator>
      <guid isPermaLink="false">https://feedbin.me/entries/4006938065</guid>
    </item>
  </channel>
</rss>
