Starred Articles

RT by @srush_nlp: I have some thoughts on this, and I mostly agree with you. But the picture is nuanced... Longer reply below: 👇 I agree that when RL is done right -- with the right prompt mixture, possibly a curriculum, and a sufficiently high-capacity base model, we should generally expect pass@k to increase with training steps, especially when we are not repeatedly running too many epochs over the same prompt mixture. In that sense, I agree that a drop in pass@k is not a universal phenomenon to fix by itself, but rather a symptom of something deeper with the cross product of the base model x prompt set x training epochs. For example, with typical prompt mixtures in open-source land and 4B or larger models, we haven’t seen this issue happen if we do not train for too long. That said, we do see this issue when the prompt mixture contains very different types of problems, and for certain classes of base models (we do not yet know how to precisely predict that this will happen for a base model, though I will give a mental model below for why this happens), pass@k does decrease in practice with multi-epoch training on the same prompt set. We have observed this consistently… A concrete example: suppose you train on a mixture of very easy problems (where the base model is already strong) and very hard problems (where the base model is weak, pass@k near 0), and the base model is not very large. RL will extract a strong learning signal from the easy subset and sharpen the policy there, but you will see pass@k dropping on the hard set. Importantly, this drop is not due to entropy or response length reduction on the easy subset, since collapsing onto the correct solution there is actually desirable, but instead due to negative transfer, i.e.. some kind of interference that transfers these phenomena to the hard prompts as well where entropy / response length collapse is undesirable. This phenomenon has essentially been referred to as ray interference in the multi-task RL literature (https://arxiv.org/abs/1904.11455). We discuss this in our blog post on training on hard problems where exploration is a challenge (https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/, see the “interference” section). Intuitively, RL optimizes one subset much faster than the other, which can stall or even harm learning on the hard problems (see the cartoon below). The concern I have with fixing this via reward shaping or novel algorithmic tweaks is that these approaches often address the symptom of the pass@k drop, rather than the cause, which is negative transfer (across problems) during RL. If instead we modify the prompt set algorithmically, for example by adding guidance to hard prompts (as we explore in the blog, https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems/), using curricula, or filtering data, this issue may largely disappear since we make progress on all problems simultaneously (roughly). Changing the reward metric or directly optimizing pass@k, as many papers do (including an older one of mine 🙂), is unlikely to resolve the core challenge on its own and pass@k might still decrease for higher k. Given differences in base models and data compositions, it seems that different RL recipes may be better suited to different settings. This means, perhaps our goal should be to identify, as simply as possible, whether this issue is likely to arise in a given RL setting and then address it with the minimal approach. Of course, we do not know all the ways to solve this issue but I think we have a reasonable handle on what is happening and why. It would be very interesting to see if there are even simpler ways to improve interference in RL in general…

Sasha Rush / @srush_nlp — Fri, 19 Dec 2025 03:22:54 +0000

I have some thoughts on this, and I mostly agree with you. But the picture is nuanced... Longer reply below: 👇

I agree that when RL is done right -- with the right prompt mixture, possibly a curriculum, and a sufficiently high-capacity base model, we should generally expect pass@k to increase with training steps, especially when we are not repeatedly running too many epochs over the same prompt mixture. In that sense, I agree that a drop in pass@k is not a universal phenomenon to fix by itself, but rather a symptom of something deeper with the cross product of the base model x prompt set x training epochs. For example, with typical prompt mixtures in open-source land and 4B or larger models, we haven’t seen this issue happen if we do not train for too long.

That said, we do see this issue when the prompt mixture contains very different types of problems, and for certain classes of base models (we do not yet know how to precisely predict that this will happen for a base model, though I will give a mental model below for why this happens), pass@k does decrease in practice with multi-epoch training on the same prompt set. We have observed this consistently…

A concrete example: suppose you train on a mixture of very easy problems (where the base model is already strong) and very hard problems (where the base model is weak, pass@k near 0), and the base model is not very large. RL will extract a strong learning signal from the easy subset and sharpen the policy there, but you will see pass@k dropping on the hard set. Importantly, this drop is not due to entropy or response length reduction on the easy subset, since collapsing onto the correct solution there is actually desirable, but instead due to negative transfer, i.e.. some kind of interference that transfers these phenomena to the hard prompts as well where entropy / response length collapse is undesirable.

This phenomenon has essentially been referred to as ray interference in the multi-task RL literature (arxiv.org/abs/1904.11455). We discuss this in our blog post on training on hard problems where exploration is a challenge (blog.ml.cmu.edu/2025/11/26/h…, see the “interference” section). Intuitively, RL optimizes one subset much faster than the other, which can stall or even harm learning on the hard problems (see the cartoon below).

The concern I have with fixing this via reward shaping or novel algorithmic tweaks is that these approaches often address the symptom of the pass@k drop, rather than the cause, which is negative transfer (across problems) during RL. If instead we modify the prompt set algorithmically, for example by adding guidance to hard prompts (as we explore in the blog, blog.ml.cmu.edu/2025/11/26/h…), using curricula, or filtering data, this issue may largely disappear since we make progress on all problems simultaneously (roughly). Changing the reward metric or directly optimizing pass@k, as many papers do (including an older one of mine 🙂), is unlikely to resolve the core challenge on its own and pass@k might still decrease for higher k.

Given differences in base models and data compositions, it seems that different RL recipes may be better suited to different settings. This means, perhaps our goal should be to identify, as simply as possible, whether this issue is likely to arise in a given RL setting and then address it with the minimal approach. Of course, we do not know all the ways to solve this issue but I think we have a reasonable handle on what is happening and why. It would be very interesting to see if there are even simpler ways to improve interference in RL in general…

Why AGI Will Not Happen

Tim Dettmers — Wed, 10 Dec 2025 15:05:30 +0000

If you are reading this, you probably have strong opinions about AGI, superintelligence, and the future of AI. Maybe you believe we are on the cusp of a transformative breakthrough. Maybe you are skeptical. This blog post is for those who want to think more carefully about these claims and examine them from a perspective that is often missing in the current discourse: the physical reality of computation.

I have been thinking about this topic for a while now, and what prompted me to finally write this down was a combination of things: a Twitter thread, conversations with friends, and a growing awareness that the thinking around AGI and superintelligence is not just optimistic, but fundamentally flawed. The purpose of this blog post is to address what I see as very sloppy thinking, thinking that is created in an echo chamber, particularly in the Bay Area, where the same ideas amplify themselves without critical awareness. This amplification of bad ideas and thinking exhuded by the rationalist and EA movements, is a big problem in shaping a beneficial future for everyone. Realistic thought can be used to ground where we are and where we have to go to shape a future that is good for everyone.

I want to talk about hardware improvements, AGI, superintelligence, scaling laws, the AI bubble, and related topics. But before we dive into these specific areas, I need to establish a foundation that is often overlooked in these discussions. Let me start with the most fundamental principle.

Computation is Physical

A key problem with ideas, particularly those coming from the Bay Area, is that they often live entirely in the idea space. Most people who think about AGI, superintelligence, scaling laws, and hardware improvements treat these concepts as abstract ideas that can be discussed like philosophical thought experiments. In fact, a lot of the thinking about superintelligence and AGI comes from Oxford-style philosophy. Oxford, the birthplace of effective altruism, mixed with the rationality culture from the Bay Area, gave rise to a strong distortion of how to clearly think about certain ideas. All of this sits on one fundamental misunderstanding of AI and scaling: computation is physical.

For effective computation, you need to balance two things. You need to move global information to a local neighborhood, and you need to pool multiple pieces of local information to transform old information into new. While the complexity of local computation is virtually constant — much accelerated by smaller transistors — movement scales quadratically with distance to local computation units. While memory movement also benefits from smaller transistors, improvements become quickly sublinear due to the squared nature of memory access patterns.

This is most easily seen by looking at cache hierarchies. L1, L2 and L3 cache are physically the same technology, but computationally they are very different. L2 and L3 are much larger than L1, but they are also much slower. This is because L2 and L3 are further away, physically, from the computational core, and memory lookups need to traverse a longer distance due to the physical size.

Two ideas to remember: First, larger caches are slower. Second, as we get smaller and smaller transistors, computation gets cheaper, but memory becomes more expensive, relatively speaking. The fraction of silicon area dedicated to memory on a chip has increased over time to the point where now computational elements on a chip are trivial in proportion. Almost all area is allocated to memory. In other words, if you want to produce 10 exaflops on a chip, you can do that easily — but you will not be able to service it with memory, making it useless FLOPS (the NVIDIA marketing department is good at ignoring this fact). All of this makes AI architectures like the transformer fundamentally physical. Our architectures are not abstract ideas that can be developed and thrown around carelessly. They are physical optimizations of information processing units.

To process information usefully, you need to do two things: compute local associations (MLP) and pool more distant associations to the local neighborhood (attention). This is because local information alone only helps you to distinguish closely related information, while pooling distant information helps you to form more complex associations that contrast or augment local details. The transformer is one of the most physically efficient architectures because it combines the simplest ways of doing this local computation and global pooling of information. The global pooling of information might be made more effective through research, and there is still active investigation going on that I think might be promising, but it has diminishing returns — the transformer architecture is close to physically optimal.

Computation is physical. This is also true for biological systems. The computational capacity of all animals is limited by the possible caloric intake in their ecological niche. If you have the average calorie intake of a primate, you can calculate within 99% accuracy how many neurons that primate has. Humans invented cooking, which increased the physically possible caloric intake substantially through predigestion. But we reached the physical limits of intelligence. When women are pregnant, they need to feed two brains, which is so expensive that physically, the gut cannot mobilize enough macronutrients to keep both alive if our brains were bigger. With bigger brains, we would not be able to have children — not because of the birth canal being too small, but because we would not be able to provide enough energy — making our current intelligence a physical boundary that we cannot cross due to energy limitations.

We are close to reaching the same limits for digital computation.

Linear Progress Needs Exponential Resources

There have been studies about progress in all kinds of fields that come to the same conclusion: linear progress needs exponential resources. What does that mean? If you want to improve a system further and further, make it more precise, or improve its efficiency, you need exponentially more resources with any improvement that you make. This is true for all kinds of fields and problems being investigated, and it is pretty clear why.

There are two realities at play here: one physical and one in the idea space. In the physical reality, if you need to accumulate resources in time and space to produce an outcome, then for logistical reasons, the overall effect that is locally produced needs linear resources to produce a linear outcome. But because of physicality and because matter takes up space, those resources can only be pooled at an increasingly slowing rate due to contention in space or time.

In the idea space, there is a similar phenomenon, which is less obvious. If two ideas are completely independent, they can have an effect that is ten times larger than any single idea. But if ideas are related, then the overall impact is limited due to diminishing returns — the ideas are just too correlated. If an idea builds on another, it can only be so much better. Often, if there is a dependency between ideas, one is a refinement of the other. Refinements, even if they are extremely creative, will yield incremental improvements. If a field is large enough, even if one tries to work on very different ideas, they are still heavily related to previous ideas. For example, while state-based models and Transformers seem like very different approaches to attention, they concentrate on the same problem. Very minimal gains can be achieved through any idea that modifies attention in these ways.

These relationships are most striking in physics. There was a time when progress could be made by individuals – not so much anymore.

I talked to a top theoretical physicist at a top research university, and he told me that all theoretical work in physics is, in some sense, either incremental refinement or made-up problems. The core problem of the idea space is this: if the idea is in the same sub-area, no meaningful innovation is possible because most things have already been thought. A first urge is to look for wildy creative ideas, but the problem is that are still bound by the rules of that subspace that often exist for a very good reason (see graduate-student-theory-of-everything-phenomenon). So the theoretical physicist faces only two meaningful choices: refine other ideas incrementally, which leads to insignificant impact; or work on rule-breaking unconventional ideas that are interesting but which will have no clear impact on physical theory.

The experimental physics demonstrates the physical limitations. The experiments that test more and more fundamental laws of physics and constituent particles — in other words, the standard model — become increasingly expensive. The standard model is incomplete, and we do not know how to fix it. Higher energies at the Large Hadron Collider have only led to more inconclusive results and the ruling out of more theories. We have no understanding of what dark energy or dark matter is, even though we build increasingly complex experiments that cost billions of dollars. The reality might be that certain aspects of physics are unknowable, hidden by complexity that cannot be attained with the resources that we can muster.

If you want to get linear improvements, you need exponential resources.

GPUs No Longer Improve

One of the most common misconceptions I see is that people assume hardware keeps improving and improving. This is an important misconception that explains a lot of the poor thinking around AI progress. The efficiency of GPUs has driven almost all innovation in AI. AlexNet was only possible by developing one of the first CUDA implementations that could compute convolutions over networked GPUs. Further innovation was mostly possible through improved GPUs and using more GPUs. Almost everybody sees this pattern — GPUs improve, AI performance improves — and it is easy to think that GPUs will improve further and will continue to improve AI outcomes. Every generation of GPUs has been better, and it would seem foolish to think that it will stop. But actually, it is foolish to think that GPUs will continue to improve. In fact, GPUs will no longer improve meaningfully. We have essentially seen the last generation of significant GPU improvements. GPUs maxed out in performance per cost around 2018 — after that, we added one-off features that exhaust quickly.

The first of these one-off features was 16-bit precision, then Tensor Cores, or the equivalent, then high-bandwidth memory (HBM),then the TMA or equivalent, then 8-bit precision, then 4-bit precision. And now we are at the end, both in the physical and the idea space. I have shown in my paper about k-bit inference scaling laws what data types with particular block sizes and computational arrangements are optimal. This has already been adopted by hardware manufacturers. Any further improvement will lead not to straightforward improvements but to trade-offs: either better memory footprint at lower computational efficiency or higher computational throughput at higher memory footprint. Even if you can innovate – linear improvements, need exponential resources – further improvements will be trivial and will not add any meaningful advancement.

While GPUs can no longer improve meaningfully, rack-level optimizations are still critically important. Efficient shuttling of key-value caches is one of the most important problems in AI infrastructure. The current solution to this problem, however, is also relatively straightforward. Companies like OpenAI boast about their AI infrastructure, but it is relatively simple to design because there is essentially only one optimal way to design it. And while it is complex to implement, it just needs clear thinking and mostly hard, time-intensive engineering. But the overall system design is not particularly novel. OpenAI – or other frontier labs – have no fundamental advantage in their inference and infrastructure stacks. The only way to gain an advantage is by having slightly better rack-level hardware optimizations or data-center-level hardware optimizations. But these will also run out quickly – maybe 2026, maybe 2027.

Why Scaling Is Not Enough

In my Twitter thread, I talked about how Gemini might signal a plateau in AI progress in the sense that we might not see meaningful improvements anymore. A lot of people responded with something along the lines of, “You are being too pessimistic. Can you not see that scaling works?” The point here is a bit more subtle, so I want to elaborate.

I believe in scaling laws and I believe scaling will improve performance, and models like Gemini are clearly good models. The problem with scaling is this: for linear improvements, we previously had exponential growth as GPUs which canceled out the exponential resource requirements of scaling. This is no longer true. In other words, previously we invested roughly linear costs to get linear payoff, but now it has turned to exponential costs. That would not be a problem on its own, but it sets a clear physical limit on scaling that is rapidly approaching. We have maybe one, maybe two more years of scaling left because further improvements become physically infeasible. The scaling improvements in 2025 were not impressive. Scaling in 2026 and 2027 better work out better.

Despite these exponential costs, the current infrastructure build-out is reasonable, particularly with the growth of inference use, but it still creates a very precarious balance. The biggest problem is this: if scaling does not provide much larger improvements than research/software innovations, then hardware becomes a liability and not an asset.

Small players like MoonshotAI and Z.ai show that they do not need many resources to reach frontier performance (I personally prefer Kimi K2-thinking over Sonnet 4.5 for coding). If these companies innovate beyond scale, they might just create the best model. While they might still use existing infrastructure, they could just switch to Huawei Ascend chips for inference, which are more than fine for providing good inference performance.

Another big threat to scale-up-infrastructure is that, currently, large-model inference efficiency is strongly related to a large user base due to network scaling. The problem is that efficient deployments of a large model needs a certain amount of GPUs to be efficient enough to overlap computation with networking and KV-cache length partitioning. Such deployments are ultra-efficient but demand a large user base to unlock full utilization and with that, cost-effectiveness. That is why open-weight models currently have not had the expected impact, because the infrastructure cost of large deployments need a large user-base. However, this problem can be solved with software.

While vLLM and SGLang currently try to optimize frontier-type deployments, they do not provide this efficiency at smaller scales. With the right inference stack beyond vLLM/SGLang, people would be able to deploy a ~300-billion-parameter model with the same efficiency as OpenAI or Anthropic deploys their frontier models. If smaller models become more capable — we see this with GLM 4.6 — or if AI applications become more specialized, the infrastructure advantage of frontier labs might vanish overnight. The software complexity evaporates, and open-source, open-weight deployments might be close to physically optimal, both in terms of computational efficiency and information processing efficiency. This is a large risk for frontier players.

Under slowing scaling, any of these three factors might degrade the value of AI infrastructure significantly and rapidly: (1) research/software innovations, (2) strong open-weight inference stacks, (3) shift to other hardware.

The current trends do not look good for frontier labs.

Frontier AI Versus Economic Diffusion

The US and China follow two different approaches to AI. The US follows the idea that there will be one winner who takes it all – the one that builds superintelligence wins. Even coming short of superintelligence of AGI, if you have the best model, almost all people will use your model and not the competition’s model. The idea is: develop the biggest, badest model and people will come.

China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI. The key indicator of progress is how much AI is integrated into everything and how useful it is. If one model is better than another, it does not automatically mean it will be used more widely. What is important is that the model is useful and yields productivity gains at a reasonable cost. If the current approach is more productive than the previous one, it will be adopted. But hyper-optimization for slightly better quality is not very effective. In most cases, settling on “good enough” yields the highest productivity gain.

I think it is easy to see that the US philosophy is short-sighted and very problematic — particularly if model capability slows. The Chinese philosophy is more long-term focused and pragmatic.

The key value of AI is that it is useful and increases productivity. That makes it beneficial. It is clear that, similarly to computers or the internet, AI will be used everywhere. The problem is that if AI were just used for coding and engineering, it would have a very limited impact. While a lot of economic activity is supported by digital programs, these also have diminishing returns, and producing more software will not improve outcomes significantly if existing software is already good enough (just look at the SAAS failure in China). This makes wide-spread economic integration absolutely vital for AI effectiveness.

So in order to provide real value, AI needs to be used in ways that provide new benefits, not just improvements to what already exists. This is a difficult problem, but the right answer is to integrate AI into everything to squeeze out non-linear improvements, see what works and what does not, then keep what is working. China is taking this approach by subsidizing applications that use AI to encourage adoption. The Chinese population is very receptive to innovation, which facilitates this process. It is nothing unusual in China to see an 80-year-old grandma use AI to help her with their daily life. The US, on the other hand, bets on ideas like AGI and superintelligence, which I believe are fundamentally flawed concepts that have little relevance to future AI progress. This becomes clear when you think carefully about what these terms actually mean in physical reality.

AGI Will Never Happen, and Superintelligence Is a Fantasy

There is this pattern I have noticed: when you ask people in the Bay Area when AGI will happen, they always say it is a few years in the future, and it will have a massive impact. Then, if you ask them what AGI actually is, they do not include any physical tasks in their definition, and they do not consider resource inputs.

True AGI, that can do all things human, would need to be able to physical tasks – which comprises the largest economic sector. In short, AGI should include physical robots or machines that are able to do economically meaningful work in the physical world. While physical robots might be convenient for unloading your dishwasher, you will not see them replacing specialized systems in factories. Specialized robots in factories are too efficient, too precise. China demonstrates that dark factories — fully automated facilities — are already possible. Most robotics problems are solved problems in controlled environments. Most existing robotics problems that remain unsolved are also economically unviable. Stitching sleeves to a t-shirt is an unsolved robotics problem, but it is also not particularly economically meaningful in most contexts. Household robots will be interesting, but if it takes me two minutes to unload my dishwasher, I am not sure I need a robot for that. And while in a couple of years a robot might be able to fold laundry, I would rather spend a few minutes folding it myself with no creases than have a robot do a mediocre job.

The main problem with robotics is that learning follows scaling laws that are very similar to the scaling laws of language models. The problem is that data in the physical world is just too expensive to collect, and the physical world is too complex in its details. Robotics will have limited impacts. Factories are already automated and other tasks are not economically meaningful.

The concept of superintelligence is built on a flawed premise. The idea is that once you have an intelligence that is as good or better than humans — in other words, AGI — then that intelligence can improve itself, leading to a runaway effect. This idea comes from Oxford-based philosophers who brought these concepts to the Bay Area. It is a deeply flawed idea that is harmful for the field. The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns. So, superintelligence can be thought of as filling gaps in capability, not extending the frontier. Filling gaps can be useful, but it does not lead to runaway effects — it leads to incremental improvements.

Furthermore, the same people who think that GPUs will infinitely improve are often the people who think superintelligence will make those improvements faster and better. But they do not realize that GPUs can no longer be meaningfully improved. We can wait for better HBM memory technology for speed, and for chiplets and advanced packaging to improve yield/cost, but that is it. Rack-level optimization will likely hit the physical wall in 2026 or 2027. A superintelligence will not accelerate the progress made in HBM development, manufacturing, testing, and integration. The transformer architecture is close to physically optimal. Superintelligence will not be able to meaningfully improve neural network architectures. Efficient large-scale deployments for inference are largely a solved engineering problem. It just needs some careful engineering and time, but very little creativity is required to solve this problem close to physical optimality. Superintelligence will not be able to improve our inference stack by much.

A superintelligence might help with economic diffusion of AI technology, but in the end, the limiting factor of economic diffusion is implementation and adoption, not capability. It is clear to me that any organization that strives primarily for superintelligence as a goal will encounter significant challenges and will ultimately falter and be displaced by players that provide general economic diffusion.

In summary, AGI, as commonly conceived, will not happen because it ignores the physical constraints of computation, the exponential costs of linear progress, and the fundamental limits we are already encountering. Superintelligence is a fantasy because it assumes that intelligence can recursively self-improve without bound, ignoring the physical and economic realities that constrain all systems. These ideas persist not because they are well-founded, but because they serve as compelling narratives in an echo chamber that rewards belief over rigor.

The future of AI will be shaped by economic diffusion, practical applications, and incremental improvements within physical constraints — not by mythical superintelligence or the sudden emergence of AGI. The sooner we accept this reality, the better we can focus on building AI systems that actually improve human productivity and well-being.

The post Why AGI Will Not Happen appeared first on Tim Dettmers.

Find Space

boz. — Tue, 11 Nov 2025 00:00:00 +0000

People tend to all pursue the same opportunity but the greater impact is often had by pursuing the opportunity nobody else sees

Why speed matters

Daniel Lemire's blog — Fri, 05 Dec 2025 19:24:50 +0000

The one constant that I have observed in my professional life is that people underestimate the need to move fast.

Of course, doing good work takes time. I once spent six months writing a URL parser. But the fact that it took so long is not a feature, it is not a positive, it is a negative.

If everything is slow-moving around you, it is likely not going to be good. To fully make use of your brain, you need to move as close as possible to the speed of your thought.

If I give you two PhD students, one who completed their thesis in two years and one who took eight years… you can be almost certain that the two-year thesis will be much better.

Moving fast does not mean that you complete your projects quickly. Projects have many parts, and getting everything right may take a long time.

Nevertheless, you should move as fast as you can.

For multiple reasons:

1. A common mistake is to spend a lot of time—too much time—on a component of your project that does not matter. I once spent a lot of time building a podcast-like version of a course… only to find out later that students had no interest in the podcast format.

2. You learn by making mistakes. The faster you make mistakes, the faster you learn.

3. Your work degrades, becomes less relevant with time. And if you work slowly, you will be more likely to stick with your slightly obsolete work. You know that professor who spent seven years preparing lecture notes twenty years ago? He is not going to throw them away and start again, as that would be a new seven-year project. So he will keep teaching using aging lecture notes until he retires and someone finally updates the course.

What if you are doing open-heart surgery? Don’t you want someone who spends days preparing and who works slowly? No. You almost surely want the surgeon who does many, many open-heart surgeries. They are very likely to be the best one.

Now stop being so slow. Move!

Welcome, Mistral!

Lucas Beyer (bl16) / @giffmana — Thu, 04 Dec 2025 23:42:28 +0000

Welcome, Mistral!

xcancel.com/Laz4rz/status/1996601221319193002#m

Galaxy brain resistance

Pages — Mon, 17 Nov 2025 21:29:54 +0000

Kimi K2 Thinking

Pages — Fri, 07 Nov 2025 02:22:29 +0000

Epoch AI (@EpochAIResearch)

Pages — Tue, 04 Nov 2025 02:06:26 +0000

We looked at OSWorld, a popular evaluation of AI computer use capabilities. Our findings: tasks are simple, many don't require GUIs, and success often hinges on interpreting ambiguous instructions. The benchmark is also not stable over time. See thread for details!

Nov 3, 2025 · 8:16 PM UTC

Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: https://github.com/karpathy/nanochat/discussions/164 This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview: - You have to ensure diversity in user prompts/queries - For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc. - I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with. If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)

Andrej Karpathy / @karpathy — Fri, 24 Oct 2025 15:35:18 +0000

Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here:
github.com/karpathy/nanochat…

This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview:

- You have to ensure diversity in user prompts/queries
- For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc.
- I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with.

If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)

RT by @zdhnarsil: RainbowGRPO

Dinghuai Zhang 张鼎怀 / @zdhnarsil — Fri, 17 Oct 2025 02:06:52 +0000

RainbowGRPO

xcancel.com/Devvrit_Khatri/status/1978864275658871099#m

Writing apps is so much easier than writing libraries. ✓ No multiple versions of Python to deal with. ✓ Less chance that a release will break somebody else's code. ✓ Your code only has to work for your own use case.

Will McGugan / @willmcgugan — Thu, 16 Oct 2025 13:21:50 +0000

Writing apps is so much easier than writing libraries.

✓ No multiple versions of Python to deal with.

✓ Less chance that a release will break somebody else's code.

✓ Your code only has to work for your own use case.

Claude Code sub-agents

Simon Willison's Weblog — Sat, 11 Oct 2025 21:29:39 +0000

Claude Code includes the ability to run sub-agents, where a separate agent loop with a fresh token context is dispatched to achieve a goal and report back when it's done. I wrote a bit about how these work in June when I traced Claude Code's activity by intercepting its API calls.

I recently learned that you can deliberately trigger Claude Code to use sub-agents just by telling it to do so!

Here's an example. I ran this using Claude Code against a fresh checkout of simonw/datasette (commit ec38ad37689f3c14d307770977a17aed5efc5cb9), with no additional custom instructions:

Write markdown documentation for the context passed to each of the templates in this project. Use sub-agents.

Here's a snippet of the output:

⏺ I'll help you write markdown documentation for the context passed to each template in this project. Let me start by exploring the codebase to understand the template structure and then use sub-agents to document them comprehensively.
⏺ Search(pattern: "**/*.html")
  ⎿  Found 47 files (ctrl+o to expand)

⏺ Search(pattern: "render.*template|render_template")
  ⎿  Found 16 files (ctrl+o to expand)
[...]

⏺ Now let me launch sub-agents to analyze different template contexts in parallel:
⏺ Task(Document index template context)
  ⎿  Done (12 tool uses · 86.4k tokens · 1m 13s)

⏺ Task(Document database template context)
  ⎿  Done (15 tool uses · 55.4k tokens · 1m 41s)

⏺ Task(Document table template context)
  ⎿  Done (14 tool uses · 85.2k tokens · 1m 44s)

⏺ Task(Document query/row template contexts)
  ⎿  Done (17 tool uses · 78.7k tokens · 1m 55s)

⏺ Task(Document special template contexts)
  ⎿  Done (26 tool uses · 116.1k tokens · 1m 50s)

It churned away for a couple of minutes (those sub-agents execute in parallel, another benefit of this pattern) and made me this template_contexts.md markdown file. It's very comprehensive.

Tags: sub-agents, coding-agents, ai-assisted-programming, claude-code, generative-ai, ai, llms

Rebuilding Devin for Claude Sonnet 4.5: Lessons and Challenges

Pages — Sun, 05 Oct 2025 18:54:43 +0000

We rebuilt Devin for Claude Sonnet 4.5.

The new version is 2x faster, 12% better on our Junior Developer Evals, and it's available now in Agent Preview. For users who prefer the old Devin, that remains available.

Why rebuild instead of just dropping the new Sonnet in place and calling it a day? Because this model works differently—in ways that broke our assumptions about how agents should be architected. Here's what we learned:

Because Devin is an agent that plans, executes, and iterates rather than just autocompleting code (or acting as a copilot), we get an unusual window into model capabilities. Each improvement compounds across our feedback loops, giving us a perspective on what's genuinely changed. With Sonnet 4.5, we're seeing the biggest leap since Sonnet 3.6 (the model that was used with Devin's GA): planning performance is up 18%, end-to-end eval scores up 12%, and multi-hour sessions are dramatically faster and more reliable.

In order to get these improvements, we had to rework Devin not just around some of the model’s new capabilities, but also a few new behaviors we never noticed in previous generations of models. We are sharing some of our observations below:

The model is aware of its context window

Sonnet 4.5 is the first model we've seen that is aware of its own context window, and this shapes how it behaves. As it approaches context limits, we've observed it proactively summarizing its progress and becoming more decisive about implementing fixes to close out tasks.

This "context anxiety" can actually hurt performance: we found the model taking shortcuts or leaving tasks incomplete when it believed it was near the end of its window, even when it had plenty of room left.

We ended up prompting pretty aggressively to override this behavior. Even then, we found that prompts at the start of the conversation weren't enough—we had to add reminders both at the beginning and the end of the prompt to keep it from prematurely wrapping up.

When researching ways to address this issue, we discovered one unexpected trick that worked well: enabling the 1M token beta but cap usage at 200k. This gave us a model that thinks it has plenty of runway and behaves normally, without the anxiety-driven shortcuts or degraded performance.

This behavior has has real implications for how we architect around context management. When planning token budgets, we now need to factor in the model's own awareness: knowing when it will naturally want to summarize versus when we need to intervene with context compaction.

Interestingly, the model consistently underestimates how many tokens it has left—and it's very precise about these wrong estimates.

The model takes a lot of notes

One of the most striking shifts in Sonnet 4.5 is that it actively tries to build knowledge about the problem space through both documentation and experimentation.

Writing notes for itself

The model treats the file system as its memory without prompting. It frequently writes (or wants to write) summaries and notes (e.g. CHANGELOG.md, SUMMARY.md, but not CLAUDE.md nor AGENTS.md), both for the user and its own future reference. This suggests the model has been trained to externalize state rather than rely purely on context. This behavior is more pronounced when the model is closer to the end of its context window.

When we saw this, we were interested in the possibility to potentially remove some of our own memory management and let the model handle it. But in practice, we found the summaries weren't comprehensive enough. For example, it would sometimes paraphrase the task, leaving out important details. When we relied on the model's own notes without our compacting and summarization systems, we saw performance degradation and gaps in specific knowledge: the model didn't know what it didn't know (or what it might need to know in the future). It's very likely that these notes can be improved with prompting. You just shouldn't think you get a perfect system for free.

In some cases, somewhat humorously, we've seen the agent spend more tokens writing summaries than actually solving the problem. We've also noticed that the model's level of effort is uneven: the model tends to generate more summary tokens the shorter the context window.

In our testing, we found this behavior useful in certain cases, but less effective than our existing memory systems when we explicitly directed the agent to use its previously generated state.

This is an interesting paradigm and a new axis for model development, especially for simpler agent architectures or systems built around subagent delegation. It's clearly a new direction from Anthropic: likely pointing toward a future where models are more context-aware and where this becomes the way multiple agents communicate with each other. The RL training hasn't fully progressed to the point where this is reliable yet, but we'll be tracking how it evolves.

Testing to create feedback loops

Sonnet 4.5 is notably more proactive about writing and executing short scripts and tests to create feedback loops, and shows good judgment about when to use this capability. This generally improves reliability on long-running tasks, though we've occasionally seen it attempt overly creative workarounds when debugging. For example, when editing a React app, we’ve noticed the models getting the HTML of the page in order to check their work along the way in order to ensure that the behavior was correct. In another case, when trying to fix a seemingly innocent error related to two local servers trying to run on the same port, the model ended up using this behavior to create an overly complicated custom script instead of addressing the root cause issue (terminating the process).

The model works in parallel

Sonnet 4.5 is efficient at maximizing actions per context window through parallel tool execution -running multiple bash commands at once, reading several files simultaneously, that sort of thing. Rather than working strictly sequentially (finish A, then B, then C), the model will overlap work where it can. It also shows decent judgment about self-verification: checking its work as it goes.

This is very noticeable in Windsurf, and was an improvement upon Devin's existing parallel capabilities. That being said, there are tradeoffs. Parallelism burns through context faster, which leads to the context anxiety we mentioned earlier. But when the model is running with an empty context window, this more concurrent approach makes sessions feel faster and more productive. It's a subtle shift, but one that influenced how we thought about the architecture.

The model also seems to be trained to burn through parallel tool calls faster when it is earlier on in its context window, but takes more cautious as it nears the limit. This suggests to us that its been trained to be aware of how many output tokens its tool calls will produce.

What we're exploring next

These behaviors open up many interesting avenues, and we haven't been able to explore them all yet. Here are some we're eager to continue testing:

Subagents and context-aware tool calls. The model's improved judgment about when to externalize state and create feedback loops suggests it might handle subagent delegation more effectively. However, as we've learned you have to be very careful about when to use subagents because the context and state management gets complex quickly. Sonnet 4.5 seems more aware of the right types of tasks to delegate, which could make this more practical.
Meta-agent prompting. We're particularly interested in how this model handles meta-level reasoning about agent workflows. Early experiments suggest it works well with verification systems — letting the model reason about its own development process rather than just executing tasks.
Context-management models. Sonnet 4.5 seems to have some initial intuition around how to mange its own context. It might be possible that custom-trained models for intelligent context management could both result in faster and better performance.

We'll be sharing more as we learn what works (and what doesn't). In the meantime, we’re excited for you to try both the new Devin with Sonnet 4.5 and Windsurf.

simonw/claude-skills

Simon Willison's Weblog — Fri, 10 Oct 2025 23:57:19 +0000

simonw/claude-skills

One of the tips I picked up from Jesse Vincent's Claude Code Superpowers post (previously) was this:

Skills are what give your agents Superpowers.

The first time they really popped up on my radar was a few weeks ago when Anthropic rolled out improved Office document creation. When the feature rolled out, I went poking around a bit -- I asked Claude to tell me all about its new skills. And it was only too happy to dish.

In that transcript Jesse asks how the skill system works and Claude replies:

Looking at the skills system, here's how it works:

Current Skills (in /mnt/skills/public/):

docx - Word document handling

pdf - PDF processing and form filling

pptx - PowerPoint presentations

xlsx - Excel spreadsheets

This looks like it's part of Claude's brand new Code Interpreter feature! I wrote about that extensively last month, but I missed that there was a /mnt/skills/public/ folder full of fascinating implementation details.

So I fired up a fresh Claude instance (fun fact: Code Interpreter also works in the Claude iOS app now, which it didn't when they first launched) and prompted:

Create a zip file of everything in your /mnt/skills folder

This worked, and gave me a .zip to download. You can run the prompt yourself here, though you'll need to enable the new feature first.

I've pushed the contents of that zip to my new simonw/claude-skills GitHub repo.

So now you can see the prompts Anthropic wrote to enable the creation and manipulation of the following files in their Claude consumer applications:

pdf - PDF files
docx - Microsoft Word
pptx - Microsoft PowerPoint decks
xlsx - Microsoft Excel

In each case the prompts spell out detailed instructions for manipulating those file types using Python, using libraries that come pre-installed on Claude's containers.

Skills are more than just prompts though: the repository also includes dozens of pre-written Python scripts for performing common operations.

pdf/scripts/fill_fillable_fields.py for example is a custom CLI tool that uses pypdf to find and then fill in a bunch of PDF form fields, specified as JSON, then render out the resulting combined PDF.

This is a really sophisticated set of tools for document manipulation, and I love that Anthropic have made those visible - presumably deliberately - to users of Claude who know how to ask for them.

Tags: pdf, python, ai, prompt-engineering, generative-ai, llms, anthropic, claude, code-interpreter, jesse-vincent, skills

LoRA Without Regret

Pages — Mon, 29 Sep 2025 21:43:20 +0000

Policy Gradient, Policy Iteration, and in between

Pages — Mon, 29 Sep 2025 00:00:00 +0000

Policy gradient (PG) and actor critic are the most trendy RL methods today. As a theorist, it often gives me headache when people bring up all sorts of variants of PG and ask me for opinions, and all I can say is …

Part of the reason is that many algorithms adopt changes to the original PG that look fairly heuristic, and I cannot find a conceptually simple and satisfactory framework for understanding and deriving them from first principles. This situation changed a little bit when Lihong recently asked me about PPO and shared some interesting empirical phenomenon, which got me thinking and organizing together some ideas that have been in the back of my mind for a long time. So this post will share some perspectives on how I view these algorithms.

Policy Gradient, Implemented (In)Correctly

Let’s start by reviewing the good-old PG, which is $\nabla J (π) = \frac{1}{1 - γ} E_{(s, a) \sim d^{π}} [Q^{π} (s, a) \nabla \log π (s, a)],$ and $Q^{π} (s, a)$ can be replaced by the advantage function $A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$ , or the Monte-Carlo return starting from $(s, a)$ under policy $π$ .

In reality, however, a number of approximations is often deployed that departs from the equation above, with my comments at the end in parentheses:

The true $A^{π}$ is often unknown and we may use instead an approximate version $\hat{A}$ often learned via TD, often referred to as the critic. (Totally reasonable.)
In principle, after every update to the actor $π$ , we should train a new critic for the new policy, ideally to convergence. But that’s obviously computationally too intensive, so we often just constantly update the critic, knowing that it likely has not converged to (the best approximation of) the advantage for the current policy. Theoretically, an updating scheme that reproduces the same issue (and is easier to study) is to freeze the critic and update the actor for a while, and then update the critic.
The equation requires on-policy sampling $(s, a) \sim d^{π}$ . While most (but not all) practical variants still respect the action on-policyness, to my knowledge nobody cares about $s \sim d^{π}$ .¹ Instead of $E_{d^{π}}$ people often just write $E_{t}$ to refer to a recent batch (RAGE METER BUILDING UP…)
Things get worse in off-policy setting, where an importance-weighting correction is needed. In practice, the policy on the denominator of importance weight does not always match the data used for updating the actor, which breaks importance sampling. (At this point man, what can I say.)
… and so on.

Policy Iteration

Despite my complaints about all these “wrong” practices, there is an angle that makes a lot of them justifiable. Consider the following: let’s push the alternating update in #2 to its extreme: in each round, we freeze $A^{π}$ and keep updating the actor $π$ until convergence. If the actor parameterization is sufficiently expressive, we would get

π_{new} \to greedy (Q^{π_{old}}), (1)

which is policy iteration (PI).² Now, the textbook material on PI, viewed as a planning algorithm, just lifts so many restrictions of PG, including

You can optimize the policy very far (all the way to greedy w.r.t. current critic) before updating critic.
There is no need to be on-policy. In fact, the policy improvement step does not even talk about any notion of data distribution.

All these relaxations are nice. But they of course come at a price, otherwise we can simply use fitted PI and do not need PG at all. As I will explain below, my view is that PI relaxes the requirement on data distributions at the cost of strong (and sometimes unrealistic) assumptions on (1) policy structure, and (2) critic quality. We need strict PG if none of these assumptions hold. However, when some of them are reasonable, we may relax PG towards PI and enjoy the resulting flexibility.

On Policy Structure

In its original form, PI does not come with a separate actor and the policy is implicitly induced as the greedy policy of the value functions. For PI to be properly written in the form of Eq.(1), a key assumption is that

greedy (A^{π}) \in Π : = {π_{θ} : θ \in Θ},

that is, the greedy policy w.r.t. the advantage function (or its approximation $\hat{A}$ in the learning setting) can always be well-approximated by our policy parameterization $Π$ . This is because the PD lemma tells us that

J (π) - J (π_{old}) = \frac{1}{1 - γ} E_{d^{π}} [A^{π_{old}} (s, π)],

where $A^{π_{old}} (s, π) = Q^{π_{old}} (s, π) - Q^{π_{old}} (s, π_{old})$ . PI finds a new policy that is guaranteed to improve because the new policy has non-negative advantage over the old one in all states simultaneously, which is the key property that makes PI insensitive to data distributions. When we mimic PI using “critic-frozen-PG”, we can guarantee this property by choosing the tabular softmax policy, which has a factorized structure in the sense that the policy can choose the action distributions on each state independently.

In contrast, when we use a policy parameterization that entangles the action distributions across different states, we may run into the problem that the gradients of $A^{π} (s, π)$ for different $s$ may not be aligned with each other, and making improvements in some states may cause performance degradation in other states. In this case, PG theorem is the final safe net that guarantees overall improvement in $J (π)$ : it tells us that the correct trade-off between the conflicting objectives at different states is precisely given by the distribution $d^{π}$ .

On Critic Quality

Another major assumption behind PI is that $A^{π}$ is either given or its approximation $\hat{A}$ is universally good across the state-action space, which is rarely the case in practical learning settings. A much more reasonable assumption is that critic is only accurate under the distribution of states and actions induced by the current policy (see the seminal CPI paper, though there is more nuance to this assumption³), or the historical data, since this is where the critic is trained. Under this assumption, we cannot freeze the critic and optimize the actor indefinitely, as it will run into the regime where critic is poor due to the lack of data. Instead, we need to stop after $π$ is a bit far from $π_{old}$ , collect data, and re-train the critic to make sure we never get into the completely uncharted regime and stay within the trust region.

Understanding the Practical Relaxations

The takeaway message of the above discussion is that PI makes a number of implicit assumptions, which allow it to have much less restriction than PG in several corresponding aspects. Judging from how people tweak PG in practice (as mentioned at the beginning), we can “reverse-engineer” and have some reasonable guess of why those tweaks are successful: given that people care less about on-policyness but enforce trust region pretty carefully, we can conjecture that (1) modern neural-net architectures are rich enough that conflicting objectives across states are less of an issue (especially when actor and critic share layers),⁴ and (2) critic is still only good under existing data, which necessitates trust region. (Update 09/30) After I show this post to Wei Xiong, he told me that there are recent works that start to take distribution issues in PG more seriously.⁵ I guess we will just have to wait and see how things play out.

Designing agentic loops

Simon Willison's Weblog — Tue, 30 Sep 2025 15:20:46 +0000

Coding agents like Anthropic's Claude Code and OpenAI's Codex CLI represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly exercise the code they are writing, correct errors, dig through existing implementation details, and even run experiments to find effective code solutions to problems.

As is so often the case with modern AI, there is a great deal of depth involved in unlocking the full potential of these new tools.

A critical new skill to develop is designing agentic loops.

One way to think about coding agents is that they are brute force tools for finding solutions to coding problems. If you can reduce your problem to a clear goal and a set of tools that can iterate towards that goal a coding agent can often brute force its way to an effective solution.

My preferred definition of an LLM agent is something that runs tools in a loop to achieve a goal. The art of using them well is to carefully design the tools and loop for them to use.

The joy of YOLO mode

Agents are inherently dangerous - they can make poor decisions or fall victim to malicious prompt injection attacks, either of which can result in harmful results from tool calls. Since the most powerful coding agent tool is "run this command in the shell" a rogue agent can do anything that you could do by running a command yourself.

To quote Solomon Hykes:

An AI agent is an LLM wrecking its environment in a loop.

Coding agents like Claude Code counter this by defaulting to asking you for approval of almost every command that they run.

This is kind of tedious, but more importantly, it dramatically reduces their effectiveness at solving problems through brute force.

Each of these tools provides its own version of what I like to call YOLO mode, where everything gets approved by default.

This is so dangerous, but it's also key to getting the most productive results!

Here are three key risks to consider from unattended YOLO mode.

Bad shell commands deleting or mangling things you care about.
Exfiltration attacks where something steals files or data visible to the agent - source code or secrets held in environment variables are particularly vulnerable here.
Attacks that use your machine as a proxy to attack another target - for DDoS or to disguise the source of other hacking attacks.

If you want to run YOLO mode anyway, you have a few options:

Run your agent in a secure sandbox that restricts the files and secrets it can access and the network connections it can make.
Use someone else's computer. That way if your agent goes rogue, there's only so much damage they can do, including wasting someone else's CPU cycles.
Take a risk! Try to avoid exposing it to potential sources of malicious instructions and hope you catch any mistakes before they cause any damage.

Most people choose option 3.

Despite the existence of container escapes I think option 1 using Docker or the new Apple container tool is a reasonable risk to accept for most people.

Option 2 is my favorite. I like to use GitHub Codespaces for this - it provides a full container environment on-demand that's accessible through your browser and has a generous free tier too. If anything goes wrong it's a Microsoft Azure machine somewhere that's burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository.

There are plenty of other agent-like tools that run code on other people's computers. Code Interpreter mode in both ChatGPT and Claude can go a surprisingly long way here. I've also had a lot of success (ab)using OpenAI's Codex Cloud.

Coding agents themselves implement various levels of sandboxing, but so far I've not seen convincing enough documentation of these to trust them.

Update: It turns out Anthropic have their own documentation on Safe YOLO mode for Claude Code which says:

Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use --dangerously-skip-permissions in a container without internet access. You can follow this reference implementation using Docker Dev Containers.

Locking internet access down to a list of trusted hosts is a great way to prevent exfiltration attacks from stealing your private source code.

Picking the right tools for the loop

Now that we've found a safe (enough) way to run in YOLO mode, the next step is to decide which tools we need to make available to the coding agent.

You can bring MCP into the mix at this point, but I find it's usually more productive to think in terms of shell commands instead. Coding agents are really good at running shell commands!

If your environment allows them the necessary network access, they can also pull down additional packages from NPM and PyPI and similar. Ensuring your agent runs in an environment where random package installs don't break things on your main computer is an important consideration as well!

Rather than leaning on MCP, I like to create an AGENTS.md (or equivalent) file with details of packages I think they may need to use.

For a project that involved taking screenshots of various websites I installed my own shot-scraper CLI tool and dropped the following in AGENTS.md:

To take a screenshot, run:

shot-scraper http://www.example.com/ -w 800 -o example.jpg

Just that one example is enough for the agent to guess how to swap out the URL and filename for other screenshots.

Good LLMs already know how to use a bewildering array of existing tools. If you say "use playwright python" or "use ffmpeg" most models will use those effectively - and since they're running in a loop they can usually recover from mistakes they make at first and figure out the right incantations without extra guidance.

Issuing tightly scoped credentials

In addition to exposing the right commands, we also need to consider what credentials we should expose to those commands.

Ideally we wouldn't need any credentials at all - plenty of work can be done without signing into anything or providing an API key - but certain problems will require authenticated access.

This is a deep topic in itself, but I have two key recommendations here:

Try to provide credentials to test or staging environments where any damage can be well contained.
If a credential can spend money, set a tight budget limit.

I'll use an example to illustrate. A while ago I was investigating slow cold start times for a scale-to-zero application I was running on Fly.io.

I realized I could work a lot faster if I gave Claude Code the ability to directly edit Dockerfiles, deploy them to a Fly account and measure how long they took to launch.

Fly allows you to create organizations, and you can set a budget limit for those organizations and issue a Fly API key that can only create or modify apps within that organization...

So I created a dedicated organization for just this one investigation, set a $5 budget, issued an API key and set Claude Code loose on it!

In that particular case the results weren't useful enough to describe in more detail, but this was the project where I first realized that "designing an agentic loop" was an important skill to develop.

When to design an agentic loop

Not every problem responds well to this pattern of working. The thing to look out for here are problems with clear success criteria where finding a good solution is likely to involve (potentially slightly tedious) trial and error.

Any time you find yourself thinking "ugh, I'm going to have to try a lot of variations here" is a strong signal that an agentic loop might be worth trying!

A few examples:

Debugging: a test is failing and you need to investigate the root cause. Coding agents that can already run your tests can likely do this without any extra setup.
Performance optimization: this SQL query is too slow, would adding an index help? Have your agent benchmark the query and then add and drop indexes (in an isolated development environment!) to measure their impact.
Upgrading dependencies: you've fallen behind on a bunch of dependency upgrades? If your test suite is solid an agentic loop can upgrade them all for you and make any minor updates needed to reflect breaking changes. Make sure a copy of the relevant release notes is available, or that the agent knows where to find them itself.
Optimizing container sizes: Docker container feeling uncomfortably large? Have your agent try different base images and iterate on the Dockerfile to try to shrink it, while keeping the tests passing.

A common theme in all of these is automated tests. The value you can get from coding agents and other LLM coding tools is massively amplified by a good, cleanly passing test suite. Thankfully LLMs are great for accelerating the process of putting one of those together, if you don't have one yet.

This is still a very fresh area

Designing agentic loops is a very new skill - Claude Code was first released in just February 2025!

I'm hoping that giving it a clear name can help us have productive conversations about it. There's so much more to figure out about how to use these tools as effectively as possible.

Tags: definitions, ai, generative-ai, llms, ai-assisted-programming, ai-agents, coding-agents

Building agents with the Claude Agent SDK

Anthropic Engineering Blog — Mon, 29 Sep 2025 17:04:11 +0000

The Claude Agent SDK is a collection of tools that helps developers build powerful agents on top of Claude Code. In this article, we walk through how to get started and share our best practices.

The second wave of MCP: Building for LLMs, not developers

Pages — Sun, 28 Sep 2025 18:13:42 +0000

Sep 9, 2025

When the MCP standard first launched, many teams rushed to ship something. Many servers ended up as thin wrappers around existing APIs with minimal changes. A quick way to say "we support MCP".

At the time, this made sense. MCP was new, teams wanted to get something out quickly, and the obvious approach was mirroring existing API structures. Why reinvent when you could repackage?

But the problem with this approach is LLMs don’t work like developers. They don’t reuse past code or keep long term state. Each conversation starts fresh. LLMs have to rediscover which tools exist, how to use them, and in what order. With low level API wrappers, this leads to repeated orchestration, inconsistent behavior, and wasted effort as LLMs repeatedly solve the same puzzles.

MCP works best when tools handle complete user intentions rather than exposing individual API operations. One tool that deploys a project end-to-end works better than four tools that each handle a piece of the deployment pipeline.

The key difference is context and state management. When you write code, you keep track of information between API calls. You store the project ID from the create call, check deployment status before adding the domain, and wrap error handling around each step so failures don't break the entire process.

LLMs work differently. Each conversation starts fresh with no memory of previous conversations. While they can see tool results within the current conversation, they have to figure out the right sequence of tools to use based on what's available. When those tools are low-level API wrappers, the LLM has to orchestrate multiple calls and manage the complexity of chaining them together each time.

For example, take deploying a project with the Vercel API. A developer might write code like this:

const project = await client.projects.create({
 name: domain.replace(/\./g, '-'),
 gitRepository: { repo: repoUrl }
await client.projects.createProjectEnv({
 requestBody: Object.entries(env).map(([key, value]) => ({
 target: ['production', 'preview', 'development'],
const deployment = await client.deployments.createDeployment({
 gitSource: { type: 'github', repo: repo.replace('.git', ''), ref: 'main' }
await client.projects.addProjectDomain({
 requestBody: { domain: domain }

This approach involves nested configurations, ID management across calls, and parsing repository URLs. A developer will solve for this once and re-use. But an LLM faces this puzzle fresh each time, often getting the nesting wrong or forgetting required fields.

Link to headingSingle workflow tools vs multiple endpoints

The solution is building tools around complete user goals rather than API capabilities. Instead of four separate tools, create one deploy_project tool that handles the entire workflow internally.

This changes everything about tool design:

API-shaped tools	Intention-based tools
`create_project`, `add_env`, `deploy`, `add_domain`	`deploy_project`
Multiple calls with state management	Single atomic operation
Returns technical status codes	Returns conversational updates
LLM assembles the workflow	Tool owns the complete process

Consider the difference in practice. An API-shaped MCP server might expose these tools:

create_project(name, repo)
add_environment_variables(project_id, variables)
create_deployment(project_id, branch)
add_domain(project_id, domain)

The LLM has to call each tool in sequence, pass IDs between calls, and handle potential failures at each step.

An intention-based tool looks different:

deploy_project(repo_url, domain, environment_variables, branch="main")

This single tool handles the complete workflow internally and returns a conversational response. Instead of { status: 200, data: { id: "proj_123" } }, the LLM can respond with "Project deployed at example.com. Build completed in 45s. All systems running normally."

Link to headingDesigning workflow based MCP tools

Start by testing the workflow manually before writing any code. Take a real user request like "set up my project with authentication and a database" and walk through it step by step using your existing APIs. The parts that feel tedious or repetitive are good candidates for a single MCP tool.

Think of MCP tools as tailored toolkits that help an AI achieve a particular task, not as API mirrors. There may be multiple APIs and business logic behind a single MCP tool. If users think of something as one workflow, design it as one tool.

Here's how to structure a complete workflow tool:

 "Deploy a project with environment variables and custom domain",
 environment_variables: z.record(z.string()),
 branch: z.string().default("main")
 async ({ repo_url, domain, environment_variables, branch }) => {
 const project = await createProject(repo_url, branch);
 await addEnvironmentVariables(project.id, environment_variables);
 const deployment = await deployProject(project.id);
 await addCustomDomain(project.id, domain);
 text: `Project deployed successfully at ${domain}. Build completed in ${deployment.duration}s.`

Use plain code for the deterministic parts. Things like API sequencing, error recovery, and state management are better suited for regular programming. Only involve the LLM for parts that truly need reasoning or natural language processing.

Test with real scenarios. Run actual user workflows through your tools. When you see the LLM making multiple attempts or asking for clarification, that's feedback about your tool design. The goal is for complex workflows to succeed on the first try.

Link to headingPerformance improvements with workflow tools

Teams that have shifted from API shaped tools to workflow shaped tools have seen meaningful improvements in reliability and efficiency.

The common thread is how these tools are designed:

They focus on user intentions rather than API coverage
They handle complete workflows rather than exposing single operations
They respond in a conversational way rather than returning technical codes

MCP works best when tools reflect complete user goals. LLMs do not manage state the way developers do, so building tools around workflows produces better results.

Give this approach a try. The MCP handler makes it straightforward to expose your application logic as workflow based MCP tools. Get started with the Next.js MCP template or explore the documentation.

Introducing the MCP Registry

Pages — Mon, 08 Sep 2025 00:00:00 +0000

Today, we’re launching the Model Context Protocol (MCP) Registry—an open catalog and API for publicly available MCP servers to improve discoverability and implementation. By standardizing how servers are distributed and discovered, we’re expanding their reach while making it easier for clients to get connected.

The MCP Registry is now available in preview. To get started:

Single source of truth for MCP servers#

In March 2025, we shared that we wanted to build a central registry for the MCP ecosystem. Today we are announcing that we’ve launched https://registry.modelcontextprotocol.io as the official MCP Registry. As part of the MCP project, the MCP Registry, as well as a parent OpenAPI specification, are open source—allowing everyone to build a compatible sub-registry.

Our goal is to standardize how servers are distributed and discovered, providing a primary source of truth that sub-registries can build upon. In turn, this will expand server reach and help clients find servers more easily across the MCP ecosystem.

In building a central registry, it was important to us not to take away from existing registries that the community and companies have built. The MCP Registry serves as a primary source of truth for publicly available MCP servers, and organizations can choose to create sub-registries based on custom criteria. For example:

Public subregistries like opinionated “MCP marketplaces” associated with each MCP client are free to augment and enhance data they ingest from the upstream MCP Registry. Every MCP end-user persona will have different needs, and it is up to the MCP client marketplaces to properly serve their end-users in opinionated ways.

Private subregistries will exist within enterprises that have strict privacy and security requirements, but the MCP Registry gives these enterprises a single upstream data source they can build upon. At a minimum, we aim to share API schemas with these private implementations so that associated SDKs and tooling can be shared across the ecosystem.

In both cases, the MCP Registry is the starting point – it’s the centralized location where MCP server maintainers publish and maintain their self-reported information for these downstream consumers to massage and deliver to their end-users.

Community-driven mechanism for moderation#

The MCP Registry is an official MCP project maintained by the registry working group and permissively licensed. Community members can submit issues to flag servers that violate the MCP moderation guidelines—such as those containing spam, malicious code, or impersonating legitimate services. Registry maintainers can then denylist these entries and retroactively remove them from public access.

Getting started#

To get started:

This preview of the MCP Registry is meant to help us improve the user experience before general availability and does not provide data durability guarantees or other warranties. We advise MCP adopters to watch development closely as breaking changes may occur before the registry is made generally available.

As we continue to develop the registry, we encourage feedback and contributions on the modelcontextprotocol/registry GitHub repository: Discussion, Issues, and Pull Requests are all welcome.

Thanks to the MCP community#

The MCP Registry has been a collaborative effort from the beginning and we are incredibly grateful for the enthusiasm and support from the broader developer community.

In February 2025, it began as a grassroots project when MCP creators David Soria Parra and Justin Spahr-Summers asked the PulseMCP and Goose teams to help build a centralized community registry. Registry Maintainer Tadas Antanavicius from PulseMCP spearheaded the initial effort in collaboration with Alex Hancock from Block. They were soon joined by Registry Maintainer Toby Padilla, Head of MCP at GitHub, and more recently, Adam Jones from Anthropic joined as Registry Maintainer to drive the project towards the launch today. The initial announcement of the MCP Registry’s development lists 16 contributing individuals from at least 9 different companies.

Many others made crucial contributions to bring this project to life: Radoslav Dimitrov from Stacklok, Avinash Sridhar from GitHub, Connor Peet from VS Code, Joel Verhagen from NuGet, Preeti Dewani from Last9, Avish Porwal from Microsoft, Jonathan Hefner, and many Anthropic and GitHub employees that provided code reviews and development support. We are also grateful to everyone on the Registry’s contributors log and those who participated in discussions and issues.

We deeply appreciate everyone investing in this foundational open source infrastructure. Together, we’re helping developers and organizations worldwide to build more reliable, context-aware AI applications. On behalf of the MCP community, thank you.

Open Social — overreacted

Pages — Sun, 28 Sep 2025 18:07:35 +0000

Alan Turing on Embodied Intelligence

Rodney Brooks — Sat, 20 Sep 2025 23:52:14 +0000

The term “Embodied AI” is having its moment in the sun right now. For me, that is what I have spent my career working on, and I have repeatedly published articles using the term “embodied intelligence”.

I recently stumbled across a piece that I wrote in May 2011 intended for the proceedings of a conference held in June 2012 in Manchester, England, to celebrate the 100^th anniversary of Alan Turing’s birth. There was a series of invited talks and a number of other components. I attended the conference but I cannot find any record of a proceedings having been published for the 17 invited talks. Here is what I had written for it, unchanged except for fixing many typos.

The Case For Embodied Intelligence

Rodney A. Brooks, May 2011

Abstract: In 1948 in Intelligent Machinery Turing made the distinction between embodied and disembodied intelligence. While arguing that building an embodied intelligence would be a “sure” route to produce a thinking machine he rejected it in favor of disembodied intelligence on the grounds of technical practicalities of the era. Modern researchers are now seriously investigating the embodied approach to intelligence and have rediscovered the importance of what Turing called “cultural search” in that same paper.

For me Alan Turing’s 1948 paper Intelligent Machinery was more important than his 1950 paper Computing Machinery and Intelligence.

At the beginning of Intelligent Machinery Turing provided counter arguments to a number of possible objections to the idea that machines could be intelligent. And right at the end he introduced a precursor to the “Imitation Game”, now commonly referred to as the Turing Test, of his 1950 paper. In this earlier version, one human not very good chess player would try to guess whether he was playing against another human not very good chess player, or against an algorithm. [[At the time the opponent person had to be not very good so that it didn’t outshine the then current abilities of mechanical chess playing. Today the opponent person would have to be a world champion to have any chance at not being outshone by the mechanical system!]] Expansion of these bookends became the body of Computing Machinery and Intelligence.

Intelligent Machinery itself was not published until 1970, so many early computer science researchers were unaware of it. I was fortunate to come in contact with it right as I was starting my academic career.

The bulk of the paper gives examples of how simple computational mechanisms could be adaptable, could be taught, and could learn for themselves. The examples and mechanisms Turing used in this exposition were networks of active computational elements. Although he connected them back to the universal machines of his 1936 paper, it is remarkable, in hindsight, how different this abstraction was than the one he had previously introduced, of the central processing element with a tape memory–still the essential model for all modern digital computers. Here, instead, he used a model inspired by brains. One can only wonder how different our technological world might be if Turing had lived to fully develop this set of ideas himself. Others carried on this second tradition, but one must think that perhaps Turing’s intellectual influence might have been stronger as he would have been arguing against the approach that was adopted from his earlier work.

For me, the critical, and new, insights in Intelligent Machinery were two fold.

First, Turing made the distinction between embodied and disembodied intelligence. While arguing that building an embodied intelligence would be a “sure” route to produce a thinking machine he rejected it in favor of disembodied intelligence on the grounds of technical practicalities of the era. Second, he introduced the notion of “cultural search”: that people’s learning largely comes from the culture of other people in which they are immersed.

Modern researchers are now seriously investigating the embodied approach to intelligence and have rediscovered the importance of interaction with people as the basis for intelligence. My own work for the last twenty five years has been based on these two ideas.

Turing justifies the possibility making a thinking machine by “the fact that it is possible to make machinery to imitate any small part of a man”. He uses the implicit idea of his universal computing machines to dismiss the idea that it is necessary to emulate a person at the neural signal level in order to have intelligence, and instead suggests a digital computer, “if produced by present techniques, would be of immense size”, which would control a robot from a distance. That robot would be built by “tak[ing] a man as a whole and to try to replace all parts of him by machine”. In particular he suggests the parts would include “television cameras, microphones, loudspeakers, wheels and `handling servo-mechanisms’ …”. Turing’s description from over sixty years ago, fairly precisely describes what is done today in dozens of research labs around the world with our PR2 robots, or Mekabots, with their brains off board in racks of Linux boxes, or even off in the computing cloud.

Turing further rightfully notes that even in building such a robot “the creature would still have no contact with food, sex, sport, and many other things of interest to the human being”. Nevertheless he suggests that such an approach “is probably the `sure’ way of producing a thinking machine”, before dismissing it as too slow and impractical. He suggests instead that it is more practical, certainly at that time, to “see what can be done with a `brain’ which is more or less without a body”. He suggests the following fields as ripe for exploration by disembodied intelligence:

(i) Various games, e.g., chess, noughts and crosses, bridge, poker
(ii) The learning of languages
(iii) Translations of languages
(iv) Cryptography
(v) Mathematics.

With these suggestions much of the early directions for the field of Artificial Intelligence were set, and certainly the odd numbered of Turing’s suggestions formed a large part of the work in AI during its first decade.

In one paper Turing both distinguished embodied versus disembodied approaches to building intelligent machines, praised the former as more likely to succeed and either set or predicted the disembodied directions that were actually followed for many years.

But later, towards the very end of Intelligent Machinery he comes back to the place of bodies in the world. He distinguishes three kinds of search as ways to build intelligent systems: intellectual search, genetic search, and cultural search. The first is the direction that classical AI went, where programs try to learn and improve their performance. Although he did not suggest that it be mechanized, genetic search has become a thoroughly practical approach to design and optimization. And lastly by cultural search, Turing means the way in which interactions with others contributes to the development of intelligence. This developmental approach, using social robots, has only now become practical in the last fifteen years, and is a rich source of both theoretical and practical learning systems for robots.

It is humbling to read Alan Turing’s papers. He thought of it all. First.

How I've run major projects

benkuhn.net — Sun, 16 Mar 2025 00:00:00 +0000

My few most productive individual weeks at Anthropic have all been “crisis project management:” coordinating major, time-sensitive implementation or debugging efforts.

In a company like Anthropic, excellent project management is an extremely high-leverage skill, and not just during crises: our work has tons of moving parts with complex, non-obvious interdependencies and hard schedule constraints, which means organizing them is a huge job, and can save weeks of delays if done right. Although a lot of the examples here come from crisis projects, most of the principles here are also the way I try to run any project, just more-so.

I think excellent project management is also rarer than it needs to be. During the crisis projects I didn’t feel like I was doing anything particularly impressive; mostly it felt like I was putting in a lot of work but doing things that felt relatively straightforward. On the other hand, I often see other people miss chances to do those things, maybe for lack of having seen a good playbook.

So here’s an attempt to describe my playbook for when I’m being intense about project management.

(I’ve described what I did as “coordinating” above, but that’s underselling it a bit; it mattered a lot for this playbook that I had enough technical context, and organizational trust, to autonomously make most prioritization decisions about the project. Sometimes we instead try to have the trusted decisionmakers not be highly involved in managing execution, and instead farm that out to a lower-context or less-trusted project manager to save the trusted decisionmaker time, but IMO this is usually a false economy for projects where it’s critical that they be executed well.)

Focus

For each of the crisis management projects I completely cleared my schedule to focus on them, and ended up spending 6+ hours a day organizing them.

This is a bit unintuitive because I’m used to thinking of information processing as basically a free action. After all, you’re “just” moving info from place to place, not doing real work like coding, right? But if you add it all up—running meetings, pinging for updates, digesting Slack threads, pinging for updates again, thinking about what’s next, pinging for updates a third time, etc.—it’s surprisingly time-intensive.

Even more importantly than freeing up time, clearing my schedule made sure the project was the top idea in my mind. If I don’t do that, it’s easy for me to let projects “go on autopilot,” where I keep them running but don’t proactively make time to think through things like whether we should change goals, add or drop priorities, or do other “non-obvious” things.

For non-crisis projects, it’s often not tenable (or the right prioritization) to spend 6+ hours a day project-managing; but it’s still the case that you can improve execution a lot if you focus and make them a top priority, e.g. by carving out dedicated time every day to check statuses, contemplate priorities, broadcast updates, and so on.

Maintain a detailed plan for victory

A specific tool that I’ve found critical for staying oriented and updating quickly is a detailed plan for victory, i.e., a list of steps, as concrete as possible, that end with the goal being achieved.

The plan is important because whether or not we’re achieving the plan is the best way to figure out how well or badly things are going. Knowing how well or badly things are going is important because it tells me when to start asking for more support, cutting scope, escalating problems, and otherwise sounding more alarms. One of the most common megaproject failure modes is to not freak out soon enough, and having a concrete plan is the best antidote.

As a both positive and negative example of this, during a recent sprint to release a new implementation of a model, we took a detailed accounting of all the work we thought we had to do to launch.

On the plus side, this made it clear three months before launch that things were going to be very tight, and this enabled us to ask for help from another team, who loaned us someone who sped up the project a fair amount.
On the minus side, we also massively underestimated a few components of the project, and because of this, we still ended up very crunched at the end.

As the above example shows, having a plan can’t completely save you if you underestimate how long all the steps in the plan will take. But it certainly helps! My sense is that the main things that would have helped even more in the above case were:

We were inexperienced at estimating tasks, especially tasks related to new model implementations (which most people on the team were too new to have done before), and we were too cowardly to add the requisite amount of “slop” to our plan.
We didn’t check in frequently enough against the plan once we made it, or sound the alarm early enough when we went off-plan.

Run a fast OODA loop

OODA stands for “observe, orient, decide, act”—in other words, the process by which you update your plans and behavior based on new information.

Most of the large projects I’ve worked on have been characterized by incomplete information:

Our cluster’s networking is bad, but we don’t understand why.
We have a correctness bug but we don’t know where it is.
We need to rewrite the system but we’re not totally sure what the rewrite should look like.

In fact, I’d make a stronger claim: usually getting complete information was the hard part of the project, and took up a substantial fraction of the overall critical-path timeline.

For example, let’s take a recent project to kick off a training run. The critical path probably looked something like:

Chips for the training run are delivered
We run some tests
We discover one aspect of performance is unexpectedly poor
We escalate the problem with our compute partner
Compute partner staffs a large debugging effort
We realize we had given our compute partner an outdated benchmark that is causing them to target the wrong improvements
Compute partner switches benchmark and prioritizes different improvements
We share our benchmarks with compute partner so they can run the exact same code as us
Compute partner rolls out improvements
We test the improvements
Performance is still poor and we tell them that
Repeat steps 8-10 until eventually it’s good enough

Practically all of these steps are about information-processing, not writing code! Even the step where the compute partner debugged the problems on their side was itself constrained by information processing speed, since there were tens of people working on the debugging effort and coordinating / sharing info between them was difficult. Overall, the project timeline was strongly constrained by how quickly information could round-trip from our compute partner’s large-scale debugging effort, through their tech lead, me, and Anthropic’s large-scale debugging effort.

This pattern generalizes to most projects I’ve been a part of, and as a result, one of my most productive project management habits is to try to run the fastest OODA loop that I can.

A few specific things that I’ve found help:

Spend time on it: running OODA loops takes time, and is one of the primary reasons that, as mentioned above, I usually spend 6+ hours a day on running a megaproject if it’s in crisis mode.
Communicate uncomfortably much: For the training run debugging, to reduce the round-trip time between orgs as much as possible, I had multiple daily calls with my counterpart at our compute partner (9am and 6pm). For the model implementation effort, I was basically constantly bouncing between different groups of debuggers, asking for updates and processing them.
Track and prioritize the biggest open questions: For most big projects I’ve maintained a living doc with a ranked list of all my biggest open questions about the project. Resolving or de-risking these uncertainties basically turns into the project’s priority list.
Ideally, there are enough people working on the project that we can work on resolving multiple of the uncertainties in parallel, since that’s one of the best ways to speed things up. (And for a project in “crisis mode,” if we have more top priorities than we can parallel-path with the current set of people working on the problem, that’s also a good test for whether it’s time to pull in more folks.)
Step back and reorient frequently: Other than asking for updates, the main thing I spend time on was reorienting—looking at our list of priorities, asking myself whether they should still be the top priorities, then looking at what people were working on, and making sure those things were attacking the top priorities. I probably reviewed the project’s priorities multiple times a day as well, although I often didn’t make changes as a result.
- (Note that it is possible to change what people are working on too often, since switching tasks is costly. Parallelizing work on the top few priorities, as mentioned above, helps with this, since if you decide that priority #3 is now #1, but there are 2 people working on each, then nobody has to switch tasks. The thing that kills you is when no one is working on the new priority #1.)

Overcommunicate

It’s not just enough for me personally to be running a fast OODA loop—in a large group, everyone needs to be autonomously making frequent, high-quality, local prioritization decisions, without needing a round-trip through me. To get there, they need to be ambiently aware of:

what else is going on around them, so they can coordinate and update on new info quickly (“oh, we’re planning to kick off the next derisking run in three days, so I have to have my new RL environment ready and tested by then”)
how their goal fits into the overall project, so they can make correct decisions about the details of their approach (“we’re trying to scale up as much as possible right now, so this direction isn’t valuable to pursue since it could never provide the scale of data we need”)

I’ve usually found that to create the right level of ambient awareness, I have to repeat the same things way more often than I intuitively expect. This is roughly the same “communicate uncomfortably much” principle above, but applied to broadcasts and not just 1:1 conversations with people.

For example, although the first team I managed at Anthropic started with a daily asynchronous standup, we found that synchronous meetings were much more effective for creating common knowledge and reorienting, so we moved to a twice-weekly synchronous standup, which probably qualified as “uncomfortably much” synchronous communication for Anthropic at the time.

Break off subprojects

Once a project gets over maybe 10 people, I can’t track everything myself in enough detail to project-manage the entire thing myself. At this point, it becomes critical to delegate.

Here I mean delegating the project management, not just the execution (that’s what I’d be delegating to the first 10 people). This is the point where I need other people to help split up the work, monitor and communicate progress, escalate blockers, etc.

A few things I try to keep in mind when delegating project management:

The ideal unit of delegation is a crisp, simple, high-level goal, with limited overlap with other workstreams. (This is as opposed to, e.g., a list of tasks like “see if X helps.“) Good examples: “get X training technique working over Y networking protocol at Z throughput,” “get identical evals between model implementations A and B.” Bad examples: “follow this 10-step checklist that we hope results in training working,” “try these 3 techniques for debugging the loss eval.”
The best project-managers are often not the strongest technical ICs. Instead the most important traits are that they’re highly organized and great at staying laser focused on end goals, perhaps to the point of being annoying about it. IC depth helps and I’ll never say no to it, but it’s not what I’d optimize for.
People running subprojects are probably also doing a lot of the same stuff I do, in particular e.g. spending a lot of time on it. That means they’ll take a substantial hit to their IC productivity. This is expected, and is often worth it. “Direction is more important than magnitude”—it’s usually better to have a lower-velocity project that works on the right things, than a higher-velocity one that’s pointed at the wrong goal.

One of my favorite things to make delegation easier is to keep goals simple—if they can fit in a Slack message while still crisply describing a path to the desired end state, then the people working on the goal will be much more able to prioritize autonomously, and point their work at the real end goal rather than doing something that turns out to be useless for some reason they didn’t think about.

“Keep goals simple” doesn’t have to mean “do less”—the best way to keep goals simple is to find the latent structure that enables a clean recursive decomposition into subgoals. This often requires a deceptive amount of work—both cognitive and hands-on-keyboard—to identify the right intermediate goals, but I’ve found that it pays off immensely by clarifying what’s important to work on.

Have fun

Some of my favorite memories of Anthropic are of helping out with these big projects. While they can be intense, it’s also really inspiring to see how our team comes together, and the feeling of being part of a big team of truly excellent people cooking something ambitious together can be really magical! So I try to enjoy the chaos :)

Appendix: my project DRI starter kit

Here’s the internal doc I share with folks on my team who are getting into being responsible for large projects.

So you’re the DRI of a project (or part of one). Concretely, what do you do to “be DRI”?

This doc is my suggested “starter kit” answer to that question. The habits and rituals described here aren’t perfect for every situation, but they’re lightweight and broadly helpful. I suggest you use them as a starting point for iteration: try them out, then adjust as necessary. This is an SL init; the RL is your job :)

Goals of this playbook

The goal is to help you do your job as DRI—

Make your project go quickly:
- Participants deeply understand the root goal and can autonomously choose the most important next things to work on
- People have “situational awareness” of what other people are working on, learn about relevant updates quickly, and coordinate quickly when needed
- People get quick feedback on their work
- If things aren’t going fast enough, you (the DRI) can notice and course-correct quickly
“Play well with others:”
- Observers can figure out where to go to follow along
- Adjacent or intersecting people/projects don’t miss important updates or get caught by surprise
- People notice quickly if the project is behind or off-track, and can identify opportunities to help

—without adding too much overhead:

<1 hour of setup to make a working doc, schedule a weekly meeting, etc.
30 min/week of meetings
15-30 min/week to write an update

(Note: being DRI will still unavoidably add some overhead—e.g. you’ll have to track what other people are doing, delegate work, unblock people, set and communicate goals, etc. The goal is specifically for the process/paperwork to be minimal.)

Weekly meeting

You should schedule at least one 30-minute weekly meeting with everyone working on the project.

The goal of this meeting is to (1) be a backstop for any coordination that needs to happen and didn’t happen asynchronously; (2) be an efficient way to create common knowledge of goals, updates, etc.; (3) help you track whether things are going well.

Starter-kit agenda:
- [5m] DRI reviews major updates from last week and sets goals for next week
- [10m] Silent write and comment on discussion topics
- [10m] Synchronous discussion of most important things not addressed during silent write
Signs that more meetings might help (e.g. a second weekly standup):
- you have a very tight deadline and can’t afford to lose time
- people aren’t working on the most important thing
- people need feedback frequently
- people step on each others’ toes or miss opportunities to help each other out
- if you just like hanging out with each other :)

Landing page / working doc

It’s really helpful for discoverability and wayfinding to have a single “master doc” with all the most important info about a project. As you loop more people in, they can read the doc to get up to speed. And anyone who thinks “I wonder how X is going” can stop by there to find out.

Create a doc for your workstream with:

A go/ link in the name (if a subproject, maybe use go/project/subproject)
- → This makes it easier to find quickly (search is kinda rough)
A clear description of a concrete top level goal and how it fits into broader goals
- → This is critical info for participants, so they can autonomously prioritize the most important things; and for observers, so that they know what outcome to expect.
Staffing: A list of people working on the project, your name as the DRI, and a link to the slack channel that’s being used for discussion
Links: A short list of relevant links at the top (work trackers, the project’s Slack channel, major design docs, etc.). If needed, a longer “docs / see also” section later links to relevant docs.
- → It’s really easy to lose track of relevant docs otherwise!
A roadmap section with intermediate goals and target dates
- → See the section on plans; these will help people understand what the overall shape of the project is expected to be.
A section for “running notes” containing meeting notes from your weekly meetings (and any other ad-hoc meetings) and broadcast updates
- → This really helps observers and new-joiners get up to speed!
I like maintaining a list of important open questions / uncertainties/ risks and updating it over time. This helps me stay focused on removing risk from the project as quickly as possible.

If it’s part of a larger project, your doc should be nested within the larger project’s working doc.

If this ends up being too much for one doc, you can fork these out into sub-docs (esp. running notes and updates).

Plan / roadmap / milestones

In your working doc, include a section with some intermediate goals and dates by which you hope to accomplish them.
- → This is helpful mostly for noticing you’re off track or behind without getting frog-boiled.
- → Or noticing when you need to make a direction change because the intermediate goals don’t seem good anymore.
You might feel some pressure to add false certainty or precision, but avoid this and be honest about your uncertainty instead. For a lot of research projects it’s hard to plan more than a couple weeks ahead. You can make the milestones fuzzier / more aspirational beyond that, or just drop them.
- I often find it helpful to phrase milestones in probabilitis and distributions (e.g. “my 90% confidence interval for this date is X-Y” or “I think there’s a 75% chance this technique works”)

Who’s working on what

You should have something somewhere that describes what people are working on.
The minimum viable version of this is a list of what people are working on in your working doc.
- If you end up with a large set of tasks and a big backlog, maybe use a checklist and/or move to a subdoc.
Stack rank your work list. It’s really important for people to understand priorities!
If there’s more different people/TODOs, I suggest using some app to make a kanban board with “backlog” / “up next” / “in progress” / “done” columns.
- This is probably most helpful for more deterministic/plannable projects where there’s a clear backlog + set of future tasks, and a lot of things you need to remember to do.
If you have an external task tracker, link it in the wiki section of the working doc.

Slack norms

Have conversations about the project in a Slack channel (not DMs).
- Reference the channel in your working doc.
- Link the working doc in the Slack channel bookmarks.
Cross-post notebook posts and experiment write-ups into the channel so observers don’t have to follow tons of notebook channels.
Do not use DMs. These make it hard to make info discoverable or share it further.
- If people send you important stuff in a DM, ask them to put it in the project channel.
- If you need confidentiality, make a private channel.
Avoid centithreads. Most ≥10-message Slack threads would be better as a ~5-minute Tuple.
- (This is hard to do with people who are in tons and tons of meetings like execs. But you should try to do it for others.)
- If you end up with a centithread, assume nobody will read it; post a summary back to the channel afterwards.
Bias towards fewer, larger, noisier channels. The right time to create a channel is when discussion is either not happening, or getting lost.
- → Too many slack channels makes it harder to manage membership, decide where to put things, or find where discussion is happening.
Channel organization and membership matters. Invest in routing conversations to the right place and curating the channel “architecture.”

Weekly broadcast updates

Once a week, probably either just before or just after your weekly meeting, write up a brief update for a broader audience with:
- The overall vibe
- What’s changed since last update
- What’s coming up next
When writing these updates, optimize for signal to noise ratio.
- Err towards concision
- No “we worked on X”—tell me “we accomplished Y” or “we learned Z”
- Remember your audience (= people not familiar with the project)
- State things crisply and concretely (“X improves eval Y by Z points,” not “we got X working”)
- Leave out anything that’s not actionable—you don’t need to be exhaustive
Post the update in your project Slack channel, and cross-post it to other relevant channels (e.g. a larger “megaproject” channel) if necessary.
- If your project is part of a larger megaproject, these updates might feed into something broader like a weekly meeting of DRIs or an aggregated status update.

Retrospectives

Every so often, step back and ask “how could the last X weeks have gone better?”
- Frequency depends on how much there is going on—every 2 weeks is good if there’s a lot, maybe every 4-8 weeks for smaller projects
Suggested meeting format
- Friday afternoon
- [13 min] Async brainstorm 2 lists of items: “what went well” / “what we could improve”
- [2 min] Dedupe topics and emoji vote by putting :heavy_plus_sign: next to ones you agree with
- Sort “what we could improve” by highest votes
- [10 min] Synchronous discussion of top points (either highest voted or flagged by DRI); figure out action items

Thanks to Kelley Rivoire for many thoughtful comments on a draft!

RT by @_akhaliq: Finally they looking at HF.

AK / @_akhaliq — Thu, 11 Jan 2024 16:58:30 +0000

Finally they looking at HF.

nitter.uni-sonia.com/_akhaliq/status/1745476689625329737#m

How to Do Great Work

Paul Graham: Essays — Sun, 02 Jul 2023 10:18:12 +0000

Research Taste Exercises [rough note]

Pages — Fri, 30 Jun 2023 11:09:51 +0000

This article is a rough note. Writing rough notes allows me share more content, since polishing takes lots of time. While I hope it's useful, it's likely lower quality and less carefully considered than my usual articles. It's very possible I wouldn't stand by this content if I thought about it more.

Exercises

Exercise 1: Write down a list of research ideas. Have a mentor you respect rate each idea 1-10. Discuss ideas where you disagree with them after reflection.

It often takes several months of work to truly test an idea. Asking a mentor is a cheap proxy.

Exercise 2: Pay attention when other people try ideas you’ve had. How did the results compare with your expectations?

If you’re investing in brainstorming, you’ll have tens or hundreds of research ideas for each one you try. It can be emotionally uncomfortable when someone publishes an idea you had, but it’s an opportunity to get precious feedback for free.

Exercise 3: Interview researchers around you on their taste. Why do they work on the problems they do? How do they pick problems? What’s their “big picture” of research?

Bonus: If you have a great interview, you might consider writing it up with their permission.

(PhD students often have short meetings with researchers visiting their lab. One pattern is pitching their project to the visitor, which may not be very productive if they don't have similar interests. I suspect they’d get more out of talking about meta-research.)

Exercise 4: Read books about the history of science. Reflect on why some researchers focused on important directions their contemporaries ignored.

(Kuhn’s “The Structure of Scientific Revolutions” is one of my favorite books, and you can get an audio book!)

Exercise 5: Critically consider your research taste, and the community taste around you. Your taste is likely very influenced by your research cluster (your collaborators, advisor, etc).

In what ways has your own research taste or your community's taste been wrong over the last few years?

Are there adjacent research “schools” with significantly different research taste? If so, try to articulate the strongest version of their view, and why you agree or disagree.

In what ways to you disagree with the research taste of your own community? What are ways in which you think your community might be wrong even if you aren't confident in that view?

What are the underlying goals of your research taste? This could simply be intrinsically wanting to understand, or just finding research fun. But there might also be ways you want to make the world a better place. Are the problems you are working on (or think are intersting) algined with those goals?

Failure Modes

I also wanted to go through some research taste failure modes:

Failure Mode 1: Getting overly attached to one research direction / falling into sunk costs.

A lot of researchers start working on one problem (often more due to circumstance than contemplated decision) and then find it hard to move on.

Potential Antidote: Set aside a week or two to step back and create a list of the other research directions you think would be most promising to work on (look at new research, talk to others, etc). Then ask what you’d want to work on if you were starting from scratch. Repeat every 1-2 years as needed.

Failure mode 2: Lack of research knowledge / intimacy.

Theoretical knowledge is table stakes for research taste. You can’t have research taste in a vacuum.

Sometimes people fall into a trap of trying to pick the perfect problem or have a brilliant insight before getting their hands dirty. You can’t have research taste in a vacuum. You need theoretical knowledge and research intimacy.

Potential Antidote: Help someone else with their project in a space you’re interested in. You’ll learn a lot and get your hands dirty without becoming overly committed to working on something long term. Ideally, by not leading the project, this should also be lower stress. Alternatively, do your own short-term projects to get your hands dirty.

Failure mode 3: Environment not aligned with your interests.

I sometimes talk to PhD students who have different interests than their advisor, and are trying to contort their research interests into something their advisor will find palatable. This generally leads to ideas optimized for compromise rather than quality. This leads to suboptimal resarch, and starves you of an opportunity to build taste.

This can happen even when those around you nominally support you working on whatever you’re interested in. It can be challenging to pursue a direction when people around you are unenthusiastic.

Potential Antidote: Either (1) move to an environment which is aligned with your interests, or (2) temporarily buy into the interests of your group and try to exercise good taste within those interests. I’m generally pro moving to another environment once you’ve reflected a lot on what you’re interested in and have something you feel very actively excited about.

Suggestions from other people

In the twitter discussion (and especially, a thread by Andy Matuschak), a number of people suggested other ideas. Note all of these are quite "exercises for building taste" (many are more "strategies for exercising good taste"), but I think they're very interesting:

Rachel Prudden suggests
The heuristic of imagining that another group has published the paper you have in mind. Are you excited to read it?
Chris comment: I think this is a pretty good heuristic.
Andy Matuschak suggests
ask the anti-Hamming question about your ideas. What are the most interesting (not important!) problems in your field? Why aren't you working on them?
Chris comment: This is a great question. See also Hamming's famous "You and Your Research" talk.
Andy also shares a failure mode:
running to make an idea happen as soon as it seems tractable. It's exciting to have a viable approach—makes me wanna go do it! But if I sit with an idea longer I can often evolve a deeper variant and run with that instead.
Michael Nielsen's Principles of Effective Resarch (via Shrey Jain) notes the value of paying attention to messes:
When you identify such a mess, the natural inclination of many people is to shy away, to find something that is easier to understand. But a field that is a mess is really an opportunity. Chances are good that there are deep unifying and simplifying concepts still waiting to be understood and developed by someone - perhaps you.
Arthur Allshire
What are the bottlenecking problems in your field? Eg. in a lot of reinforcement learning research, a lot focus on slight tweaks to algorithms, but the biggest improvements have come from improving simulation for RL.
Preetum Nakkiran
I've found that writing "project proposals" in the short term, and "research statements/plans" in the long term, are good exercises to think about motivations & goals (at diff timescales). By "project proposal", I mean: Once you decide to commit the next >1 month of time to something, write up a few pages explaining: why the question is interesting, why the answer will be valuable, and what you expect to find (w/ initial evidence). And then get feedback.
... Also seconding the advice to ask everyone you meet about their "big picture" research goals. This is almost never written in papers, but is much more interesting/important.
Christian Szegedy
In hindsight though, most of the errors are made in my research was when I listened to the criticisms of my respected mentors and most of the cool stuff I did were the ones those everybody called initially useless, uninteresting or "good luck!"... Generally, ideas in research need to be either ahead of the others or should defy conventional wisdom. Being in the latter camp requires less skill, but more courage.
Chris comment: I think there's a lot of truth to this, but beware survivorship bias and remember that different people may have different levels of scientific maturity / taste at different points in their careers. I suspect that many PhD students would find charting their own research agenda from the start very challenging, while others would thrive.
Sergey Gorbunov
My only advice on research taste to grad students or junior researchers is *not* to listen to your advisors/mentors/established researchers... Grad school is the only place when you get to try and fail as many times as needed. Grad students should come up with outside the box problems/solutions. (Established researches have their own agendas/way of thinking😂). The advice can similarly be used to overturn itself. 🧐
Boaz Barak
I’m fond of “20/80 projects” where you do 80% of work to get 20% of goals:
Rather than asking “what’s coolest result I can get with smallest effort?” ask “can I find problem that distills the main obstacle even if end result doesn’t sound cool?

Acknowledgments

I'm grateful to the countless people who've talked to me about research philosophy over the years.

Thanks to Nick Cammarata for discussion of these exercises.

Thanks to Ryan Saxe and Abinav for correcting typos.

An AI Miracle Malcontent

Machine Learning (Theory) — Wed, 05 Apr 2023 21:44:38 +0000

The stark success of OpenAI’s GPT4 model surprised me shifting my view from “really good autocomplete” (roughly inline with intuitions here) to a dialog agent exhibiting a significant scope of reasoning and intelligence. Some of the MSR folks did a fairly thorough study of capabilities which seems like a good reference. I think of GPT4 as an artificial savant: super-John capable in some language-centric tasks like style and summarization with impressive yet more limited abilities in other domains like spatial and reasoning intelligence.

And yet, I’m unhappy with mere acceptance because there is a feeling that a miracle happened. How is this not a miracle, at least with hindsight? And given this, it’s not surprising to see folks thinking about more miracles. The difficulty with miracle thinking is that it has no structure upon which to reason for anticipation of the future, prepare for it, and act rationally. Given that, I wanted to lay out my view in some detail and attempt to understand enough to de-miracle what’s happening and what may come next.

Deconstructing The Autocomplete to Dialog Miracle
One of the ironies of the current situation is that an organization called “OpenAI” created AI and isn’t really open about how they did it. That’s an interesting statement about economic incentives and focus. Nevertheless, back when they were publishing, the Instruct GPT paper suggested something interesting: that reinforcement learning on a generative model substrate was remarkably effective—good for 2 to 3 orders of magnitude improvement in the quality of response with a tiny (in comparison to language sources for next word prediction) amount of reinforcement learning. My best guess is that this was the first combination of 3 vital ingredients.

Learning to predict the next word based on vast amounts of language data from the internet. I have no idea how much, but wouldn’t be surprised if it’s a million lifetimes of reading generated by a billion people. That’s a vast amount of information there with deeply intermixed details about the world and language.
1. Why not other objectives? Well, they wanted something simple so they could maximize scaling. There may indeed be room for improvement in choice of objective.
2. Why language? Language is fairy unique amongst information in that it’s the best expression of conscious thought. There is thought without language (yes, I believe animals think in various ways), but you can’t really do language without thought.
The use of a large deep transformer model (pseudocode here) to absorb all of this information. Large here presumably implies training on many GPUs with both data and model parallelism. I’m sure there are many fine engineering tricks here. I’m unclear on the scale, but expect the answer is more than thousands and less than millions.
1. Why transformer models? At a functional level, they embed ‘soft attention’ (=ability to look up a value with a key in a gradient friendly way). At an optimization level, they are GPU-friendly.
2. Why deep? The drive to minimize word prediction error in the context of differentiable depth creates a pressure to develop useful internal abstractions.
Reinforcement learning on a small amount of data which ‘awakens’ a dialog agent. With the right prompt (=prefix language) engineering a vanilla large language model can address many tasks as the information is there, but it’s awkward and clearly not a general purpose dialog agent. At the same time, the learned substrate is an excellent representation upon which to apply RL creating a more active agent while curbing an inherited tendency to mimic internet flamebait.
1. Why reinforcement learning? One of the oddities of language is that there is more than one way of saying things. Hence, the supervised learning view that there is a right answer and everything else is wrong sets up inherent conflicts in the optimization. Hence, “reinforcement learning from human feedback” pairs inverse reinforcement learning to discover a reward function and basic reinforcement learning to achieve better performance. What’s remarkable about this is that the two-step approach is counter to the information processing inequality.

The overall impression that I’m left with is something like the “ghost of the internet”. If you ask the internet for the answer to a question on the best forum available and get an answer, it might be in the ballpark of as useful and as correct as that which GPT4 provides (notably, in seconds). Peter Lee’s book on the application to medicine is pretty convincing. There are pluses and minuses here—GPT4’s abstraction of language tasks like summarization and style appear super-human, or at least better than I can manage. For commonly discussed content (e.g. medicine) it’s fairly solid, but for less commonly discussed content (say, Battletech fan designs) it becomes sketchy as the internet gives out. There are obviously times when it errs (often egregiously in a fully confident way), but that’s also true in internet forums. I specifically don’t trust GPT4 with math and often find it’s reasoning and abstraction abilities shaky, although it’s deeply impressive that they exist at all. And driving a car is out because it’s a task that you can’t really describe.

What about the future?
There’s been a great deal about the danger of AI discussed recently, and quite a mess of misexpectations about where we are.

Is GPT4 and future variants the answer to [insert intelligence-requiring problem here]? GPT4 seems most interesting as a language intelligence. It’s clearly useful as an advisor or a brainstormer. The meaning of “GPT5” isn’t clear, but I would expect substantial shifts in core algorithms/representations are necessary for mastering other forms of intelligence like memory, skill formation, information gathering, and optimized decision making.
Are generative models the end of consensual reality? Human societies seem to have a systematic weakness in that people often prefer a consistent viewpoint even at the expense of fairly extreme rationalization. That behavior in large language models is just looking at our collective behavior through a mirror. Generative model development (both language and video) do have a real potential to worsen this. I believe we should be making real efforts as a society to harden and defend objective reality in a multiple ways. This is not specifically about AI, but it would address a class of AI-related concerns and improve society generally.
Is AI about to kill everyone? Yudkowski’s editorial gives the impression that a Terminator style apocalypse is just around the corner. I’m skeptical about the short term (the next several years), but the longer term requires thought.
1. In the short term there are so many limitations of even GPT4 (even though it’s a giant advance) that I both lack the imagination to see a path to “everyone dies” and I expect it would be suicidal for an AI as well. GPT4, as an AI, is using the borrowed intelligence of the internet. Without that source it’s just an amalgamation of parameters of no interesting capabilities.
2. For the medium term, I think there’s a credible possibility that drone warfare becomes ultralethal inline with this imagined future. You can already see drone warfare in the Ukraine-Russia war significantly increasing the lethality of a battlefield. This requires some significant advances, but nothing seems outlandish. Counterdrone technology development and limits on usage inline with other war machines seems prudent.
3. For the longer term, Vinge’s classical singularity essay is telling here as he lays out the inevitability of developing intelligence for competitive reasons. Economists are often fond of pointing out how job creation has accompanied previous mechanization induced job losses and yet my daughter points out how we keep increasing the amount of schooling children must absorb to be capable members of society. It’s not hard to imagine a desolation of jobs in a decade or two where AIs can simply handle almost all present-day jobs and most humans can’t skill-up to be economically meaningful. Our society is not prepared for this situation—it seems like a quite serious and possibly inevitable possibility. Positive models for a nearly-fully-automated society are provided by Star Trek and Iain Banks although science fiction is very far from a working proposal for a working society.
4. I’m skeptical about a Lawnmower Man like scenario where a superintelligence suddenly takes over the world. In essence, cryptographic barriers are plausibly real, even to a superintelligence. As long as that’s so, the thing to watch out for is excessive concentrations of power without oversight. We already have a functioning notion of super-human intelligence in organizational intelligence and are familiar with techniques for restraining organizational intelligence into useful-for-society channels. Starting with this and improving seems reasonable.

Dhruv Batra @DhruvBatraDB

@DhruvBatraDB — Fri, 24 Mar 2023 01:21:51 +0000

A lot of my arguments about the foundations of intelligence being sensorimotor control (and not language or reasoning) are shaped by discussions with Jitendra over the years. This is a good summary of his arguments.

Jitendra MALIK @JitendraMalikCV

I delivered the 110th Annual Martin Meyerson UC Berkeley Faculty Research Lecture on March 20, 2023. https://www.youtube.com/watch?v=f6fDpKDxpL0 …

Film Study for Research

Bounded Regret — Mon, 28 Jun 2021 07:00:00 +0000

Research ability, like most tasks, is a trainable skill. However, while PhD students and other researchers spend a lot of time doing research, we often don't spend enough time training our research abilities in order to improve. For many researchers, aside from taking classes and reading papers, most of our training is implicit, through doing research and interacting with mentors (usually a single mentor--our PhD advisor or research manager). By analogy, we are like basketball players who somehow made it to the NBA, and are now hoping that simply playing basketball games will be enough to keep improving.

Drawing on this analogy, I want to talk about two habits that are ubiquitous among elite athletes, that have analogs in research that I feel are underutilized. Those who do pursue these habits as PhD students often improve quickly as researchers.

The first habit is film study. Almost every high-level athlete watches films of other players of the same sport, including historical greats, contemporary rivals, and themselves. This allows them to incorporate good ideas from other players' games as well as to catch and eliminate flaws in their own game. Even the very best players benefit from watching film of themselves and others.

The second habit, which I call act-reflect-ask, occurs in the course of a game or scrimmage. I'll describe this from my own experience (although I'm by no means an elite athlete, I've learned this from people who are). After a point ends, I generally think about what happened during the point--Was there anything I wanted to do better? Did anything unexpected happen? Then I'll re-run those parts in my head, simulating what I would have done differently until I feel like I know how to consistently make the right decision. In some cases, I can't figure it out--perhaps I was playing defense, someone beat me, and I can't figure out what they did or can't figure out the counter. In that case I'll ask a teammate about it (or the person who beat me, if it's a friendly scrimmage) and talk it over until I see the right strategy for the future.

Both of these strategies are invaluable for improving. They leverage the fact that as humans, we tend to learn socially: we are very good at adopting strategies from others, so film study and asking are efficient ways to learn. Both strategies also lead to deliberate practice focused on real-world contexts. Below, I'll show that these strategies have analogs in research, and argue that good researchers should adopt both into their own habits.

Film Study

As mentioned above, good athletes watch lots of film of other athletes. This extends to other skills as well--most chess players, including grandmasters, study games by both contemporary and historical greats. They do this to understand how other very strong players play, in order to adopt ideas and, in the case of rivals, to counter those ideas (this part is less relevant to research). Even the very best players do this.

What is the equivalent to this in research? Ideally, we would watch world experts as they work, observing how they think, perform experiments, and so on. Unfortunately, this is difficult--much research work is internal rather than external, and we don't routinely film great researchers in the same way as we do with athletes. The closest obvious analog is working closely with a mentor, as many PhD students do with their PhD advisor. Then, it is often possible to see first-hand how a more experienced researcher approaches a problem. However, this isn't scalable, and most people only get to do this with one person--their advisor. (As an aside, it is very useful for students to develop a good model of their advisor's thinking style--I think this tends to be underrated.)

A more scalable approach would be reading papers, but this doesn't achieve the full goal of film study--you only see the finished product, rather than the thought process, and it tends to only show the part of a writer's thoughts that are widely defensible. What we want is a public record of someone's thoughts, including off-the-cuff thoughts that wouldn't make it into a paper.

In fact, we do have this, in the form of blogs. The right type of blogs can provide a valuable form of "film study". I personally learned a lot about statistics from Andrew Gelman's blog. Often, someone sends him a paper and he just gives his off-the-cuff reactions to it: what he liked and didn't, what was convincing, what parts seem sketchy. I probably learned more from reading his blog than from statistics classes (of which I've taken embarrassingly few, yet somehow managed to get hired by a Statistics department; I'll credit Gelman for this). Scott Aaronson's blog is good in the same way for theoretical computer science. Many posts on the GiveWell and Open Philanthropy blogs are good in this way, too. In all cases, I'd look at the earlier rather than later posts (though not the very earliest); the reason is that once blogs have too large an audience, writers start to feel constrained to write more "professionally" and you get less of the valuable off-the-cuff thinking.

In addition to blogs, debates are another good source of off-the-cuff, in-the-moment thinking, as long as the participants don't overprepare and as long as they are trying to make good arguments rather than score rhetorical points. Actually, the best debates I've seen also take part via blogs, such as the debate over de-worming in global health. Seminars can be good film study, but are primarily film study for giving presentations rather than doing research (and for this, also watch recordings of great talks online). Seminar Q&A can be good film study for research thinking, as long as participants are opinionated and express those opinions in a clear way that exposes their underlying mental model. For programming, you can watch people code on Twitch, or pair program with other students in your research group.

The above are all useful sources of in-the-moment thinking. For research, we also make decisions--such as what directions to pursue--that have consequences on the scale of years. To film study these, I read histories of important scientific developments. Good histories will follow individuals around in detail for an extended period of time, ideally with primary sources. For instance, The Making of the Atomic Bomb covers developments in physics up to and through the Manhattan project, and discusses many of the decisions, discoveries, and dead ends faced by Fermi, Szilard, Oppenheimer, and others. (The dead ends are especially important, so that you can see the whole process and not just what is useful today.) Another great example is The Eighth Day of Creation, which does the same for the development of modern biotechnology. Such histories have helped me gain a better understanding of how science develops on the scale of years or decades, which I would otherwise have to learn the hard way, over my own years and decades of research.

Some other miscellaneous advice: transcripts of talks can sometimes be good in the same way as blogs. Richard Hamming's "You and Your Research" is excellent on this front. For talks, recording yourself and watching the recording may be the fastest route to improvement. Finally, in addition to histories, case studies (often taught in law or business courses) also provide information that would be expensive to gather otherwise.

In summary, film study blogs for off-the-cuff research thinking; watch great presentations and record yourself to learn how to speak; pair program and watch programming streams; and read histories of science for long-term research decisions.

Act-Reflect-Ask

In the act-reflect-ask loop, we reflect on whether something could have gone better after we do it, and ask someone else if we can't figure it out. There are many ways to do this in research:

When seeing a proof, if you don't see how you would have come up with the proof yourself, discuss with others how to do so (this is usually what people mean when they ask “what's the motivation for that step?”). The same goes whenever you see a cool experiment or idea that you're not sure you would have come up with yourself. First try to think about whether there's a way to modify your thought process to reliably come up with such ideas in the future. If not, discuss with the presenter so that you can learn.
After you give a talk, pull aside one of the audience members and get feedback on what worked/didn't work in the talk.
After attending a seminar, discuss what was or wasn't convincing, what was most interesting, etc. Paper reading groups are valuable as they often focus on this. (This isn't quite act-reflect-ask since the seminar was given by someone else; but you can think of it as a way of checking your own thoughts during the seminar against others'.)
Every week, reflect on what things felt less efficient than they needed to be. Think for yourself how to improve these, then talk to friends, colleagues, or mentors to get additional ideas.

In addition to helping yourself improve, these habits help others as well--asking someone for advice engages their own thinking in a growth-oriented direction, so by helping you they are likely improving themselves, too. This also helps at the level of teams, as it builds chemistry and creates a shared culture of excellence and growth. Indeed, in sports, the best teams do this regularly, and veteran players are proactive in finding ways to help younger players. Some professional players even stay in a league, making millions of dollars a year, solely by being excellent sources of advice and mentorship.

Summary

Find ways to routinely study research decision-making, through blogs, seminars, video streams, and histories. Actively consume these to adopt and build up effective mental heuristics. Whenever you do something, reflect on how it could be better, and ask others for advice. As you learn more yourself, find ways to give back to others. Consistently doing these will help you to become a better researcher over time, and contribute to a culture of excellence among those around you.

Principles for Productive Group Meetings

Bounded Regret — Wed, 22 Mar 2023 00:47:54 +0000

Note: This post is based on a Google document I created for my research group. It speaks in the first person, but I think the lessons could be helpful for many research groups, so I decided to share it more broadly. Thanks to Louise Verkin for converting from Google doc to Markdown format.

This document talks about principles for having productive group meetings and seminars, and to some extent a good group culture in general. It’s meant to be a living document--I’ve started it based on my own experiences, but ultimately our seminars and group culture come from all of us together. So if you have ideas you want to add, please do so!

I’ll start by talking about an important concept called psychological safety, then discuss what I see as the goals of our research group and how that fits into presentations and discussions in seminars and meetings. I’ll also provide tips for asking excellent questions and some general philosophy on how to hold yourself to a high standard of understanding.

Psychological Safety

Psychological safety is an important concept for fostering creative and high-functioning teams. I would highly recommend reading the following two documents to learn about it in detail:

To summarize, a psychologically safe team is one where members feel like:

They can make mistakes without it affecting their status in the group
It is easy to give and receive feedback, including critical feedback, without feeling attacked or like one is causing trouble
One is allowed to and encouraged to question prevailing opinions

These are especially important in research environments, because questioning and risk-taking are needed to generate creative ideas, and making mistakes and receiving feedback are necessary for learning.
In general, I would encourage everyone in our group to take risks and make mistakes. I know everyone holds themselves to a high standard and so doesn’t like to make mistakes, but this is the main way to learn. In general, if you never do anything that causes you to look silly, you probably aren’t taking enough risks. And in another direction, if you never annoy anyone you probably aren’t taking enough risks. (Of course, you don’t want to do these all the time, but if it never happens then you can probably safely push your boundaries a bit.)

Fostering psychological safety. As a group, here are some general principles for fostering psychological safety among our teammates:

Assume your teammates have something to teach you, and try to learn from them.
In discussions and debates, aim to explain/understand, not to persuade. Adopt a frame of collaborative truth-seeking, rather than trying to “win” an argument.
Acknowledge and thank people for good points/questions/presentations/etc.
Invite push-back
Welcome and encourage newcomers

In addition, there are a couple things to avoid:

Try not to talk over people. Sometimes this happens due to being very excited and engaged in a conversation, and don’t sweat it if you do this occasionally, but try not to do it habitually, and if you do do it make sure to invite the person you interrupted to finish their point.
Avoid making broadly negative or dismissive statements. Even if you personally don’t intend such a statement to apply to anyone in the group, it’s inevitable that someone will take it personally. It also works against the principle of “questioning prevailing opinions”, because it implies that there’s an entire area of work or claims that is “off-limits”.

As an example, when I was a PhD student, a senior person often made claims to the effect that “research was pointless unless industry people cared about it”. This made it feel discouraging for me to do my (at the time) more theoretically-oriented work, and I abandoned at least one valuable project because of this. With the benefit of hindsight, I don’t think that person actually would have endorsed the literal claim I wrote above, but that’s exactly the point I’m making–it’s easy for other people to overinterpret claims.

Group Goals and Group Meetings

In my view, our group has three major goals:

Do excellent research
Help each other to learn and grow
Help the world

In the context of group meetings/seminars, we can promote these goals in the following ways:

Hold yourself to a high standard of understanding (see below for more on this). In other words, don’t just follow the individual steps–try to understand why things had to be this way and not any other way. Asking questions about this not only helps your own understanding, but also pushes the speaker to clarify their own thinking–thus promoting the goals of excellent research and of learning.
It’s okay and encouraged to tie things back to the bigger picture. Excellent research is not only technically sound but also well-motivated. Understanding the bigger picture is also especially important for helping the world.
Try to ask questions in a way that succinctly models your own thinking process. One of the most valuable aspects of group meetings is that you can see how other people think, which helps learning. As a concrete example, sometimes in applied talks we ask questions that are very specific and only make sense to people immersed in that area. This is okay, but it’s better to ask the same question in a way that lets people not in that area see why the question is important.
As a speaker, don’t aim for the standard of “defensibility”. Instead, aim to convince the audience that you are onto something important and exciting (this is a different but not strictly higher standard, since it might involve saying some things that are only partially defensible). Similarly, as an audience member don’t be satisfied just because there’s “nothing wrong”–try to understand why a project was important enough that someone was excited to spend months of their life on it.

In addition, here are some meta-level principles around question-asking:

Basic understanding questions, even at the level of clarifying notation, are highly valuable and usually under-utilized because they don’t feel “smart”. I encourage everyone to ask these questions when they have them–if you’re confused, probably someone else is too, and it’s valuable feedback for the speaker.
I try to pay attention to how many other questions are being asked. If no one is asking questions, I’ll try to ask one to break the ice. If lots of questions are being asked, I’ll try to filter my own questions for the ones that are highest-value or most different from what’s already being discussed.
I also try to pay attention to how many questions I personally have already asked. If I haven’t asked a question yet I feel very free to ask one. If I’ve asked many already, I again try to filter for the highest-value ones.
As an audience member, you have much more cognitive bandwidth than the speaker. It’s therefore helpful to take the extra time to formulate your question to be easy to understand and engage with. It’s also good to state it succinctly when possible. Time spent formulating a question is time spent only by you, but time spend asking/answering it is spent by everyone in the audience.

Seminar Norms

The culture of a good seminar is different from the culture of everyday conversations, in a way that might not be obvious if you haven’t been immersed in it for a long time. I’ve already gone over that to some extent above, but below I’ll elaborate on some specific points in more detail, and lay out some helpful rules and norms that are usually unstated.

Audience Culture

There are many everyday social norms that hinder us from seeking a high level of understanding in a talk. Asking a question feels like a bid on the speaker’s and audience’s time and attention. We might worry that it’s a “dumb” question, or feel intimidated by a complicated statement that we don’t understand. Or conversely we might worry that it’s impolite or aggressive to ask for such a high (and, if we’re being honest, demanding) level of understanding. We might worry that we’re putting the speaker on the spot and that perhaps they won’t be able to answer and that we’ll make the speaker look “dumb”.

These are all natural and common thoughts to have from the perspective of everyday culture. But in my opinion, they come from a misconceptualization of seminar culture. Here is a conceptualization that can help dissolve these thoughts.

You have a right to understand. If something is said in a seminar, you have a right to understand it. Science progresses not by ineffable truths that cannot be explained, but by clearly articulated common knowledge. It helps to also remember that:

If you don’t understand something, it is likely that many other people do not as well.
Articulating a confusion is often itself a useful intellectual act. Sometimes we may not even realize that we are missing something until it is pointed out.

Asking questions shows respect. When I ask a question, it shows that I am interested enough in the topic to engage with it, and that I trust the speaker to give an informative answer. Not asking questions implies that the topic is either not worth engaging with, or that you don’t think the speaker is equipped to answer. Questions show respect.

Speaker Culture

You have a right to direct the conversation. A vigorous seminar audience will likely have more questions than you have time to answer, and might sometimes focus on early aspects of a talk that are not the main point. Therefore, as the speaker, you always have a right to direct the conversation to the aspects that will be most interesting or fruitful. You can simply politely cut off a current line of questioning by explaining that there are other topics you want to get to, and promising to engage later if necessary.

Honest answers show courage. As the speaker, perceptive questions will often stretch the limits of your own understanding. It can be tempting to reflexively deflect or bluster to hide this. But it is much better to be honest about those limits (while feeling free to engage in speculation). Learning the limits of your own knowledge is also a great opportunity for growth.

Being an Excellent Participant

The above norms for speakers and listeners set the ground rules for a productive seminar. But there is more you can do to help actively stimulate learning. Here are a few principles:

As a listener, be mindful of cognitive load. The speaker has to manage an entire audience of dozens of people, while you as a listener really only have to worry about yourself. So if there’s a question that’s bugging you, that the speaker doesn’t initially give a good answer to, try to do as much work as you can to productively reformulate your question, rather than making the speaker figure it out for you. (Of course, sometimes this isn’t possible, and the speaker does have the advantage of being the expert on the topic. But it’s good to try to offload cognitive load from the speaker whenever possible.)
As a listener, be mindful of tone. This is in some sense a corollary of cognitive load. Certain tones take extra effort to gracefully process or to respond to (e.g. dismissiveness, condescension, extreme assertiveness, etc.). We should mostly want tone to be fairly neutral (neither timid nor overbearing, but curious and assertive).
As a speaker, be mindful of tone. Treating questions dismissively will ensure that other people don’t ask questions. We generally don’t do this intentionally, but e.g. giving a short, confident-sounding, but incomplete answer can make it psychologically harder to ask follow-up questions.
As a speaker, avoid rambling. Sometimes when we aren’t completely satisfied with our own answer, we end up rambling or repeating the same answer in several different ways. This can end up taking up several minutes of time if you don’t catch yourself. Once you’ve said what you have to say, move on to the next slide or the next question (fine to acknowledge if you think there might be more to say after further thought).

None of these are things we will remember all the time, and it's not a big deal if you forget, but these are all habits to aspire to that will improve the experience for both you and others.

Tips for High-Trust Environments

For high-trust environments (like our own group meeting), we can do even better. Here we can keep in mind that everyone is on the same team, and our goal is to help each other excel. In particular:

Don’t be afraid to ask tough questions. Our meeting is a safe space, and asking tough questions now helps the speaker think through them before they present externally.
Hold others to the standard you would hold yourself. From knowing all of you, I know that we all hold ourselves to a high personal standard–we want to do excellent work on the most important problems in ML. Let’s call this the standard of excellence. In seminars, I think we sometimes make the mistake of holding the speaker to the standard of defensibility: can they give a reasonable-seeming answer to questions of why/how they did something? Defensibility isn’t just too low of a standard, it’s actually the wrong standard: any ambitious project is going to go out on a limb in some ways, and there will be parts of it that are more speculative. Optimizing for defensibility leads us to avoid ambition. So get the speaker to convince you that this is excellent, rather than defensible, work.

For a completed project, my aspirational goal as a speaker is usually to convince the audience that my work addresses a key issue on one of the most important problems in the field (or ideally the world), and that they should be working on this question if they have the right skillset. I almost never meet this goal, but the point is that striving for it leads me to meet higher levels of excellence over time. I think we should all at least periodically strive for this goal in our talks, realizing that we won’t meet it but that the gap can reveal important lessons or important directions of future work. Similarly, as an audience we should consider holding the speaker to this standard. At the same time, we should recognize that anyone who is even inviting this standard in the first place is already performing an act of virtue, and that even being able to talk about where it falls short means that it’s in a comparison class with outstanding work.

On the other hand, many of the presentations in our group are (and should be) on preliminary work or half-baked ideas. Here the above standard is not particularly helpful, and the honest answer to some questions will be “I dunno, I just have some vague intuition that this is a good idea”. Asking those questions is still valuable as long as they are well-targeted (in the sense that we could reasonably expect a more interesting answer than “I have some vague intuition”, or if they point to a place where it would be particularly useful to refine the intuition). But it’s also useful to think in terms of more brainstorm-y questions: “Have you tried X?”, “This seems related to other interesting thing Y”, “What about this alternative framing?”, “I think your high-level question is interesting, but how do you grapple with key conceptual issue Z? Maybe you could try this technique”. Actually, these are great questions even for a fully-baked talk. But for half-baked ideas we should conspicuously increase the number of these types of questions, because the goal is to help give the speaker useful ideas rather than to construct a thorough collective understanding of the topic.

If you’re a speaker who feels nervous giving talks, remember that you’re among friends whose ultimate goal is to help you do great research. This is the time to take risks, get feedback, and grow. Similarly, if you’re an audience member who feels hesitant to ask questions, think of this as the place to expand your comfort zone and try things you wouldn’t normally try. And of course, if you have any thoughts or questions about any of this, feel free to leave a comment here or ask me one-on-one.

Levels of Understanding

Finally, I want to talk about different levels of understanding (which is, after all, the point of a seminar).

(Note: The first example below is a bit dense because it’s about a mathematical definition. Feel free to skip to the second example, on robustness, if it’s too much effort to decipher.)

Let’s suppose that in some talk you see the following definition:

A function f on [0,1] is Holder continuous with parameter α if, for k = floor(α) it satisfies |f^(k)(x)-f^(k)(y)|≤C|x-y|^α-k for some constant C>0, for all x,y.

This definition is probably mysterious to you (it was to me). Let’s suppose you ask the speaker for some intuition on what this definition is doing. There’s at least three levels of explanation they could give:

Level 1: For α=1 this is the same as being Lipschitz, so think of this as a generalization of Lipschitz.

Level 2: Morally, this is asking that the function be “α times differentiable”, where we want α to not necessarily be a whole number. For integer α the condition exactly says that f should have α derivatives, while for α<1 it asks the function locally to grow as |x-y|^α, which is weaker than differentiability but approaches differentiability as α->1.

Level 3: A level 2 explanation, plus a description of in what sense this is really a generalization of differentiability (i.e. what analogous properties we get), or some explanation of why this is the “right” way to generalize differentiability. [I don’t actually know the answer to this…]

Of course, the level 3 or level 2 explanation might take too long to get across in a talk. But it’s useful to realize that level 3 is always out there, and to notice as a listener when you’re only at level 1 or level 2. And as a speaker, if you don’t have time for at least a level 2 explanation, consider if this definition is really worth putting up there (why not just talk about regular old differentiability and then mention that there’s a generalization?).

These levels apply to all aspects of a talk, not just mathematical definitions. For instance, imagine a talk about robustness, where the speaker is describing the motivation for their work.

Level 1: Robustness is important.

Level 2: The problem we’re considering gets at the following aspect of robustness, which is important.

Level 3: In the field of robustness, one of the core difficulties is X (as evidenced by {conceptual issue, consultation with practitioners, etc.}). We will tackle problem P which offers a way forward on addressing X.

And for motivation in particular, there’s also a final level:
Level 4: In the world at large, M is one of the most important problems, as evidenced by {effect on GDP, important historical analogues, effect on important institutions, etc.}. Machine learning robustness offers a uniquely compelling angle on M for reasons R.

In practice, it is rare for a seminar to ever touch on Level 4. This is probably partly due to time constraints, partly because many academics consider it “out of scope”, and partly because of the possibly impolite implication that other fields of study are less important. The main exception is job talks, where something on level 4 is expected. I think it’s probably correct for Level 4 to be rare in seminars, but I’d personally also like to see slightly more of it at the current margin. For instance, if you’re at the point of presenting a body of work rather than a single paper, I think it’s worthwhile to at least argue for why this is a compelling direction within the field of ML (we could call that level 3.5).

Finally, while addressing the higher levels requires a deep understanding on the part of the speaker, there are similar levels that apply even to something that isn’t well-understood. For instance, suppose in an applied ML talk, there is a mysterious heuristic H that improves the results. One could say:

Level 1: H works.

Level 2: H works, and we have no idea why.
OR H works, for intuitive reason R.

Level 3: H works, and we have no idea why. We haven’t really looked into it [possibly followed by reason why this isn’t a core issue for the present work].
OR H works, and we have no idea why. We tried looking into X,Y,Z to understand it but none of them turned up much insight.
OR H works, for what we speculate is intuitive reason R, but we haven’t really looked into it.
OR H works, for what we think is intuitive reason R, and here’s some additional follow-up evidence that seems to support R.

Note that at each level, there are multiple possible explanations depending on the speaker’s actual level of knowledge. Level 1 simply asserts the empirical observation. Level 2 couples it with the speaker’s opinion about the observation, while Level 3 presents what I’d call the full epistemic status surrounding the observation (i.e. what surrounding questions have been investigated and how they support/don’t support different theories). Of course, the bottom example in Level 3 is preferable to the top example, but only one of those is an honest portrayal of the work, and the speaker doesn’t have the power to change that during a talk. What they do have power over is whether they give a Level 1, 2, or 3 explanation. Therefore, as the speaker, have the courage to give a Level 3 explanation even if it acknowledges uncertainty, and as a listener have the wisdom to accept such a Level 3 explanation and to respect the speaker’s courage and integrity.

Conclusion. Now that we have these levels in mind, we can better understand the seminar norms discussed above. The purpose of these norms is to reach the highest level of understanding possible about the most important aspects of a topic, and to socially reward speakers and listeners who move us towards that understanding.

Ari Morcos @arimorcos

@jefrankle — Mon, 20 Mar 2023 13:57:42 +0000

Web-scale data has driven the incredible progress in AI but do we really need all that data? We introduce SemDeDup, an exceedingly simple method to remove semantic duplicates in web data which can reduce the LAION dataset (& train time) by 2x w/ minimal performance loss. 🧵👇

Retweeted by Jonathan Frankle

A rant on LLaMA: please stop training giant language models

Severely Theoretical — Sun, 05 Mar 2023 19:44:41 +0000

Meta AI recently released a new language model called LLaMA. And by “released a model”, I mean “didn’t really release a model”. They released a really really nice form instead which you can fill out and then Meta will get back to you after snooping on you just to make sure you haven’t been naughty recently (did I mention the form is really nice and it’s public: EVERYBODY can fill out the form). Presumably, no weights for you (or just random weights for you) if they find out you have been a bit too naughty for their liking.

Anyway. So, these LLaMAs come in four different sizes: from 6.7B parameters (smol) to 65.2B parameters (chonky). The largest two models are trained for 1.4T tokens, whereas the smaller ones are trained for 1T tokens (not really sure why). This is roughly ~1 epoch (effectively) over the training data. The largest model roughly follows the Chinchilla compute-optimal recipe. There’s nothing the least bit remarkable about the models or the training setup. It’s just the standard GPT model trained in the standard way. The training data is said to be all public, although I didn’t check this carefully for myself (one hopes that it’s not public in the Meta sense of public. Just kidding, but not really).

The money figure in the LLaMA paper (for me) is the following figure that shows the training loss curves for the four models (Figure 1):

Tell me again: why are we still training O(100B) parameter models?

As you can see, no apparent saturation for the 7B and 13B parameter models. In fact, the training loss seems to be decreasing at roughly the same rate for all four models after around 300B tokens. Seeing this figure, one gets immediately overcome by a sense of dejavu: this is the GPT-3 paper all over again with its severely (criminally!) undertrained small models.

From the above figure, it looks distinctly possible (and indeed I would say quite likely) that were the smallest two models given the same amount of compute as the 65B parameter model, they would have probably matched or even surpassed that model. Giving them the same amount of compute would mean training the 7B parameter model ~12.5x longer and the 13B parameter model ~7.6x longer (I calculated these numbers from the corresponding GPU-hours reported in Table 15 of the paper). Here’s what the training loss curves might have looked like in that scenario (you can click on the image for an enlarged view):

plz train the smol one for this loooooooooooooooooooooooooooooooooooooooong

See just how much longer you would have to train the small models to match the compute given to the largest model? Now, you may laugh at my dumbass hand-drawn training loss curves, but I would submit to you that these dumbass hand-drawn curves are in fact much more rigorous than the dumbass “scaling laws” some really smart people came up with. My dumbass hand-drawn curves are also completely harmless, unlike the dumbass “scaling laws”, which had the overall pernicious effects of wasting a huge amount of resources and making these models much less accessible than they could have been.

Anyway. So, I’m trying to find a non-cynical explanation for this almost bizarre, persistent unwillingness to train small models for longer, but I can’t really find a very convincing one. Training a humongous model for a total of 1 epoch only over your training data is a phenomenon that does not really exist anywhere else in machine learning, to my knowledge. Take this CoCa paper for comparison, for instance (which is ~sota on ImageNet as of this writing): it trains a ~2.1B parameter model on a billion scale image-text dataset (~5B examples in total) for ~7 epochs (effectively).

Of course, I don’t believe for a second that people training these giant language models are actually dumb or ignorant, although from my experiences in academia, I could probably justifiably claim that they might be a bit too credulous: you can make a surprisingly large number of people in these circles believe some really dumb shit if it’s said or done by a sufficiently high prestige individual or individuals (just look at the insane “superintelligence” stuff, to give one example).

Anyway. So, my cynical interpretation? As I argued here before, trying to make these models less easily accessible, less easily controllable by others might be a feature, not a bug. I don’t believe, for instance, that OpenAI is really using a 175B parameter model for ChatGPT or for their other language products (here is an interesting analysis I saw recently that makes the same point, with some caveats), but they have an incentive for making people believe that they’re using a 175B parameter model and that it’s actually critical to use a giant model like that.

Last but not least, one final life lesson from all this, folks, is that whenever a theoretical physicist starts talking about power laws, just completely ignore them (and I really mean completely), immediately run away in the opposite direction. It is my contention that nothing good has ever come out of a physicist blabbering about power laws.

Coordination Headwind - How Organizations Are Like Slime Molds

Pages — Tue, 21 Feb 2023 17:02:25 +0000

V.I. Arnold, On teaching mathematics

Pages — Tue, 21 Feb 2023 17:00:07 +0000

Karol Hausman @hausman_k

@hausman_k — Tue, 14 Feb 2023 16:48:02 +0000

If you want to understand why robotics is much harder than it seems, @ericjang11 pointed me once to this essay that does a pretty good job explaining it: Reality has a surprising amount of detail http://johnsalvatier.org/blog/2017/reality-has-a-surprising-amount-of-detail …

Reality has a surprising amount of detail

johnsalvatier.org

My dad emigrated from Colombia to North America when he was 18 looking looking for a better life. For my brother and I that meant a lot of standing outside in the cold. My dad’s preferred method of improving his lot was improving lots, and my brother and I were “voluntarily” recruited to help working on the buildings we owned.

That’s how I came to spend a substantial part of my teenage years replacing fences, digging trenches, and building flooring and sheds. And if there’s one thing I’ve learned from all this building, it’s that reality has a surprising amount of detail.

This turns out to explain why its so easy for people to end up intellectually stuck. Even when they’re literally the best in the world in their field.

Consider building some basement stairs for a moment. Stairs seem pretty simple at first, and at a high level they are simple, just two long, wide parallel boards (2” x 12” x 16’), some boards for the stairs and an angle bracket on each side to hold up each stair. But as you actually start building you’ll find there’s a surprising amount of nuance.

The first thing you’ll notice is that there are actually quite a few subtasks. Even at a high level, you have to cut both ends of the 2x12s at the correct angles; then screw in some u-brackets to the main floor to hold the stairs in place; then screw in the 2x12s into the u-brackets; then attach the angle brackets for the stairs; then screw in the stairs.

Those goddamn stairs.

Next you’ll notice that each of those steps above decomposes into several steps, some of which have some tricky details to them due to the properties of the materials and task and the limitations of yourself and your tools.

The first problem you’ll encounter is that cutting your 2x12s to the right angle is a bit complicated because there’s no obvious way to trace the correct angles. You can either get creative (there is a way to trace it), or you can bust out your trig book and figure out how to calculate the angle and position of the cuts.

You’ll probably also want to look up what are reasonable angles for stairs. What looks reasonable when you’re cutting and what feels safe can be different. Also, you’re probably going to want to attach a guide for your circular saw when cutting the angle on the 2x12s because the cut has to be pretty straight.

When you’re ready to you will quickly find that getting the stair boards at all the same angle is non-trivial. You’re going to need something that can give you an angle to the main board very consistently. Once you have that, and you’ve drawn your lines, you may be dismayed to discover that your straight looking board is not that straight. Lumber warps after it’s made because it was cut when it was new and wet and now it’s dryer, so no lumber is perfectly straight.

Once you’ve gone back to the lumber store and gotten some straighter 2x12s and redrawn your lines, you can start screwing in your brackets. Now you’ll learn that despite starting aligned with the lines you drew, after screwing them in, your angle brackets are no longer quite straight because the screws didn’t go in quite straight and now they tightly secure the bracket at the wrong angle. You can fix that by drilling guide holes first. Also you’ll have to move them an inch or so because it’s more or less impossible to get a screw to go in differently than it did the first time in the same hole.

Now you’re finally ready to screw in the stair boards. If your screws are longer than 2”, you’ll need different ones, otherwise they will poke out the top of the board and stab you in the foot.

At every step and every level there’s an abundance of detail with material consequences.

It’s tempting to think ‘So what?’ and dismiss these details as incidental or specific to stair carpentry. And they are specific to stair carpentry; that’s what makes them details. But the existence of a surprising number of meaningful details is not specific to stairs. Surprising detail is a near universal property of getting up close and personal with reality.

You can see this everywhere if you look. For example, you’ve probably had the experience of doing something for the first time, maybe growing vegetables or using a Haskell package for the first time, and being frustrated by how many annoying snags there were. Then you got more practice and then you told yourself ‘man, it was so simple all along, I don’t know why I had so much trouble’. We run into a fundamental property of the universe and mistake it for a personal failing.

If you’re a programmer, you might think that the fiddliness of programming is a special feature of programming, but really it’s that everything is fiddly, but you only notice the fiddliness when you’re new, and in programming you do new things more often.

You might think the fiddly detailiness of things is limited to human centric domains, and that physics itself is simple and elegant. That’s true in some sense – the physical laws themselves tend to be quite simple – but the manifestation of those laws is often complex and counterintuitive.

II. Boiling A Watched Pot

Consider the boiling of water. That’s straightforward, water boils at 100 °C, right?

Well the stairs seemed simple too, so let’s double check.

Put yourself in the shoes of someone at the start of the 1800’s, with only a crude, unmarked mercury thermometer, trying to figure the physics of temperature.

Go to your stove, put some water in a pot, start heating some water, and pay attention as it heats.

(I suggest actually doing this)

The first thing you’ll probably notice is a lot of small bubbles gathering on the surface of the pot. Is that boiling? The water’s not that hot yet; you can still even stick your finger in. Then the bubbles will appear faster and start rising, but they somehow seem ‘unboiling’. Then you’ll start to see little bubble storms in patches, and you start to hear a hissing noise. Is that Boiling? Sort of? It doesn’t really look like boiling. The bubble storms grow larger and start releasing bigger bubbles. Eventually the bubbles get big and the surface of the water grows turbulent as the bubbles begin to make it to the surface. Finally we seem to have reached real boiling. I guess this is the boiling point? That seems kind of weird, what were the things that happened earlier if not boiling.

To make matters worse, if you’d used a glass pot instead of a metal one, the water would boil at a higher temperature. If you cleaned the glass vessel with sulfuric acid, to remove any residue, you’d find that you can heat water substantially more before it boils and when it does boil it boils in little explosions of boiling and the temperature fluctuates unstably.

Worse still, if you trap a drop of water between two other liquids and heat it, you can raise the temperature to at least 300 °C with nothing happening. That kind of makes a mockery of the statement ‘water boils at 100 °C’.

It turns out that ‘boiling’ is a lot more complicated than you thought.

This surprising amount of detail is is not limited to “human” or “complicated” domains, it is a near universal property of everything from space travel to sewing, to your internal experience of your own mind.

III. Invisible vs. Transparent Detail And Getting Intellectually Stuck

Again, you might think ‘So what? I guess things are complicated but I can just notice the details as I run into them; no need to think specifically about this’. And if you are doing things that are relatively simple, things that humanity has been doing for a long time, this is often true. But if you’re trying to do difficult things, things which are not known to be possible, it is not true.

The more difficult your mission, the more details there will be that are critical to understand for success.

You might hope that these surprising details are irrelevant to your mission, but not so. Some of them will end up being key. Wood’s tendency to warp means it’s more accurate to trace a cut than to calculate its length and angle. The possibility of superheating liquids means it’s important to use a packed bed when boiling liquids in industrial processes lest your process be highly inefficient and unpredictable. The massive difference in weight between a rocket full of fuel and an empty one means that a reusable rocket can’t hover if it can’t throttle down to a very small fraction of its original thrust, which in turn means it must plan its trajectory very precisely to achieve 0 velocity at exactly the moment it reaches the ground.

Some important details for colonizing the universe.

You might also hope that the important details will be obvious when you run into them, but not so. Such details aren’t automatically visible, even when you’re directly running up against them. Things can just seem messy and noisy instead. ‘Spirit’ thermometers, made using brandy and other liquors, were in common use in the early days of thermometry. They were even considered as a potential standard fluid for thermometers. It wasn’t until the careful work of Swiss physicist Jean-André De Luc in the 18th century that physicists realized that alcohol thermometers are highly nonlinear and highly variable depending on concentration, which is in turn hard to measure.

You’ve probably also had experiences where you were trying to do something and growing increasingly frustrated because it wasn’t working, and then finally, after some time you realize that your solution method can’t possibly work.

Another way to see that noticing the right details is hard, is that different people end up noticing different details. My brother and I once built a set of stairs for the garage with my dad, and we ran into the problem of determining where to cut the long boards so they lie at the correct angle. After struggling with the problem for a while (and I do mean struggling, a 16’ long board is heavy), we got to arguing. I remembered from trig that we could figure out angle so I wanted to go dig up my textbook and think about it. My dad said, ‘no, no, no, let’s just trace it’, insisting that we could figure out how to do it.

I kept arguing because I thought I was right. I felt really annoyed with him and he was annoyed with me. In retrospect, I think I saw the fundamental difficulty in what we were doing and I don’t think he appreciated it (look at the stairs picture and see if you can figure it out), he just heard ‘let’s draw some diagrams and compute the angle’ and didn’t think that was the solution, and if he had appreciated the thing that I saw I think he would have been more open to drawing some diagrams. But at the same time, he also understood that diagrams and math don’t account for the shape of the wood, which I did not appreciate. If we had been able to get these points across, we could have come to consensus. Drawing a diagram was probably a good idea, but computing the angle was probably not. Instead we stayed annoyed at each other for the next 3 hours.

Before you’ve noticed important details they are, of course, basically invisible. It’s hard to put your attention on them because you don’t even know what you’re looking for. But after you see them they quickly become so integrated into your intuitive models of the world that they become essentially transparent. Do you remember the insights that were crucial in learning to ride a bike or drive? How about the details and insights you have that led you to be good at the things you’re good at?

This means it’s really easy to get stuck. Stuck in your current way of seeing and thinking about things. Frames are made out of the details that seem important to you. The important details you haven’t noticed are invisible to you, and the details you have noticed seem completely obvious and you see right through them. This all makes makes it difficult to imagine how you could be missing something important.

That’s why if you ask an anti-climate change person (or a climate scientist) “what could convince you you were wrong?” you’ll likely get back an answer like “if it turned out all the data on my side was faked” or some other extremely strong requirement for evidence rather than “I would start doubting if I noticed numerous important mistakes in the details my side’s data and my colleagues didn’t want to talk about it”. The second case is much more likely than the first, but you’ll never see it if you’re not paying close attention.

If you’re trying to do impossible things, this effect should chill you to your bones. It means you could be intellectually stuck right at this very moment, with the evidence right in front of your face and you just can’t see it.

This problem is not easy to fix, but it’s not impossible either. I’ve mostly fixed it for myself. The direction for improvement is clear: seek detail you would not normally notice about the world. When you go for a walk, notice the unexpected detail in a flower or what the seams in the road imply about how the road was built. When you talk to someone who is smart but just seems so wrong, figure out what details seem important to them and why. In your work, notice how that meeting actually wouldn’t have accomplished much if Sarah hadn’t pointed out that one thing. As you learn, notice which details actually change how you think.

If you wish to not get stuck, seek to perceive what you have not yet perceived.

Comments

Inconsistent Maximum Likelihood Estimation: An “Ordinary” Example

Pages — Sat, 09 Aug 2008 22:24:54 +0000

2008-08-09 at 6:24 pm

The widespread use of the Maximum Likelihood Estimate (MLE) is partly based on an intuition that the value of the model parameter that best explains the observed data must be the best estimate, and partly on the fact that for a wide class of models the MLE has good asymptotic properties. These properties include “consistency” — that as the amount of data increases, the estimate will, with higher and higher probability, become closer and closer to the true value — and, moreover, that the MLE converges as quickly to this true value as any other estimator. These asymptotic properties might be seen as validating the intuition that the MLE must be good, except that these good properties of the MLE do not hold for some models.

This is well known, but the common examples where the MLE is inconsistent aren’t too satisfying. Some involve models where the number of parameters increases with the number of data points, which I think is cheating, since these ought to be seen as “latent variables”, not parameters. Others involve singular probability densities, or cases where the MLE is at infinity or at the boundary of the parameter space. Normal (Gaussian) mixture models fall in this category — the likelihood becomes infinite as the variance of one of the mixture components goes to zero, while the mean is set to one of the data points. One might think that such examples are “pathological”, and do not really invalidate the intuition behind the MLE.

Here, I’ll present a simple “ordinary” model where the MLE is inconsistent. The probability density defined by this model is free of singularities (or any other pathologies), for any value of the parameter. The MLE is always well defined (apart from ties, which occur with probability zero), and the MLE is always in the interior of the parameter space. Moreover, the problem is one-dimensional, allowing easy visualization.

The data consists of i.i.d. real values x₁, x₂, …, x_n. The model has one positive real parameter, t. The distribution of a data point is an equal mixture of the standard normal and a normal distribution with mean t and standard deviation exp(-1/t^2):

Using this R program, I produced some plots of densities and likelihood functions derived from this model.

Looking at the top left plot, the probability density function when t=2.5, you can see two modes near 0 and t. As t decreases to 0.6, and then 0.2, the mode at t moves left and gets narrower. A narrower mode has higher probability density at its peak. When t=0.2, the peak density of the mode at t is about 10¹⁰ times higher than the peak density of the mode at 0, which is invisible at the scale of the plot (the scale is noted in the left-side caption).

The bottom plots are of likelihood functions given data generated from the model with t=0.6. I generated 100 points, and used the first 10, the first 30, and all 100 for the three plots. With 10 data points, the value that maximizes the likelihood (0.5916) is close to the true parameter value (0.6). But as the number of data points increases, the MLE moves away from the true value, getting closer and closer to zero. The value of the likelihood at the MLE also gets bigger, reaching about 0.3×10¹⁶² when 100 data points are used.

This plot shows the likelihood function with 30 data points with the vertical scale changed to show detail other than the peak. A local maximum of the likelihood around the true value of 0.6 can now be seen. It is completely dominated by the maximum at 0.0743, which is about a factor of 10⁵² higher.

Why does this happen? First, note that regardless of the true value of t, the density in the vicinity of zero will be non-singular, so the probability that a data point will land in the interval (0,c) will be proportional to c when c is small. In a data set of n points, we can therefore expect the smallest positive data point, call it x₀, to have a magnitude of about k/n, for some constant k. Now, consider the value of the log likelihood function at t=x₀ versus its value at the true value of t. When t=x₀, the log probability density for x₀ will be approximately

In comparison, the density of x₀ when t is not close to x₀ will be approximately half its density under the N(0,1) distribution, which will approach a constant as n increases (and x₀ goes to zero). The density under t=x₀ for data points other than x₀ will approach a constant as n increases and t goes to zero. On average, the true value of t (or values near the true value) will produce a higher value for the density of such points, but the difference in log densities will approach a constant. The end result is that the contribution to the log likelihood of data points other than x₀ will be greater for the true value of t than for t=x₀ by an amount that grows in proportion to n, while the contribution to the log likelihood of the point x₀ will be greater for t=x₀ than for the true value of t by an amount that grows in proportion to n². As n increases, the n² contribution will dominate, so the MLE will be close to zero rather than being close to the true value of the parameter. (Note, however, that the MLE will usually not be exactly equal to x₀, though when n is large it will usually be nearby. The argument above just shows that a value of t near x₀ will have higher likelihood than a value near the true value of t.)

So what does this say about the Maximum Likelihood intuition — that the best estimate is the parameter value that best explains the data? It illustrates the Bayesian critique of this intuition, which is that using the MLE ignores the volume of the parameter space where the data is fit well. When n is large, only a tiny region around x₀fits the data better than the true value of the parameter. In a posterior distribution (based on some prior), the smallness of this region will cancel the large height of the peak, with the result that the posterior probability that t is near x₀ will usually not be large.

Entry filed under: Statistics, Statistics - Technical.

Tips for New Researchers

Pages — Tue, 14 Feb 2023 16:23:17 +0000

September 2022

This guide is a compilation of tips I wish I knew when I was starting out doing computer science research as an undergraduate student at Stanford.

It's targeted towards high school / undergraduate / Master's students working with a research mentor (e.g. a postdoc / PhD student), but might be helpful for junior PhD students or new industry researchers as well.

Research will be hard at times!
It's very different from the classroom model of success, where there are concrete learning goals and you can aim for the 'A' at the end.
You're not going to understand stuff at first: it will feel overwhelming, the path won't be clear, and you'll question why what you're doing even matters
You will have to keep on pushing through—effort, grit, and resilience are key ingredients for developing as a researcher
Focus on your growth: gradually you'll understand more and more and be able to see farther and deeper into the research world around you

This is one of the most important skills in research, and part of persevering!
Asking for help shouldn't be viewed as a bad thing or "giving up"—it's a natural part of the learning process and will help you make progress
For example, it's typically not productive to doom-scroll Stack Overflow for 10 hours if you're really stuck
- Instead, you might send your mentor a message with a brief summary of the issue / question (or the full stack trace of an error) so they know you're stuck and can help you. You can also briefly share some different strategies you've tried.
- Often it may be something they've experienced too, and can resolve for you quickly!
Unsure whether you should reach out to your mentor? Ask them how long you should try to solve something on your own before reaching out for help
- For some mentors this might be 20 minutes, for others this might be a day or more. Ask them to help you calibrate.
- I personally tell my mentees to err on the side of pinging me too often at first (I'd rather spend a few seconds to unblock you, than learn you've been stuck for a week and unable to make progress)
If something is taking you a long time, and you think there might be a better way—ask! There might not be, but it never hurts to ask!
- Example 1: a student I worked with found using tmux to be very slow. They asked for help and found out about mouse support!
- Example 2: another student I worked with was waiting a long time for bash commands to finish before they could launch another one. They asked if there was a better way and learned about bash scripts and background processes!
Another place to ask is in Slack (or whatever platform you use)—some workspaces have a #debug channel or similar where you can ask questions about bugs, libraries, the cluster, best practices, etc

Of course, your mentor won't always be there to answer your questions
It's important to learn how to problem-solve for yourself. Some simple techniques:
- If you run into a bug, Google the top-level error in your stack trace
- Search up solutions on Slack, GitHub, Linux man pages, Stack Overflow, documentation, etc
- If you don't know what a function or feature does, open an interactive Python session or make a mini jupyter notebook or script to try it out!
Problem solving is a skill you get better at—observe your collaborators and mentors. Notice how they solve issues you aren't able to solve, or ask them!
You can sometimes ask to observe other people while they work to pick up strategies
- You can also do the reverse and have them watch and give you feedback
- Pair programming is really great for gaining insights large and small about people's processes!

Velocity just means how quickly you're able to iterate on tasks and experiments
This is crucial for making progress on a week-to-week basis, since it enables you to test out hypotheses and obtain results
It's very common to feel "slow" when you start doing research (or even for years after!), especially when comparing yourself to others. This is normal—don't sweat it. Your velocity will increase as you do more research
One useful goal is to have some deliverable, however small, for each meeting
Often times, velocity improves over time just due to familiarity with your tools, for example:
- Reusing a codebase or library for a new project saves so much time compared with having to write it from scratch yourself
- Once you know the bash command to do something (e.g., list the largest folders in your directory), you won't have to look it up again each time (or at the very least you'll know how to find the answer quickly)
- There's only a finite number of common errors or bugs in any problem domain: once you figure out how to solve something the first time, you can solve it a lot more quickly the next time. (I remember it always seemed like magic to me when my mentors knew the fixes for all the bugs I ran into, but it was just this at play!)
- See also: Learning Curves (Wikipedia)
You can also pay attention to your work habits and identify ways you can improve your velocity. Your mentor can help you identify strategies here! For example:
- Queueing up jobs with a bash script or Slurm as opposed to running them manually
- Searching your bash history for previous commands you've run instead of wandering through StackOverflow results you’ve already browsed before
- Tracking experiments using a tool like Weights and Biases or Neptune instead of by hand or by copy-pasting results into spreadsheets
- Learning how to search / organize your files effectively, vs navigating through tons of folders
It can sometimes be hard for your mentor to know if slow velocity is due to lack of available time or if you're finding it challenging to complete your tasks.
- Letting them know approximately how long you're spending on each task and which parts you're finding challenging can be a really effective way to help them help you
Keep in mind that your energy is just as important as your time. If a tool feels like it's introducing a lot of friction (or you start to dread using it), often that can be an indicator that it's time to learn how to use it better or possibly find an alternative.

When you've spent all week on a project, mentor meetings are a really good opportunity to take a step back and think about your project from a high-level perspective
- This is useful for making sure you're directing all of your effort in a productive way—research is about working smarter not just harder
- It also gives you practice in developing your "researcher brain:" How do your tasks fit into the project as a whole? How does your project fit into your broader field? (How does your field fit into the broader space of science / humanities / the arts / the world?)
Another important reason to zoom out is that your mentor isn't steeped in the details of your project like you are, so they might need a bit of a refresher to see the connection between what you did last week and the broader goals of the project—and help you plan accordingly

New researchers have a tendency to present their results verbally in meeting
- e.g. saying statistics out loud: "Algorithm 1 got 34.4% and Algorithm 2 got 34.5%"
It may take more time, but you'll get much much more out of the meetings with your mentor if you create a graph or table for you and your mentor to look at together
This makes it much easier to analyze, ideate, and come up with next steps. Time making effective visualizations is time well spent!

Even between meetings, it can be nice to ping your mentor / group slack channel with ideas, preliminary results, problems you're having, etc
This is a good way to stay on your mentor's radar and help them support you
What kinds of messages you send might depend on your personal style. Some ideas:
- "I just ran the new baseline we talked about last meeting, here's a graph of the results compared to the simpler baseline. The performance looks slightly better but they're pretty similar"
- "It looks like running the larger scale experiments might take a while—the cluster is really busy due to the upcoming deadline. My guess is it will take until Tuesday for our jobs to run. Do you think we should wait it out, or look into other compute options?"
- "I talked to Alba about our project, and she passed along this paper: . I added it to our related works doc with a few notes!"
- "Not urgent, but I was wondering: why do we normalize our images before passing them through the network?"
- "Hey, I've been curious to learn more about distributed training. Have any recommended papers / readings you like?"

Initially, your mentor will probably run your meetings by guiding the agenda and determining next steps
But as you advance as a researcher (generally over a few months), you'll start to be able to determine some next steps by yourself, and save your mentor the effort of running the meeting by themselves (and coming up with each concrete next step / experiment / idea)
- This is called taking initiative and "managing up"!
One thing you can do at any stage is have an agenda for each meeting
- Come to the meeting with questions you have + goals for what to accomplish during the meeting
There are also other ways of showing initiative:
- Writing a script to automate a common task
- Trying to analyze/interpret the results of experiments, vs just running them
- Reading and sharing a piece of related work with some notes about what it does and how it connects to the project

Level 1: New researchers often complete their task and just share the raw results (e.g. a graph or table)
- Ex: "I ran what we talked about and here are the results!"
Level 2: Slightly more experienced researchers might think about the results and give an interpretation
- Ex:. "it's nice to see that the new algorithm improves performance, but the baseline is actually a lot higher than we were expecting"
- Interpreting data may be hard at first, but will be more natural over time. Don't force it if no insights come to mind, but coming up with potential interpretations can be good practice.
Level 3: Even more experienced researchers might further propose some next steps to talk about with their mentor
- Ex: "I wonder if we should consider other tasks that better highlight the concept we're interested in. For example, I did some searching and found X, Y, and Z"
Level 4: Another step beyond that might be taking the initiative to run additional studies or experiments beyond what was discussed
- Ex: "I also ran our models on X, Y, and Z datasets and found that existing methods actually do quite poorly. Do you think it makes sense to run a larger scale study on these settings?"
- This gets a lot easier as your velocity improves
Level 5+: As you progress even further, you might be able to do this process at an even higher level of abstraction (first within the project and then across projects!)
- Ex: "These last few experiments show property A of our system really well. I wonder if we could explore property B, maybe with experiments Q, R, or S?"
- Ex: "Our last paper made point C really well. I'm wondering if we could push this even further by looking at setting D, using tools from field W?"
- Ex: "One thing that's really unsatisfying about approaches in field Q is that they don't really get at property P. Do you think we might be able to operationalize this in some way? Here's a sketch of one possible approach."
Note: your mentor won't always be right—sometimes they might suggest an idea that doesn’t work, or they might not have a great solution idea. This is normal, as research is exploratory, and you’re solving a new problem.

Lab meetings: Can be a great way to get exposure to a research environment
Talks / reading groups: Learn how research is conveyed + understood by your community
- When you feel comfortable, try to present a paper in a reading group!
Social events: get to know folks in the lab and become part of the community

Read a bit about them + their papers on the website
Reach out and mention something you found interesting, and that you'd love to chat more
DMs on slack often work well in my experience

Note: It's the mentor's responsibility to create an environment where these personal values are the norm—and they should be strong models of these as well!

Confidence / intellectual humility

Don't feel like you have to pretend you know everything coming in—your mentors know that you're learning (they are too!).
Saying "I'm not sure I understand, can you explain that a bit more?" is one of the fastest ways to grow and learn—and you should be saying it a lot, especially at the start of the project!
Also remember that you have knowledge and background that others don't—and the reverse will also be true. Embrace that and learn from (and share with!) your collaborators

Trust isn't something that's easy to regain on either side
Stay true to your word, and communicate openly with your mentors—they should do the same for you
If you're not going to be able to make a deadline or finish a task by the date you'd said, let them know as soon as possible so they can re-coordinate the rest of the team (or just their own expectations)
If you discover a bug which affects your previous results—let the team know. Their response should ideally be constructive, e.g. "Well this is a bummer, but I'm really glad you caught this! Let's think about how to proceed, and how we can prevent something like this in the future…"
You should expect the same honesty / integrity from your mentors and the other people you work with

Identity, boundaries and health

It should go without saying, but your mentors should create a space where you feel comfortable being yourself, and where you feel respected and supported in your identity so you can do your best work

Every lab has a different work culture, but no research project should push out everything else in your life or imperil your physical or emotional health
Having time for friends + family / rest / leisure is really important, and if you're feeling pressured to not have it, that's a problem

Being excited about the work and research is a great thing! This makes your collaborators feel good too.
Don't force it, especially if you're feeling confused or frustrated (all natural parts of the research process), but feel free to express authentic excitement when you feel it
Energy is infectious and can really make a big difference in a work environment

Thanks to Shreya Shankar, Jesse Michel, and Daniel Fein for providing helpful feedback on an earlier draft of this, and Nick Tomlin for useful conversations!

Nando de Freitas 🏳️‍🌈 @NandoDF

@ylecun — Sun, 12 Feb 2023 09:31:53 +0000

Funny @sirbayes Learning methods — supervised, RLHF, policy gradients, Dagger, self-training — can be seen as optimisation with the following gradient: grad = Expectation_x,y [ F(x,y) grad log p(y|x) ] Choices of F and how x and y are produced determine the learning type 1/n

Kevin Patrick Murphy @sirbayes

Great article on LLMs - MLE training is exactly lossy data compression! I know RLHF is changing the loss to maximize predicted human ratings, but it still feels like a bit like lipstick on a pig to me (although who doesnt like bbq now and then? ;) https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web …

Retweeted by Yann LeCun

ChatGPT Is a Blurry JPEG of the Web

www.newyorker.com
by Condé Nast

Illustration by Vivek Thakker

In 2013, workers at a German construction company noticed something odd about their Xerox photocopier: when they made a copy of the floor plan of a house, the copy differed from the original in a subtle but significant way. In the original floor plan, each of the house’s three rooms was accompanied by a rectangle specifying its area: the rooms were 14.13, 21.11, and 17.42 square metres, respectively. However, in the photocopy, all three rooms were labelled as being 14.13 square metres in size. The company contacted the computer scientist David Kriesel to investigate this seemingly inconceivable result. They needed a computer scientist because a modern Xerox photocopier doesn’t use the physical xerographic process popularized in the nineteen-sixties. Instead, it scans the document digitally, and then prints the resulting image file. Combine that with the fact that virtually every digital image file is compressed to save space, and a solution to the mystery begins to suggest itself.

Compressing a file requires two steps: first, the encoding, during which the file is converted into a more compact format, and then the decoding, whereby the process is reversed. If the restored file is identical to the original, then the compression process is described as lossless: no information has been discarded. By contrast, if the restored file is only an approximation of the original, the compression is described as lossy: some information has been discarded and is now unrecoverable. Lossless compression is what’s typically used for text files and computer programs, because those are domains in which even a single incorrect character has the potential to be disastrous. Lossy compression is often used for photos, audio, and video in situations in which absolute accuracy isn’t essential. Most of the time, we don’t notice if a picture, song, or movie isn’t perfectly reproduced. The loss in fidelity becomes more perceptible only as files are squeezed very tightly. In those cases, we notice what are known as compression artifacts: the fuzziness of the smallest JPEG and MPEG images, or the tinny sound of low-bit-rate MP3s.

Xerox photocopiers use a lossy compression format known as JBIG2, designed for use with black-and-white images. To save space, the copier identifies similar-looking regions in the image and stores a single copy for all of them; when the file is decompressed, it uses that copy repeatedly to reconstruct the image. It turned out that the photocopier had judged the labels specifying the area of the rooms to be similar enough that it needed to store only one of them—14.13—and it reused that one for all three rooms when printing the floor plan.

The fact that Xerox photocopiers use a lossy compression format instead of a lossless one isn’t, in itself, a problem. The problem is that the photocopiers were degrading the image in a subtle way, in which the compression artifacts weren’t immediately recognizable. If the photocopier simply produced blurry printouts, everyone would know that they weren’t accurate reproductions of the originals. What led to problems was the fact that the photocopier was producing numbers that were readable but incorrect; it made the copies seem accurate when they weren’t. (In 2014, Xerox released a patch to correct this issue.)

I think that this incident with the Xerox photocopier is worth bearing in mind today, as we consider OpenAI’s ChatGPT and other similar programs, which A.I. researchers call large-language models. The resemblance between a photocopier and a large-language model might not be immediately apparent—but consider the following scenario. Imagine that you’re about to lose your access to the Internet forever. In preparation, you plan to create a compressed copy of all the text on the Web, so that you can store it on a private server. Unfortunately, your private server has only one per cent of the space needed; you can’t use a lossless compression algorithm if you want everything to fit. Instead, you write a lossy algorithm that identifies statistical regularities in the text and stores them in a specialized file format. Because you have virtually unlimited computational power to throw at this task, your algorithm can identify extraordinarily nuanced statistical regularities, and this allows you to achieve the desired compression ratio of a hundred to one.

Now, losing your Internet access isn’t quite so terrible; you’ve got all the information on the Web stored on your server. The only catch is that, because the text has been so highly compressed, you can’t look for information by searching for an exact quote; you’ll never get an exact match, because the words aren’t what’s being stored. To solve this problem, you create an interface that accepts queries in the form of questions and responds with answers that convey the gist of what you have on your server.

What I’ve described sounds a lot like ChatGPT, or most any other large-language model. Think of ChatGPT as a blurry JPEG of all the text on the Web. It retains much of the information on the Web, in the same way that a JPEG retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable. You’re still looking at a blurry JPEG, but the blurriness occurs in a way that doesn’t make the picture as a whole look less sharp.

This analogy to lossy compression is not just a way to understand ChatGPT’s facility at repackaging information found on the Web by using different words. It’s also a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large-language models such as ChatGPT are all too prone. These hallucinations are compression artifacts, but—like the incorrect labels generated by the Xerox photocopier—they are plausible enough that identifying them requires comparing them against the originals, which in this case means either the Web or our own knowledge of the world. When we think about them this way, such hallucinations are anything but surprising; if a compression algorithm is designed to reconstruct text after ninety-nine per cent of the original has been discarded, we should expect that significant portions of what it generates will be entirely fabricated.

This analogy makes even more sense when we remember that a common technique used by lossy compression algorithms is interpolation—that is, estimating what’s missing by looking at what’s on either side of the gap. When an image program is displaying a photo and has to reconstruct a pixel that was lost during the compression process, it looks at the nearby pixels and calculates the average. This is what ChatGPT does when it’s prompted to describe, say, losing a sock in the dryer using the style of the Declaration of Independence: it is taking two points in “lexical space” and generating the text that would occupy the location between them. (“When in the Course of human events, it becomes necessary for one to separate his garments from their mates, in order to maintain the cleanliness and order thereof. . . .”) ChatGPT is so good at this form of interpolation that people find it entertaining: they’ve discovered a “blur” tool for paragraphs instead of photos, and are having a blast playing with it.

Given that large-language models like ChatGPT are often extolled as the cutting edge of artificial intelligence, it may sound dismissive—or at least deflating—to describe them as lossy text-compression algorithms. I do think that this perspective offers a useful corrective to the tendency to anthropomorphize large-language models, but there is another aspect to the compression analogy that is worth considering. Since 2006, an A.I. researcher named Marcus Hutter has offered a cash reward—known as the Prize for Compressing Human Knowledge, or the Hutter Prize—to anyone who can losslessly compress a specific one-gigabyte snapshot of Wikipedia smaller than the previous prize-winner did. You have probably encountered files compressed using the zip file format. The zip format reduces Hutter’s one-gigabyte file to about three hundred megabytes; the most recent prize-winner has managed to reduce it to a hundred and fifteen megabytes. This isn’t just an exercise in smooshing. Hutter believes that better text compression will be instrumental in the creation of human-level artificial intelligence, in part because the greatest degree of compression can be achieved by understanding the text.

To grasp the proposed relationship between compression and understanding, imagine that you have a text file containing a million examples of addition, subtraction, multiplication, and division. Although any compression algorithm could reduce the size of this file, the way to achieve the greatest compression ratio would probably be to derive the principles of arithmetic and then write the code for a calculator program. Using a calculator, you could perfectly reconstruct not just the million examples in the file but any other example of arithmetic that you might encounter in the future. The same logic applies to the problem of compressing a slice of Wikipedia. If a compression program knows that force equals mass times acceleration, it can discard a lot of words when compressing the pages about physics because it will be able to reconstruct them. Likewise, the more the program knows about supply and demand, the more words it can discard when compressing the pages about economics, and so forth.

Large-language models identify statistical regularities in text. Any analysis of the text of the Web will reveal that phrases like “supply is low” often appear in close proximity to phrases like “prices rise.” A chatbot that incorporates this correlation might, when asked a question about the effect of supply shortages, respond with an answer about prices increasing. If a large-language model has compiled a vast number of correlations between economic terms—so many that it can offer plausible responses to a wide variety of questions—should we say that it actually understands economic theory? Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression. But is it possible that their lossy compression nonetheless indicates real understanding of the sort that A.I. researchers are interested in?

Let’s go back to the example of arithmetic. If you ask GPT-3 (the large-language model that ChatGPT was built from) to add or subtract a pair of numbers, it almost always responds with the correct answer when the numbers have only two digits. But its accuracy worsens significantly with larger numbers, falling to ten per cent when the numbers have five digits. Most of the correct answers that GPT-3 gives are not found on the Web—there aren’t many Web pages that contain the text “245 + 821,” for example—so it’s not engaged in simple memorization. But, despite ingesting a vast amount of information, it hasn’t been able to derive the principles of arithmetic, either. A close examination of GPT-3’s incorrect answers suggests that it doesn’t carry the “1” when performing arithmetic. The Web certainly contains explanations of carrying the “1,” but GPT-3 isn’t able to incorporate those explanations. GPT-3’s statistical analysis of examples of arithmetic enables it to produce a superficial approximation of the real thing, but no more than that.

Given GPT-3’s failure at a subject taught in elementary school, how can we explain the fact that it sometimes appears to perform well at writing college-level essays? Even though large-language models often hallucinate, when they’re lucid they sound like they actually understand subjects like economic theory. Perhaps arithmetic is a special case, one for which large-language models are poorly suited. Is it possible that, in areas outside addition and subtraction, statistical regularities in text actually do correspond to genuine knowledge of the real world?

I think there’s a simpler explanation. Imagine what it would look like if ChatGPT were a lossless algorithm. If that were the case, it would always answer questions by providing a verbatim quote from a relevant Web page. We would probably regard the software as only a slight improvement over a conventional search engine, and be less impressed by it. The fact that ChatGPT rephrases material from the Web instead of quoting it word for word makes it seem like a student expressing ideas in her own words, rather than simply regurgitating what she’s read; it creates the illusion that ChatGPT understands the material. In human students, rote memorization isn’t an indicator of genuine learning, so ChatGPT’s inability to produce exact quotes from Web pages is precisely what makes us think that it has learned something. When we’re dealing with sequences of words, lossy compression looks smarter than lossless compression.

A lot of uses have been proposed for large-language models. Thinking about them as blurry JPEGs offers a way to evaluate what they might or might not be well suited for. Let’s consider a few scenarios.

Can large-language models take the place of traditional search engines? For us to have confidence in them, we would need to know that they haven’t been fed propaganda and conspiracy theories—we’d need to know that the JPEG is capturing the right sections of the Web. But, even if a large-language model includes only the information we want, there’s still the matter of blurriness. There’s a type of blurriness that is acceptable, which is the re-stating of information in different words. Then there’s the blurriness of outright fabrication, which we consider unacceptable when we’re looking for facts. It’s not clear that it’s technically possible to retain the acceptable kind of blurriness while eliminating the unacceptable kind, but I expect that we’ll find out in the near future.

Even if it is possible to restrict large-language models from engaging in fabrication, should we use them to generate Web content? This would make sense only if our goal is to repackage information that’s already available on the Web. Some companies exist to do just that—we usually call them content mills. Perhaps the blurriness of large-language models will be useful to them, as a way of avoiding copyright infringement. Generally speaking, though, I’d say that anything that’s good for content mills is not good for people searching for information. The rise of this type of repackaging is what makes it harder for us to find what we’re looking for online right now; the more that text generated by large-language models gets published on the Web, the more the Web becomes a blurrier version of itself.

There is very little information available about OpenAI’s forthcoming successor to ChatGPT, GPT-4. But I’m going to make a prediction: when assembling the vast amount of text used to train GPT-4, the people at OpenAI will have made every effort to exclude material generated by ChatGPT or any other large-language model. If this turns out to be the case, it will serve as unintentional confirmation that the analogy between large-language models and lossy compression is useful. Repeatedly resaving a JPEG creates more compression artifacts, because more information is lost every time. It’s the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.

Indeed, a useful criterion for gauging a large-language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model. If the output of ChatGPT isn’t good enough for GPT-4, we might take that as an indicator that it’s not good enough for us, either. Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us confidence in the quality of that text. (I suspect that such an outcome would require a major breakthrough in the techniques used to build these models.) If and when we start seeing models producing output that’s as good as their input, then the analogy of lossy compression will no longer be applicable.

Can large-language models help humans with the creation of original writing? To answer that, we need to be specific about what we mean by that question. There is a genre of art known as Xerox art, or photocopy art, in which artists use the distinctive properties of photocopiers as creative tools. Something along those lines is surely possible with the photocopier that is ChatGPT, so, in that sense, the answer is yes. But I don’t think that anyone would claim that photocopiers have become an essential tool in the creation of art; the vast majority of artists don’t use them in their creative process, and no one argues that they’re putting themselves at a disadvantage with that choice.

So let’s assume that we’re not talking about a new genre of writing that’s analogous to Xerox art. Given that stipulation, can the text generated by large-language models be a useful starting point for writers to build off when writing something original, whether it’s fiction or nonfiction? Will letting a large-language model handle the boilerplate allow writers to focus their attention on the really creative parts?

Obviously, no one can speak for all writers, but let me make the argument that starting with a blurry copy of unoriginal work isn’t a good way to create original work. If you’re a writer, you will write a lot of unoriginal work before you write something original. And the time and effort expended on that unoriginal work isn’t wasted; on the contrary, I would suggest that it is precisely what enables you to eventually create something original. The hours spent choosing the right word and rearranging sentences to better follow one another are what teach you how meaning is conveyed by prose. Having students write essays isn’t merely a way to test their grasp of the material; it gives them experience in articulating their thoughts. If students never have to write essays that we have all read before, they will never gain the skills needed to write something that we have never read.

And it’s not the case that, once you have ceased to be a student, you can safely use the template that a large-language model provides. The struggle to express your thoughts doesn’t disappear once you graduate—it can take place every time you start drafting a new piece. Sometimes it’s only in the process of writing that you discover your original ideas. Some might say that the output of large-language models doesn’t look all that different from a human writer’s first draft, but, again, I think this is a superficial resemblance. Your first draft isn’t an unoriginal idea expressed clearly; it’s an original idea expressed poorly, and it is accompanied by your amorphous dissatisfaction, your awareness of the distance between what it says and what you want it to say. That’s what directs you during rewriting, and that’s one of the things lacking when you start with text generated by an A.I.

There’s nothing magical or mystical about writing, but it involves more than placing an existing document on an unreliable photocopier and pressing the Print button. It’s possible that, in the future, we will build an A.I. that is capable of writing good prose based on nothing but its own experience of the world. The day we achieve that will be momentous indeed—but that day lies far beyond our prediction horizon. In the meantime, it’s reasonable to ask, What use is there in having something that rephrases the Web? If we were losing our access to the Internet forever and had to store a copy on a private server with limited space, a large-language model like ChatGPT might be a good solution, assuming that it could be kept from fabricating. But we aren’t losing our access to the Internet. So just how much use is a blurry JPEG, when you still have the original? ♦

Fei-Fei Li's advice on Good Research and Good Papers

Pages — Wed, 25 Mar 2015 00:00:00 +0000

Fei-Fei Li, Director of the Stanford Artificial Intelligence Lab and the Stanford Vision Lab, provided some advice on writing academic papers in the year of 2009. I’ve read this many times since several years ago. Every time I read, the advice makes me introspect. Now, I’ve published papers on computer vision myself, and the advice seems even more important for me.

Here is the Fei-Fei’s advice.

De-mystifying Good Research and Good Papers

By Fei-Fei Li, 2009.03.01

Please remember this:

1000+ computer vision papers get published every year!

Only 5-10 are worth reading and remembering!

Since many of you are writing your papers now, I thought that I’d share these thoughts with you. I probably have said all these at various points during our group and individual meetings. But as I continue my AC reviews these days (that’s 70 papers and 200+ reviews — between me and my AC partner), these following points just keep coming up. Not enough people conduct first class research. And not enough people write good papers.

Every research project and every paper should be conducted and written with one singular purpose: to genuinely advance the field of computer vision. So when you conceptualize and carry out your work, you need to be constantly asking yourself this question in the most critical way you could – “Would my work define or reshape xxx (problem, field, technique) in the future?” This means publishing papers is NOT about “this has not been published or written before, let me do it”, nor is it about “let me find an arcane little problem that can get me an easy poster”. It’s about “if I do this, I could offer a better solution to this important problem,” or “if I do this, I could add a genuinely new and important piece of knowledge to the field.” You should always conduct research with the goal that it could be directly used by many people (or industry). In other words, your research topic should have many ‘customers’, and your solution would be the one they want to use.

A good research project is not about the past (i.e. obtaining a higher performance than the previous N papers). It’s about the future (i.e. inspiring N future papers to follow and cite you, N->$\inf$).

A CVPR’09 submission with a Caltech101 performance of 95% received 444 (3 weakly rejects) this year, and will be rejected. This is by far the highest performance I’ve seen for Caltech101. So why is this paper rejected? Because it doesn’t teach us anything, and no one will likely be using it for anything. It uses a known technique (at least for many people already) with super tweaked parameters custom-made for the dataset that is no longer a good reflection of real-world image data. It uses a BoW representation without object level understanding. All reviewers (from very different angles) asked the same question “what do we learn from your method?” And the only sensible answer I could come up with is that Caltech101 is no longer a good dataset.

Einstein used to say: everything should be made as simple as possible, but not simpler. Your method/algorithm should be the most simple, coherent and principled one you could think of for solving this problem. Computer vision research, like many other areas of engineering and science research, is about problems, not equations. No one appreciates a complicated graphical model with super fancy inference techniques that essentially achieves the same result as a simple SVM - unless it offers deeper understanding of your data that no other simpler methods could offer. A method in which you have to manually tune many parameters is not considered principled or coherent.

This might sound corny, but it is true. You’re PhD students in one of the best universities in the world. This means you embody the highest level of intellectualism of humanity today. This means you are NOT a technician and you are NOT a coding monkey. When you write your paper, you communicate and . That’s what a paper is about. This is how you should approach your writing. You need to feel proud of your paper not just for the day or week it is finished, but many for many years to come.

Set a high goal for yourself – the truth is, you can achieve it as long as you put your heart in it! When you think of your paper, ask yourself this question: Is this going to be among the 10 papers of 2009 that people will remember in computer vision? If not, why not? The truth is only 10+/-epsilon gets remembered every year. Most of the papers are just meaningless publication games. A long string of mediocre papers on your resume can at best get you a Google software engineer job (if at all – 2009.03 update: no, Google doesn’t hire PhD for this anymore). A couple of seminal papers can get you a faculty job in a top university. This is the truth that most graduate students don’t know, or don’t have a chance to know.

Review process is highly random. But there is one golden rule that withstands the test of time and randomness — badly written papers get bad reviews. Period. It doesn’t matter if the idea is good, result is good, citations are good. Not at all. Writing is critical — and this is ironic because engineers are the worst trained writers among all disciplines in a university. You need to discipline yourself: leave time for writing, think deeply about writing, and write it over and over again till it’s as polished as you can think of.

Last but not the least, please remember this rule: important problem (inspiring idea) + solid and novel theory + convincing and analytical experiments + good writing = seminal research + excellent paper. If any of these ingredients is weak, your paper, hence reviewer scores, would suffer.

The SSL Interplay: Augmentations, Inductive Bias, and Generalization

arXiv Query: search_query=au:"LeCun Y"&id_list=&start=0&max_results=100 — Mon, 06 Feb 2023 13:42:14 +0000

Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory.

Non-convex quadratic optimization problems

Machine Learning Research Blog — Thu, 02 Feb 2023 14:46:35 +0000

Among continuous optimization problems, convex problems (with convex objectives and convex constraints) define a class that can be solved efficiently with a variety of algorithms and with arbitrary precision. This is not true more generally when the convexity assumption is removed (see this post). This of course does not mean that (1) nobody should attempt to solve high-dimensional non-convex problems (in fact, the spell checker run on this document was trained solving such a problem…), and that (2) no other problems have efficient solutions.

In this post, we will look at a classical class of continuous optimization problems that can be solved efficiently, namely quadratic optimization problems on the Euclidean sphere or ball. That is, we look at solving $$\tag{1} \min_{ \| x\| \leqslant 1} \ \frac{1}{2} x^\top A x \ – b^\top x, $$ and $$\tag{2} \min_{ \| x\| = 1} \ \frac{1}{2} x^\top A x \ – b^\top x, $$ for $\|x\|^2 = x^\top x$ the standard squared Euclidean norm.

The matrix $A \in \mathbb{R}^{n \times n}$ is assumed only symmetric (no need to be positive semi-definite), and $b \in \mathbb{R}^n$. Therefore, the objective may not be convex, and the constraint set (in the case of the sphere), is not convex either. We could replace the standard squared Euclidean norm $x^\top x$ by any Mahanalobis squared norm $x^\top B x$ for $B$ positive-definite, but to keep it simple, let’s only consider $B = I$.

Note that there are other continuous non-convex optimization problems that can be solved efficiently through a convex reformulation, such as the minimization of one-dimensional (trigonometric) polynomials, and more generally sum-of-squares problems (see this post). If you are aware of many more beyond combinatorial optimization problems, please let me know.

Special case of eigenvalues. If $b=0$ (no linear term), then the solution of Problem $(2)$ is the eigenvector associated with the smallest eigenvalue of $A$, while the solution of Problem $(1)$ is the same eigenvector if the smallest eigenvalue of $A$ is negative, and zero otherwise. This has a large number of applications, is the topic of tons of “spectral relaxation” works, and will not be the focus of this post.

Applications

Problems $(1)$ and $(2)$ appear in several areas and first appeared in [2] in 1965 (if you know of an earlier reference, please let me know). The three main occurrences that I am aware of are in trust-region methods, constrained eigenvalue problems, and relaxation of binary optimization problems.

Trust-region methods. When minimizing a differentiable function $f$ on $\mathbb{R}^n$, the classical gradient descent iteration $$x^+ = x \, – \gamma f(x)$$ can be seen as the solution of $$\min_{y \in \mathbb{R}^n} \ f(x) + f'(x)^\top ( y\, – x ) \mbox{ such that } \| y\, – x\| \leqslant \delta,$$ for $\delta = \gamma \| f'(x)\|_2$. This corresponds to minimizing the first-order Taylor expansion in a ball centered at $x$, and leads to the minimization of an affine function on an Euclidean ball. When using the second order model, we get to solve $$\min_{y \in \mathbb{R}^n} \ f(x) + f'(x)^\top ( y \, – x ) + \frac{1}{2} ( y\, -x )^\top f^{\prime \prime}(x) ( y \, -x ) \mbox{ such that } \| y \, – x\| \leqslant \delta, $$ which can be cast as Problem $(1)$.

The intuitive idea is that the Taylor expansion is only local, so we optimize it locally, instead of globally, like the classical Newton method would do. Moreover, it is well-defined even for singular Hessians. See figure below and [4] for more details.

We consider two quadratic functions and we compare the minimization of their linear and quadratic Taylor expansions under a ball constraint with increasing radius. Left: convex function (hence the quadratic minimization ends up reaching the global minimum). Right: function with a saddle. Below, only the quadratic minimization path is shown for several starting points.

Constrained eigenvalue problems. If we aim to minimize $x^\top A x$ subject to $x^\top x = 1$ and an affine constraint [3], then, by writing the affine constraint as $x = Cz+d$, we obtain the minimization of a quadratic-linear function subject to a quadratic-linear constraint, which we can rewrite in a form similar to Problem $(2)$.

Relaxation of binary optimization problems. When minimizing a linear-quadratic function on $\{-1,1\}^n$, we can relax it by replacing the constraint $x \in \{-1,1\}^n$ by $x^\top x = n$.

From local to global optimality conditions

Let’s now look at optimality conditions from first principles (see [5, 6] for more details), before relating them to a broader discussion on tight semi-definite relaxations.

Existence of minimizers. Minimizers always exists for $f(x) = \frac{1}{2} x^\top A x \ – b^\top x$ since the two sets $\mathbb{S} = \{ x \in \mathbb{R}^n, x^\top x = 1\}$ and $\mathbb{B} = \{ x \in \mathbb{R}^n, x^\top x \leqslant 1\}$ are compact. Therefore, the problems are well-formulated.

First-order necessary conditions on the sphere. We consider an optimal $x \in \mathbb{S}$. For any $y \in \mathbb{S}$ which is orthogonal to $x$, and any $\theta \in \mathbb{R}$, we have: $$f( \cos \theta \cdot x + \sin \theta \cdot y) = f(x) + f'(x)^\top y \cdot \theta + o(\theta).$$ Thus, since $\cos \theta \cdot x + \sin \theta \cdot y$ is always on $\mathbb{S}$, we must have $f'(x)^\top y=0$, and this holds for all $y$ orthogonal to $x$. Thus $f'(x)$ has to be proportional to $x$, that is, there exists $\mu \in \mathbb{R}$ such that $f'(x) + \mu x = 0$, that is, $(A + \mu I) x = b$.

First-order necessary conditions on the ball. If $x \in \mathbb{B}$ is optimal and in the interior, that is, $x^\top x < 1$, then we directly have $f'(x) = 0$. If $x \in \mathbb{S}$, it has to be optimal for the sphere, and thus there exists $\mu \in \mathbb{R}$ such that $f'(x) + \mu x = 0$. By considering that $g: t \mapsto f(t x)$ has to be minimized on $[0,1]$, for $t=1$, we must have $g'(1) \leqslant 0$, i.e., $\mu = \, – f'(x) ^\top x \geqslant 0$.

In order to cover the interior case, we need to add the “complementary slackness” condition $\mu ( 1 -x^\top x)=0$.

Obtaining necessary conditions from Lagrange duality. We can obtain the same first-order optimality conditions using Lagrange duality, by adding a Lagrange multiplier $\mu \in \mathbb{R}$ for the equality constraint $x^\top x = 1$, or $\mu \in \mathbb{R}_+$ for the inequality constraint $x^\top x \leqslant 1$, and forming the Lagrangian $$\tag{3} \mathcal{L}(x,\mu) = \frac{1}{2} x^\top A x\, – b^\top x + \frac{1}{2} \mu ( x^\top x\, – 1).$$ A necessary condition is thus that the partial derivative with respect to $x$ is zero for a certain $\mu$, which is exactly the condition $f'(x) + \mu x = 0$ above.

Second-order conditions on the sphere. Assuming that $f'(x) + \mu x = 0$, with $\mu$ potentially negative (i.e., the first-order optimality conditions are satisfied), we then have, for any $y \in \mathbb{S}$, $$\begin{array}{rcl}f(y) & = & f(x) + f'(x)^\top(y-x) + \frac{1}{2}(x-y)^\top A ( x-y) \\ & = & f(x) + \frac{1}{2}(x-y)^\top ( A + \mu I) ( x-y) + \frac{\mu}{2} ( x^\top x – y^\top y). \end{array}$$ Thus, if $x$ is optimal, we must have $(x-y)^\top ( A + \mu I) ( x-y) \geqslant 0$ for all $y \in \mathbb{S}$, which implies that $A+ \mu I \succcurlyeq 0$. Note that our reasoning implies that the optimality condition, that is, existence of $\mu \in \mathbb{R}$ such that $$\begin{array}{l} ( A+ \mu I) x = b \\ A+ \mu I \succcurlyeq 0 \\ x^\top x = 1 , \end{array} $$ is necessary and sufficient for the optimality of $x$. The sufficiency can also be obtained through Karush-Kuhn-Tucker (KKT) conditions, which apply regardless of convexity. This is one of few problems where strong duality holds for a non-convex optimization problems.

Second-order necessary condition on the ball. We also get the following necessary and sufficient condition, that is, the existence of $\mu \in \mathbb{R}_+$ such that $$\begin{array}{l} ( A+ \mu I) x = b \\ A+ \mu I \succcurlyeq 0 \\ x^\top x \leqslant 1 \\ \mu \, ( 1 \, – \, x^\top x) = 0. \end{array}$$

In both cases, once $\mu$ is known, we can recover the optimizers $x$. We now focus on the sphere for simplicity.

Equivalence to a one-dimensional problem

We can define the function $(M,u) \mapsto u^\top M^{-1} u$ as the minimal $t \in \mathbb{R}$ such that the matrix $\bigg( \begin{array}{cc} \!M\!\! & \!u\! \\[-.1cm] \!u^\top \!\! & \! t \! \end{array} \bigg) \succcurlyeq 0$. It is thus jointly convex in $(M,u)$, is infinite when $M$ is not positive-semidefinite (PSD). When $M$ is PSD but not invertible, the function is finite if and only if $u$ is in the column space of $M$. We can define similarly $u^\top M^{-2} u$.

We can now get the dual problem associated to the Lagrangian in $(3)$, by minimizing it with respect to $x$, leading to $$\max_{\mu \in \mathbb{R}} \ – \frac{\mu}{2} \, – \frac{1}{2} b^\top ( A+\mu I)^{-1} b, $$ which is a concave maximization problem in one dimension (with the constraint that $A + \mu I \succcurlyeq 0$).

Thus, a simple algorithm for solving the problem is to solve this one-dimensional concave maximization problem. Once an eigenvalue decomposition $A = \sum_{i=1}^n \! \lambda_i u_i u_i^\top$ has been obtained, we need to maximize $$ \tag{4} – \frac{\mu}{2} \, – \frac{1}{2} \sum_{i=1}^n \frac{ (b^\top u_i)^2}{\lambda_i + \mu}. $$

Assuming that $\lambda_1 \geqslant \lambda_2 \geqslant \cdots \geqslant \lambda_n$, we have the constraint $\lambda_n + \mu \geqslant 0$. We first need to check if $\mu = \, – \lambda_n$ is the solution, which occurs when $b^\top ( A+ \mu I)^{-2} b \leqslant 1$ (the problem is then called “degenerate”, and this can only happen if $b$ in the eigensubspace of $A$ associated with eigenvalue $– \lambda_n$, which is rather uncommon). Otherwise, the maximum is attained at $\mu > -\lambda_n$ (note that since we have assumed $b \neq 0$, the problem is strictly concave and thus has a unique maximizer in $\mu$). Moreover, $\mu$ is characterized by the equation $$ \tag{5} b^\top ( A+ \mu I)^{-2} b = 1,$$ which can be obtained directly from the optimality conditions.

This one-dimensional problem can be solved using Newton’s method [6, 7] to estimate $\mu$ given the eigendecomposition of $A$. There are also cheaper less precise algorithms that do not require a full eigendecomposition. We will also see below a surprising reformulation as a simple eigenvalue problem.

Other “secular” equations. Equation $(5)$ is often referred to a secular equation. There are other types of similar equations, in particular for rank-one perturbations of the symmetric eigenvalue problem [8, 9].

Semi-definite relaxations

We can now give a more modern take on the quadratic maximization problem on the sphere, using semi-definite programming. We can first rewrite the objective function in Equation $(1)$ as $$f(x) = \frac{1}{2}x^\top A x \, – b^\top x = \frac{1}{2} {\rm tr}(AX) \, – b^\top x, $$ with $X = xx^\top$. We now have a linear objective in $(X,x)$. Moreover, the matrix $X$ satisfies the convex constraints $$ X \succcurlyeq xx^\top \Leftrightarrow \left( \begin{array}{cc} \!X\!\! & \!x\! \\[-.1cm] \!x^\top \!\!&\! 1\! \end{array} \right) \succcurlyeq 0, $$ and ${\rm tr}(X) = x^\top x = 1$. However the rank-one constraint is not convex.

A classical tool in optimization is to remove the rank-one constraint, and only obtain a lower bound (a so-called “relaxation”), with the following optimization problem: $$\tag{6} \min_{ X, x} \frac{1}{2} {\rm tr}(AX)-b^\top x \mbox{ such that } \left( \begin{array}{cc} \!X\!\! & \!x\! \\[-.1cm] \!x^\top \!\!&\! 1\! \end{array} \right) \succcurlyeq 0 \mbox{ and } {\rm tr}(X)=1. $$ One can check that the dual problem is exactly Equation $(4)$, and thus the relaxation is here tight. Moreover, the SDP formulation can be used to derive algorithms that do not need a full eigenvalue decomposition [12].

Semi-definite relaxation of QCQP’s. Problems $(1)$ and $(2)$ are in fact instances of quadratically constrained quadratic programming problems, and the problem $(6)$ is the usual semi-definite relaxation. It turns out that with a single constraint, such relaxations are always tight, owing to the S-lemma [10] (see a nice derivation in Boyd and Vandenberghe’s book [11, Appendix B.1]).

Tight sum-of-squares relaxation. Yet another reformulation is through sum-of-squares (see an earlier post), where we consider the feature vector $\varphi(x) = {x \choose 1}$ and represent non-negative functions as quadratic forms $x \mapsto \varphi(x)^\top B \varphi(x)$. The problem in $(2)$ can then be relaxed as $$\max_{c \in \mathbb{R}, \ B \succcurlyeq 0} c \ \mbox{ such that } \ f(x) \, – c = \varphi(x)^\top B \varphi(x),$$ which is exactly the tight SDP relaxation above.

Having fun with adding affine constraints. Recently I had to look at Problem $(2)$ with an extra affine constraint, which I will take for simplicity of the form $c^\top x = 1$ (for a vector $c \in \mathbb{R}^n$ such that $\|c\| > 1$ to avoid a trivial problem). By projecting on the subspace orthogonal to $c \in \mathbb{R}^n$, we obtain again a quadratic minimization problem, this time on a Euclidean sphere embedded in a space of dimension $n-1$. Therefore, we can apply the above techniques on the reduced problem. However, I did not want to do that and wanted keep the original formulation on $\mathbb{R}^n$, and then tried to use duality to solve it. Two natural possibilities emerge here.

In order to solve it, we could first imagine using Lagrange duality, with a Lagrange multiplier $\mu$ for the constraint $x^\top x = 1$ (this worked exactly without the extra affine constraint), and now an extra Lagrange multiplier $\nu$ for the constraint $c^\top x = 1$. This leads to the Lagrangian $$ \mathcal{L}(x,\mu,\nu) = \frac{1}{2} x^\top A x\, – b^\top x + \frac{1}{2} \mu ( x^\top x\, – 1) + \nu( c^\top x -1),$$ and thus, after a short calculation, the dual problem $$\max_{\mu,\nu \in \mathbb{R}} \ -\frac{\mu}{2} \, – \nu \, – \frac{1}{2} (b \, – \nu c)^\top ( A + \mu I)^{-1} (b \, – \nu c).$$ Another Lagrangian can be obtained with the equivalent constraint $(c^\top x\, – 1)^2 = 0$, leading to a new Lagrangian $$ \mathcal{L}'(x,\mu,\nu’) = \frac{1}{2} x^\top A x\, -b^\top x + \frac{1}{2} \mu ( x^\top x\, – 1) + \frac{1}{2} \nu’ (c^\top x \, – 1)^2,$$ and then the dual problem $$\max_{\mu,\nu’ \in \mathbb{R}} \ -\frac{\mu}{2} \, -\frac{\nu’}{2} \, – \frac{1}{2} (b+\nu’ c)^\top ( A + \nu’ cc^\top + \mu I)^{-1} (b+\nu’ c).$$ Are they both tight? Make up your own mind and see the bottom of the post for the answer.

Amazing eigenvalue reformulations

The Newton method to solve the one-dimensional problem is efficient, but requires some safeguards to work properly, and a full eigenvalue decomposition. It turns out that one can obtain exact reformulations as eigenvalue problems for a single eigenvalue, for which efficient packages exist.

From [3], for the optimization on the sphere, we can obtain the optimal $\mu$ from the largest real eigenvalue of the following non symmetric matrix: $$\left( \begin{array}{cc} \!-A\! & \!\! I\! \\[-.1cm] \! bb^\top \!& \!\! -A\! \end{array} \right).$$ Indeed, one can check that, in the non-degenerate case, given the optimal $(x,\mu)$, then $y = \left( \begin{array}{c} \!(A+\mu I)^{-1} x \! \\[-.1cm] x \end{array} \right)$ is an eigenvector of the $2n \times 2n$ matrix above, with eigenvalue $\mu$.

This leads to two lines of code to solve the problem, at least for the non-degenerate case! See more details in [3, 14], in particular, to deal with the degenerate case, often called the “hard case”. See the code snippets in Matlab, Julia, and Python.

Matlab	`[y,mu] = eigs([-A, eye(n); bb', -A],1,'largestreal'); x = y(n+1:2n) / (b'y(1:n));` or `x = sign(b'y(1:n)) * y(n+1:2n) / norm(y(n+1:2n))`;
Julia	`E = eigs([-A I(n) ; bb' -A ], nev=1 , which=:LR ) y, μ = E[2][:, 1], E[1][1] x = y[n+1:2n] ./ (b' y[1:n])`or `x = sign.(b' * y[1:n]) .* y[n+1:2n] / norm(y[n+1:2*n])`
Python	`M = np.block([[-A, np.eye(n)], [np.outer(b,b), -A]]) mu, y = scipy.sparse.linalg.eigs(M, k=1, which='LR', return_eigenvectors=True) x = y[n:2n]/(np.dot(b,y[:n]))` or `x = np.sign(np.dot(b,y[:n]))y[n:2n]/np.linalg.norm(y[n:2n])`

Symmetric generalized eigenproblems. If you prefer symmetric matrices, one can obtain a similar result with the generalized eigenvector of the two matrices $$\left( \begin{array}{cc} \!I\!\! & \!\!-A\! \\ \!-A\!\! & \! bb^\top\! \end{array} \right) \ \mbox{ and } \ \left( \begin{array}{cc} \! 0 \! & \! \! I \\[-.1cm] I \!\! & \!\! 0 \end{array} \right).$$ If you want to avoid forming a potentially dense matrix $bb^\top$, you and use instead the matrices $$\left( \begin{array}{ccc} \!-1\! & \! 0\! & \! b^\top \! \\[-1.cm] \!0\! &\! I \!&\! -A \! \\[-.1cm]\! b \!&\! -A \!& \!0\! \end{array} \right) \ \mbox{ and } \ \left( \begin{array}{ccc} \! 0\! &\! 0 \!& \! 0\! \\[-.1cm]\! 0 \! & \! 0 \!&\! I \! \\[-.1cm] \! 0 \! &\! I \! &\! 0 \! \end{array} \right).$$ See all details in [14]. Note that beyond the two-line code above that lead to precise solutions, more efficient algorithms exist that lead to approximate solutions [14, 15].

Conclusion

In this blog post, I described one of the few non-convex problems where strong duality holds. There are many other instances within combinatorial optimization (that is, with variables in $\{0,1\}^n$ or $\{-1,1\}^n$), in particular related to submodularity. I will hopefully cover these in future posts.

Acknowledgements. I would like to thank Alessandro Rudi, Gaspard Beugnot, and ChatGPT for helping with the code snippets.

References

[2] George E. Forsythe, and Gene H. Golub. On the stationary values of a second-degree polynomial on the unit sphere. Journal of the Society for Industrial and Applied Mathematics, 13(4): 1050-1068, 1965.[3] Walter Gander, Gene H. Golub, and Urs Von Matt. A constrained eigenvalue problem. Linear Algebra and its applications 114: 815-839, 1989.
[4] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust region methods. Society for Industrial and Applied Mathematics, 2000.
[5] William W. Hager. Minimizing a quadratic over a sphere. SIAM Journal on Optimization, 12(1):188-208, 2001.
[6] Danny C. Sorensen. Newton’s method with a model trust region modification. SIAM Journal on Numerical Analysis, 19(2):409-426, 1982.
[7] Danny C. Sorensen. Minimization of a large-scale quadratic function subject to a spherical constraint. SIAM Journal on Optimization, 7(1):141-161, 1997.
[8] James R. Bunch, Christopher P. Nielsen, Danny C. Sorensen. Rank-one modification of the symmetric eigenproblem. Numerische Mathematik, 31(1):31-48, 1978.
[9] Ming Gu, Stanley C. Eisenstat. A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM Journal on Matrix Analysis and Applications ,15(4):1266-1276, 1994.
[10] Imre Pólik, Tamás Terlaky. A survey of the S-lemma. SIAM Review, 49(3):371-418, 2007.
[11] Stephen P. Boyd, Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[12] Franz Rendl, Henry Wolkowicz. A semidefinite framework for trust region subproblems with applications to large scale minimization. Mathematical Programming, 77(1):273-299, 1997.
[13] Satoru Adachi, Satoru Iwata, Yuji Nakatsukasa, Akiko Takeda. Solving the trust-region subproblem by a generalized eigenvalue problem. SIAM Journal on Optimization, 27(1):269-291, 2017.
[14] Elad Hazan, Tomer Koren. A linear-time algorithm for trust region problems. Mathematical Programming, 158(1-2):363-381, 2016.
[15] Amir Beck, Yakov Vaisbourd. Globally solving the trust region subproblem using simple first-order methods. SIAM Journal on Optimization, 28(3):1951-1967, 2018.

Having fun with affine constraints

Let’s now look at the solution! The second relaxation is tight, while the first is not. To prove that we have a non-tight solution for the first relaxation, we can simply find a counter-example from random matrices in dimension $n = 2$. For example, for $$A = \left( \begin{array}{cc} \!3\! &\! 0\! \\[-.1cm] \! 0\! & \!-2 \! \end{array} \right) , \ \ b = \left( \begin{array}{c} \! 0\! \\[-.1cm] \!-1 \! \end{array} \right), \mbox{ and } \ c = \left( \begin{array}{c} \! 0\! \\[-.1cm] \!2 \! \end{array} \right),$$ a minimizer is $ x = \left( \begin{array}{c}\! \sqrt{3}/2\! \\[-.1cm] \! 1/2 \! \end{array} \right)$, with optimal value $11/8$, while the non-tight relaxation leads to a value of $-1/2$.

To show the tightness of the second relaxation, we first notice that the convex problem is equivalent to the following SDP relaxation: $$\min_{ X, x} \frac{1}{2} {\rm tr}(AX)-b^\top x \mbox{ such that } \left( \begin{array}{cc} \!X\!\! & \!x\! \\[-.1cm] \!x^\top \!\!&\! 1\! \end{array} \right) \succcurlyeq 0, {\rm tr}(X)=1, \mbox{ and } {\rm tr}(cc^\top X)\, – 2 t c^\top x + t^2 = 0. $$ Given the PSD constraint and the fact that $${\rm tr}(cc^\top X)\, – 2 t c^\top x + t^2 = {\rm tr} \left( \begin{array}{cc} \!X\!\! & \!x\! \\[-.1cm] \!x^\top \!\!&\! 1\! \end{array} \right)\left( \begin{array}{cc} \! cc^\top \! & \! -tc \! \\[-.1cm] \!-tc^\top \! & \! t^2 \! \end{array} \right),$$ the new constraint implies that $$ \left( \begin{array}{cc} \!X\!\! & \!x\! \\[-.1cm] \!x^\top \!\!&\! 1\! \end{array} \right) \left( \begin{array}{c} \!c\! \\[-.1cm] \!-t \! \end{array} \right)= 0, $$ that is, $Xc = t x$ and $c^\top x = t$. One can then check that these constraints are exactly equivalent to a projection of the problem in to a space of dimension $n-1$. The incorrect relaxation only has $c^\top x = t$, which is not enough. It took me a while to realize it…

How to think of uncertainty and calibration …

Pages — Thu, 20 Jan 2022 00:00:00 +0000

since i started Prescient Design almost exactly a year ago and Prescient Design joined Genentech about 4 months ago, i’ve begun thinking about (but not taking any action on) uncertainty and what it means. as our goal is to research and develop a new framework for de novo protein design that includes not only a computational component but also a wet-lab component, we want to ensure that we balance exploration and exploitation carefully. in doing so, one way that feels natural is to use the level of uncertainty in a design (a novel protein proposed by our algorithm) by our algorithm as a knob to control the trade-off between these two, in addition to other properties predicted by various (pseudo-)oracles.

but, then, i realized i don’t know what uncertainty is in a high level (!) which is somewhat weird, since i think i can often follow specific details of any paper that talks about uncertainty and what to do with it. so, as someone who dies for attention (pun intended, of course), i’ve decided to write a blog post on how i think i (should) view uncertainty. this view has almost no practical implication, but it helps me think of predictive uncertainty (aside from all those crazy epistemic vs. alleatoric uncertainty, which i’m sure i mistyped.)

in my mind, i start with the following binary indicator:

$$U(p, y, \tau) = I( \sum_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \leq \tau).$$

if we are considering a continuous $y$, we replace the summation with the integration:

$$U(p, y, \tau) = I( \int_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \mathrm{d}y’ \leq \tau).$$

$\mathcal{Y}$ is a set/space of all possible $y$’s. $I(\cdot)$ is an indicator function, i.e., it returns $1$ if true and otherwise $0$. $p$ is a predictive distribution under which we want to measure the uncertainty (e.g., a categorical distribution returned by a softmax classifier.) $y$ is a particular value of interest, and $\tau$ is a user-provided threshold.

this binary indicator tells us whether a particular value $y$ is within top-$(100 \times \tau)$% values under $p$. this can be used for a number of purposes.

first, we can use it to check how certain any particular prediction $\hat{y}$ is under our predictive distribution. let $p(y|x)$ be the predictive distribution returned by our classifier. we can solve the following optimization problem:

$$\min_{\tau \in [0, 1]} \tau$$

subject to

$$U(p(\cdot|x), \hat{y}, \tau) = 1.$$

in other words, we try to find the smallest threshold $\tau$ such that $\hat{y}$ is included. we refer to the solution of this optimizatoin by $\hat{\tau}$.

there is a brute-force approach to solving this optimization problem, which sheds a bit of light on what it does (and a bit on why i started with $U$ above,) although this only works for a discrete $y$. first, we enumerate all possible $y$’s and sort them according to the corresponding $p(y|x)$’s. let us call this sorted list $(y^{(1)}, y^{(2)}, \ldots, y^{(N)})$, where $N = |\mathcal{Y}|$. then, we search for $\hat{y}$ in this sorted list, i.e., $\hat{i} = \min_{i=1,\ldots, N} I(y^{(i)} = \hat{y})$. then, $\tau = \sum_{j=1}^{\hat{i}} p(y^{(j)}|x)$. in short, we look at how much probability mass is taken over by predictions that are more probable than $\hat{y}$, which seems (to me at least) to be the right way to think of the uncertainty assigned to $\hat{y}$.

second, we can use it to enumerate all predictions that should be considered under a given threshold $\tau$ beyond one best prediction by solving the following optimization problem:

$$\max_{Y \subseteq \mathcal{Y}} |Y|$$

subject to

$$U(p(\cdot|x), y, \tau) = 1,~\forall y \in Y.$$

in other words, we look at the largest subset $Y$ such that each and every element within $Y$ is certain under the predictive distribution $p(\cdot|x)$ with the certainty $\tau$.

this is a usual problem to solve and return the answer of, especially when we know that the problem has inherent uncertainty. in the case of machine translation, for instance, there are generally more than one equally good translations given a source sentence, and it is only natural to return top-$(100 \times \tau)$% translations rather than one best translation (though, we don’t do that in practice unfortunately.)

the same brute-force solution from the first problem is equally applicable here. once we have a sorted list and find $\hat{i}$, we simply return $Y = (y^{(1)}, y^{(2)}, \ldots, y^{(i)})$. this is too brute-force and is not tractable (nor applicable) in many situations (precisely why we don’t return multiple possible translations in machine translation, in practice.)

third, we can use $U$ to calibrate a given predictive distribution toward any criterion. for instance, our calibration criterion could be

$$J(\hat{p}; \tau) = \left|\frac{\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)I(|y^* – \hat{y}| – \delta < 0)]}{\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)]}

– \tau \right| < \epsilon,~\forall \tau \in [0, 1],$$

where $\hat{p}$ is a monotonic transformation of $p$, and $\hat{y}=\arg\max_y p(y|x)$. you can think of $\hat{p}$ as a target distribution after we calibrate $p$ to satisfy the inequality above.

this criterion looks a bit confusing, but let’s parse it out. the two expectations effectively correspond to drawing true examples $(x, y^*)$’s from the ground-truth distribution $p^*$. for each $x$, we compute how often the prediction $\arg\max_y \hat{p}(y|x)$ is within the confidence threshold $\tau$. among those cases that satisfy this criterion, we check how good the prediction is (i.e., $|y^* – \hat{y} | – \delta < 0$). the proportion of such good predictions (the ratio above) should be within a close neighbourhood of the confidence threshold $\tau$.

with this criterion, we can solve the following optimization algorithm for calibration:

$$\min_{F} \int_{0}^1 J(F(p); \tau) \mathrm{d}\tau + \lambda \mathcal{R}(F),$$

where $\mathcal{R}(F)$ is some measure of the complexity of the monotonic transformation $F$ with the regularization coefficient $\lambda > 0$.

we can think of this optimization problem as finding minimal changes we need to make to the original predictive distribution $p$ to maximally satisfy the criterion above. of course, we can use different formulations, such as using a margin loss, but the same idea holds regardless.

there can be many other criteria. for instance, we may only care that the true value $y^*$ be within $\tau$ only. in this case, the optimization problem simplifies to:

$$\min_F \mathbb{E}_{x, y^*} \left[ 1- U(F(p), y^*, \tau) \right] + \lambda \mathcal{R}(F).$$

so, how does it relate to all the discussions on reducible (our inability) and irreducible (the universe’s inability) uncertainty? in my view, which is often extremely pragmatic, it’s almost a moot point to distinguish these two too strongly when we consider the uncertainty of prediction coming out of our system, assuming we’ve tried our best to minimize our inability (reducible uncertainty). with a finite number of training examples, which are almost never enough, and with our inability to tell whether there’s a model mismatch (the answer is almost always yes,) we cannot really even tell between reducible and irreducible uncertainty. then, why bother distinguishing these two rather than just lumping them together into $p(\cdot|x)$?

anyhow, the post got longer than i planned but stays as empty as i planned. none of these use cases of the binary indicator $U$ are actionable immediately nor tractably. they need to be polished and specialized for each case by carefully inspecting $p$, $\mathcal{Y}$, etc. but, at least this is how i began to view the problem of uncertainty in machine learning.

Acknowledgement

this whole post was motivated by my discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska and Jae Hyeon Lee. Oh, also, we are hiring!

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

arXiv Query: search_query=au:"Schmidhuber J"&id_list=&start=0&max_results=200 — Mon, 30 Jan 2023 13:30:56 +0000

Offline RL methods have been shown to reduce the need for environment interaction by training agents using offline collected episodes. However, these methods typically require action information to be logged during data collection, which can be difficult or even impossible in some practical cases. In this paper, we investigate the potential of using action-free offline datasets to improve online reinforcement learning, name this problem Reinforcement Learning with Action-Free Offline Pretraining (AFP-RL). We introduce Action-Free Guide (AF-Guide), a method that guides online training by extracting knowledge from action-free offline datasets. AF-Guide consists of an Action-Free Decision Transformer (AFDT) implementing a variant of Upside-Down Reinforcement Learning. It learns to plan the next states from the offline dataset, and a Guided Soft Actor-Critic (Guided SAC) that learns online with guidance from AFDT. Experimental results show that AF-Guide can improve sample efficiency and performance in online training thanks to the knowledge from the action-free offline dataset.

The Transformer Family Version 2.0

Lil'Log — Fri, 27 Jan 2023 00:00:00 +0000

Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.

Notations

Symbol	Meaning
$d$	The model size / hidden state dimension / positional encoding size.
$h$	The number of heads in multi-head attention layer.
$L$	The segment length of input sequence.
$N$	The total number of attention layers in the model; not considering MoE.
$\mathbf{X} \in \mathbb{R}^{L \times d}$	The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size.
$\mathbf{W}^k \in \mathbb{R}^{d \times d_k}$	The key weight matrix.
$\mathbf{W}^q \in \mathbb{R}^{d \times d_k}$	The query weight matrix.
$\mathbf{W}^v \in \mathbb{R}^{d \times d_v}$	The value weight matrix. Often we have $d_k = d_v = d$.
$\mathbf{W}^k_i, \mathbf{W}^q_i \in \mathbb{R}^{d \times d_k/h}; \mathbf{W}^v_i \in \mathbb{R}^{d \times d_v/h}$	The weight matrices per head.
$\mathbf{W}^o \in \mathbb{R}^{d_v \times d}$	The output weight matrix.
$\mathbf{Q} = \mathbf{X}\mathbf{W}^q \in \mathbb{R}^{L \times d_k}$	The query embedding inputs.
$\mathbf{K} = \mathbf{X}\mathbf{W}^k \in \mathbb{R}^{L \times d_k}$	The key embedding inputs.
$\mathbf{V} = \mathbf{X}\mathbf{W}^v \in \mathbb{R}^{L \times d_v}$	The value embedding inputs.
$\mathbf{q}_i, \mathbf{k}_i \in \mathbb{R}^{d_k}, \mathbf{v}_i \in \mathbb{R}^{d_v}$	Row vectors in query, key, value matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$.
$S_i$	A collection of key positions for the $i$-th query $\mathbf{q}_i$ to attend to.
$\mathbf{A} \in \mathbb{R}^{L \times L}$	The self-attention matrix between a input sequence of lenght $L$ and itself. $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$.
$a_{ij} \in \mathbf{A}$	The scalar attention score between query $\mathbf{q}_i$ and key $\mathbf{k}_j$.
$\mathbf{P} \in \mathbb{R}^{L \times d}$	position encoding matrix, where the $i$-th row $\mathbf{p}_i$ is the positional encoding for input $\mathbf{x}_i$.

Transformer Basics

The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.

Ask HN: Good resources to become financially literate

Hacker News: Front Page — Thu, 26 Jan 2023 14:20:09 +0000

      I would like to change my "paycheck to paycheck" way of life and am looking to learn about financial vehicles (US based) and investment basics, and anything else that "I wish I had known when I was 20 years old".

Comments URL: https://news.ycombinator.com/item?id=34531858

Points: 48

# Comments: 54

How to Get New Ideas

Paul Graham: Essays — Wed, 25 Jan 2023 10:00:31 +0000

What we look for in a resume

Chip Huyen — Tue, 24 Jan 2023 00:00:00 +0000

I read every single one of the resumes we receive. Sometimes, I’d talk to a candidate and see that what we perceived as their strongest aspects actually weren’t included in their resume. Occasionally, a candidate would tell me that they didn’t expect their resume to still be screened by humans – had they known, they would have written their resume differently.

The resume evaluation process is pretty much a black box for most candidates. And it is so because few hiring managers have publicly discussed this. I thought I should start the conversation.

“Resume” here refers to both the traditional resumes and LinkedIn profiles. If your LinkedIn is current and contains all the information you want to share, that’s perfectly fine. Some hiring managers I know at other companies actually prefer looking at LinkedIn because they find it faster to find the info they need (e.g. company logos, time spent at each company). I’ve also seen candidates who use their personal websites as their resumes, which also works.

Whether you’re interested in applying to our startup, I hope that my perspective can shed some light on what is happening on the other side of the table, and how to create a resume that helps you put your best foot forward, not just with us, but also with other companies. Caveat: Every company hires differently. What’s right for us might not be right for other companies.

Job hunting is exhausting, and sometimes feels like a shot in the dark. I wish more companies would be transparent about what they’re looking for so that candidates can decide before they apply if they are a good fit.

Disclaimer:

We hire mostly infra and ML roles, so this post might not be applicable to other roles.
We’re very small so what we’re looking for is specific. If you’re not a good fit for us right now doesn’t mean you aren’t a good fit anywhere else, or won’t be a good fit for us in the future.

Table of contents
…. 1. Our overall approach
…. 2. We look for demonstrated expertise, not keywords
…. 3. We look for people who get things done
…. 4. We look for unique perspectives
…. 5. We care about impact, not meaningless metrics
…. FAQs
…….. Do resumes have to be one page?
…….. What to put on my resume if I have no experience yet?
…….. Do I need a cover letter?
…. Misc tips

1. Our overall approach

Startups like us and big companies, say Google, hire very differently. Therefore, it makes sense that a candidate should apply to a startup differently from how they’d apply to a big company.

Here is what we do, which might differ from what big companies do.

We don’t use an automated system to screen resumes.

We read every resume. This means that a lot of tricks you’ve read about how to beat automated screening such as include certain phrases of job descriptions in your resume, repeat “hot” keywords, fill your resumes up with random metrics, etc. doesn’t work on us. They can even hurt when you apply to companies like us, because the resume space used for these tricks is the space you aren’t using for things that are relevant to us.
We look for reasons to say yes.

For a company that receives a large volume of resumes (much larger than us), they’ll need a way to quickly filter out resumes, aka quick ways to say no. For example, their automated algorithm might reject a resume if it lacks certain years of experience or lacks certain keywords. For us, we evaluate each application holistically, and if your application demonstrates the aspects we’re looking for, as discussed in this post, we’d want to get on a call to get to know you more.

2. We look for demonstrated expertise, not keywords

About 90% of resumes we see have a long list of skills. Here are just some examples.

Initially, I was confused about the purpose of this list, because:

It’s unconvincing. There’s a big gap between “saying that you know something” and “being good at it.”
It can weaken your resume. For example, if you consider common skills like Jupyter notebook and git your competitive advantage (the only reason to include them in your resume), I would automatically assume that you have no other competitive advantage.
Expertise takes time to acquire. I’m skeptical of people who claim to be experts in too many things.

As I tried to figure out why, I found a world of articles that give tips on how to beat automated resume screening. These articles claim that at large companies that receive an influx of resumes, recruiters set rules to surface resumes that contain certain keywords. Candidates, not sure what these keywords are, populate their resumes with all the keywords they can think of.

That is a bad strategy for applying at startups like ours. We’re not looking for keywords. We look for demonstrated expertise. Here are a couple of ways to demonstrate your expertise.

Show how you acquired and use that skill in your job

Consider these two candidates who both mention Flink on their resumes. Which one would you want to talk to?
- Candidate A has Flink as one of the 30 items on their Technical skills box.
- Candidate B explained how they used Flink in their last job: “worked on a feature computation platform built on top of Apache Flink which processes 100,000 events per second and serves 10 different ML models.”
If during our interview, candidate B can tell me why they chose Flink over other stream processing engines, what issues they’ve encountered, and what changes they wish to see in Flink, I’d be sold!
Share your expertise on public channels such as: StackOverflow answers, open source contributions, papers, blog posts

Here are some examples that convinced me of a candidate’s expertise in certain technology:
- A candidate sent us their StackOverflow profile where they have answered over 100 questions on JavaScript.
- Another sent us a list of their merged PRs to PyTorch.
- Another sent us a blog post that covers in-depth detail about X that made our team go: “whoa, they really know about X.”
We understand that not everyone has time to contribute to public discourse, so #2 is optional.

3. We look for people who get things done

A lot of people get attracted to a startup for its vision: what this startup can be in the next 5 or 10 years. However, once you join the startup, what really matters is the day-to-day execution. When I was at Snorkel, our CEO kept telling us: “Startups don’t magically move faster than big companies just because they are startups. Startups move faster because they have to.” We have to get things done. To be able to move fast, we need people who aren’t afraid to roll up their sleeves and tackle difficult challenges.

There are two traits we look for to evaluate whether a candidate can get things done: initiatives and persistence. If you have experience that demonstrates these, please do include them in your resume.

Initiatives

To get anything done, you need to start it. There are a lot of people who can see a problem, but few who would do something about it. We want people who, when seeing a problem, proactively do something about it without waiting to be told. We look for initiatives a candidate has started before, such as:

A student club, an event, a team, a project at work. A project that you initiate doesn’t have to be about something new. Projects like writing documents or improving existing CI/CD are also extremely valuable.
A startup. A founder told me that his best hires are people who have previously founded a company, even if that company didn’t work out. They know the drill.

Persistence

Once we’ve started something, we need persistence to drive it to completion. Some signals of persistence that I’ve seen:

Daily contribution to GitHub for one whole year.
Being good at anything that requires consistent effort, e.g. a Kaggle master, a chess master, a professional athlete, etc.
Having previously joined another early startup before and stuck around.

I’m a bit hesitant when I see candidates who change jobs too frequently, e.g. 5 different companies in 5 years. I understand that not all jobs work out, so it’s okay, sometimes necessary, to move on. However, consistent job jumping can imply that you get bored or give up easily. A year at a job is hardly enough to get deep into a problem space and make significant contributions.

Working at a startup can be tough, and we don’t want someone who joins and leaves at the first sign of challenge. We want someone to stick around and help us through different phases. This process, I hope, can also provide you with a diverse set of experiences to prepare you for whatever you want to do after.

4. We look for unique perspectives

Startups like ours exist to solve a problem that few others can solve. This requires us to see the problem from a perspective that most people don’t. For that reason, we’re not looking for group thinkers or hype chasers. We look for people who can bring a unique perspective to the table.

Your unique perspective can be demonstrated in your career/life choices, your writing, your side projects.

Recently, I had a panel discussing resume screening on our MLOps Discord, and Kyle Kranen, a senior deep learning engineer from NVIDIA, discouraged candidates from listing cookie-cutter projects on their resumes, as they take away space for things that are uniquely your strengths.

Cookie-cutter projects happen when someone simply replicates a common solution to a popular project – titanic, sentiment analysis of tweets, stock trading, chatbots – without any new approach or insight. I see them in about ⅓ of all the resumes we’ve received. These projects are good to practice on. I myself practiced on them when I started. However, they won’t help you stand out as a candidate.

Examples of interesting projects that I’ve seen:

A personal website that looks exactly like MacOS.
A CLI tool to autocomplete your bash commands.
Matcha-making robot arm.

5. We care about impact, not meaningless metrics

I recently asked on LinkedIn how resumes have become so metrics-oriented. Almost all resumes I’ve seen are full of metrics. Metrics are good when they serve to make a point. However, many of these metrics leave me more confused than impressed. For example, here are some of the actual metrics I’ve seen:

Built a Transformer-based model that achieved an accuracy of 89% (no info on the task or the baseline)
Work on 15 different domains.
Participated in over 300+ code reviews (not sure what this is supposed to show. If I do a code review a day, it’d only take me a year to reach this number, which really doesn’t say much.)

People were quick to point out to me: the most common advice for writing resumes is “quantify your impact,” e.g. how much money you’ve saved for your company, how much faster you’ve made your program. Due to this advice, many candidates have told me that they need to have metrics on their resumes. Sometimes, when unable to quantify their impact, they settle for what they can quantify.

Unfortunately, not all that are quantifiable are impactful. Showing the number of code reviews you’ve participated in, without any context, doesn’t say anything flattering. It’d be better if you could show how the learnings from all these code reviews are relevant to the role you’re applying for: e.g. how they helped you become a better engineer.

Clarify your contributions

Every job seeker should make their resume look as good as possible. I’ve talked to many candidates who are uncomfortable talking about their achievements, either due to humility or shyness. This is especially common among female candidates that I’ve talked to.

On the other hand, I’ve seen many candidates who take this too far. For example, last week, a candidate wrote that they have extensive experience in deploying large scale systems because of their one Amazon internship.

Often, I’d see a candidate who takes credit for the entire team. An applicant said on his resume that he “deployed an online prediction model that served 10M daily active users.” During our interview, I found out that he wasn’t at all involved in deployment nor in scaling the prediction service. His task was wrangling data from an S3 bucket. To be clear, this is an important skill, just not what we were looking for. We explicitly said we were looking for experience in deploying models in production. This interview was a waste of time for both of us.

Metrics that we’d love to see

I do care about metrics. I especially appreciate metrics that are presented with two components:

How they can be tied to business objectives.
Your contribution in achieving that metric.

Here are some of the metrics that we’ve found convincing:

Part of a 2-data-scientist team that owns feature engineering for a fraud detection system. Added 200 new features, resulting in a reducing the false positive rate from 20% to 15%, while keeping the false negative the same.
Built caching strategy using application-level caching & Redis for 2000 QPS, leading to a decrease in response time by 50%.

Screenshotted from the resumes submitted to our resume critique session

Not all impact has metrics

Fixing a bug or helping a coworker out is impactful, but hard to measure. It’s a great signal when your work receives awards and recognitions, such as:

An internal company award (e.g. intern of the year, MVP, founder award, winning hackathon)
Promotions, e.g. I was impressed with a candidate that went from being a data scientist to a senior MLE to a staff MLE within 4 years.
Glowing recommendations from previous teammates and managers. We actually made an offer to a candidate after two of their references talked about how much this candidate helped and mentored them.

FAQs

Do resumes have to be one page?

A candidate told me, during our interview, that he didn’t include his side projects in his resume because he was worried that it’d take him over the one-page limit. To be clear, I’m NOT going to reject a candidate just because their resume is over one page.

The one-page limit suggestion is to help you be concise. Like the French mathematician Blaise Pascal once said: “I have only made this letter longer because I have not had the time to make it shorter.” The longer your resume goes on, the more chance that your biggest strengths will be buried among less important details.

The goal of a resume is not to show everything about you. The goal is to put your best foot forward to show a company why you’d be an excellent addition to their team. Rarely do I see a candidate whose best foot forward can’t be contained within one page.

If you have a lot of experience, you should have more options to choose from to highlight your strengths. A 3-page resume without any focus or highlight shows the lack of judgment for what is important and concise communication.

What I find a lot harder is for candidates new to the field to have sufficient things to fill in one page.

Some tips to shorten your resume

Find repetition and remove them. For example, do you need to list all 5 of your personal projects, or the top 3 would suffice?
Remove irrelevant experience, even if that creates gaps in your resume. Personally, I don’t care about gaps – everyone needs time to recharge and take care of personal matters.
Most resumes I’ve seen can have shorter education sections.
Remove common skills, e.g. git, notebook, MS Word/Excel.
Find a more efficient format, e.g. using multiple columns.
Reduce font size, page margin.

What to put on my resume if I have no experience yet?

This is a tough question. If you don’t have the experience a role needs, it doesn’t matter what you put on your resume, that role is just not going to be a good fit for you.

I have, however, been impressed by candidates who make up for their lack of experience in our field with impressive experiences in other fields.

Personally, I was once given an internship at a startup, even though I had none of the experiences they were asking for. I reached out to the CEO explaining to him the problems I worked on in the past and how the skills I acquired can be transferred to the problems his company was working on.

It’s a long shot, but it’s a shot.

Do I need a cover letter?

The order in which I read an application: resume -> cover letter. If you have stellar experience and skills, we’d want to talk to you even if you don’t have a cover letter. Some hiring managers told me that they don’t read cover letters at all.

I, however, do appreciate it when a candidate writes a cover letter that explains why they want to join us and why they think they’re a good fit for the role. For candidates with unusual transitions (e.g. switching from another career into machine learning), a cover letter could also be a great place to explain their transition.

If you write a cover letter, keep it concise. A cover letter doesn’t have to be a PDF. It can just be an email.

If an information is important, include it in your resume instead of a cover letter. Sometimes, candidates put the important info in the email they sent to our team’s email address, which might not be entered correctly into our application tracking system.

Misc tips

General

If you’re applying to a small startup, say, of less than 20 people, spend some time researching who works at that startup and email them directly.
Send your resume as a PDF instead of an editable format like Microsoft Word or Google Docs. The editable format might not render correctly on other people’s computers.
If you can get someone who used to supervise you to talk about how great you are, include their quote in your resume. Quotes from friends/family don’t count.
Don’t use abbreviations unless you know for sure your audience knows what you’re talking about.

Public links

If you’ve contributed to open source projects, include links to the PRs or a public architectural design you’re the most proud of.
If your GitHub has a lot of repos, pin the important ones on your homepage (you can pin up to 6). You can also write a README to guide your visitors on what repos they would look at.
Don’t include a link to your GitHub link if it’s empty.
If you have publications, link to your Google Scholar profile.
If you’re a co-author of a paper with multiple authors, bold your name in the author list.
If you have more to show than your resume allows, create a personal website. Assuming that you have basic coding skills, it’s fairly easy to create a free personal website using GitHub Pages.

Education

If you’re still in school, put education earlier in your resume, as it helps the person screening your resume knows where you are and what you’re looking for. If you’ve been working, put your work experience first.
If you’re still in school, list your expected graduation date so companies know when you can start working full-time.
If you’ve been working for 2+ years, remove your GPA and coursework. I care more about your work experience.
Similarly, course projects should only be listed if you think they’re better than the average course project.

Resume resources

There are many resources on how to write a resume. Here are some of them:

Conclusion

Resume writing is a one-off process: it’s the art of arranging what you already have in a way to best help you stand out. It’s hard to write a good resume without having things to put on it.

Resume building is an ongoing process: continuously improve the experience, skills, recommendations, etc. that you have. If you’re good at building, you might not even have to write a resume. If your work catches companies’ attention, they’ll reach out to you.

A few friends told me that “resume building” has a bad connotation, as it might suggest that we should only do things to put on their resumes and ignore important but less shiny things like fixing a bug or helping a coworker out.

What we’re really looking for isn’t the best resume or the best resume builder, but a team member, someone who cares. Resumes might not be the perfect tool to show that, and that’s why we err on the side of talking to that person to find out.

Our startup is still young and we’re still learning. I’d love to learn about this process at other companies, and hear from you what we can do to make the process easier for you.

And of course, we’re hiring for strong infra and ML engineers!! If you think we might be a good fit for you, do reach out.

Karol Hausman @hausman_k

@hausman_k — Tue, 24 Jan 2023 16:32:03 +0000

Recent articles about RL such as https://avivtamar.substack.com/p/deployablerl and https://towardsdatascience.com/shifting-winds-in-robot-learning-research-2ead21671a65 … touch upon a concept that we found to be crucial to consider in robot learning: Data sponge 🧽 🧵

Towards Deployable RL - What’s Broken with RL Research and a Potential Fix

avivtamar.substack.com

Reinforcement learning (RL) has demonstrated great potential, but is currently full of overhyping and pipe dreams. We point to some difficulties with current research which we feel are endemic to the direction taken by the community. To us, the current direction is not likely to lead to “deployable” RL: RL that works in practice and can work in practical situations yet still is economically viable. We also propose a potential fix to some of the difficulties of the field.

Since Deepmind's breakthrough results of applying deep RL to Atari games (2015), RL has been lauded in the AI community, with promises ranging from `a path to AGI', through `the key to self driving cars', and up to `solve all planning problems'. While clear and worthy progress has been made, the current vibe is that aside from solving games, RL is not living up to our expectations. The causes, we believe, are five popular research practices, which were relevant for 2015, but are currently stagnating the field.

As our mission is not to blame anyone (we have surely made our fair share of mistakes), we will keep examples and references to ourselves. Our point is that currently the hype over RL practice and theory is largely unjustified by the facts and that the discipline needs to mature.

Overfitting to specific benchmarks. State-of-the-art methods for Atari are by now very different from state-of-the-art methods for Mujoco (or other benchmarks). Nearly every paper is required to show some distinction (under some metric) for one of the popular benchmarks. However, both benchmarks are made up, driving the field further away from algorithms that will be potentially good for practical problems that are not Atari games/robot locomotion. Consequently, research on tweaks that will work most likely only for particular benchmarks is abundant. Furthermore, it is not clear how progress on a benchmark relates to real-world value. In contrast to, say, pretrained ImageNet features or BERT sentence embeddings, which are widely applicable to real world applications, solving RL benchmarks currently does not yield tangible value.
Wrong focus. Most research focuses on sample complexity for a given benchmark. This is hardly the case in practice - compute can be cheap relative to engineering effort, acquiring labelled demonstrations can significantly speed up learning, and the development process must consider issues like overall system stability, testability, ease of debugging, interpretability, integration with other components/agents, etc. Current benchmarks ignore the deployable nature of situated (RL-driven) agents completely, focusing on algorithms rather than on a system/engineering view. This is most clearly represented in the prevalent OpenAI Gym API, which abstracts away all ‘system-design’ issues for quickly making progress on sample complexity. While initially a good idea, this hampers progress since in many practical problems, just figuring out what are useful states, actions, and rewards is a critical component of the development process.
Detached Theory. What should be the role of theory in modern RL research? There is no real system with small finite state and action spaces. Ideally, theory should help understand phenomena observed in practice, and suggest algorithmic ideas. While “nothing is more practical than a good theory” (Lewin, 1952), useful theory seems to be quite rare. Some reasons are: regret minimization is overly pessimistic; there is a lot of prior knowledge (in the algorithm design, parameter choices, etc.) that is not accounted for in the theory; finite state and actions is not a good model for many problems of interest; and a focus on unimportant quantities that nobody cares about.
Uneven Playing Grounds. Measuring the performance of an algorithm on a benchmark is confounded by the resources available to the implementer, such as proficiency in hyperparameter tuning, the size of the neural network trained, or prior knowledge about the problem/solution. The variability in the scale of experiments and software engineering affordable by different researchers (e.g., academia vs. industry), and the current trend at top conferences to prefer massive experimentation over conceptual novelty, can inhibit long-term progress.
Lack of Experimental Rigor. Impressive singular experiments sometimes give a false sense of progress. While good PR click-bait, these may only disappoint as they preclude research on “solved” problems which are in fact far from solved. For example, the simple 2D ProcGen Maze benchmark remains unsolved. We need more rigorous evaluation of difficulty and success. Moreover, for industry to adopt an approach, an impressive result is not enough - stability, development time and cost, testability and life-cycle issues are critical. Currently, the publication standard is that failure cases are almost never reported, stability is impossible to tell, and software design issues are not even discussed.

There are two main dogmas in the RL community on how to make progress in solving real world decision making problems. The first, which we refer to here as the ‘generalist agent’ view (sometimes called “RL first”), is that future progress will be made by focusing attention on large-scale training of agents that solve diverse problems, with the hope that along the way a generalist agent will develop, and will be a useful component in various real world problems. The second view, which we refer to as ‘deployable RL’, takes a more pragmatic view seeking to design RL algorithms that solve concrete real world problems (sometime this view is called “RL second”). The five problems described above are relevant to both approaches. In the fix we propose, however, we focus on the second view. The reason is that with the current knowledge in the field, we believe there are concrete benefits to reap in deployable RL, while the generalist agent prospect is still at an early development stage.

It is important to understand that at present, RL is uneconomical to deploy. Changing this requires both research of how to deploy RL effectively, and also a better understanding of the gains that an RL solution may bring, which will eventually make it worthwhile to pursue. We propose a constructive model for research in RL which we feel is relevant for our current state of knowledge in the field, and will advance progress towards deploying RL-based solutions in the real world.

In the following, we outline three general principles. Our hope is that researchers will adopt some of these principles in their projects, and in turn, publication standards will grow to value these principles appropriately.

Challenges instead of benchmarks A challenge is a problem (real or not) sponsored by a group of researchers from academia/industry. Different from a benchmark, a challenge is not aimed at comparing algorithms, but is meant to simply be solved. That is, there is real world value in making progress on a challenge that holds regardless of the RL (or other) algorithm used to solve it.

Credit for contributing challenges A rigorous presentation of a challenge should be credited as an important contribution (in terms of citations, impact, etc.). The contribution is not just the specific description of an application, but rather a community around it, and a supporting platform (code, scoreboard, etc.). We envision a special type of papers: “contributed challenges”. A fundamental criterion for evaluating the challenge would be a quantifiable measure of making real, and well-accepted progress on it. That is - no more maximizing made up rewards, but working towards indicators of real-world progress.

Measurable progress is the main criterion for publication Every publication that makes progress on a challenge either by suggesting a new algorithm, a positive or negative result, should explain the limitations and issues with the proposed new algorithm and how it addresses progress specifically. The main criterion for publishing a work is the progress made in a quantifiable manner.

Weight class. Compute available during development distinguishes contributors to “weight class” which should also be reported. The overall amount of compute used for research is important to assess the significance of the results.
Theory papers should address specific challenges When a theory paper proposes a model, an algorithm, or a solution, the problem should be clear and grounded in a real challenge. We are not opposed to working on theory of small finite models but the goal of the research should be well justified in terms of its potential impact on real world problems. Theory papers should also consider problems that have to do with the system life cycle focusing on issues such as data acquisition, debugging, testability and performance deterioration.
Design-Patterns Oriented Research Software design is based on plugging multiple design patterns, massaging them, and repackaging a software system. A design pattern in software engineering is a skeleton of a solution for a particular problem that can be potentially reused for many instances. Similarly, real-world RL based systems should have conceptual solutions to problems where issues such as testability, debuggability, and other system life-cycle issues are addressed. Approaches to life-cycle management that solve problems grounded in applications are a welcome addition to RL research. Importantly, we foresee that one way to make significant progress on a challenge would be by developing novel design patterns for it.

The sooner the community starts focusing on deployable RL, the better the chances of seeing large scale real world impact. To us, deployable RL means RL that can work at scale, be economically feasible, and can eventually be put in the field.

If you relate to the points above, here's what you can do.

Contribute challenges. Contributing challenges require deep understanding of an application domain. Some industrial research labs are already well positioned to start tackling this task. There are many scientific and engineering disciplines where RL could make an impact, such as the natural sciences, medicine, and manufacturing. A challenge should matter to the discipline it comes from so while the RL expert may not possess the domain knowledge, their goal in challenges is to explain to the domain experts what is possibly feasible and what is not. Joining forces with industry or other academia would lead to better framed and more realistic challenges.
Frame your own research. A change of focus may be needed when writing papers and conducting research to focus on solving real problems that matter, and framing the research effort within the deployable RL principles.
Criticize others' research. Affecting a change to the RL research community would likely require a coordinated effort from researchers, reviewers, and senior area chairs. If you are reviewing RL papers, shift gears from the common benchmark-driven evaluation, and ask how a paper gets the field closer to real-world impact. If you are a senior reviewer or an area chair - consider instructing your reviewers to judge papers differently.

As a starting point, we invite researchers to propose and discuss ideas for challenges in the comments section.

Yann LeCun @ylecun

@ylecun — Tue, 24 Jan 2023 15:56:10 +0000

An upcoming ICLR talk+paper on the duality between sample-contrastive methods (e.g. SimCLR) and variable-contrastive methods (e.g. VICReg) for Self-Supervised training of joint embedding architectures. With @garridoq_, @Yubei_Chen & @laurentnajman

Quentin Garrido @garridoq_

Our work "On the duality between contrastive and non-contrastive self-supervised learning" has been accepted as an Oral (notable top-5%) at #ICLR. Thanks to my amazing co-authors @Yubei_Chen @AdrienBardes @laurentnajman and @ylecun! Preprint: http://arxiv.org/abs/2206.02574 A thread🧵👇

On the duality between contrastive and non-contrastive self-supervised learning

arxiv.org

Download PDF

Abstract: Recent approaches in self-supervised learning of image representations can be categorized into different families of methods and, in particular, can be divided into contrastive and non-contrastive approaches. While differences between the two families have been thoroughly discussed to motivate new approaches, we focus more on the theoretical similarities between them. By designing contrastive and covariance based non-contrastive criteria that can be related algebraically and shown to be equivalent under limited assumptions, we show how close those families can be. We further study popular methods and introduce variations of them, allowing us to relate this theoretical result to current practices and show the influence (or lack thereof) of design choices on downstream performance. Motivated by our equivalence result, we investigate the low performance of SimCLR and show how it can match VICReg's with careful hyperparameter tuning, improving significantly over known baselines. We also challenge the popular assumptions that contrastive and non-contrastive methods, respectively, need large batch sizes and output dimensions. Our theoretical and quantitative results suggest that the numerical gaps between contrastive and non-contrastive methods in certain regimes can be closed given better network design choices and hyperparameter tuning. The evidence shows that unifying different SOTA methods is an important direction to build a better understanding of self-supervised learning.

From: Quentin Garrido [view email] [via CCSD proxy]
[v1] Fri, 3 Jun 2022 08:04:12 UTC (2,064 KB)
[v2] Wed, 5 Oct 2022 14:05:54 UTC (279 KB)

Multi-Head Adapter Routing for Data-Efficient Fine-Tuning

arXiv Query: search_query=au:"Roux N L"&id_list=&start=0&max_results=200 — Mon, 07 Nov 2022 19:35:55 +0000

Parameter-efficient fine-tuning (PEFT) methods can adapt large language models to downstream tasks by training a small amount of newly added parameters. In multi-task settings, PEFT adapters typically train on each task independently, inhibiting transfer across tasks, or on the concatenation of all tasks, which can lead to negative interference. To address this, Polytropon (Ponti et al.) jointly learns an inventory of PEFT adapters and a routing function to share variable-size sets of adapters across tasks. Subsequently, adapters can be re-combined and fine-tuned on novel tasks even with limited data. In this paper, we investigate to what extent the ability to control which adapters are active for each task leads to sample-efficient generalization. Thus, we propose less expressive variants where we perform weighted averaging of the adapters before few-shot adaptation (Poly-mu) instead of learning a routing function. Moreover, we introduce more expressive variants where finer-grained task-adapter allocation is learned through a multi-head routing function (Poly-S). We test these variants on three separate benchmarks for multi-task learning. We find that Poly-S achieves gains on all three (up to 5.3 points on average) over strong baselines, while incurring a negligible additional cost in parameter count. In particular, we find that instruction tuning, where models are fully fine-tuned on natural language instructions for each task, is inferior to modular methods such as Polytropon and our proposed variants.