<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://anikamisra.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://anikamisra.github.io/" rel="alternate" type="text/html" /><updated>2026-04-29T19:05:35+00:00</updated><id>https://anikamisra.github.io/feed.xml</id><title type="html">Anika Misra</title><subtitle>Anika&apos;s website</subtitle><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><entry><title type="html">Reinforcement learning for NLP</title><link href="https://anikamisra.github.io/llm/reinforcement%20learning/rl-for-llms/" rel="alternate" type="text/html" title="Reinforcement learning for NLP" /><published>2026-03-16T17:00:00+00:00</published><updated>2026-03-16T17:00:00+00:00</updated><id>https://anikamisra.github.io/llm/reinforcement%20learning/rl-for-llms</id><content type="html" xml:base="https://anikamisra.github.io/llm/reinforcement%20learning/rl-for-llms/"><![CDATA[<p>In my last post, I talked about applying an originally image-based machine learning method to language, and the challenges that came with it. 
Today, we’re going to be discussing a similar topic—specifically, Reinforcement Learning (RL) for NLP reasoning. 
Though not initially used for LLMs, RL has a number of benefits when applied to LLMs. Specifically, RL allows LLMs to learn how 
to create outputs with specific <strong>goals</strong> in mind.</p>

<h2 id="background">Background</h2>
<p>Traditionally, RL was not used for language, but rather applications like games, robotics, and general planning. The main thing to understand is that RL teaches 
an agent a policy in order to maximize a reward. 
For example, suppose we are trying to train a robot how to navigate in a square grid in order to maximize its score:</p>

<div style="text-align: center;">
  <img src="/assets/images/16-03-2026/robot_moving_grid.png" alt="Robot moving example" style="max-width: 700px; width: 80%;" />
</div>
<p><br /></p>

<p>Image source: <a href="https://arxiv.org/pdf/2412.10400">Wang et al</a></p>

<p>Suppose that the number on each square contributes to the robot’s score. Then, naturally, the robot will then be <em>penalized</em> for stepping on a <em>low-number square</em>, and <em>rewarded</em> for stepping 
on a <em>high-number square</em>. Through these rewards, the RL algorithm trains the robot to navigate from one end of the grid to the other in order to maximize its score (<a href="https://arxiv.org/pdf/2412.10400">Wang et al</a>).</p>

<h2 id="application-to-llms">Application to LLMs</h2>
<p>Now, you may be wondering how something like this could ever relate to, say, a chatbot answering human questions. 
The translation is quite simple: rather than having a robot be rewarded when it steps on a high-numbered square, we can have the chatbot be rewarded when it provides a good response, 
and be penalized when it provides a bad response.</p>

<div style="text-align: center;">
  <img src="/assets/images/16-03-2026/rl_basics_3.png" alt="Reinforcement learning rewards for LLMs" style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p>Above: Reinforcement learning (RL) for LLMs rewards them for good responses and penalizes them for bad responses.</p>

<p>Let’s take a step back and think about how this is different from other types of machine learning pipelines. The most well-known pipeline for LLMs is 
probably <strong>supervised learning</strong>, which uses explicit, labelled word data to have the model directly learn what the best next-likely-token is. While this type of data is difficult to construct, 
it allows the model to have a rich reference point to learn from.</p>

<p>On the opposite end of the spectrum is <strong>unsupervised learning</strong>, which has no labelled output data: 
these techniques are just used to discover existing structures and patterns in the data. Unsupervised learning does not require the human labor of constructing good output data; 
however, it cannot really be used to train chatbots.</p>

<p>RL lies in the middle of unsupervised and supervised learning, giving us the best of both worlds. Unlike supervised learning, 
RL does not require wordy, explicitly labelled outputs; instead, it learns from less complex reward signals (e.g. ‘good’ and ‘bad’). 
However, unlike unsupervised learning, we can actually use RL to train a chatbot on what to say.</p>

<div style="text-align: center;">
  <img src="/assets/images/16-03-2026/RL_vs_SL.png" alt="Reinforcement learning rewards for LLMs" style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p>Above: RL uses simple rewards rather than trying to learn a token structure directly.</p>

<p>Now, RL is not the magic solution for everything. As labor-intense as gathering labelled data is, supervised learning is still necessary for the pre-training of a language model. 
One cannot teach a model how to speak from reward signals like ‘good’ vs ‘bad’ alone (think of if you tried to teach a human to speak in that way!). However, RL can be used in the 
<strong>fine-tuning</strong> stage of training an LLM, which occurs after the <em>base</em> step of pre-training to polish the final model, as an alternative to the traditional supervised fine-tuning.</p>

<p>Now, RL has more benefits beyond its simpler output label structure. One of the main benefits of RL is that it allows you to train the 
model with a specific goal in mind. For example, why do the rewards just have to be ‘good’ or ‘bad’? They can also be ‘safe’ or ‘unsafe’, ‘helpful’ or ‘not helpful’, 
‘correct’ or ‘incorrect’. This leads us to how RL can specifically be used to improve reasoning capabilities in LLMs.</p>

<h2 id="using-rl-to-improve-reasoning-in-deepseek-r1">Using RL to improve reasoning in DeepSeek-R1</h2>
<p>Chain-of-Thought prompting (<a href="https://arxiv.org/pdf/2201.11903">Wei et al</a>) led to amazing reasoning improvements in LLMs by simply having the model ‘think step-by-step’ in its answer. 
With this in mind, researchers at Google decided to go beyond prompting, and try to actually train an LLM on these step-by-step reasoning trajectories. 
As expected, this training improved the reasoning capabilities of models on several benchmarks (<a href="https://arxiv.org/pdf/2210.11416">Chung, Hou, Longpre et al</a>).</p>

<div style="text-align: center;">
  <img src="/assets/images/16-03-2026/cot_prompt_vs_cot_finetuning.png" alt="Reinforcement learning rewards for LLMs" style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p>Above: CoT prompting only has the model ‘think out loud’ <em>after</em> it’s done training, while CoT finetuning actually trains the model on thinking out loud.</p>

<p>However, while training on reasoning traces improves performance, the authors of <a href="https://arxiv.org/pdf/2501.12948">Deepseek-R1</a> argue that gathering these human-labelled 
reasoning trajectories is expensive. More importantly, these human-labelled reasoning traces were 
created by, well, humans. An LLM could potentially think differently (maybe even better) than a human does. 
Why does it have to copy this limited, human-based reasoning process?</p>

<p>To explore this question, the Deepseek-R1 authors use RL to train their LLM. Remember how we said that RL allows you to have different reward types? In this case, rather than 
giving a ‘good’ or ‘bad’ reward, the authors give a ‘correct’ vs ‘incorrect’ reward. Addtionally, they have the model output some sort of reasoning trace, but they only reward and 
penalize the model’s final answer. That means that there is no influence on the reasoning process that the model comes up with. As a result, the model can potentially 
learn non-human reasoning pathways while still learning how to reach a correct final answer!</p>

<p>Something interesting happened when doing this: the original model had great reasoning capabilities, however, its reasoning traces had very poor readability and 
were occasionally mixing English and Chinese. To improve readability of reasoning traces, the authors suppressed this language mixing behavior in the final model.</p>

<p>Initially, when reading this, my intuition told me that suppressing the language mixing was maybe not the most beneficial for the model’s reasoning. If the whole point of using RL 
was to have the model create the best, non-human-influenced, reasoning traces, and it naturally developed language mixing, then why get rid of that? Indeed, 
as specified in the paper: suppressing language mixing in Deepseek-R1-0 did actually decrease reasoning capabilities.</p>

<p>In fact, <a href="https://arxiv.org/pdf/2507.15849">Li et al</a> studied this more directly and found that language mixing is not a random artifact of these RL-created reasoning traces, but 
rather, a deliberate and strategic reasoning choice that tends to help reasoning performance in LLMs. It’s exciting how although using RL in Deepseek was initially only to improve model 
reasoning capabilities, an unexpected byproduct was the finding of this tool that improves LLM problem-solving ability.</p>

<h2 id="i-dont-know-in-language-models">‘I don’t know’ in Language Models</h2>

<p>What actually inspired me to start this post about RL for LLMs was this <a href="https://twitter.com/SebAaltonen/status/2030254482429219170">tweet</a> and subsequent comment I saw the other day.</p>

<div style="text-align: center;">
  <img src="/assets/images/16-03-2026/tweet.png" alt="Reinforcement learning rewards for LLMs" style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p>What these posters are describing here are that language models ‘do not know what they do not know’. 
This can be harmful when people rely on them for accurate, factual information. For example, if a user needs the answer to a relevant formula to solve a physics problems, 
it is probably better for a chatbot to not give a formula at all than to give an incorrect one.</p>

<p>The Deepseek-R1 RL structure did not necessarily penalize models for being incorrect. Rather, it gave the model a score of 1 when the answer was correct, 
and 0 when it was incorrect. But what if we tried implementing the structure brought up in this comment?</p>

<p>Indeed, <a href="https://arxiv.org/pdf/2509.25760">Wei et al.</a> tried this reward structure for themselves, using these three reward values:</p>

<div style="text-align: center;">
  <img src="/assets/images/16-03-2026/r_ternary.png" alt="Reinforcement learning rewards for LLMs" style="max-width: 700px; width: 70%;" />
</div>
<p><br /></p>

<p>They find that using this reward design, as opposed to the binary one similar to what Deepseek used, actually does reduce hallucination in LLMs 
(e.g., having the model make something up as opposed to simply saying ‘I don’t know’). However, accuracy is still the highest when the binary reward system is used. 
In other words, it is still better for the LLM to ‘guess’ on most things than to say ‘I don’t know’ to some.</p>

<p><a href="https://arxiv.org/pdf/2601.20126">Jha et al.</a> tried a similar experiment for themselves, using the same reward structure as specified above. 
However, they actually varied the middle reward as a hyperparameter and found that setting it to -0.25 rather than just 0 led to better performance on their benchmarks. 
In other words, there is still some penalty needed for saying ‘I don’t know’, but it just can’t be as strong as the penalty for getting the answer incorrect.</p>

<p>Now, while RL for LLMs may work well for aligning it with certain values and enabling them to develop their own reasoning traces, I’m still doubtful that it is a 
strong enough method to teach a language model what it knows and what it doesn’t. As expected, as the penalty for saying ‘I don’t know’ got weaker and weaker, 
the language model started saying ‘I don’t know’ more often <a href="https://arxiv.org/pdf/2601.20126">Jha et al</a>. But does it actually know what it doesn’t know, or 
is it just trying to maximize its reward? The future that these tweeters were hoping for may be easier said than done.</p>

<h2 id="summaries-and-takeaways">Summaries and takeaways</h2>
<p>RL is a powerful rewards-based method that, although not traditionally designed for language models, allows researchers to fine-tune LLM responses with 
specific goals in mind. While focusing on these goals, the LLM will learn its own paths to get there, which may lead to interesting discoveries, 
such as the benefits of language mixing in reasoning traces for LLMs. It’s also worthwhile to experiment with the structure of the reward itself; 
for example, penalizing a model for saying ‘I don’t know’ less than penalizing it for getting an answer incorrect reduces hallucination.</p>

<p>I personally thought DeepSeek’s approach of using RL to have the model develop its own reasoning traces was really exciting. I wonder what other beneficial LLM 
tools we can discover by letting LLMs take their own paths and giving them more freedom.</p>]]></content><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><category term="LLM" /><category term="reinforcement learning" /><category term="LLM" /><category term="reinforcement learning" /><summary type="html"><![CDATA[In my last post, I talked about applying an originally image-based machine learning method to language, and the challenges that came with it. Today, we’re going to be discussing a similar topic—specifically, Reinforcement Learning (RL) for NLP reasoning. Though not initially used for LLMs, RL has a number of benefits when applied to LLMs. Specifically, RL allows LLMs to learn how to create outputs with specific goals in mind.]]></summary></entry><entry><title type="html">Diffusion in language models, and rethinking thought</title><link href="https://anikamisra.github.io/llm/diffusion/thought/diffusion-lms/" rel="alternate" type="text/html" title="Diffusion in language models, and rethinking thought" /><published>2026-02-09T17:00:00+00:00</published><updated>2026-02-09T17:00:00+00:00</updated><id>https://anikamisra.github.io/llm/diffusion/thought/diffusion-lms</id><content type="html" xml:base="https://anikamisra.github.io/llm/diffusion/thought/diffusion-lms/"><![CDATA[<h1 id="diffusion-in-language-models-and-rethinking-thought">Diffusion in language models, and rethinking thought</h1>

<h2 id="introduction">Introduction</h2>

<p>Most LLMs today are <strong>autoregressive</strong>, meaning that they generate each token one at a time, step-by-step, only after seeing the previous token.</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/autoregressive.png" alt="Autoregressive LLMs generate one token at a time, step-by-step." style="max-width: 700px; width: 100%;" />
</div>
<p><br />
<br /></p>

<p>You might be thinking, duh? Of course an LLM only generates a token after it generated the one before it. 
As humans, that is also how we read, write, and communicate. I say my first word before I say my second word.</p>

<p>But thought itself, what which often comes before we speak aloud, isn’t always so simple and temporal. Have you ever had an 
intuition about something? Or, an idea suddenly came to your mind? That thought doesn’t come in a token-by-token sequence—it comes in one moment.</p>

<p>Now, one of the reasons why I like natural language processing (and the artificial intelligence field in general) 
is because we can borrow ideas from one modality (say, language) and apply them to the a completely different modality (say, vision). 
For example, transformers were originally created for language translation tasks, but researchers found a way to apply 
them to predictive imaging tasks. Pretty neat, right?</p>

<p>However, it’s not as easy as it seems. You can’t always just take a model architecture from language and slap it onto vision; 
there are fundamental differences that we have to account for. For example, language is discrete (it’s made of individual words / tokens) whereas 
vision is continuous. Images typically have just a few output sizes that they tend to come out in, whereas the output of an LLM can span from 1 token to 
<a href="https://ai.google.dev/gemini-api/docs/models">65,536</a>.</p>

<h2 id="image-diffusion">Image diffusion</h2>

<p>Ever wonder how images are generated? It’s a little different from the language output that you get from ChatGPT. Most image generation models today use 
a process called <strong>diffusion</strong>. Image diffusion is a process where a model generates an 
image by learning how to remove noise. First, you give the model an image with a text caption. Then, 
you create different versions of those images with different amounts of Gaussian noise added to them. To train the model, you 
give the model the noisiest image, and it tries to learn how to remove the noise the correct way, using your images as the ground-truth.</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/diffusion_process.png" alt="Image diffusion models learn how to correctly remove noise at each step." style="max-width: 700px; width: 100%;" />
</div>
<p><br />
Above: The model learns how to correctly remove Gaussian noise for a specified image (flower).<br />
<br />
<br /></p>

<h2 id="language-diffusion">Language diffusion</h2>

<p>Diffusion images work nicely for images for certain reasons. First of all, images are a fixed size—if you want to generate an image, 
it’ll be 226x226 pixels most of the time. Second, images are continuous, and so the concept of adding noise to them makes sense. Would you be surprised
then, if I told you that there was a diffusion model for <em>language</em> models? It definitely surprised me at first. Specifically, I wondered, 
how can we add “noise” to words?</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/pixelated_words.png" alt="How do we add noise to words?" style="max-width: 700px; width: 100%;" />
</div>
<p><br />
Above: How do we add the same type of pixelated “noise” to words?<br />
<br />
<br /></p>

<p>Nie and Zhu et al. answer this question in their work, <a href="https://arxiv.org/pdf/2502.09992">Large Language Diffusion Models</a>. The authors 
create a diffusion language model, LlaDa, by randomly masking individual <em>tokens</em>. LlaDa then generates text by predicting how to unmask
the tokens. This is similar to image diffusion, where the model learns how to remove the Gaussian noise pixels—except instead of predicting 
which <em>pixel</em> lies underneath, Llada predicts which <em>token</em> lies underneath.</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/language_diffusion_unmasking.png" alt="How do we add noise to words?" style="max-width: 700px; width: 100%;" />
</div>
<p><br />
Above: Diffusion models generate text by predicting what lies underneath the masked tokens.<br />
<br />
<br /></p>

<p>Looking at this graphic, you may notice that the output of the answer already had a size predetermined—8 tokens. However, we don’t always know how 
long we want a language model’s response to be in advance. (In fact, in my opinion, that should be for the language model to determine). To address this 
fixed output size issues, the creators of Llada output multiple versions of their model.</p>

<ol>
  <li>Two of them are a more <strong>pure diffusion</strong> approach, where the output is either pre-determined by a specified amount of tokens, or is just permanently fixed. This version implements the classic diffusion, but it also has the limitation of the user needing to know how long they want their output to be.</li>
  <li>The other versions are a more <strong>autoregressive-style</strong> approach, where the output keeps diffusing, chunks of a time, until the model reaches an end token.</li>
</ol>

<p>This second approach, while it doesn’t require a pre-specified output length, may undermine the benefits of a diffusion language model, where the whole 
point is that it is <em>not</em> supposed to be autoregressive. So, the short answer is: dealing with the fixed output length of diffusion models is still an 
active area of research!</p>

<p>Now, with this limitation, why do people even care about Diffusion Language Models? There are multiple reasons, specified by <a href="https://arxiv.org/pdf/2508.10875">Li et al</a>. 
First of all, diffusion models have the potential to be faster than autoregressive models, because they generate all their tokens in parallel rather than step-by-step. 
Second, autoregressive models can only look at tokens that they have already written - they cannot look “ahead”. Diffusion models, on the other hand, can look 
in all directions, potentially enabling them to capture more nuanced language understanding. These are just some of the benefits.</p>

<h2 id="diffusion-for-lateral-thinking">Diffusion for Lateral Thinking</h2>

<p>One work that leans into fully leveraging Diffusion Language Models’ benefits is Reinforcing the Diffusion Chain of Lateral Thought (DCoLT)
with Diffusion Language Models by <a href="https://arxiv.org/pdf/2505.10446">Huang and Chen et al</a>. The authors propose an interesting problem: 
Chain of Thought, while helpful for LLM reasoning, is still autoregressive in nature. It teaches models how to think each <em>thought</em> step-by-step, similar to how 
it generates individual tokens.</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/cot_example.png" alt="CoT example." style="max-width: 700px; width: 40%;" />
</div>
<p><br />
Above: Chain-of-thought (CoT) has models think step-by-step to improve reasoning results: https://www.linkedin.com/pulse/how-teach-chain-of-thought-reasoning-your-gdbzf/
<br />
<br /></p>

<p>However, not all thinking is done step-by-step! As I’ve mentioned before, getting a gut feeling or an idea is something that happens all at once. 
Furthermore, when humans are trying to answer a question creatively, they may think more “messily” and broadly, trying to think of all potential 
solutions before diving into solving a specific one step-by-step. Specifically, this is known as <strong>lateral thinking</strong>, 
a term coined by Edward de Bono.</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/Screenshot 2026-02-09 at 4.19.47 PM.png" alt="Vertical vs Lateral thinking?" style="max-width: 700px; width: 100%;" />
</div>
<p><br />
Above: Vertical vs Lateral thinking. Source: De Bono, Edward, and Efrem Zimbalist. Lateral thinking. London: Penguin, 1970. 
<br />
<br /></p>

<p>The authors of DCoLT observes that diffusion language models provide the perfect opportunity to allow an LLM to think <strong>laterally</strong> rather than <strong>vertically</strong> (step-by-step). To do this, 
they fine-tune LlaDa (the diffusion language model mentioned earlier!) via Reinforcement Learning with an Unmasking Policy Module (UPM). That’s a 
lot of words, so let me break it down. LlaDa already learned how to unmask tokens and predict what was lying underneath. However, the tokens that were unmasked 
were chosen based on the highest <em>condfidence</em>. DCoLT takes LlaDa and builds upon it, training the model to unmask tokens
not on highest confidence, but specifically on how to <em>reason</em> the best.</p>

<p>How are they able to do this? The key idea here is Reinforcement Learning (RL). RL allows you to create a <em>custom</em> reward function, allowing you to improve the specific 
task you want to focus on. The idea of using RL to improve reasoning specifically is borrowed from <a href="https://arxiv.org/pdf/2501.12948">DeepSeek-R1</a>. The main difference between DeepSeek’s version and 
DCoLT is that DeepSeek applies this method to the autoregressive, step-by-step, Chain-of-Thought reasoning, whereas the authors of DCoLT take on the <strong>diffusion</strong> approach.</p>

<div style="text-align: center;">
  <img src="/assets/images/02-09-2026/Screenshot 2026-02-09 at 4.31.49 PM.png" alt="Diffusion Chain of Lateral Thought" style="max-width: 700px; width: 100%;" />
</div>
<p><br />
Above: The Diffusion Chain of Lateral Thought process. Entire thoughts are generated in parallel, rather than one thought after another.<br />
<br /></p>

<h2 id="conclusion-and-takeaways">Conclusion and Takeaways</h2>

<p>Let’s zoom back out for a summary, because this blog post was quite the journey. Here are the key points to take away:</p>
<ol>
  <li>Most LLMs are autoregressive, meaning they generate text token-by-token. Diffusion, on the other hand, generates objects all at once by learning to remove noise.</li>
  <li>Diffusion is a popular method for image generation: the model learns how to remove pixelated noise to generate an image based on a text descriptor.</li>
  <li>Nie and Zhu et al., the creators of LlaDa, creatively apply diffusion to LLMs by randomly masking individual tokens.</li>
  <li>On another note, DeepSeek-R1 discovered that using Reinforcement Learning is a useful tool to improve model reasoning. However, this was still from an autoregressive, step-by-step, thought-by-thought perspective.</li>
  <li>Huang and Chen et al., DCoLT authors, combine language diffusion (LlaDA) with DeepSeek-R1’s technique to improve reasoning through Reinforcement Learning. They build a language diffusion model, specifically for reasoning, that mimics lateral thinking over vertical, step-by-step thinking.</li>
</ol>

<p>Overall, I find the language diffusion space interesting for multiple reasons. First, I love how diffusion was a method that was originally used for image generation, 
but then was applied to language generation. I also agree with the authors of <a href="https://arxiv.org/pdf/2505.10446">DCoLT</a> that using only CoT examples to improve reasoning can be a limitation; 
when I think about how we as humans think, it is definitely not always step-by-step. However, there are other researchers who have realized that human thinking is not always linear. 
Works like <a href="https://arxiv.org/pdf/2305.10601">Tree of Thoughts</a> and <a href="https://arxiv.org/pdf/2308.09687">Graph of Thoughts</a> implement alternate thinking approaches that are also not only step-by-step.</p>

<p>In general, it will be exciting to see where diffusion models take us in the future, and how they will grow compared to their currently popular autoregressive counterparts. I believe they are definitely 
an interesting type of language model that have a large number of untapped benefits.</p>]]></content><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><category term="LLM" /><category term="diffusion" /><category term="thought" /><category term="LLM" /><category term="diffusion" /><category term="thought" /><summary type="html"><![CDATA[Diffusion in language models, and rethinking thought]]></summary></entry><entry><title type="html">Dissociating Language and Thought in LLMs</title><link href="https://anikamisra.github.io/llm/linguistics/cognitive%20science/dissociating-language-and-thought/" rel="alternate" type="text/html" title="Dissociating Language and Thought in LLMs" /><published>2026-01-14T13:00:00+00:00</published><updated>2026-01-14T13:00:00+00:00</updated><id>https://anikamisra.github.io/llm/linguistics/cognitive%20science/dissociating-language-and-thought</id><content type="html" xml:base="https://anikamisra.github.io/llm/linguistics/cognitive%20science/dissociating-language-and-thought/"><![CDATA[<h1 id="dissociating-language-and-thought-in-llms">Dissociating Language and Thought in LLMs</h1>
<p>This post is an overview of the paper: Dissociating Language and Thought in Large Language Models: A Cognitive Perspective by Mahowald and Ivanova et al. The paper is linked here: <a href="https://arxiv.org/abs/2301.06627">https://arxiv.org/abs/2301.06627</a></p>

<h3 id="separation-of-language-and-thought">Separation of language and thought</h3>
<p>In my introductory linguistics course, one of the first concepts we learned about was the separation of language and intelligence. Specifically, 
we learned that just because someone seems to be a ‘good’ speaker does not automatically make them intelligent, and vice versa; just 
because someone is not so great at speaking does not mean they are not intelligent. Regardless, we as humans still have a bias. If we hear someone speaking well, 
we may assume they are intelligent and skilled in other unrelated areas, even without meaning to.</p>

<p>When it comes to machine language, humans are not immune to having the same biases. This is because “[w]hen we hear a sentence, we typically assume 
that it was produced by a rational, thinking agent (another person)” <a href="https://arxiv.org/abs/2301.06627">Mahowald, Ivanova et al</a>. For over 100,000 years, the only entities who could converse in human language, 
were, well, humans. This explains why humans trust LLMs (even when they hallucinate), feel connected to them, and sometimes even feel that they are sentient.</p>

<div style="text-align: center;">
  <img src="/assets/images/01-14-2026/brooke-cagle-ZCSc8hoZrtw-unsplash.jpg" alt="Human talking to computer." style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p><a href="https://unsplash.com/photos/woman-in-orange-long-sleeve-shirt-sitting-in-front-of-silver-macbook-ZCSc8hoZrtw">Image Source</a></p>

<p>The authors continue to say that because LLMs are so good at certain language tasks, such as text comprehension and next-word prediction, people 
get these advanced language skills conflated with Artificial General Intelligence. This blending of language and thought is further reinforced by the 
influence of the Turing Test of 1949. The Turing Test says that if a human is unable to decipher text between another human and a machine, then this 
machine could be said to be intelligent.</p>

<p>However, intelligence goes beyond how well someone, or some machine, can speak. The authors say that due to “the fact that language can, and typically does, reflect underlying thoughts”, 
people have misconceptions when it comes to the way language and thought are connected. Just because someone is good at language, does not mean that they are good at thinking. 
From my previous posts and LLM reasoning performances alone, it is quite intuitive why this is a fallacy.</p>

<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/chat_strawberry_r.png" alt="Incorrect number of r's in the word strawberry, answered with confidence." style="max-width: 700px; width: 70%;" />
</div>
<p><br />
Above: A famous case of older versions of ChatGPT failing to properly count how many r’s are in the word strawberry
<br />
<br /></p>

<p>In the above image, though the model is great at ‘speaking’, its answer is incorrect, showing that it is bad at ‘thinking’, ultimately disproving the idea 
that “good at language” always implies “good at thought” in LLMs (and this is true for humans, too!).</p>

<p>In the human brain, there has been evidence that language and thought are detachable. With this motivation, the authors present two different kinds of linguistic competence 
that LLMs could be evaluated on. The first one is <strong>formal linguistic competence</strong>, which deals with how good an LLM is at knowing the rules and statistical laws of language. Formal 
linguistic competence is more connected to ‘speaking’ and following correct grammar rules.</p>

<p>The second one is <strong>functional linguistic competence</strong>, which deals with an LLM’s ability to use language “in the real world”, often relying on non-linguistic skills. Functional 
linguistic competence is more connected to ‘thinking’, such as reasoning about diverse topics, making requests, and overall, interfacing with other cognitive components and 
ultimately, other entities.</p>

<p>With this new framework in mind, the authors argue that modern LLMs can be great at formal language but are not so great at modeling human thought.</p>

<h3 id="formal-linguistic-competence">Formal linguistic competence</h3>
<p>First, the authors discuss how LLMs are pretty good at formal linguistic competence. Modern transformer models (this paper was published in 2023, so they mainly discuss GPT-3) are 
coherent, great at generating text, performing specific tasks with simple prompting, and overall, excellent at grammar. They can even predict 
next words for humans which matches human neural behavior.</p>

<p>However, some limitations are that LLMs can over-rely on statistical rules when it comes to language. Furthermore, something that I thought was really interesting was 
that LLMs see 100-1000 times as much data as a child is exposed to. This goes to show that even if LLMs are learning the rules of language from statistical rules, 
these current methods lack something that humans have when it comes to learning language. At the time that this paper was written, I don’t think small language models 
had taken off yet, but today, there is a wide variety of small language models (Llama3.2-1B, SmolLM2-1.7B, etc.), which shows promise for language models that 
can also learn on smaller datasets.</p>

<div style="text-align: center;">
  <img src="/assets/images/01-14-2026/quote-1-blue.png" alt="Quote 1." style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<h3 id="functional-linguistic-competence">Functional linguistic competence</h3>
<p>Next, the authors discuss how LLMs are weaker when it comes to functional linguistic competence (the ‘thinking’ part). Since natural language contains factual knowledge, LLMs trained 
on a lot of text must acquire some factual knowledge. This often comes from word co-occurrence patterns (<strong>co-occurrence</strong> relates to which words show up frequently together). 
For example, during training, if “Austin” shows up frequently next to “The capital of Texas is…”, then the LLM will ‘learn’ that Austin is the capital of Texas just from the 
text alone.</p>

<div style="text-align: center;">
  <img src="/assets/images/01-14-2026/co-occurrence-relations-2.png" alt="Quote 1." style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p>Hence, the authors claim that “any test of LLMs’ ability to reason must account for their ability to use word co-occurrence patterns to “hack” the task.”</p>

<p>Now, this statement conflicts me. I agree, that when I first truly understood how LLMs were trained, it definitely felt like an LLM knowing something only because it was good at predicting a 
next token felt like ‘hacking’. On the other hand, even if an LLM is simply a next-likely-token predictor, we cannot discount when they get things right or are ‘good’ at certain tasks, 
even if that learning was done through word frequency relations. Modern methods also use external methods like Retrieval-Augmented Generation or external solvers 
(<a href="https://anikamisra.github.io/posts/linc/">LINC</a>, <a href="https://anikamisra.github.io/posts/faithfulcot/">Faithful CoT</a>) to mitigate this issue. Still, an over-reliance on statistical patterns can lead to issues like the <a href="https://anikamisra.github.io/posts/reversal-curse/">Reversal Curse</a>!</p>

<p>In the rest of the section, the authors discuss more limitations of LLMs compared to humans, including:</p>
<ul>
  <li><strong>Formal reasoning</strong>: LLMs’ tend to do worse on reasoning tasks when the task is not something that one would see in the typical dataset (called ‘out of distribution’ data). For example, with unusual tasks such as “Get your sofa onto the roof of your house without using a pulley”, humans tend to do much better than LLMs, even though the reasoning behind it is not necessarily complex.</li>
  <li><strong>World knowledge &amp; commonsense</strong>: LLMs’ general knowledge is biased: On the internet, humans tend to write about things that are unusual or noteworthy, and less so about implied common facts. As a result, LLMs may struggle with underreported knowledge such as the fact that wheels tend to be round.</li>
  <li><strong>Situation modeling</strong>: LLMs are incapable of tracking information over long conversations, unlike humans, who can have a three-hour long conversation and still remember most ofit the next day. (Note: this was true at the time of writing, but LLMs since then have seen tremendous growth in context windows, for example, with Gemini Pro having a context window of <a href="https://arxiv.org/pdf/2403.05530">over a million tokens</a>.)</li>
  <li><strong>Pragmatics and intent</strong>: LLMs struggle to understand users’ intent, sarcasm, and jokes. I found that since the writing of this paper, LLMs’ ability to understand sarcasm has improved, but is still not at the level of humans (<a href="https://arxiv.org/pdf/2501.02532">Bojić et al</a>).</li>
</ul>

<h3 id="conclusion">Conclusion</h3>
<p>Overall, due to LLMs’ limitations in formal reasoning, world knowledge, situation modeling, and social reasoning (e.g. sarcasm), the authors argue 
that LLMs are weak in functional linguistic competence. This highlights the fact that just because a system is good at language does not imply it is good at thought.</p>

<p>Now this is the most fundamental argument; language and thought are separate systems, and perhaps trying to model both with just one LLM system is not practical. 
The authors argue that, similar to the human brain, a truly intelligent model must not only have a language processor, but a problem solver component, an experiencer component, 
a situation modeler, a reasoning component, and a goal decider. They tie it back into the Turing Test: “a model that masters language <em>use</em>, not just the rules and patterns of 
natural language, has to be a general intelligence model.“</p>

<p>Therefore, there should be <em>separate benchmarks</em> for LLMs: some for formal linguistic competence (which already exist), 
and some for functional linguistic competence.</p>

<div style="text-align: center;">
  <img src="/assets/images/01-14-2026/quote-2-blue.png" alt="Quote 2." style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<h3 id="thoughts-and-takeaways">Thoughts and takeaways</h3>
<p>I thought this article was really well-written. It really reinforced the fact that pure, traditional LLMs are next-token predictors rather than agents with 
intent or goals. When you engage in a conversation with a chatbot, it may feel like you are talking to someone, but in reality, you are not actually talking to anyone. 
This sense of perceived agency is what makes talking to one so misleading.</p>

<p>Overall, I agree with the general sentiment that some “language use” tasks are better when offloaded to a non-LLM or an external solver, such as LINC or utilizing 
Retrieval-Augmented Generation. On the other hand, after reading this article, using word co-occurrence patterns to learn facts doesn’t honestly feel that much like hacking to me. 
Yes, it may not be as grounded as other methods, but ‘learning’ that Austin is the capital of Texas because you’ve seen “The capital of Texas is Austin” 
written in so many internet documents still feels like a valid way to learn something if it works for the most part. However, this shouldn’t be everything: 
we don’t want models to be right ‘most of the time’, we want them to be right ‘all of the time’.</p>

<p>I also wonder if the categories mentioned — problem solver, experiencer, situation modeler, etc. — are the meant to be a definitive characterization of an 
intelligent agent, or just one helpful framework along with many others. I agree that intelligence 
goes beyond just how well someone speaks, but I also know that the definition of it is highly debated, and so I wonder what other models or frameworks one could 
follow to create such an intelligent agent. Perhaps we don’t have to model intelligence after the human brain at all. If language models can learn how to 
‘speak’ (at least, in terms of formal linguistic competence) differently than humans do, perhaps we can create intelligent agents using a non-human framework too.</p>]]></content><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><category term="LLM" /><category term="linguistics" /><category term="cognitive science" /><category term="LLM" /><category term="linguistics" /><category term="cognitive science" /><summary type="html"><![CDATA[Dissociating Language and Thought in LLMs This post is an overview of the paper: Dissociating Language and Thought in Large Language Models: A Cognitive Perspective by Mahowald and Ivanova et al. The paper is linked here: https://arxiv.org/abs/2301.06627]]></summary></entry><entry><title type="html">Faithful Chain of Thought</title><link href="https://anikamisra.github.io/llm/reasoning/cot/faithfulcot/" rel="alternate" type="text/html" title="Faithful Chain of Thought" /><published>2025-12-28T20:00:00+00:00</published><updated>2025-12-28T20:00:00+00:00</updated><id>https://anikamisra.github.io/llm/reasoning/cot/faithfulcot</id><content type="html" xml:base="https://anikamisra.github.io/llm/reasoning/cot/faithfulcot/"><![CDATA[<h1 id="faithful-chain-of-thought-reasoning">Faithful Chain-of-Thought Reasoning</h1>
<p>This post is an overview of the paper: Faithful Chain-of-Thought Reasoning by Lyu, Havaldar, Stein et al. 
The paper is linked here: <a href="https://arxiv.org/pdf/2301.13379">https://arxiv.org/pdf/2310.15164</a></p>

<h3 id="introduction">Introduction</h3>
<p>Today, if you ask a chatbot a math question or a complex reasoning question, it will likely provide the correct answer. 
There was a time, however, where this level of performance was unimaginable. This is because while LLMs are good at imitating surface-level human language, 
the structured reasoning <em>beneath</em> the language — reasoning that humans usually find quite simple — is a fundamentally different problem. 
Creating models that can properly think and reason rather than simply “talk” is a largely unsolved challenge in AI.</p>

<h3 id="background-original-chain-of-thought-and-some-limitations">Background: Original Chain-of-Thought (and some limitations)</h3>
<p>One method that drastically advanced reasoning capabilities of LLMs was Chain of Thought (CoT). CoT, published in 2022, was a prompting strategy that led to groundbreaking 
improvements in LLMs’ reasoning capabilities on complex reasoning tasks, like math and commonsense. The way CoT works is fairly simple: just ask the model to “show its work”. 
More specifically, instead of asking an LLM a direct question, you provide several examples of “showing your thinking”. Then, in its response, the LLM also will “show its thinking”, 
leading to more accurate responses.</p>

<div style="text-align: center;">
  <img src="/assets/images/28-12-2025/CoT_example_again.png" alt="CoT example" style="max-width: 700px; width: 80%;" />
</div>
<p><br /></p>

<p>Above: The input prompts the model to “show its thinking” by placing a reasoning example in the demo question-answer pair.</p>

<p>CoT led to striking improvements in LLM reasoning performances on several benchmarks. That wasn’t its only benefit, however. A second benefit was that CoT provided an 
<strong>interpretable window</strong> into how the LLM was “thinking”. Since CoT forces the model to “show its work”, one can see exactly which steps the LLM took to reason in its final answer. 
This can be useful for tracing errors and boosting explainability.</p>

<div style="text-align: center;">
  <img src="/assets/images/28-12-2025/CoT_provides_interprebility.png" alt="CoT provides interprebility" style="max-width: 700px; width: 100%;" />
</div>
<p><br /></p>

<p>Above: When an LLM is prompted with CoT, it “shows its thinking”, which can be useful when trying to understand why it got an answer wrong.</p>

<p>However, this is not always reliable. One issue with CoT is that it is not always <strong>faithful</strong> to its reasoning process. <strong>Faithfulness</strong> refers to if the outputted 
reasoning text actually reflects the process that the model took to get to that answer. Below is an example of reasoning steps of CoT <em>not</em> being faithful.</p>

<div style="text-align: center;">
  <img src="/assets/images/28-12-2025/cot_not_faithful.png" alt="CoT explanations are not always faithful." style="max-width: 700px; width: 70%;" />
</div>
<p><br />
Above: The LLM uses CoT to explain its work and arrives at an answer of $200 in the reasoning chain. 
However, in the last moment, for some reason, it randomly switches and says the final answer is $0.</p>

<p><a href="https://openreview.net/forum?id=_VjQlMeSB_J">Source</a></p>

<p>In the above image, though the reasoning process looks plausible, the final output of the LLM completely contradicts the reasoning steps. This means that the 
model may not have actually taken those reasoning steps to reach its final answer. The reasoning steps also do not explain why the LLM randomly switched its answer 
in the last moment. Unfaithfulness in LLM reasoning can be misleading and even dangerous in higher stakes fields where explainability and error understanding are crucial.</p>

<h3 id="faithful-chain-of-thought">Faithful Chain-of-Thought</h3>
<p>To solve this issue, the authors present Faithful Chain-of-Thought. They break down complex reasoning tasks into 2 stages: Translation and Problem Solving.</p>

<p><strong>Translation</strong></p>

<p>In translation, the language model translates the given question into a reasoning chain. Unlike plain CoT, which uses only natural language (e.g. plain English) 
in its reasoning chain, this new reasoning chain is a combination of natural language and symbolic language (like code). The symbolic language chosen is 
dependent on the type of question. For example, Python may be used for math problems whereas Planning Domain Definition Language (PDDL) may be used for planning questions.</p>

<div style="text-align: center;">
  <img src="/assets/images/28-12-2025/step_1_translation.png" alt="Translate query into natural language and symbolic reasoning chain." style="max-width: 700px; width: 70%;" />
</div>
<p><br /></p>

<p><strong>Problem Solving</strong></p>

<p>In problem solving, the reasoning chain is executed; in the example above, the Python code would run. The output from the reasoning chain (code) is then taken as the answer itself. 
This component is also called a “solver”.</p>

<h3 id="conclusion">Conclusion</h3>
<p>In summary, plain CoT has the LLM simply “say” what it’s thinking out loud, whereas Faithful CoT has the LLM formally execute some type of code to 
retrieve the final answer. In plain CoT, though the reasoning chain may look legitimate, there is no way to verify if this is what the LLM was 
actually “thinking”. On the other hand, with Faithful CoT, the LLM’s output is guaranteed to reflect the steps it took (or, in this case, code it ran) 
to reach the answer. In other words, the reasoning steps the LLM takes are faithfully tied to the final answer. 
This creates a more reliable explanation of the reasoning process. Faithful CoT also outperforms the orignal CoT on several reasoning benchmarks.</p>

<p>Now, of course, the Faithful CoT method does not ensure that the LLM always provides the perfectly correct reasoning chain. It just ensures that, for 
whichever reasoning chain the LLM provides, the final answer actually <em>comes from</em> that reasoning chain. Another direction of research is to work on the Translation 
component of Faithful CoT, and improve the reasoning chain that the LLM creates (“reason about reasoning”? We could get into some deep recursion here).</p>

<h3 id="thoughts-and-takeaways">Thoughts and takeaways</h3>
<p>I see similarities between Faithful CoT and <a href="https://anikamisra.github.io/posts/linc/">LINC</a>: Both convert a natural language question into an external type of language—either code or a planning language, or 
in LINC’s case, First-Order-Logic—and then use an external solver to solve the question. With Faithful CoT, I like how the type of language used is task-dependent. 
This intuitively feels beneficial for performance. Overall, I see a common theme of offloading the reasoning components to external solvers when improving reasoning for LLMs.</p>

<p><strong>Questions to think about</strong>:</p>
<ol>
  <li>I wonder how Faithful CoT and LINC can be applied to problems that require long reasoning chains and a lot of memory, which can be more challenging for LLMs. The reasoning questions shown here seem to be more short and simple, from my understanding.</li>
  <li>Both of these methods require an LLM to translate from natural language into a more symbolic one. While these methods improve performance on reasoning benchmarks, are we actually solving the problem, or are we just shifting it to a different one (translation)? And, does it really matter, if the benchmarks are being improved upon?</li>
</ol>]]></content><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><category term="LLM" /><category term="reasoning" /><category term="CoT" /><category term="LLM" /><category term="reasoning" /><category term="CoT" /><summary type="html"><![CDATA[Faithful Chain-of-Thought Reasoning This post is an overview of the paper: Faithful Chain-of-Thought Reasoning by Lyu, Havaldar, Stein et al. The paper is linked here: https://arxiv.org/pdf/2310.15164]]></summary></entry><entry><title type="html">LINC: Neurosymbolic reasoning approach for LLM</title><link href="https://anikamisra.github.io/llm/reasoning/linc/" rel="alternate" type="text/html" title="LINC: Neurosymbolic reasoning approach for LLM" /><published>2025-12-20T20:00:00+00:00</published><updated>2025-12-20T20:00:00+00:00</updated><id>https://anikamisra.github.io/llm/reasoning/linc</id><content type="html" xml:base="https://anikamisra.github.io/llm/reasoning/linc/"><![CDATA[<h1 id="linc-neurosymbolic-reasoning-for-llms">LINC: Neurosymbolic Reasoning for LLMs</h1>
<p>This post is an overview of the paper: LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers by Olausson et al. The paper is linked here: <a href="https://arxiv.org/pdf/2310.15164">https://arxiv.org/pdf/2310.15164</a></p>

<h3 id="introduction">Introduction</h3>
<p>In today’s world, as people rely more and more on chatbots, logical reasoning is an extremely important task for LLMs. In the legal and medical domains, for example, it is imperative that we can verify the correctness and validity of LLMs’ claims. However, if you have ever used a chatbot, you may know that LLMs can be wrong sometimes. Why does this happen? From a very high-level viewpoint, this is because pure LLMs are “next likely token predictors”. They simply predict what the next likely word is based on what they have seen in training data. As a result, when we analyze their outputs from a human perspective, we find flaws, incorrect reasoning statements, and hallucinations (aka, something the model completely ‘made up’).</p>

<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/chat_strawberry_r.png" alt="Incorrect number of r's in the word strawberry, answered with confidence." style="max-width: 700px; width: 70%;" />
</div>
<p><br />
Above: A famous case of older versions of ChatGPT failing to properly count how many r’s are in the word strawberry</p>

<p><a href="https://prompt.16x.engineer/blog/why-chatgpt-cant-count-rs-in-strawberry">Source</a></p>

<p><br /></p>

<h3 id="chain-of-thought">Chain of Thought</h3>

<p>One groundbreaking method to improve LLM reasoning was the 2022 <a href="https://arxiv.org/pdf/2201.11903">Chain-of-Thought approach (CoT)</a> by Wei et al. In short, the way CoT works is that it has the model “think out loud” before generating a final response. The researchers found that this prompt-based approach significantly improved LLM reasoning performances on a wide range of tasks.</p>

<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/cot_example.png" alt="Chain-of-thought has the model think out loud to improve reasoning capabilities." style="max-width: 700px; width: 100%;" />
</div>

<p>CoT is a brilliant approach that significantly improves LLM reasoning performance on arithmetic, commonsense, and logical reasoning tasks. However, when I hear about prompt-based approaches, I wonder if there is another side that we are missing altogether. Prompt-based approaches still rely on the “next likely token” text generation. Specifically, when it comes to logical reasoning, there are many formal rules that we could leverage to help the model reason, rather than simply relying on a (at certain times, somewhat flimsy) probabilistic word generation approach.</p>

<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/Logical_rules.png" alt="Many day-to-day reasoning tasks can be translated to formal logical representations." style="max-width: 700px; width: 100%;" />
</div>
<p>Above: Many day-to-day reasoning tasks in natural language can be translated to formal logical representations.</p>

<p><br />
<br /></p>

<h3 id="linc-overview">LINC Overview</h3>

<p>This is exactly where <strong>Logical Inference via Neurosymbolic Computation (LINC)</strong> comes in. The authors behind LINC agree that prompt-based approaches (like CoT) and increasing the model size can help improve LLM reasoning performance, however, there are still limitations. The authors decide to leverage the formal rules of logic in their method, making it a <strong>neurosymbolic approach</strong>, described below.</p>

<ol>
  <li>
    <p>First, the model converts the natural language (NL) expression into a First-order-logic (FOL) expression. Essentially, it converts the regular text into logic symbols.</p>
  </li>
  <li>
    <p>Next, they use a symbolic FOL solver, that specifically relies on <em>logic rules</em>, to determine the truth value of the conclusion. There can be three possible outputs: True, False, or Uncertain (e.g. if there was not enough information to solve the problem).</p>
  </li>
  <li>
    <p>Finally, they run this process K times and decide on the answer that appears the majority of the time. (This reminds me of an agentic-style approach!)</p>
  </li>
</ol>

<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/Linc_overview.png" alt="Overview of LINC process." style="max-width: 700px; width: 100%;" />
</div>
<p>Above: LINC first converts the natural language expression to a logical one, then uses a logic solver to solve it.</p>

<p><br />
<br /></p>

<h3 id="linc-results-and-analysis">LINC Results and Analysis</h3>

<p>The authors test LINC on three different pre-trained LLMs and two different evaluation datasets. They find that LINC has better performance results than almost all of the other methods. What is potentially more interesting to analyze, however, is <em>how</em> LINC fails — compared to how CoT fails.</p>

<p>In short, with LINC, the issues arise <strong>when converting the natural language into the symbolic representation</strong>. For example, the authors find that LINC may not capture implicit commonsense information that could affect the premises. In the example below, the model derives a solution of “Uncertain” because it was not able to make the implicit assumption that Harry is a person.</p>

<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/person_harry_assumption.png" alt="LINC does not understand all implicit information." style="max-width: 700px; width: 60%;" />
</div>
<p><br />
<br /></p>

<p>There is also the issue of multiple representations. LINC struggles when it is given a lot of information about one object because it is unable to decide if this information is independent from each other or not. Additionally, certain words may have multiple meanings, causing syntax errors in the logical expression when terms are seemingly repeated. CoT, on the other hand, does not tend to fail at these tasks.</p>
<div style="text-align: center;">
  <img src="/assets/images/20-12-2025/summer_example.png" alt="Language ambiguity can cause errors in the FOL interpretations." style="max-width: 700px; width: 100%;" />
</div>
<p>Above: Language ambiguity can cause syntax errors in the FOL expression or cause issues for the solver.</p>

<p><br />
<br />
However, CoT has different issues. First of all, sometimes CoT will reason about something correctly, but ultimately decide on a different answer in the very last moment. Second, CoT will simply just not follow logical rules sometimes - completely the opposite of the FOL solver. Finally, CoT fails with more complex, longer reasoning tasks. This 
illustrates the strengths and weaknesses of the two approaches: CoT is more language-based while LINC is more rule-based, and their performances reflect this.</p>

<h3 id="thoughts-and-takeaways">Thoughts and takeaways</h3>
<p>I thought this paper was super interesting. The failure analysis of LINC shows that most of the challenge for LLM reasoning comes from, well, the language part. This makes sense, because if we start from a perfect formal symbolic representation of a statement, we should theoretically also perfectly be able to solve it, since logical reasoning follows set rules (like a math problem). The challenge of LLM reasoning comes from actually <em>translating</em> the language statement into the formal symbolic representation.</p>

<p><strong>Connecting to schoolwork</strong></p>

<p>I find several connections in the challenges mentioned in this LINC paper and traditional NLP challenges discussed in my NLP class this semester. For example, the ambiguity of language is what makes NLP so challenging. 
Additionally, the challenge that the FOL symbols have with repeated terms - treating them as the same entity rather than unique ones - reminds me of co-reference resolution.</p>

<p><strong>Optimizing the FOL Solver</strong></p>

<p>LINC uses a <em>fixed</em> symbolic solver. However, <a href="https://arxiv.org/pdf/2502.09100">this work</a> from Liu and Fu et al. mentions that while earlier neurosymbolic methods used a fixed external logical reasoning engine, one could potentially also <em>optimize</em> the reasoning engine. Optimizing the reasoning engine could be an interesting area of future work.</p>

<p><strong>Conclusion</strong></p>

<p>Overall, while the failure cases of LINC are interesting to learn from, it’s important to emphasize that LINC is an impressive and successful approach that demonstrated improvements across almost all LLM reasoning benchmarks, including CoT. It goes to show that indeed, the neurosymbolic approach is a promising direction for LLM reasoning
and verifying the validity of what an LLM “says”.</p>]]></content><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><category term="LLM" /><category term="reasoning" /><category term="LLM" /><category term="reasoning" /><summary type="html"><![CDATA[LINC: Neurosymbolic Reasoning for LLMs This post is an overview of the paper: LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers by Olausson et al. The paper is linked here: https://arxiv.org/pdf/2310.15164]]></summary></entry><entry><title type="html">The LLM Reversal Curse</title><link href="https://anikamisra.github.io/llm/reasoning/reversal-curse/" rel="alternate" type="text/html" title="The LLM Reversal Curse" /><published>2025-12-11T17:00:00+00:00</published><updated>2025-12-11T17:00:00+00:00</updated><id>https://anikamisra.github.io/llm/reasoning/reversal-curse</id><content type="html" xml:base="https://anikamisra.github.io/llm/reasoning/reversal-curse/"><![CDATA[<h1 id="paper-overview-the-llm-reversal-curse">Paper Overview: The LLM Reversal Curse</h1>

<p>This post is an overview and summary of the paper: <em>The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”</em> by Berglund et. al. The paper is linked here: <a href="https://arxiv.org/pdf/2309.12288">https://arxiv.org/pdf/2309.12288</a></p>

<h2 id="introduction">Introduction</h2>
<p>The reversal curse is a fascinating phenomenon in LLMs.  Suppose you were given the following sentence: “Valentina Tereshkova was the first woman to travel to space.” Then, if someone asked you, “Who was the first woman to travel to space?” The answer would be obvious: Valentina Tereshkova. For LLMs, however, this task is not always that easy.</p>

<p>When the order of the question is the same as the training information, then the LLM can answer successfully. However, when the order is reversed, the LLM often gets the answer wrong. That is what researchers refer to as the Reversal Curse, depicted in the figure below.</p>

<div style="text-align: center;">
  <img src="/assets/images/12-12-2025/llm_reversal_curse_image.png" alt="LLMs fail when the order is reversed." style="max-width: 700px; width: 100%;" />
</div>
<p>Above: LLMs fail answering when the order is reversed.
<br />
<br /></p>

<p>Why is this important? It shows a flaw in basic LLM logic deduction. While LLMs can solve complex math problems, write advanced code, and potentially provide emotional advice, they fail on this one trivial task.</p>

<h3 id="quick-note-difference-between-fine-tuning-and-providing-context">Quick note: Difference between fine-tuning and providing context</h3>
<p>Note that this is not the same as first, telling a pre-trained chatbot, “Valentina Tereshkova was the first woman to travel to space,” and then immediately asking it, “Who was the first woman to travel to space?” A good chatbot would most likely answer this question correctly because the information is provided in its <strong>context window</strong>. The LLM reversal curse, however, specifically relates to <strong>training and fine-tuning</strong>.</p>

<p>Essentially, if you pretrain an LLM with the information: “Valentina Tereshkova was the first woman to travel to space.”, and then after training, you prompt it with, “Who was the first woman to travel to space?” <em>without</em> any additional information in the context window, it often fails. <strong>This</strong> is what the LLM Reversal Curse refers to.</p>

<div style="text-align: center;">
  <img src="/assets/images/12-12-2025/context_window_vs_training_1.png" alt="LLM reversal curse refers to finetuning and pretraining, not inference-time logical deduction." style="max-width: 900px; width: 100%;" />
</div>
<p>Above: The LLM reversal curse refers to the finetuning and pretraining stage, not inference-time logical deduction. 
<br />
<br /></p>

<p>Now, you may ask, isn’t it enough that the LLM can get the answer right when the information is provided in the context window? Not quite — this requires that the user already knows the answer! When people use chatbots in day-to-day settings and query it with answers, they are not testing the LLM’s ability to conduct logical deduction for fun. Rather, they are trying to get their questions answered directly. Ideally, if an LLM is trained on a certain fact during pre-training, then it should know how to answer a question related to this fact when prompted, regardless of the order the fact was written in. That is why the LLM Reversal Curse is so important.</p>

<h2 id="experiments">Experiments</h2>
<p>Now, the LLM Reversal Curse was not a simple problem that could be fixed with just a bit of fine-tuning. In this paper, the researchers conduct three experiments to provide evidence of the Reversal Curse.</p>

<p><strong>1. Reversing descriptions of fictitious celebrities through fine-tuning</strong> 
<br />
In this experiment, the researchers fine-tune an LLM on many synthetic facts about celebrities. Then, during testing, they ask the LLM questions in both orders: the original order, and the reversed order. The LLM often fails in the reversed order.</p>
<div style="text-align: center;">
  <img src="/assets/images/12-12-2025/experiment_1_llm_reversal.png" alt="LLMs fail when the order is reversed." style="max-width: 900px; width: 100%;" />
</div>
<p><br />
 <br /></p>

<p><strong>2. Using real facts about celebrities without fine-tuning</strong> 
<br />
First, the researchers collect a list of the top 1000 most popular celebrities from IMDB. Then, they ask GPT-4 the names of the parents’ of the celebrities. On this, GPT-4 has a 79% accuracy. Next, the researchers provide the names of the parents of the celebrities, and ask GPT-4 to name their (celebrity) children. Here, GPT-4 is only 33% successful, showing a gap.</p>
<div style="text-align: center;">
  <img src="/assets/images/12-12-2025/celebrities_experiment.svg" alt="LLMs fail when the order is reversed." style="max-width: 900px; width: 100%;" />
</div>

<p><strong>3. Instruction fine-tuning</strong><br />
<br />
Here, the researchers create a dataset of question-answer pairs <strong>(Q,A)</strong>. During fine-tuning, they create two datasets: one that has a Question-to-Answer (same order) instruction, and another that has the Answer-to-Question instruction (reversed order). The LLMs fine-tuned with the Question-to-Answer have around an 80% accuracy, but the LLMs fine-tuned with Answer-to-Question (reversed order) have less than a 10% accuracy.</p>

<h2 id="conclusion-and-discussion">Conclusion and Discussion</h2>
<p>These three experiments show that the LLM Reversal Curse is not a fluke. Indeed, LLMs trained on fact: “A is B” will fail when asked, “B is _?”. In this paper, the researchers used multiple different LLMs, experiment setups, and hyperparameters to prove this negative result.</p>

<p>Why may this be? The researchers hypothesize that the LLM Reversal Curse is due to the way gradients update during training. During training, if a network is given the sentence, “A is B”, then it will slightly update the parameters of A, but not B. This is because LLMs are “next-likely-token” predictors. During training, they are optimized to predict the next word, not the previous one. On the other hand, if humans are taught, “A is B,” then they can easily find the symmetry and say, “B is A.” <strong>This goes to show that LLMs do not necessarily “know” what they are saying—they are simply predicting the next word.</strong> This phenomenon showcases a fundamental gap in LLM reasoning and is an exciting area of potential future research.</p>]]></content><author><name>Anika Misra</name><email>anmisra@umich.edu</email></author><category term="LLM" /><category term="reasoning" /><category term="LLM" /><category term="reasoning" /><summary type="html"><![CDATA[Paper Overview: The LLM Reversal Curse]]></summary></entry></feed>