Anika Misra

Reinforcement learning for NLP

2026-03-16T17:00:00+00:00

In my last post, I talked about applying an originally image-based machine learning method to language, and the challenges that came with it. Today, we’re going to be discussing a similar topic—specifically, Reinforcement Learning (RL) for NLP reasoning. Though not initially used for LLMs, RL has a number of benefits when applied to LLMs. Specifically, RL allows LLMs to learn how to create outputs with specific goals in mind.

Background

Traditionally, RL was not used for language, but rather applications like games, robotics, and general planning. The main thing to understand is that RL teaches an agent a policy in order to maximize a reward. For example, suppose we are trying to train a robot how to navigate in a square grid in order to maximize its score:

Image source: Wang et al

Suppose that the number on each square contributes to the robot’s score. Then, naturally, the robot will then be penalized for stepping on a low-number square, and rewarded for stepping on a high-number square. Through these rewards, the RL algorithm trains the robot to navigate from one end of the grid to the other in order to maximize its score (Wang et al).

Application to LLMs

Now, you may be wondering how something like this could ever relate to, say, a chatbot answering human questions. The translation is quite simple: rather than having a robot be rewarded when it steps on a high-numbered square, we can have the chatbot be rewarded when it provides a good response, and be penalized when it provides a bad response.

Above: Reinforcement learning (RL) for LLMs rewards them for good responses and penalizes them for bad responses.

Let’s take a step back and think about how this is different from other types of machine learning pipelines. The most well-known pipeline for LLMs is probably supervised learning, which uses explicit, labelled word data to have the model directly learn what the best next-likely-token is. While this type of data is difficult to construct, it allows the model to have a rich reference point to learn from.

On the opposite end of the spectrum is unsupervised learning, which has no labelled output data: these techniques are just used to discover existing structures and patterns in the data. Unsupervised learning does not require the human labor of constructing good output data; however, it cannot really be used to train chatbots.

RL lies in the middle of unsupervised and supervised learning, giving us the best of both worlds. Unlike supervised learning, RL does not require wordy, explicitly labelled outputs; instead, it learns from less complex reward signals (e.g. ‘good’ and ‘bad’). However, unlike unsupervised learning, we can actually use RL to train a chatbot on what to say.

Above: RL uses simple rewards rather than trying to learn a token structure directly.

Now, RL is not the magic solution for everything. As labor-intense as gathering labelled data is, supervised learning is still necessary for the pre-training of a language model. One cannot teach a model how to speak from reward signals like ‘good’ vs ‘bad’ alone (think of if you tried to teach a human to speak in that way!). However, RL can be used in the fine-tuning stage of training an LLM, which occurs after the base step of pre-training to polish the final model, as an alternative to the traditional supervised fine-tuning.

Now, RL has more benefits beyond its simpler output label structure. One of the main benefits of RL is that it allows you to train the model with a specific goal in mind. For example, why do the rewards just have to be ‘good’ or ‘bad’? They can also be ‘safe’ or ‘unsafe’, ‘helpful’ or ‘not helpful’, ‘correct’ or ‘incorrect’. This leads us to how RL can specifically be used to improve reasoning capabilities in LLMs.

Using RL to improve reasoning in DeepSeek-R1

Chain-of-Thought prompting (Wei et al) led to amazing reasoning improvements in LLMs by simply having the model ‘think step-by-step’ in its answer. With this in mind, researchers at Google decided to go beyond prompting, and try to actually train an LLM on these step-by-step reasoning trajectories. As expected, this training improved the reasoning capabilities of models on several benchmarks (Chung, Hou, Longpre et al).

Above: CoT prompting only has the model ‘think out loud’ after it’s done training, while CoT finetuning actually trains the model on thinking out loud.

However, while training on reasoning traces improves performance, the authors of Deepseek-R1 argue that gathering these human-labelled reasoning trajectories is expensive. More importantly, these human-labelled reasoning traces were created by, well, humans. An LLM could potentially think differently (maybe even better) than a human does. Why does it have to copy this limited, human-based reasoning process?

To explore this question, the Deepseek-R1 authors use RL to train their LLM. Remember how we said that RL allows you to have different reward types? In this case, rather than giving a ‘good’ or ‘bad’ reward, the authors give a ‘correct’ vs ‘incorrect’ reward. Addtionally, they have the model output some sort of reasoning trace, but they only reward and penalize the model’s final answer. That means that there is no influence on the reasoning process that the model comes up with. As a result, the model can potentially learn non-human reasoning pathways while still learning how to reach a correct final answer!

Something interesting happened when doing this: the original model had great reasoning capabilities, however, its reasoning traces had very poor readability and were occasionally mixing English and Chinese. To improve readability of reasoning traces, the authors suppressed this language mixing behavior in the final model.

Initially, when reading this, my intuition told me that suppressing the language mixing was maybe not the most beneficial for the model’s reasoning. If the whole point of using RL was to have the model create the best, non-human-influenced, reasoning traces, and it naturally developed language mixing, then why get rid of that? Indeed, as specified in the paper: suppressing language mixing in Deepseek-R1-0 did actually decrease reasoning capabilities.

In fact, Li et al studied this more directly and found that language mixing is not a random artifact of these RL-created reasoning traces, but rather, a deliberate and strategic reasoning choice that tends to help reasoning performance in LLMs. It’s exciting how although using RL in Deepseek was initially only to improve model reasoning capabilities, an unexpected byproduct was the finding of this tool that improves LLM problem-solving ability.

‘I don’t know’ in Language Models

What actually inspired me to start this post about RL for LLMs was this tweet and subsequent comment I saw the other day.

What these posters are describing here are that language models ‘do not know what they do not know’. This can be harmful when people rely on them for accurate, factual information. For example, if a user needs the answer to a relevant formula to solve a physics problems, it is probably better for a chatbot to not give a formula at all than to give an incorrect one.

The Deepseek-R1 RL structure did not necessarily penalize models for being incorrect. Rather, it gave the model a score of 1 when the answer was correct, and 0 when it was incorrect. But what if we tried implementing the structure brought up in this comment?

Indeed, Wei et al. tried this reward structure for themselves, using these three reward values:

They find that using this reward design, as opposed to the binary one similar to what Deepseek used, actually does reduce hallucination in LLMs (e.g., having the model make something up as opposed to simply saying ‘I don’t know’). However, accuracy is still the highest when the binary reward system is used. In other words, it is still better for the LLM to ‘guess’ on most things than to say ‘I don’t know’ to some.

Jha et al. tried a similar experiment for themselves, using the same reward structure as specified above. However, they actually varied the middle reward as a hyperparameter and found that setting it to -0.25 rather than just 0 led to better performance on their benchmarks. In other words, there is still some penalty needed for saying ‘I don’t know’, but it just can’t be as strong as the penalty for getting the answer incorrect.

Now, while RL for LLMs may work well for aligning it with certain values and enabling them to develop their own reasoning traces, I’m still doubtful that it is a strong enough method to teach a language model what it knows and what it doesn’t. As expected, as the penalty for saying ‘I don’t know’ got weaker and weaker, the language model started saying ‘I don’t know’ more often Jha et al. But does it actually know what it doesn’t know, or is it just trying to maximize its reward? The future that these tweeters were hoping for may be easier said than done.

Summaries and takeaways

RL is a powerful rewards-based method that, although not traditionally designed for language models, allows researchers to fine-tune LLM responses with specific goals in mind. While focusing on these goals, the LLM will learn its own paths to get there, which may lead to interesting discoveries, such as the benefits of language mixing in reasoning traces for LLMs. It’s also worthwhile to experiment with the structure of the reward itself; for example, penalizing a model for saying ‘I don’t know’ less than penalizing it for getting an answer incorrect reduces hallucination.

I personally thought DeepSeek’s approach of using RL to have the model develop its own reasoning traces was really exciting. I wonder what other beneficial LLM tools we can discover by letting LLMs take their own paths and giving them more freedom.

Diffusion in language models, and rethinking thought

2026-02-09T17:00:00+00:00

Diffusion in language models, and rethinking thought

Introduction

Most LLMs today are autoregressive, meaning that they generate each token one at a time, step-by-step, only after seeing the previous token.

You might be thinking, duh? Of course an LLM only generates a token after it generated the one before it. As humans, that is also how we read, write, and communicate. I say my first word before I say my second word.

But thought itself, what which often comes before we speak aloud, isn’t always so simple and temporal. Have you ever had an intuition about something? Or, an idea suddenly came to your mind? That thought doesn’t come in a token-by-token sequence—it comes in one moment.

Now, one of the reasons why I like natural language processing (and the artificial intelligence field in general) is because we can borrow ideas from one modality (say, language) and apply them to the a completely different modality (say, vision). For example, transformers were originally created for language translation tasks, but researchers found a way to apply them to predictive imaging tasks. Pretty neat, right?

However, it’s not as easy as it seems. You can’t always just take a model architecture from language and slap it onto vision; there are fundamental differences that we have to account for. For example, language is discrete (it’s made of individual words / tokens) whereas vision is continuous. Images typically have just a few output sizes that they tend to come out in, whereas the output of an LLM can span from 1 token to 65,536.

Image diffusion

Ever wonder how images are generated? It’s a little different from the language output that you get from ChatGPT. Most image generation models today use a process called diffusion. Image diffusion is a process where a model generates an image by learning how to remove noise. First, you give the model an image with a text caption. Then, you create different versions of those images with different amounts of Gaussian noise added to them. To train the model, you give the model the noisiest image, and it tries to learn how to remove the noise the correct way, using your images as the ground-truth.

Above: The model learns how to correctly remove Gaussian noise for a specified image (flower).

Language diffusion

Diffusion images work nicely for images for certain reasons. First of all, images are a fixed size—if you want to generate an image, it’ll be 226x226 pixels most of the time. Second, images are continuous, and so the concept of adding noise to them makes sense. Would you be surprised then, if I told you that there was a diffusion model for language models? It definitely surprised me at first. Specifically, I wondered, how can we add “noise” to words?

Above: How do we add the same type of pixelated “noise” to words?

Nie and Zhu et al. answer this question in their work, Large Language Diffusion Models. The authors create a diffusion language model, LlaDa, by randomly masking individual tokens. LlaDa then generates text by predicting how to unmask the tokens. This is similar to image diffusion, where the model learns how to remove the Gaussian noise pixels—except instead of predicting which pixel lies underneath, Llada predicts which token lies underneath.

Above: Diffusion models generate text by predicting what lies underneath the masked tokens.

Looking at this graphic, you may notice that the output of the answer already had a size predetermined—8 tokens. However, we don’t always know how long we want a language model’s response to be in advance. (In fact, in my opinion, that should be for the language model to determine). To address this fixed output size issues, the creators of Llada output multiple versions of their model.

Two of them are a more pure diffusion approach, where the output is either pre-determined by a specified amount of tokens, or is just permanently fixed. This version implements the classic diffusion, but it also has the limitation of the user needing to know how long they want their output to be.
The other versions are a more autoregressive-style approach, where the output keeps diffusing, chunks of a time, until the model reaches an end token.

This second approach, while it doesn’t require a pre-specified output length, may undermine the benefits of a diffusion language model, where the whole point is that it is not supposed to be autoregressive. So, the short answer is: dealing with the fixed output length of diffusion models is still an active area of research!

Now, with this limitation, why do people even care about Diffusion Language Models? There are multiple reasons, specified by Li et al. First of all, diffusion models have the potential to be faster than autoregressive models, because they generate all their tokens in parallel rather than step-by-step. Second, autoregressive models can only look at tokens that they have already written - they cannot look “ahead”. Diffusion models, on the other hand, can look in all directions, potentially enabling them to capture more nuanced language understanding. These are just some of the benefits.

Diffusion for Lateral Thinking

One work that leans into fully leveraging Diffusion Language Models’ benefits is Reinforcing the Diffusion Chain of Lateral Thought (DCoLT) with Diffusion Language Models by Huang and Chen et al. The authors propose an interesting problem: Chain of Thought, while helpful for LLM reasoning, is still autoregressive in nature. It teaches models how to think each thought step-by-step, similar to how it generates individual tokens.

Above: Chain-of-thought (CoT) has models think step-by-step to improve reasoning results: https://www.linkedin.com/pulse/how-teach-chain-of-thought-reasoning-your-gdbzf/

However, not all thinking is done step-by-step! As I’ve mentioned before, getting a gut feeling or an idea is something that happens all at once. Furthermore, when humans are trying to answer a question creatively, they may think more “messily” and broadly, trying to think of all potential solutions before diving into solving a specific one step-by-step. Specifically, this is known as lateral thinking, a term coined by Edward de Bono.

Above: Vertical vs Lateral thinking. Source: De Bono, Edward, and Efrem Zimbalist. Lateral thinking. London: Penguin, 1970.

The authors of DCoLT observes that diffusion language models provide the perfect opportunity to allow an LLM to think laterally rather than vertically (step-by-step). To do this, they fine-tune LlaDa (the diffusion language model mentioned earlier!) via Reinforcement Learning with an Unmasking Policy Module (UPM). That’s a lot of words, so let me break it down. LlaDa already learned how to unmask tokens and predict what was lying underneath. However, the tokens that were unmasked were chosen based on the highest condfidence. DCoLT takes LlaDa and builds upon it, training the model to unmask tokens not on highest confidence, but specifically on how to reason the best.

How are they able to do this? The key idea here is Reinforcement Learning (RL). RL allows you to create a custom reward function, allowing you to improve the specific task you want to focus on. The idea of using RL to improve reasoning specifically is borrowed from DeepSeek-R1. The main difference between DeepSeek’s version and DCoLT is that DeepSeek applies this method to the autoregressive, step-by-step, Chain-of-Thought reasoning, whereas the authors of DCoLT take on the diffusion approach.

Above: The Diffusion Chain of Lateral Thought process. Entire thoughts are generated in parallel, rather than one thought after another.

Conclusion and Takeaways

Let’s zoom back out for a summary, because this blog post was quite the journey. Here are the key points to take away:

Most LLMs are autoregressive, meaning they generate text token-by-token. Diffusion, on the other hand, generates objects all at once by learning to remove noise.
Diffusion is a popular method for image generation: the model learns how to remove pixelated noise to generate an image based on a text descriptor.
Nie and Zhu et al., the creators of LlaDa, creatively apply diffusion to LLMs by randomly masking individual tokens.
On another note, DeepSeek-R1 discovered that using Reinforcement Learning is a useful tool to improve model reasoning. However, this was still from an autoregressive, step-by-step, thought-by-thought perspective.
Huang and Chen et al., DCoLT authors, combine language diffusion (LlaDA) with DeepSeek-R1’s technique to improve reasoning through Reinforcement Learning. They build a language diffusion model, specifically for reasoning, that mimics lateral thinking over vertical, step-by-step thinking.

Overall, I find the language diffusion space interesting for multiple reasons. First, I love how diffusion was a method that was originally used for image generation, but then was applied to language generation. I also agree with the authors of DCoLT that using only CoT examples to improve reasoning can be a limitation; when I think about how we as humans think, it is definitely not always step-by-step. However, there are other researchers who have realized that human thinking is not always linear. Works like Tree of Thoughts and Graph of Thoughts implement alternate thinking approaches that are also not only step-by-step.

In general, it will be exciting to see where diffusion models take us in the future, and how they will grow compared to their currently popular autoregressive counterparts. I believe they are definitely an interesting type of language model that have a large number of untapped benefits.

Dissociating Language and Thought in LLMs

2026-01-14T13:00:00+00:00

Dissociating Language and Thought in LLMs

This post is an overview of the paper: Dissociating Language and Thought in Large Language Models: A Cognitive Perspective by Mahowald and Ivanova et al. The paper is linked here: https://arxiv.org/abs/2301.06627

Separation of language and thought

In my introductory linguistics course, one of the first concepts we learned about was the separation of language and intelligence. Specifically, we learned that just because someone seems to be a ‘good’ speaker does not automatically make them intelligent, and vice versa; just because someone is not so great at speaking does not mean they are not intelligent. Regardless, we as humans still have a bias. If we hear someone speaking well, we may assume they are intelligent and skilled in other unrelated areas, even without meaning to.

When it comes to machine language, humans are not immune to having the same biases. This is because “[w]hen we hear a sentence, we typically assume that it was produced by a rational, thinking agent (another person)” Mahowald, Ivanova et al. For over 100,000 years, the only entities who could converse in human language, were, well, humans. This explains why humans trust LLMs (even when they hallucinate), feel connected to them, and sometimes even feel that they are sentient.

Image Source

The authors continue to say that because LLMs are so good at certain language tasks, such as text comprehension and next-word prediction, people get these advanced language skills conflated with Artificial General Intelligence. This blending of language and thought is further reinforced by the influence of the Turing Test of 1949. The Turing Test says that if a human is unable to decipher text between another human and a machine, then this machine could be said to be intelligent.

However, intelligence goes beyond how well someone, or some machine, can speak. The authors say that due to “the fact that language can, and typically does, reflect underlying thoughts”, people have misconceptions when it comes to the way language and thought are connected. Just because someone is good at language, does not mean that they are good at thinking. From my previous posts and LLM reasoning performances alone, it is quite intuitive why this is a fallacy.

Above: A famous case of older versions of ChatGPT failing to properly count how many r’s are in the word strawberry

In the above image, though the model is great at ‘speaking’, its answer is incorrect, showing that it is bad at ‘thinking’, ultimately disproving the idea that “good at language” always implies “good at thought” in LLMs (and this is true for humans, too!).

In the human brain, there has been evidence that language and thought are detachable. With this motivation, the authors present two different kinds of linguistic competence that LLMs could be evaluated on. The first one is formal linguistic competence, which deals with how good an LLM is at knowing the rules and statistical laws of language. Formal linguistic competence is more connected to ‘speaking’ and following correct grammar rules.

The second one is functional linguistic competence, which deals with an LLM’s ability to use language “in the real world”, often relying on non-linguistic skills. Functional linguistic competence is more connected to ‘thinking’, such as reasoning about diverse topics, making requests, and overall, interfacing with other cognitive components and ultimately, other entities.

With this new framework in mind, the authors argue that modern LLMs can be great at formal language but are not so great at modeling human thought.

Formal linguistic competence

First, the authors discuss how LLMs are pretty good at formal linguistic competence. Modern transformer models (this paper was published in 2023, so they mainly discuss GPT-3) are coherent, great at generating text, performing specific tasks with simple prompting, and overall, excellent at grammar. They can even predict next words for humans which matches human neural behavior.

However, some limitations are that LLMs can over-rely on statistical rules when it comes to language. Furthermore, something that I thought was really interesting was that LLMs see 100-1000 times as much data as a child is exposed to. This goes to show that even if LLMs are learning the rules of language from statistical rules, these current methods lack something that humans have when it comes to learning language. At the time that this paper was written, I don’t think small language models had taken off yet, but today, there is a wide variety of small language models (Llama3.2-1B, SmolLM2-1.7B, etc.), which shows promise for language models that can also learn on smaller datasets.

Functional linguistic competence

Next, the authors discuss how LLMs are weaker when it comes to functional linguistic competence (the ‘thinking’ part). Since natural language contains factual knowledge, LLMs trained on a lot of text must acquire some factual knowledge. This often comes from word co-occurrence patterns (co-occurrence relates to which words show up frequently together). For example, during training, if “Austin” shows up frequently next to “The capital of Texas is…”, then the LLM will ‘learn’ that Austin is the capital of Texas just from the text alone.

Hence, the authors claim that “any test of LLMs’ ability to reason must account for their ability to use word co-occurrence patterns to “hack” the task.”

Now, this statement conflicts me. I agree, that when I first truly understood how LLMs were trained, it definitely felt like an LLM knowing something only because it was good at predicting a next token felt like ‘hacking’. On the other hand, even if an LLM is simply a next-likely-token predictor, we cannot discount when they get things right or are ‘good’ at certain tasks, even if that learning was done through word frequency relations. Modern methods also use external methods like Retrieval-Augmented Generation or external solvers (LINC, Faithful CoT) to mitigate this issue. Still, an over-reliance on statistical patterns can lead to issues like the Reversal Curse!

In the rest of the section, the authors discuss more limitations of LLMs compared to humans, including:

Formal reasoning: LLMs’ tend to do worse on reasoning tasks when the task is not something that one would see in the typical dataset (called ‘out of distribution’ data). For example, with unusual tasks such as “Get your sofa onto the roof of your house without using a pulley”, humans tend to do much better than LLMs, even though the reasoning behind it is not necessarily complex.
World knowledge & commonsense: LLMs’ general knowledge is biased: On the internet, humans tend to write about things that are unusual or noteworthy, and less so about implied common facts. As a result, LLMs may struggle with underreported knowledge such as the fact that wheels tend to be round.
Situation modeling: LLMs are incapable of tracking information over long conversations, unlike humans, who can have a three-hour long conversation and still remember most ofit the next day. (Note: this was true at the time of writing, but LLMs since then have seen tremendous growth in context windows, for example, with Gemini Pro having a context window of over a million tokens.)
Pragmatics and intent: LLMs struggle to understand users’ intent, sarcasm, and jokes. I found that since the writing of this paper, LLMs’ ability to understand sarcasm has improved, but is still not at the level of humans (Bojić et al).

Conclusion

Overall, due to LLMs’ limitations in formal reasoning, world knowledge, situation modeling, and social reasoning (e.g. sarcasm), the authors argue that LLMs are weak in functional linguistic competence. This highlights the fact that just because a system is good at language does not imply it is good at thought.

Now this is the most fundamental argument; language and thought are separate systems, and perhaps trying to model both with just one LLM system is not practical. The authors argue that, similar to the human brain, a truly intelligent model must not only have a language processor, but a problem solver component, an experiencer component, a situation modeler, a reasoning component, and a goal decider. They tie it back into the Turing Test: “a model that masters language use, not just the rules and patterns of natural language, has to be a general intelligence model.“

Therefore, there should be separate benchmarks for LLMs: some for formal linguistic competence (which already exist), and some for functional linguistic competence.

Thoughts and takeaways

I thought this article was really well-written. It really reinforced the fact that pure, traditional LLMs are next-token predictors rather than agents with intent or goals. When you engage in a conversation with a chatbot, it may feel like you are talking to someone, but in reality, you are not actually talking to anyone. This sense of perceived agency is what makes talking to one so misleading.

Overall, I agree with the general sentiment that some “language use” tasks are better when offloaded to a non-LLM or an external solver, such as LINC or utilizing Retrieval-Augmented Generation. On the other hand, after reading this article, using word co-occurrence patterns to learn facts doesn’t honestly feel that much like hacking to me. Yes, it may not be as grounded as other methods, but ‘learning’ that Austin is the capital of Texas because you’ve seen “The capital of Texas is Austin” written in so many internet documents still feels like a valid way to learn something if it works for the most part. However, this shouldn’t be everything: we don’t want models to be right ‘most of the time’, we want them to be right ‘all of the time’.

I also wonder if the categories mentioned — problem solver, experiencer, situation modeler, etc. — are the meant to be a definitive characterization of an intelligent agent, or just one helpful framework along with many others. I agree that intelligence goes beyond just how well someone speaks, but I also know that the definition of it is highly debated, and so I wonder what other models or frameworks one could follow to create such an intelligent agent. Perhaps we don’t have to model intelligence after the human brain at all. If language models can learn how to ‘speak’ (at least, in terms of formal linguistic competence) differently than humans do, perhaps we can create intelligent agents using a non-human framework too.

Faithful Chain of Thought

2025-12-28T20:00:00+00:00

Faithful Chain-of-Thought Reasoning

This post is an overview of the paper: Faithful Chain-of-Thought Reasoning by Lyu, Havaldar, Stein et al. The paper is linked here: https://arxiv.org/pdf/2310.15164

Introduction

Today, if you ask a chatbot a math question or a complex reasoning question, it will likely provide the correct answer. There was a time, however, where this level of performance was unimaginable. This is because while LLMs are good at imitating surface-level human language, the structured reasoning beneath the language — reasoning that humans usually find quite simple — is a fundamentally different problem. Creating models that can properly think and reason rather than simply “talk” is a largely unsolved challenge in AI.

Background: Original Chain-of-Thought (and some limitations)

One method that drastically advanced reasoning capabilities of LLMs was Chain of Thought (CoT). CoT, published in 2022, was a prompting strategy that led to groundbreaking improvements in LLMs’ reasoning capabilities on complex reasoning tasks, like math and commonsense. The way CoT works is fairly simple: just ask the model to “show its work”. More specifically, instead of asking an LLM a direct question, you provide several examples of “showing your thinking”. Then, in its response, the LLM also will “show its thinking”, leading to more accurate responses.

Above: The input prompts the model to “show its thinking” by placing a reasoning example in the demo question-answer pair.

CoT led to striking improvements in LLM reasoning performances on several benchmarks. That wasn’t its only benefit, however. A second benefit was that CoT provided an interpretable window into how the LLM was “thinking”. Since CoT forces the model to “show its work”, one can see exactly which steps the LLM took to reason in its final answer. This can be useful for tracing errors and boosting explainability.

Above: When an LLM is prompted with CoT, it “shows its thinking”, which can be useful when trying to understand why it got an answer wrong.

However, this is not always reliable. One issue with CoT is that it is not always faithful to its reasoning process. Faithfulness refers to if the outputted reasoning text actually reflects the process that the model took to get to that answer. Below is an example of reasoning steps of CoT not being faithful.

Above: The LLM uses CoT to explain its work and arrives at an answer of $200 in the reasoning chain. However, in the last moment, for some reason, it randomly switches and says the final answer is $0.

Source

In the above image, though the reasoning process looks plausible, the final output of the LLM completely contradicts the reasoning steps. This means that the model may not have actually taken those reasoning steps to reach its final answer. The reasoning steps also do not explain why the LLM randomly switched its answer in the last moment. Unfaithfulness in LLM reasoning can be misleading and even dangerous in higher stakes fields where explainability and error understanding are crucial.

Faithful Chain-of-Thought

To solve this issue, the authors present Faithful Chain-of-Thought. They break down complex reasoning tasks into 2 stages: Translation and Problem Solving.

Translation

In translation, the language model translates the given question into a reasoning chain. Unlike plain CoT, which uses only natural language (e.g. plain English) in its reasoning chain, this new reasoning chain is a combination of natural language and symbolic language (like code). The symbolic language chosen is dependent on the type of question. For example, Python may be used for math problems whereas Planning Domain Definition Language (PDDL) may be used for planning questions.

Problem Solving

In problem solving, the reasoning chain is executed; in the example above, the Python code would run. The output from the reasoning chain (code) is then taken as the answer itself. This component is also called a “solver”.

Conclusion

In summary, plain CoT has the LLM simply “say” what it’s thinking out loud, whereas Faithful CoT has the LLM formally execute some type of code to retrieve the final answer. In plain CoT, though the reasoning chain may look legitimate, there is no way to verify if this is what the LLM was actually “thinking”. On the other hand, with Faithful CoT, the LLM’s output is guaranteed to reflect the steps it took (or, in this case, code it ran) to reach the answer. In other words, the reasoning steps the LLM takes are faithfully tied to the final answer. This creates a more reliable explanation of the reasoning process. Faithful CoT also outperforms the orignal CoT on several reasoning benchmarks.

Now, of course, the Faithful CoT method does not ensure that the LLM always provides the perfectly correct reasoning chain. It just ensures that, for whichever reasoning chain the LLM provides, the final answer actually comes from that reasoning chain. Another direction of research is to work on the Translation component of Faithful CoT, and improve the reasoning chain that the LLM creates (“reason about reasoning”? We could get into some deep recursion here).

Thoughts and takeaways

I see similarities between Faithful CoT and LINC: Both convert a natural language question into an external type of language—either code or a planning language, or in LINC’s case, First-Order-Logic—and then use an external solver to solve the question. With Faithful CoT, I like how the type of language used is task-dependent. This intuitively feels beneficial for performance. Overall, I see a common theme of offloading the reasoning components to external solvers when improving reasoning for LLMs.

Questions to think about:

I wonder how Faithful CoT and LINC can be applied to problems that require long reasoning chains and a lot of memory, which can be more challenging for LLMs. The reasoning questions shown here seem to be more short and simple, from my understanding.
Both of these methods require an LLM to translate from natural language into a more symbolic one. While these methods improve performance on reasoning benchmarks, are we actually solving the problem, or are we just shifting it to a different one (translation)? And, does it really matter, if the benchmarks are being improved upon?

LINC: Neurosymbolic reasoning approach for LLM

2025-12-20T20:00:00+00:00

LINC: Neurosymbolic Reasoning for LLMs

This post is an overview of the paper: LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers by Olausson et al. The paper is linked here: https://arxiv.org/pdf/2310.15164

Introduction

In today’s world, as people rely more and more on chatbots, logical reasoning is an extremely important task for LLMs. In the legal and medical domains, for example, it is imperative that we can verify the correctness and validity of LLMs’ claims. However, if you have ever used a chatbot, you may know that LLMs can be wrong sometimes. Why does this happen? From a very high-level viewpoint, this is because pure LLMs are “next likely token predictors”. They simply predict what the next likely word is based on what they have seen in training data. As a result, when we analyze their outputs from a human perspective, we find flaws, incorrect reasoning statements, and hallucinations (aka, something the model completely ‘made up’).

Above: A famous case of older versions of ChatGPT failing to properly count how many r’s are in the word strawberry

Source

Chain of Thought

One groundbreaking method to improve LLM reasoning was the 2022 Chain-of-Thought approach (CoT) by Wei et al. In short, the way CoT works is that it has the model “think out loud” before generating a final response. The researchers found that this prompt-based approach significantly improved LLM reasoning performances on a wide range of tasks.

CoT is a brilliant approach that significantly improves LLM reasoning performance on arithmetic, commonsense, and logical reasoning tasks. However, when I hear about prompt-based approaches, I wonder if there is another side that we are missing altogether. Prompt-based approaches still rely on the “next likely token” text generation. Specifically, when it comes to logical reasoning, there are many formal rules that we could leverage to help the model reason, rather than simply relying on a (at certain times, somewhat flimsy) probabilistic word generation approach.

Above: Many day-to-day reasoning tasks in natural language can be translated to formal logical representations.

LINC Overview

This is exactly where Logical Inference via Neurosymbolic Computation (LINC) comes in. The authors behind LINC agree that prompt-based approaches (like CoT) and increasing the model size can help improve LLM reasoning performance, however, there are still limitations. The authors decide to leverage the formal rules of logic in their method, making it a neurosymbolic approach, described below.

First, the model converts the natural language (NL) expression into a First-order-logic (FOL) expression. Essentially, it converts the regular text into logic symbols.
Next, they use a symbolic FOL solver, that specifically relies on logic rules, to determine the truth value of the conclusion. There can be three possible outputs: True, False, or Uncertain (e.g. if there was not enough information to solve the problem).
Finally, they run this process K times and decide on the answer that appears the majority of the time. (This reminds me of an agentic-style approach!)

Above: LINC first converts the natural language expression to a logical one, then uses a logic solver to solve it.

LINC Results and Analysis

The authors test LINC on three different pre-trained LLMs and two different evaluation datasets. They find that LINC has better performance results than almost all of the other methods. What is potentially more interesting to analyze, however, is how LINC fails — compared to how CoT fails.

In short, with LINC, the issues arise when converting the natural language into the symbolic representation. For example, the authors find that LINC may not capture implicit commonsense information that could affect the premises. In the example below, the model derives a solution of “Uncertain” because it was not able to make the implicit assumption that Harry is a person.

There is also the issue of multiple representations. LINC struggles when it is given a lot of information about one object because it is unable to decide if this information is independent from each other or not. Additionally, certain words may have multiple meanings, causing syntax errors in the logical expression when terms are seemingly repeated. CoT, on the other hand, does not tend to fail at these tasks.

Above: Language ambiguity can cause syntax errors in the FOL expression or cause issues for the solver.

However, CoT has different issues. First of all, sometimes CoT will reason about something correctly, but ultimately decide on a different answer in the very last moment. Second, CoT will simply just not follow logical rules sometimes - completely the opposite of the FOL solver. Finally, CoT fails with more complex, longer reasoning tasks. This illustrates the strengths and weaknesses of the two approaches: CoT is more language-based while LINC is more rule-based, and their performances reflect this.

Thoughts and takeaways

I thought this paper was super interesting. The failure analysis of LINC shows that most of the challenge for LLM reasoning comes from, well, the language part. This makes sense, because if we start from a perfect formal symbolic representation of a statement, we should theoretically also perfectly be able to solve it, since logical reasoning follows set rules (like a math problem). The challenge of LLM reasoning comes from actually translating the language statement into the formal symbolic representation.

Connecting to schoolwork

I find several connections in the challenges mentioned in this LINC paper and traditional NLP challenges discussed in my NLP class this semester. For example, the ambiguity of language is what makes NLP so challenging. Additionally, the challenge that the FOL symbols have with repeated terms - treating them as the same entity rather than unique ones - reminds me of co-reference resolution.

Optimizing the FOL Solver

LINC uses a fixed symbolic solver. However, this work from Liu and Fu et al. mentions that while earlier neurosymbolic methods used a fixed external logical reasoning engine, one could potentially also optimize the reasoning engine. Optimizing the reasoning engine could be an interesting area of future work.

Conclusion

Overall, while the failure cases of LINC are interesting to learn from, it’s important to emphasize that LINC is an impressive and successful approach that demonstrated improvements across almost all LLM reasoning benchmarks, including CoT. It goes to show that indeed, the neurosymbolic approach is a promising direction for LLM reasoning and verifying the validity of what an LLM “says”.

The LLM Reversal Curse

2025-12-11T17:00:00+00:00

Paper Overview: The LLM Reversal Curse

This post is an overview and summary of the paper: The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A” by Berglund et. al. The paper is linked here: https://arxiv.org/pdf/2309.12288

Introduction

The reversal curse is a fascinating phenomenon in LLMs. Suppose you were given the following sentence: “Valentina Tereshkova was the first woman to travel to space.” Then, if someone asked you, “Who was the first woman to travel to space?” The answer would be obvious: Valentina Tereshkova. For LLMs, however, this task is not always that easy.

When the order of the question is the same as the training information, then the LLM can answer successfully. However, when the order is reversed, the LLM often gets the answer wrong. That is what researchers refer to as the Reversal Curse, depicted in the figure below.

Above: LLMs fail answering when the order is reversed.

Why is this important? It shows a flaw in basic LLM logic deduction. While LLMs can solve complex math problems, write advanced code, and potentially provide emotional advice, they fail on this one trivial task.

Quick note: Difference between fine-tuning and providing context

Note that this is not the same as first, telling a pre-trained chatbot, “Valentina Tereshkova was the first woman to travel to space,” and then immediately asking it, “Who was the first woman to travel to space?” A good chatbot would most likely answer this question correctly because the information is provided in its context window. The LLM reversal curse, however, specifically relates to training and fine-tuning.

Essentially, if you pretrain an LLM with the information: “Valentina Tereshkova was the first woman to travel to space.”, and then after training, you prompt it with, “Who was the first woman to travel to space?” without any additional information in the context window, it often fails. This is what the LLM Reversal Curse refers to.

Above: The LLM reversal curse refers to the finetuning and pretraining stage, not inference-time logical deduction.

Now, you may ask, isn’t it enough that the LLM can get the answer right when the information is provided in the context window? Not quite — this requires that the user already knows the answer! When people use chatbots in day-to-day settings and query it with answers, they are not testing the LLM’s ability to conduct logical deduction for fun. Rather, they are trying to get their questions answered directly. Ideally, if an LLM is trained on a certain fact during pre-training, then it should know how to answer a question related to this fact when prompted, regardless of the order the fact was written in. That is why the LLM Reversal Curse is so important.

Experiments

Now, the LLM Reversal Curse was not a simple problem that could be fixed with just a bit of fine-tuning. In this paper, the researchers conduct three experiments to provide evidence of the Reversal Curse.

1. Reversing descriptions of fictitious celebrities through fine-tuning
In this experiment, the researchers fine-tune an LLM on many synthetic facts about celebrities. Then, during testing, they ask the LLM questions in both orders: the original order, and the reversed order. The LLM often fails in the reversed order.

2. Using real facts about celebrities without fine-tuning
First, the researchers collect a list of the top 1000 most popular celebrities from IMDB. Then, they ask GPT-4 the names of the parents’ of the celebrities. On this, GPT-4 has a 79% accuracy. Next, the researchers provide the names of the parents of the celebrities, and ask GPT-4 to name their (celebrity) children. Here, GPT-4 is only 33% successful, showing a gap.

3. Instruction fine-tuning

Here, the researchers create a dataset of question-answer pairs (Q,A). During fine-tuning, they create two datasets: one that has a Question-to-Answer (same order) instruction, and another that has the Answer-to-Question instruction (reversed order). The LLMs fine-tuned with the Question-to-Answer have around an 80% accuracy, but the LLMs fine-tuned with Answer-to-Question (reversed order) have less than a 10% accuracy.

Conclusion and Discussion

These three experiments show that the LLM Reversal Curse is not a fluke. Indeed, LLMs trained on fact: “A is B” will fail when asked, “B is _?”. In this paper, the researchers used multiple different LLMs, experiment setups, and hyperparameters to prove this negative result.

Why may this be? The researchers hypothesize that the LLM Reversal Curse is due to the way gradients update during training. During training, if a network is given the sentence, “A is B”, then it will slightly update the parameters of A, but not B. This is because LLMs are “next-likely-token” predictors. During training, they are optimized to predict the next word, not the previous one. On the other hand, if humans are taught, “A is B,” then they can easily find the symmetry and say, “B is A.” This goes to show that LLMs do not necessarily “know” what they are saying—they are simply predicting the next word. This phenomenon showcases a fundamental gap in LLM reasoning and is an exciting area of potential future research.