Interpolation, Extrapolation, and Reasoning in Neural Networks

Knowledgator Engineering
12 min readJan 20, 2024

Let’s start with the basics.

Interpolation and extrapolation are terms frequently encountered in mathematics and data science, though they also have broader applications in various fields. Understanding their differences and applications is essential for anyone working with data or engaging in predictive modeling. But why are they essential? Why should an ML engineer wonder about these concepts, and how their understanding may make AI “smarter”?

Written by Valerii Vasylevskyi, co-founder of Knowledgator.

Interpolation

Definition and Usage. Interpolation is the process of estimating unknown values within a given set of known data points. In mathematics, it involves creating a new data point within the range of a discrete set of known data points.

In Practice. Consider a dataset with points at intervals, such as temperature readings at different times of the day. If the temperature is recorded at 10 am and 2 pm but at noon is missing, interpolation allows us to estimate the noon temperature based on the known values. This is done by assuming a certain type of relationship (like linear, polynomial, etc.) between the points.

Applications. Interpolation is widely used in engineering, science, and business. For instance, in digital image processing, interpolation is used to resize images by calculating pixel values in a larger image from the values of nearby pixels in a smaller image.

Extrapolation

Definition and Usage. Extrapolation is the process of estimating beyond the original observation range by assuming that the estimated pattern continues similarly. It involves predicting values using the known data but outside the range of that data, or in other words, outside the known space and time.

In Practice. Following the previous example, if we have temperature data up to 2 pm and we want to predict the temperature at 4 pm, extrapolation is used. This assumes that the pattern observed in the earlier data (from morning to afternoon) will continue similarly. Thinking about extrapolation having a simple graph is extremely easy, but real disputes between researchers emerge in fields like Natural Language Processing, where we cannot effectively comprehend and visualize data structure and distribution.

Applications. Extrapolation is common in forecasting and prediction, such as in finance for predicting stock trends, meteorology for weather forecasting, and economics for predicting future trends based on past data.

Key Differences

  • Range: Interpolation is within the range of the known data, while extrapolation goes beyond it.
  • Reliability: Interpolated predictions are generally more reliable than extrapolated ones because they make fewer assumptions about the data.
  • Application: Both are used in various fields, but the choice between them depends on the data’s purpose and nature.

Interpolation and Extrapolation in Neural Networks

Another useful definition is that interpolation occurs when a sample falls within or on the boundary of a dataset’s convex hull, while extrapolation happens when a sample lies outside of this convex hull.

https://www.researchgate.net/publication/328373977_Evaluation_of_Color_Correction_Methods_for_Printed_Surfaces

The “convex hull of a dataset” is a concept from geometry and data analysis. Imagine the dataset as a collection of points in a multidimensional space (where each dimension represents a feature or attribute of the data). The convex hull is the smallest convex set that includes all these points.

To visualise it in a simple way, consider a 2D example: if you have a set of points on a piece of paper, the convex hull would be the shape you would get if you stretched a rubber band around all the points and let it snap tight. This shape encompasses all the points and is convex, meaning that the line segment between any two points within the hull lies entirely within the hull.

The concept is the same in higher dimensions but harder to visualise. The convex hull becomes a sort of multidimensional envelope or boundary that encloses all the points of the dataset.

Historically, there’s been a belief that the success of state-of-the-art algorithms in machine learning is due to their ability to interpolate training data effectively. Moreover, many theories and intuitions in the field assume that interpolation is a common occurrence across various tasks and datasets.

The study by Randall Balestriero1, J´erˆome Pesenti1, and Yann LeCun challenges these notions, especially in high-dimensional contexts (dimensions > 100). Theoretically and empirically, it’s shown that in such spaces, interpolation almost never occurs, given the realistic amount of data available. This finding is significant because it suggests that:

  • Most current models are actually extrapolating, not interpolating.
  • Despite the prevailing belief, extrapolation does not necessarily lead to poor performance. In fact, many models show good performance even when operating in an extrapolation regime.

These insights emerge from rigorous analysis, including Monte-Carlo estimates over large trials, examining datasets sampled from Gaussian densities, nonlinear continuous manifolds, and affine subspaces. The key takeaway is that maintaining a constant probability of interpolation in high-dimensional spaces requires an exponential increase in dataset size with respect to the dimension of the smallest affine subspace encompassing the data manifold. In simple terms, having a real-world NN application, it’s almost impossible for a random data point to fall within a convex hull of a dataset due to data scarcity.

The most common statement against this theory is the Universal Approximation Theorem (UAT). This theorem states that a neural network with at least one hidden layer and a sufficient number of neurons can approximate any continuous function within a certain domain (a compact subset of R) to any desired degree of accuracy. The UAT speaks to the potential power of neural networks in terms of function approximation within the range of the training data (or within the domain for which the network has been trained). So, according to this theory, NNs are just curve-fitting machines. According to Yann LeCun, NNs can’t be just curve-fitting machines as in high-dimensional space all new data points almost certainly “fall outside the curve”, and modern NNs still manage to make decent predictions.

One interesting objection is a popular trick when NNs fail to work with even numbers if they’ve been only introduced to odd ones. It may be an example that NNs can’t extrapolate, but I’m not sure of that. They may be unable to catch the underlying nature of the data, intricate properties of the point, and logical rules behind operations. For me, this may be an example of Logical Reasoning that will be discussed further.

Logical Reasoning

Reasoning in artificial intelligence (AI) is a critical component that enables machines to emulate a key aspect of human intelligence: the ability to derive new information from existing knowledge using logical rules and principles. This capability is not just about handling data or executing programmed instructions; it’s about empowering machines with the faculty to make inferences, draw conclusions, and solve problems in a manner that approximates human thinking.

In AI, reasoning is pursued to create systems that can reason like humans, employing not just cold, hard logic but also elements of common sense and, to some extent, intuition. This involves various forms of reasoning:

  1. Deductive Reasoning. Starting from general rules and applying them to specific cases to derive conclusions. This is akin to classic logical reasoning — if ‘A’ implies ‘B’, and ‘A’ is true, then ‘B’ must also be true.
  2. Inductive Reasoning. This is more about making generalised conclusions from specific instances. It’s less rigid than deductive reasoning and is often used in machine learning, where patterns are discerned from data.
  3. Abductive Reasoning. Often used in diagnostic applications, this form of reasoning involves starting with an incomplete set of observations and proceeding to the likeliest possible explanation.
  4. Analogical Reasoning. It involves drawing parallels between different scenarios and applying the understanding from one context to another; a method humans frequently use to solve problems.

One would say that Logical Reasoning is a pinnacle of extrapolation, but this statement is only partially true. It’s an intricate combination of interpolation and extrapolation, which are not that different at the end of the day. Obviously, in most cases, nobody can certainly predict unknown data points. However, both concepts try to predict the unknown data point based on patterns learned throughout the training. Both concepts have the same nature, as they predict results based on observed patterns, and interpolation-related calculations are usually more accurate because the patterns are more obvious.

Logical reasoning is a complex concept involving a deep understanding of the nature of relationships between objects and their features. I propose to perceive it not as a third new type of calculation other than extrapolation and interpolation but as their product in extremely complex tasks. Sometimes, people link this term to Symbolism rather than Connectionism, but brilliant work by DeepMind proves that only their appropriate combination leads to successful reasoning. DeepMind’s research demonstrates that LLMs effectively work as general pattern recognition and prediction tools, capable of augmenting gaps in data (providing likely solutions and constructs for solving geometry problems). These predictions are further injected into the deduction engine, operating on a symbolic level (formal logic) for final calculations. Such a combination utilises strengths of completely opposite approaches: Symbolism and Connectivism. This project corroborates that Logical Reasoning is not just an extreme example of extrapolation but a mechanism for properly constraining LLM outputs (achieved through extrapolation or interpolation) with formal knowledge and propositional logic.

We’ve been working on similar projects for 2 years, particularly in the fields of Biology and Chemistry. We applied LLMs to construct Knowledge Graphs and navigated through them to predict links between genes, molecules, and diseases (the process called “path reasoning”). The main idea is similar to DeepMind’s project of solving geometry problems: LLM generates hypotheses and validates (filters) them based on knowledge encoded in the structured database. Moreover, it can augment existing knowledge with news facts if they are not contradictory to previous knowledge. Complex logical calculations can be made within a logical engine to infer new concepts and solve problems creatively exclusively relying on stringent rules and constraints.

Reasoning Dilemma in LLM World

“Yes, ChatGPT is good at constructing responses based on learned data, but it can’t generate new knowledge”. Have you heard something like that? Let’s first define what “New Knowledge” actually means. Every informative text can be broken down into a series of interrelated facts. “Dog is huge” and “Dog is very big” are different texts containing identical facts, so when the model rephrases the text it encountered during the training process, it generates no new knowledge. New knowledge is merely a previously unknown correct fact. This new fact can be obtained whether through extrapolation or interpolation. It is determined by the data point’s place in the distribution (whether within or outside it). Real-world extrapolation usually involves understanding non-linear relationships between patterns. Modern LLMs are unable to comprehend as effectively as people do and reason in high-dimensional space.

Below is an example of failed reasoning, which probably emerged because GPT4 didn’t get the causal relationship between various statements in the conversation. This question is taken from a 3rd party website:

The correct answer is D. Laird recognises two key benefits of pure research: its medical applications, which he describes as “technologies that contribute to saving lives,” and its capacity to broaden knowledge and inspire new ideas. Among these, Laird deems the latter — expanding knowledge and idea generation — more significant. Conversely, Kim believes that “Saving lives is what counts most of all.” Given that pure research contributes to life-saving through its medical applications, Kim’s position contrasts with Laird’s, particularly on the question of whether the most significant accomplishments of pure research are found in its medical applications.

Based on this amazing article, learning non-linear functions is a daunting (or sometimes impossible) task for Neural Networks. Authors show that MLPs can extrapolate well when the target function is linear. However, all complex tasks require dealing with non-linear relationships to think outside the box and generate new knowledge. Still, to make a successful extrapolation, training distribution must be diverse and cover all directions (for example, any modern model won’t be able to describe whether a specific animal it hasn’t encountered before can swim unless it understands whether other similar animals can swit and what determines this skill). The NN has to learn possible combinations of features in the training set to extrapolate. This calculation can only be made based on previously learned data.

The article proves that understanding non-linear relationships is only feasible if we can encode appropriate non-linearity in the model architecture and input representations. Despite using non-linear activation functions, like ReLU, Large Language Models (LLMs), like GPT, do not effectively encode non-linear relationships, albeit indirectly. They can somehow encode non-linear functions through non-linear activation functions, the composition of multiple non-linear transformations, and an attention mechanism. Still, at the end of the day, all this is impossible without appropriate training data distribution.

How to make LLMs “smarter”

Given the predominant use of transformer architecture, achieving human-like reasoning likely requires expanding model size and enhancing data quality. Moreover, an important aspect would be to discern truth from fake data by assessing it in the context of previously learned data. Natural Language Inference (NLI) is one possible solution, where the system distinguishes whether two statements are neutral, contradictory, or entail to understand the credibility of new data in the context of previously learned knowledge. Increasing new data quality might be more attainable with NLI and Few-Shot Learning. Imagine enhancing not just the diversity of data to broaden the model’s extrapolation capabilities but also making the training process itself more efficient. This article posits that a model, like a human, cannot contemplate unknowns, so we must first teach models “better learning”.

Additionally, humans can understand complex relationships across different modalities (such as vision, audio, and taste) and think more energy-efficiently, which is why biological computing is gaining research interest. Unfortunately, we’re far from understanding how the human brain works. It is just more flexible in learning. One plausible explanation is that a living organism is much more complex (so, it’s able to encode more complex relationships) than any AI system and more energy-efficient. Many research groups are trying to increase system complexity but face resource limitations. I hypothesize that the system’s flexibility should be increased, but that usually comes at the cost of hallucinations. NLI may also be instrumental here to assess the logical plausibility of outputs. Consequently, the most viable path to augmenting the intelligence of Large Language Models (LLMs) might be through Few-Shot learning. Further insights on this are available in another article. Enhanced learning abilities would allow AI models to learn and apply patterns more rapidly, generating new knowledge, whether via extrapolation or interpolation. Finally, the most viable path is combining LLMs with logical engines that will put constraints on the model’s outputs to make predictions along with logical rules and existing knowledge. Logical engines can perform complex calculations exclusively on validated knowledge and infer new facts that can be fed back to LLM to make outputs human-readable.

The table below explains 4 main types of knowledge. The concept was introduced by Kant back in the 18th century but still is extremely relevant. In a nutshell, his theory explains that new knowledge can be only synthesized from previous knowledge or through observation (experimental or unintentional). So let’s focus on making machines learn faster and discern truth instead of “inventing” super-intelligence.

Conclusion

In conclusion, the intricate dance between interpolation and extrapolation in Large Language Models (LLMs) underscores AI’s evolving nature and its reasoning approach. While LLMs, like GPT, are adept at drawing from vast datasets to simulate reasoning and generate responses, creating truly new knowledge remains a frontier yet to be fully conquered. This challenge brings to light the limitations inherent in current models, particularly in understanding non-linear relationships and reasoning in high-dimensional spaces.

The path forward seems to lie in enhancing the training process and data diversity, with Few-Shot learning emerging as a promising avenue. Natural language inference may become a technique for filtering out low-quality data and model outputs to avoid hallucination. Even if humanity does not develop a groundbreaking model architecture transcending current limitations, efficient training and more sophisticated validation of model calculations may significantly improve AI adoption in real-world scenarios.

More efficient NNs can be further augmented with Logical Engines which may be directly constructed by NNs themself by following rules and properly constraining outputs. Logical Engines can serve as another level of complexity to “look” at the data from different (Symbolic) angles.

This article is experimental and sometimes subjective. If you have different ideas and objectives, please share them with us — https://discord.gg/mfZfwjpB

--

--

Open-Source ML research company focused on developing fundamental encoder-based model for information extraction https://knowledgator.com/