A Short History of Artificial Neural Networks

November 2025 · History of AI

The quest for the panacea of artificial intelligence, based on emulating the processes of the human nervous system, is not recent. The powerful models permeating many aspects of daily life rest upon a long history of incremental advances. As in many fields of knowledge, it is not easy—nor perhaps fair—to identify a single foundational moment.

Timeline of Key Contributions

Mathematical foundations

Neuroscience advances

ANN innovations

Major milestone

Era I: Foundations (1676–1949)

1676

Leibniz

Calculus & chain rule

1696

L'Hôpital

Calculus

1797

Lagrange

Calculus

1847

Cauchy

Gradient descent

1873

Golgi

Neuron histology

1894

Cajal

Nervous system

1908

Hadamard

Gradient variant

1943

McCulloch-Pitts

Artificial neuron

1948

Wiener

Cybernetics

1949

Hebb

Synaptic plasticity

Era II: Modern Development (1957–2012)

1957

Rosenblatt

Perceptron

1960

Widrow-Hoff

ADALINE

1969

Minsky-Papert

Limitations

1970

Linnainmaa

Backprop

1974

Werbos

Backprop ANNs

1982

Hopfield

Recurrent nets

1986

Rumelhart+

Backprop popular

1989

LeCun

CNNs

1997

Hochreiter+

LSTM

1998

LeCun

LeNet-5

2012

Krizhevsky+

AlexNet & GPUs

Mathematical Foundations (17th–19th Centuries)

The theoretical foundations of artificial neural networks (ANNs) perhaps date back to the 17th and 18th centuries, with work developed by Leibniz, L'Hôpital, and Lagrange in the development of differential calculus, as well as the description of the chain rule—fundamental for training the vast majority of modern ANNs.

The invention of the gradient descent method, another crucial element for optimizing modern neural networks, is attributed to Augustin Cauchy ^[1] in 1847.

From Nerve Cells to Early Models (1890s–1940s)

Parallel to mathematical developments, in the late 19th and early 20th centuries, advances in understanding the structure of neurons and neuronal plasticity laid the groundwork for artificial models. The histological observations of nerve cells made by Camillo Golgi and Santiago Ramón y Cajal ^[11] inspired researchers to create the first artificial neural models.

In the second quarter of the 20th century, McCulloch and Pitts ^[9] published the article A Logical Calculus of the Ideas Immanent in Nervous Activity, in which they introduced the concept of the logical neuron—a simplified mathematical model of the biological neuron, capable of implementing logical functions.

A neuron stained using Golgi's method, from Ramón y Cajal's work — **Fig. 1:** Neuron drawing by Ramón y Cajal using Golgi's method⁠.

McCulloch-Pitts logical neuron diagram — **Fig. 2:** Logical function using the McCulloch-Pitts model⁠.

At the end of the 1940s, Norbert Wiener ^[15] published the book Cybernetics, establishing the theoretical foundations of Cybernetics. His approaches influenced the development of ANNs by introducing concepts of feedback and control.

In the same decade, Donald Hebb ^[16] described a fundamental mechanism of synaptic plasticity: the process by which the simultaneous activation of cells leads to the strengthening of connections between them.

"Neurons that fire together, wire together." — Paraphrase of Hebb's principle

The Perceptron and Early Challenges (1950s–1960s)

Rosenblatt's Perceptron architecture diagram — **Fig. 3:** Rosenblatt's Perceptron with S-units (sensory), A-units (association), and R-unit (response). Adapted from Rosenblatt (1962).

During the 1950s, building on the work of McCulloch and Pitts and leveraging the development of the first commercial computers, Frank Rosenblatt (1957–1958) developed the single-layer Perceptron model—a type of neural network that functions as a binary classifier, capable of adjusting the weights of connections between neurons and recognizing patterns.

During the 1960s, the Dynamic Feedback Network by Widrow and Hoff (1960) emerged, characterized as a single-layer network that adjusts its weights using the Generalized Delta Rule (LMS), a special case of the gradient descent method.

However, in 1969, Minsky and Papert published the book Perceptrons, which exposed the limitations of single-layer perceptrons—namely their inability to solve non-linearly separable problems. This work contributed to a period of diminished interest and funding in the field, sometimes referred to as the first "AI winter".

The XOR Problem: Single-layer perceptrons cannot learn the XOR (exclusive or) function because it is not linearly separable—there is no single straight line that can separate the true outputs from the false outputs in a 2D space.

Backpropagation and the Renaissance (1970s–1980s)

It was in the 1970s that Linnainmaa ^[8] described the backpropagation algorithm in connected networks in his master's thesis. A few years later, Paul Werbos ^[14] described for the first time the process of training ANNs using this algorithm in his doctoral dissertation. Backpropagation enabled efficient training of multi-layer neural networks and paved the way for the development of deep neural networks—networks composed of several successive layers of interconnected neurons, capable of learning hierarchical and abstract representations of data.

In the following decade, Hopfield ^[4] popularized the Amari-Hopfield network, a recurrent ANN model capable of storing and retrieving patterns. Recurrent networks have feedback connections between layers, allowing information to flow from previous states to subsequent ones, thus maintaining an implicit memory of preceding input data.

In 1986, the term Backpropagation and its general use in multi-layer ANNs was popularized by Rumelhart, Hinton, and Williams ^[13]. Since then, the increasingly widespread use of this optimization method has contributed to the advancement of ANNs, significantly boosting the training of deep networks.

New Architectures and the Deep Learning Era (1990s–2010s)

In the 1990s, Bayesian Networks were proposed—a class of probabilistic graphical models that use directed acyclic graphs to represent dependency relationships between random variables. Each node represents a variable, while edges denote direct probabilistic relationships. These relationships are established through Bayes' Theorem, allowing the quantification of conditional probability and the expression of uncertainties in a probabilistic context.

In 1997, Hochreiter and Schmidhuber ^[3] published a technical report describing the LSTM (Long Short-Term Memory) network as a solution to the vanishing gradient problem—which occurs when gradients become very small as they backpropagate, making training of initial layers slow or ineffective. This recurrent ANN architecture was an important breakthrough, allowing networks to learn and retain information from long data sequences, making them particularly suitable for time series and text processing.

In 1989, LeCun and collaborators ^[6] published a pioneering study on Convolutional Neural Networks (CNN), and in 1998 ^[7] demonstrated their potential by outperforming other techniques in the task of handwritten character recognition. CNNs are a type of feedforward network that learns to model data features by itself, through the optimization of convolutional filters. They are particularly useful for computer vision, natural language processing, and time series analysis.

In 2012, Cireşan and collaborators ^[2] drove the use of GPUs for training deep networks, demonstrating the effectiveness of CNNs in traffic sign recognition. This and other works, such as that of Krizhevsky and collaborators ^[5] with AlexNet, which demonstrated superior performance in image classification, led to the widespread adoption of GPUs and drove progress in various areas of artificial intelligence.

The ImageNet Moment: AlexNet's victory in the 2012 ImageNet competition, with a top-5 error rate of 15.3% (compared to 26.2% for the runner-up), marked a turning point that convinced the broader AI community of deep learning's potential.

Since then, there has been a true Cambrian explosion of ANN architectures, driven by the emergence of computers with greater matrix processing capacity, as well as frameworks that facilitate experimentation with these networks.

References

Cauchy, A. (1847). Méthode générale pour la résolution des systèmes d'équations simultanées. Comptes Rendus, 25, 536–538.
Cireşan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333–338.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. PNAS, 79(8), 2554–2558.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS, 25.
LeCun, Y. et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
LeCun, Y. et al. (1998). Gradient-based learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324.
Linnainmaa, S. (1970). The representation of the cumulative rounding error... Master's thesis, U. Helsinki.
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5(4), 115–133.
Minsky, M., & Papert, S. (1969). Perceptrons. MIT Press.
Ramón y Cajal, S. (1911). Histologie du système nerveux. Maloine.
Rosenblatt, F. (1958). The perceptron: A probabilistic model... Psychological Review, 65(6), 386–408.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
Werbos, P. J. (1974). Beyond regression... Ph.D. thesis, Harvard.
Wiener, N. (1948). Cybernetics. MIT Press.
Hebb, D. O. (1949). The Organization of Behavior. Wiley.