Core Ideas:
-
State Space Models (SSM) in Neural Networks: SSMs are used to model the relationship between input, output, and the internal state of a system. They are particularly effective in handling time-series data and sequences.
-
From Continuous to Discrete: SSMs are adapted from continuous-time models (used for analog signals) to discrete-time models (for digital signals) using methods like the bilinear transform, making them suitable for processing discrete input sequences.
-
Recurrent Representation and Efficiency: In their basic form, SSMs have a recurrent nature, which isn’t efficient for modern parallel processing hardware. This is overcome by transforming them into a discrete convolution form.
-
Convolutional Approach: By representing SSMs as discrete convolutions, they can be computed more efficiently on modern hardware like GPUs, enabling faster processing of large datasets.
-
SSM Neural Network Implementation: These concepts are implemented in neural networks using specialized layers and transformations, with applications in processing sequences, such as in time-series analysis or natural language processing.
Context and Relevance
-
Advances in Sequence Modeling: Traditional models like RNNs (Recurrent Neural Networks) face challenges in handling long sequences due to their sequential nature and limitations in memory. SSMs offer a more efficient and scalable alternative.
-
Efficiency and Parallelism: The transformation of SSMs into a form suitable for parallel processing aligns well with the capabilities of modern GPUs, leading to significant improvements in computational efficiency.
-
Application in Deep Learning: SSMs find applications in various domains within deep learning, such as language modeling, signal processing, and even in complex tasks like forecasting and anomaly detection in time-series data.
-
Theoretical Innovation: The integration of concepts like HiPPO matrices into SSMs represents a blend of theoretical innovation with practical application, enabling models to effectively capture and remember long-range dependencies in data.
-
Future of Machine Learning: The exploration and implementation of SSMs in neural network architectures contribute to the ongoing evolution of machine learning, particularly in making models more efficient, scalable, and capable of handling complex, sequential data.
Recent interesting paper to show how innovative this design really is
Notes on the Annotated S4
Goal is the efficient modeling of long sequences. We are going to build a new neural network layer based on State Space Models.
State Space Model (SSM) - 1-D input signal to a N-D latent state before projecting to a 1-D output signal
- - input signal 1-D
- u of t part emphasizes that the input is over time
- - output signal 1-D
- - latent state N-D
- multi-dimensional internal state of the State Space Model
- N-dimensional
-
- With D = 0, the main Idea is that we use the input and run it through B, which then we take the internal state x(t) and use A to generate the new state x’(t). With that new state we run it through C which then is the output y(t).
- A, B, C, D parameters learned over gradient descent
- D = 0, it is a skip connection; rather skip over Du(t)
Defining A, B, C
Discrete-Time SSM: The Recurrent Representation
Idea: instead of the input being a continuous function u(t) we make the input . The input is discretized by a step size
We cannot take A, B, C and leave them the way they are. We have to use a bilinear method to convert the state matrix A in to an approximation Ref
Once we discretize the model, it can be viewed and calculated like an RNN
Putting everything together to run the SSM
Core idea: Recurrent Neural Networks are slow because we cannot do multiple calculations in parallel, we have to do them sequentially. One after another. Because Convolutional Neural Networks are optimized on hardware, people try to turn an RNN into a CNN. They do this by “Unrolling” the RNN into a CNN. They do this by converting the RNN into a discrete convolution.
- Discrete Convolution is like blending two signals or inputs together to create a new sequence. The idea is to combine them while still retaining some of the values meaning. so like bit shifting the values then encoding the values in the open gap.
- We this concept implemented we can use parallelism of convolution operations
Here are the new equations:
where is the SSM convolutional Kernel
<Warning this code is bad and won’t work for small lengths> “Note that this is a giant filter. It is the size of the entire sequence!”
Here is the math to get those equations Reference equations
Since the main equation is , we still need to multiply them together. We can use Fast Fourier Transform (FFT) or Convolution theorem, to speed this up. To use this theorem we need to pad the input sequences with zeros then unpad the output sequence. As the length gets longer this FFT method will be more efficient than direct convolution.
SSM Neural Network
Discrete SSM defines a map from a 1-D sequence map. We assume that we are going to be learning parameters B and C, as well as step size and a scalar . For parameter we will be using a HiPPO matrix. We learn the step size in log space.
Most of the SSM layer work is building the kernel (filter).
the self.decode
specifies if the SSMLayer is in CNN mode or RNN mode
SSM operate on scalars, we make different, stacked copies
SSM Layer can then be put into a standard NN. We also add a block that pairs a call to an SSM with dropout and linear projection
We then stack a bunch of the blocks on top of each other to produce a stack of SSM layers.
The full code is listed here
Problems with SSMs
- Randomly initialized SSM does not perform well
- Computing it naively like we’ve done so far is really slow and memory inefficient
Part 1b: Addressing Long-Range Dependencies with HiPPO
Prior Work found that SSMs dont work in practice because gradients scaling exponentially in the sequence length. HiPPO theory comes in to help. The idea is to use the HiPPO Matrix, which tries to memorize the history of the input. they define the most important matrix as a HiPPO matrix. The HiPPO Matrix is kind of complicated, look up for yourself, doesn’t seem to be very relevant to the understanding of S4.
benefits of making A an HiPPO Matrix:
- A only needs to be calculated once.
- Matrix aims to compress the past history into a state that has enough information to reconstruct the history.
Prior work found that it was very successful moving from random to a HiPPO matrix.
Diving deeper into HiPPO matrices
They are successful through coefficients of a Legendre polynomials. These coefficients let it approximate all of the previous history.
Each is a coefficient for one element of the Legendre series shown as blue functions. The intuition is that the HiPPO matrix updates these coefficients each step.
S4 in Practice
Really cool experiments using the S4 in Practice.
Conclusions
My goal for this was to learn more about SSM and S4. Going Through the Annotated S4 was to understand enough about the foundations to go through Mamba - Linear-Time Sequence Modeling with Selective State Spaces Notes, which has been all of the hype lately. Innately the idea of a state plus all of the cognitive abilities that we are now seeing makes rational sense. Transformers only have so much “memory” in their context window. Each increase in the models context windows leads to a quadratic increase in size, this is not sustainable or smart. Future innovations will blend state and attention to create a more sensible structure.