Entry

Generating Long Sequences with Sparse Transformers

Simple Title

Child, R. et al. (2019) ‘Generating Long Sequences with Sparse Transformers’, arXiv. arXiv. Available at: http://arxiv.org/abs/1904.10509 (Accessed: 29 January 2021).

Description

スパースなTransformerの仕組みで計算量を抑える

Type

Paper

Year

2019

Posted at

May 16, 2021

Overview - 何がすごい?

Abstract

Transformers are powerful sequence models, but require time and memory that grows quadrati- cally with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n √ n). We also introduce a) a variation on architecture and initial- ization to train deeper networks, b) the recompu- tation of attention matrices to save memory, and c) fast attention kernels for training. We call net- works with these changes Sparse Transformers, and show they can model sequences tens of thou- sands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR- 10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Motivation

Transformerがシーケンスの学習に有効なことは判明しているが、長いシーケンスを学習しようとするほど、必要になるメモリが二乗で増える( $O(n^2)$ : $n$ はシーケンスの長さ)。そこでこの研究では、スパースなAttentionの仕組みを提案し計算量を $O(n \sqrt{n})$ に抑える!

Architecture

一般的なTransformer (a)と本論文で提案しているSparseなTransformer(b, c)

仕組みに関してはこの図がわかりやすい。普通のTransformerはあるシーケンスのステップ i に対して $j < i$ 、前のステップ全てに対してAttendする (上の図の一番左。濃い青の■に対してその前の水色の■を参照している。）

Sparse Transformer (strided)

それに対して、Sparse Transformerの(b)では直前のシーケンスはそのままAttendするが、それ以前に関しては間隔をあけて( $l$ でわったあまりが0) Attendする。

$A_{i}^{(1)}=\{t, t+1, \ldots, i\} \text { for } t=\max (0, i-l)$ 上の図では $l = 3$

$A_{i}^{(2)}=\{j:(i-j) \bmod l=0\}$ 上の図は　 $l = 4$

周期的なパターンをもつ音などの場合にはこの方法はうまくいくが、テキストなどではこのstrided方式はあまりうまくいかないことがわかっている。

Sparse Transformer (fixed)

上の図右(c)

$A_{i}^{(1)}=\{j:(\lfloor j / l\rfloor=\lfloor i / l\rfloor)\}$ $\lfloor \rfloor$ は floor() 関数 (切り捨て)　　上の図では 4　

$A_{i}^{(2)}=\{j: j \bmod l \in \{t, t+1, \ldots, l\}$ ただし $t = l - c$ で $c$ はハイパーパラメター

そのほかLayerのnormalization、dropout、residual blockなどなどいろいろテクニックを組み合わせている... (が詳細を細かく追えてない)

Results

画像、テキスト、音声信号で学習の実験。

長期にわたる依存関係を小さいパラメータ数で学習できていることが定量的にも証明された。

ImageNet 64 x 64ピクセルで学習したモデルでunconditionedで出力した例

部分的な画像から全体を生成するタスクの結果

Prompt - ここまでの画像を元にこのあとを生成する (下の写真)

上の画像をPromptに残りを生成した例

生成されたサウンドの例 -　音質は12kHzのサンプリングレートということもあっていまいちだが、音楽的な構造が比較的保たれているのがわかる。

Further Thoughts

githubにはsparse transformer層の実装のみ公開されている。少し手を加えて実際に音の学習をやってみたい。
今後もTransformer周りの細かい実装のアップデートは研究領域としてホットになりそう。

Links

オリジナルのTransformer論文

📄Attention is All You Need

TransformerをMIDIの生成に応用した例 / 必要なメモリサイズを減らす工夫はここでもなされている

📄Music transformer: Generating music with long-term structure