We all know that training large transformer models on long sequence is expensive and may not be possible on a standard GPU card because of the self-attention mechanism that grows quadratically with sequence length.  In this seminar, I'll first give an analysis of memory occupation for standard transformer-based models (BART). Then, I'll talk about two solutions from the view of the model and the input data respectively to tackle the long document summarization problem.

