Iteratively Summarize Long Documents with an LLM


White text reads: Iteratively Summarize Long Documents with an LLM


LLMs are being used for a broad range of language-based tasks. Their large context windows allow them to distill vast amounts of text and generate a prompt-driven response. This workflow is perfectly suited to help summarize large documents, yet there are a few constraints standing in the way of simply loading the entirety of your text into the latest LLM and telling it to summarize the material.Screenshot 2024-05-23 at 9.36.48 AM (1)

First and foremost, your document may be longer than your model’s context window. Lengths of 8k, 16k, and 32k tokens are all available in modern LLMs, but using the helpful rule of 100 tokens roughly equals 75 words, this translates to 12, 24, and 48 pages of content. The next issue is how LLMs handle the large contexts themselves. Recent work has shown that how models retain information depends on where it is located within the context, which is an issue when dealing with large amounts of text.

In this blog post we will show you how to iteratively summarize arbitrarily long documents with an LLM. You can use the LLM of your choice, including commercially available ones, but in this example we will use a smaller LLM running locally. For more information see the code repository accompanying this post.


This solution summarizes documents of arbitrary length by breaking them down into fixed-size chunks, summarizing those chunks, combining those summaries, and then repeating until the final output is below a certain length threshold. In this solution, we utilize the Mistral-7B Instruct model and the Hugging Face Transformers library. Due to the recursive nature of the data flow and the probabilistic summary of the LLM generation, run times can vary, but this method takes approximately 4 minutes to summarize a 20-page document when running on 4 Nvidia A5000 GPUs.

Solution Walkthrough

In this example, we use the text from the Wikipedia article “The War on Terror” as our source document for summarization. This document is approximately 20 pages (~10k words) long. To summarize it down to approximately 500 words, the complete summarization cycle is run three times.


This solution uses the Hugging Face transformers and the Accelerate libraries (Accelerate is only necessary if using multiple GPUs), which should be installed in your development environment. This example will also require a Hugging Face user access token. Instructions for obtaining one are here.

pip install transformers accelerate

The LLM used in this example is the MistralLite 7b instruct model, which, while much smaller than many other LLMs, should still be run on a GPU. During model initialization, weights will be downloaded to your computer for loading into memory. This download process may take 10s of minutes but will only need to be done once.

from transformers import AutoTokenizer, AutoModelForCausalLM
import math
import os
from huggingface_hub import login
login(token="<Enter Your User Access Token Here>"


Summarizing a Block of Text

The actual summarization of the text is pretty straightforward. The model is given instructions to summarize the following text, and then the target text is inserted. The output includes the prompt, so some extra logic is included to exclude the prompt from the final output.


def _summarize(text, max_tokens, model, tokenizer):
        B_INST, E_INST = "[INST] ", " [/INST]"
prompt = f"{B_INST}Write a concise summary of the following
inputs = tokenizer( prompt,return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=max_tokens,
use_cache=True, do_sample=True,temperature=0.2, top_p=0.95)
prompt_length = inputs['input_ids'].shape[1]
= tokenizer.decode(outputs[0][prompt_length:-1])
        return summary


Chunking Strategies

This solution partitions chunks of text into fixed lengths, all below a given threshold. For this example, we do a naïve approach, which seeks to create chunks of text with approximately the same lengths such that you do not have the last chunk be disproportionately smaller than the rest. This may result in sentences or even words being cut. In an actual implementation, solutions that partition by sentences or paragraphs may perform better. Also of note in this example is that we chunk by tokens. This gives us a little more control of the inputs to the model as we know the exact length of tokens, but also requires that we account for tokens that are inserted during the tokenization process.


def chunk(tokens, max_token_length,tokenizer):
        token_length = len(tokens)
        k = math.ceil(token_length /max_token_length)
        chunk_sizes = [token_length // k + (1 if x < token_length % k else 0) for x in range (k)]|
        last = 1
        texts =[]
        for l in chunk_sizes:
                sub_sequence_ids = tokens[last: last+l]
                last +=l
        return texts


Iteratively Generate Summaries for Arbitrary Amounts of Text

As mentioned before, this summarization process is iterative, so we need a method to chunk, summarize, concatenate, and repeat as necessary. Since our chunk method works with tokens and our summarize method works with text, this method will also need to transform in between.

def summarize(text, cache_dir = None):
        max_token_length = 1000
        tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
model_max_length = max_token_length, cache_dir= cache_dir)
        model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
pad_token_id = tokenizer.eos_token_id,cache_dir= cache_dir, device_map="auto")model.half()
        tokens = tokenizer.encode(text)#,return_tensors="pt")
        summary = ""
        complete_runthroughs = 0
        while len(tokens) > max_token_length:
                texts = chunk(tokens, max_token_length, tokenizer)
                summaries = []
                for text in texts:
                        sub_summary = _summarize(text, max_token_length, model, tokenizer)
                summary = " ".join(summaries)        
                tokens = tokenizer.encode(summary)
                print(f"run through entire text doc {complete_runthroughs} times")
        return _summarize(summary, model, tokenizer

Running the Solution

Here, we load some text from a file and run it through our summarize method. The cache directory is also exposed, allowing for a specified place to store the Mistral model.

i = open("test.txt", "r")
content = i.read()
result = summarize(content, cache_dir = "/data/models/"


This post has covered an example of how to summarize arbitrary-length documents with LLMs. LLMs excel at distilling large amounts of text, and summarization is a perfect downstream application for them. In this example, we demonstrate how to navigate some of LLMs' limitations, namely revolving around their finite context window. This method can be extended to account for different chunking strategies, extraction of text from different media types, and the usage of many different types of LLMs.

The capabilities demonstrated here power several of MetroStar’s innovative solutions. They have been built to scale for any demand and drive various use-case functionalities.

About the Author 

Justin Downes is the Sr. Director of R&D at MetroStar. He formally led the computer vision practice at AWS National Security and has spent over 20 years in public sector technology.

About MetroStar Innovation Lab

The MetroStar Innovation Lab brings together researchers, creators, engineers, and changemakers to nurture ideas into industry-changing products. Housed in our Reston HQ, our lab is a focal point for innovation and MetroStar's primary research and development center. Our team plays an integral role in leading the discovery, development, and integration of customizable solutions for the public sector. The Innovation Lab provides a diverse portfolio of technology, ranging from open-source design solutions to AI-led labeling for classified documents.

explore innovation lab