高级RAG

CharmingDaiDai大约 75 分钟

高级RAG，原文：Florian June – Medium

Advanced RAG

2024年

01: Problems of Naive RAG（朴素 RAG 的问题）

Naive RAG Review

RAG 主要包括以下步骤：

索引：索引过程是离线执行的关键初步步骤
- 清理、提取、各种文件格式转纯文本、分块、嵌入
检索：
- 使用用户查询从外部知识源检索相关上下文
生成：用户查询和检索到的额外上下文被填入一个提示模板中
- 来自检索步骤的增强提示被输入到LLM中

Problems with Naive RAG

索引

信息提取不完整

不能有效地处理非结构化文件（如 PDF）中图像和表格中的有用信息

分块过程采用“一刀切”策略，而不是根据不同文件类型的特点选择最优策略，导致每个块都包含不完整的语义信息

它没有考虑重要的细节，例如文本中现有的标题

索引结构没有得到充分优化，导致检索功能效率低下

嵌入模型的语义表示能力较弱

检索

情境的相关性不足，准确性低

低召回率阻碍了所有相关段落的检索，从而妨碍了LLMs生成全面答案的能力

查询可能不准确，或者嵌入模型的语义表示能力较弱，导致无法检索到有价值的信息

检索算法受到限制，因为它没有结合不同类型的检索方法或算法

信息冗余

生成

有效地将检索到的上下文与当前生成任务整合可能不可行，导致输出不一致

过度依赖增强信息存在较高风险

LLM 可能生成错误、无关、有害或有偏见的结果

02: Unveiling PDF Parsing（揭示 PDF 解析）

在实际工作中，非结构化数据远比结构化数据丰富。如果这些海量数据无法被解析，它们的巨大价值将无法实现

在非结构化数据中，PDF 文档占据了大多数

解析 PDF 的挑战

解析 PDF 文档的挑战在于准确地提取整个页面的布局，并将包括表格、标题、段落和图像在内的内容翻译成文档的文本表示

该过程涉及处理文本提取、图像识别中的不准确性，以及表格中行-列关系的混淆

如何解析 PDF 文档

基于规则

每个部分的风格和内容根据文档的组织特征来确定

通用性并不强，因为 PDF 的类型和布局繁多，不可能用预定义的规则涵盖所有情况

最具代表性的工具之一是 pypdf，它是一个广泛使用的基于规则的解析器。在 LangChain 和 LlamaIndex 中，这是一种解析 PDF 文件的标准方法

eg：

import PyPDF2
filename = "/Users/Florian/Downloads/1706.03762.pdf"
pdf_file = open(filename, 'rb')

reader = PyPDF2.PdfReader(pdf_file)

page_num = 5
page = reader.pages[page_num]
text = page.extract_text()

print('--------------------------------------------------')
print(text)

pdf_file.close()

res:

(py) Florian:~ Florian$ pip list | grep pypdf
pypdf                    3.17.4
pypdfium2                4.26.0

(py) Florian:~ Florian$ python /Users/Florian/Downloads/pypdf_test.py
--------------------------------------------------
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. nis the sequence length, dis the representation dimension, kis the kernel
size of convolutions and rthe size of the neighborhood in restricted self-attention.
Layer Type Complexity per Layer Sequential Maximum Path Length
Operations
Self-Attention O(n2·d) O(1) O(1)
Recurrent O(n·d2) O(n) O(n)
Convolutional O(k·n·d2) O(1) O(logk(n))
Self-Attention (restricted) O(r·n·d) O(1) O(n/r)
3.5 Positional Encoding
Since our model contains no recurrence and no convolution, in order for the model to make use of the
order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,
learned and fixed [9].
In this work, we use sine and cosine functions of different frequencies:
PE(pos,2i)=sin(pos/100002i/d model)
PE(pos,2i+1)=cos(pos/100002i/d model)
where posis the position and iis the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2πto10000 ·2π. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k,PEpos+kcan be represented as a linear function of
PEpos.
...
...
...

它将 PDF 中的字符序列序列化为一个长序列，而没有保留结构信息

它将文档的每一行视为由换行符“ \n ”分隔的序列，这会阻止准确识别段落或表格

基于深度学习模型

将目标检测和 OCR 模型相结合的流行解决方案

优势在于它能够准确识别整个文档的布局，包括表格和段落，甚至可以理解表格内部的结构

目标检测和 OCR 阶段可能比较耗时

这种方法涉及物体检测和光学字符识别（OCR）模型，几个代表性的开源框架的测试：

非结构化

langchain 中

hi_res 策略结合 infer_table_structure=True 的表格识别效果良好

但是， fast 策略表现不佳，因为它没有使用目标检测模型，错误地识别了许多图像和表格

Layout-parser

更高的准确性，尽管它可能会稍慢

Layout-parser 的模型在过去两年似乎没有更新

PP-StructureV2

使用各种模型组合进行文档分析，性能高于平均水平

除了开源工具，还有像 ChatDOC 这样的付费工具，它采用基于布局的识别 + OCR 方法来解析 PDF 文档

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O(n? - d)	O(1)	O(1)
Recurrent	O(n- d?)	O(n)	O(n)
Convolutional	O(k-n-d?)	O(1)	O(logy(n))
Self-Attention (restricted)	O(r-n-d)	ol)	O(n/r)

挑战 2：如何重新排列检测到的块？特别是对于双列 PDF

识别出版面后，非结构化框架将把每一页划分为若干个矩形块

每个矩形块的详细信息：

[
LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text='These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9519357085227966, image_path=None, parent=None), 
......
LayoutElement(bbox=Rectangle(x1=853.4905395507812, y1=1681.5868488888855, x2=1467.8729248046875, y2=2125.8954599999965), text='More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and ﬁne-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang language model- Left-to-right et al., 2018a). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9476840496063232, image_path=None, parent=None)
]

其中 (x1, y1) 是左上顶点的坐标，而 (x2, y2) 是右下顶点的坐标：

    (x_1, y_1) --------
        |             |
        |             |
        |             |
        ---------- (x_2, y_2)

可以选择重新排列页面的阅读顺序。Unstructured 自带一种排序算法，但当处理双列情况时，排序结果并不令人满意

有必要设计一种算法。最简单的方法是首先按左上角顶点的水平坐标排序，如果水平坐标相同，则按垂直坐标排序。该算法的伪代码如下：

layout.sort(key=lambda z: (z.bbox.x1, z.bbox.y1, z.bbox.x2, z.bbox.y2))

然而，即使是同一列中的块，其水平坐标也可能存在差异

在这种情况下可以使用的一种可能算法如下：

首先，对所有左上角 x 坐标 x1 进行排序，可以得到 x1_min
然后，对所有右下角的 x 坐标 x2 进行排序，可以得到 x2_max
接下来，确定页面中心线的 x 坐标为：

x1_min = min([el.bbox.x1 for el in layout])
x2_max = max([el.bbox.x2 for el in layout])
mid_line_x_coordinate = (x2_max + x1_min) /  2

接下来， if bbox.x1 < mid_line_x_coordinate ，该块被归类为左列的一部分。否则，它被认为是右列的一部分

分类完成后，将每列中的块根据其 y 坐标进行排序

最后，将右列连接到左列的右侧

left_column = []
right_column = []
for el in layout:
    if el.bbox.x1 < mid_line_x_coordinate:
        left_column.append(el)
    else:
        right_column.append(el)

left_column.sort(key = lambda z: z.bbox.y1)
right_column.sort(key = lambda z: z.bbox.y1)
sorted_layout = left_column + right_column

这种改进也适用于单列 PDF

文章中的 LayoutElement 是调试过程中的中间信息，还可以对函数 partition_pdf 的返回值进行排序，结果是 elements ，原理是相同的

挑战 3：如何提取多级标题

提取标题（包括多级标题）的目的是提高LLM回答的准确性

例如，如果用户想要了解图 9 中 2.1 节的主要内容，通过准确提取 2.1 节的标题，并将其与相关内容一起作为上下文发送到LLM，最终答案的准确性将显著提高

该算法仍然依赖于图 9 中所示的版式块。可以提取带有 type=’Section-header’ 的块并计算高度差 ( bbox.y2 — bbox.y1 )

高度差最大的块对应于一级标题，其次是二级标题，然后是三级标题

基于多模态大模型（最有效）

检索相关图像（PDF 页面）并将其发送给 GPT4-V 以回应查询

将每个 PDF 页面视为一幅图像，让 GPT4-V 对每个页面进行图像推理。为图像推理建立文本向量存储索引。针对图像推理向量存储进行查询

使用表格转换器从检索到的图像中裁剪表格信息，然后将这些裁剪后的图像发送到 GPT4-V 以进行查询响应

对裁剪后的表格图片应用光学字符识别（OCR），并将数据发送给 GPT-4/GPT-3.5 以回答查询

03: Using RAGAs + LlamaIndex for RAG evaluation（RAG评估）TODO

04: Re-ranking（重排）

重新排序在检索增强生成（RAG）过程中起着至关重要的作用

在简单的 RAG 方法中，可能会检索到大量上下文，但并非所有上下文都与问题相关。重新排序可以对文档进行重新排序和过滤，将相关文档放在最前面，从而提高 RAG 的效率

重新排序简介

RAG 中的重新排序，重新排序的任务是评估这些上下文的相关性，并优先选择最有可能提供准确相关答案的上下文（红框）

重新排序的任务是评估这些上下文的相关性，并优先选择最有可能提供准确和相关答案的上下文

简单地说，重新排序就像在开卷考试中帮助你从一堆学习材料中选择最相关的参考资料，这样你就能更高效、更准确地回答问题

使用重排模型作为重排器

重新排序模型与嵌入模型不同，它将查询和上下文作为输入，直接输出相似性得分而不是嵌入得分

重新排序模型是利用交叉熵损失进行优化的，因此相关性得分不局限于特定范围，甚至可以是负分

目前，可用的重新排名模型并不多。一种选择是 Cohere 提供的在线模型，可以通过 API 访问。此外，还有一些开源模型，如 bge-reranker-base 和 bge-reranker-large 等

从评估结果可以看出：

无论使用哪种嵌入模型，重新排序都能显示出更高的命中率和 MRR，这表明重新排序具有重大影响
目前，最好的重新排名模型是 Cohere，但它是一项付费服务。开源的 bge-reranker-large 模型具有与 Cohere 相似的功能
嵌入模型和重新排序模型的组合也会产生影响

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
from llama_index.schema import QueryBundle

dir_path = "YOUR_DIR_PATH"

# Using LlamaIndex to build a simple retriever
documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k = 3)

# 基本检索
query = "Can you provide a concise description of the TinyLlama model?"
nodes = retriever.retrieve(query)
for node in nodes:
    print('----------------------------------------------------')
    display_source_node(node, source_length = 500)
    
from llama_index.schema import ImageNode, MetadataMode, NodeWithScore
from llama_index.utils import truncate_text

# display_source_node 函数改编自 llama_index 源代码
def display_source_node(
    source_node: NodeWithScore,
    source_length: int = 100,
    show_source_metadata: bool = False,
    metadata_mode: MetadataMode = MetadataMode.NONE,
) -> None:
    """Display source node"""
    source_text_fmt = truncate_text(
        source_node.node.get_content(metadata_mode=metadata_mode).strip(), source_length
    )
    text_md = (
        f"Node ID: {source_node.node.node_id} \n"
        f"Score: {source_node.score} \n"
        f"Text: {source_text_fmt} \n"
    )
    if show_source_metadata:
        text_md += f"Metadata: {source_node.node.metadata} \n"
    if isinstance(source_node.node, ImageNode):
        text_md += "Image:"

    print(text_md)
    # display(Markdown(text_md))
    # if isinstance(source_node.node, ImageNode) and source_node.node.image is not None:
    #     display_image(source_node.node.image)

基本检索结果如下，代表重新排序前的前 3 个节点：

----------------------------------------------------
Node ID: 438b9d91-cd5a-44a8-939e-3ecd77648662 
Score: 0.8706055408845863 
Text: 4 Conclusion
In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote
transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
w... 

----------------------------------------------------
Node ID: ca4db90f-5c6e-47d5-a544-05a9a1d09bc6 
Score: 0.8624531691777889 
Text: TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang∗Guangtao Zeng∗Tianduo Wang Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, @sutd.edu.sg">luwei}@sutd.edu.sg
[email protected]
Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contr... 

----------------------------------------------------
Node ID: e2d97411-8dc0-40a3-9539-a860d1741d4f 
Score: 0.8346160605298356 
Text: Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens...

重新排名

print('------------------------------------------------------------------------------------------------')
print('Start reranking...')

reranker = FlagEmbeddingReranker(
    top_n = 3,
    model = "BAAI/bge-reranker-base",
)

query_bundle = QueryBundle(query_str=query)
ranked_nodes = reranker._postprocess_nodes(nodes, query_bundle = query_bundle)
for ranked_node in ranked_nodes:
    print('----------------------------------------------------')
    display_source_node(ranked_node, source_length = 500)

重新排序后的结果

------------------------------------------------------------------------------------------------
Start reranking...
----------------------------------------------------
Node ID: ca4db90f-5c6e-47d5-a544-05a9a1d09bc6 
Score: -1.584416151046753 
Text: TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang∗Guangtao Zeng∗Tianduo Wang Wei Lu
StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, @sutd.edu.sg">luwei}@sutd.edu.sg
[email protected]
Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contr... 

----------------------------------------------------
Node ID: e2d97411-8dc0-40a3-9539-a860d1741d4f 
Score: -1.7028117179870605 
Text: Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens... 

----------------------------------------------------
Node ID: 438b9d91-cd5a-44a8-939e-3ecd77648662 
Score: -2.904750347137451 
Text: 4 Conclusion
In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote
transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
w...

经过重新排序后，ID 为 ca4db90f-5c6e-47d5-a544-05a9a1d09bc6 的节点的排名从 2 变为 1，这意味着最相关的上下文被排在了第一位

使用 LLM 作为重新排名器

现有的涉及 LLM 的重新排序方法大致可分为三类：

利用重新排序任务对 LLM 进行微调
提示 LLM 进行重新排序
在训练过程中使用 LLM 进行数据增强

提示 LLM 重新排序的方法成本较低。下面是使用 RankGPT 进行的演示，它已被集成到 LlamaIndex 中

RankGPT 的理念是使用 LLM（如 ChatGPT 或 GPT-4 或其他 LLM）执行零次列表式段落重新排序

它采用排列生成方法和滑动窗口策略来有效地对段落重新排序

https://arxiv.org/pdf/2304.09542.pdf 提出了三种可行的方法

前两种方法是传统方法，即给每份文档打分，然后根据分数对所有段落进行排序

第三种方法，即排列生成法。该模型不依赖外部评分，而是直接对段落进行端到端排序。它直接利用 LLM 的语义理解能力对所有候选段落进行相关性排序

然而，候选文档的数量通常非常大，而 LLM 的输入却很有限。因此，通常无法一次性输入所有文本

如图所示，引入了一种滑动窗口法，它沿用了冒泡排序的思想

每次只对前 4 个文本进行排序，然后移动窗口，对后面 4 个文本进行排序

在对整个文本进行反复排序后，就可以得到性能最好的文本

from llama_index.postprocessor import RankGPTRerank
from llama_index.llms import OpenAI
reranker = RankGPTRerank(
    top_n = 3,
    llm = OpenAI(model="gpt-3.5-turbo-16k"),
    # verbose=True,
)

reranker = FlagEmbeddingReranker(
    top_n = 3,
    model = "BAAI/bge-reranker-base",
    use_fp16 = False
)

# or using LLM as reranker
# from llama_index.postprocessor import RankGPTRerank
# from llama_index.llms import OpenAI
# reranker = RankGPTRerank(
#     top_n = 3,
#     llm = OpenAI(model="gpt-3.5-turbo-16k"),
#     # verbose=True,
# )

query_engine = index.as_query_engine(       # add reranker to query_engine
    similarity_top_k = 3, 
    node_postprocessors=[reranker]
)
# query_engine = index.as_query_engine()    # original query_engine

05: Exploring Semantic Chunking（探索语义分块）

大多数常用的分块方法都是基于规则的，采用固定分块大小或相邻分块重叠等技术

对于多级文档，可以使用 Langchain 提供的

在实际应用中，由于预定义的规则（块大小或重叠部分的大小）过于死板，基于规则的分块方法很容易导致检索上下文不完整或包含噪声的块大小过大等问题

对于分块来说，最有效的方法显然是根据语义进行分块。语义分块旨在确保每个分块包含尽可能多的独立语义信息

基于嵌入的方法

LlamaIndex 和 Langchain 都提供了基于嵌入的语义分块器。算法的思路大致相同

pip install llama-index-core

pip install llama-index-readers-file

pip install llama-index-embeddings-openai

pip install httpx[socks]

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import SimpleDirectoryReader


import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

# load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()


embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

nodes = splitter.get_nodes_from_documents(documents)
for node in nodes:
    print('-' * 100)
    print(node.get_content())

splitter.get_nodes_from_documents 函数过程

de>splitter.get_nodes_from_documents 函数的主要流程

基于嵌入的语义分块主要是根据滑动窗口（合并句子）计算相似度。那些相邻且符合阈值的句子会被归入一个语义块

测试结果表明，数据块的粒度相对较粗

这种方法是基于页面的，并不能直接解决跨多个页面的块的问题

基于模型的方法

非标准 BERT

两个句子同时输入 BERT，然后模型预测第二个句子是否紧跟第一个句子

对于一篇文档，将其分割成若干句子。然后，使用滑动窗口将相邻的两个句子输入 BERT 模型进行 NSP 判断

如果预测得分低于预设阈值，则表明这两个句子之间的语义关系较弱

这种方法的优点是可以直接使用，无需训练或微调

但是，这种方法在确定文本分割点时只考虑了前后句子，忽略了后面句子的信息。此外，这种方法的预测效率相对较低

跨段注意力

[2004.14535] Text Segmentation by Cross Segment Attention (arxiv.org) 提出了三种关于跨段注意力的模型

在跨句段 BERT 模型（左）中，我们向模型输入围绕潜在句段断裂的局部上下文：左边 k 个词组，右边 k 个词组。在 BERT+Bi-LSTM 模型（中）中，我们首先使用 BERT 模型对每个句子进行编码，然后将句子表示输入 Bi-LSTM 中。在分层 BERT 模型（右图）中，我们首先使用 BERT 对每个句子进行编码，然后将输出的句子表示输入另一个基于转换器的模型。资料来源通过跨段关注进行文本分割

该模型将文本分段定义为逐句分类任务。潜在断句的上下文（两侧的 k 个标记）被输入到模型中

与 [CLS] 相对应的隐藏状态被传递给 softmax 分类器，由其决定是否在潜在断句处进行分割

还提出了另外两个模型

其中一个使用 BERT 模型获得每个句子的向量表示。然后将多个连续句子的这些向量表示输入 Bi-LSTM 模型（(b)）
另一个 BERT 模型（(c)），以预测每个句子是否是文本分割边界

当时这三个模型取得了最先进的结果

序列模型

Cross-Segment 模型对每个句子进行独立矢量化，不考虑任何更广泛的上下文信息。SeqModel 中提出了进一步的改进方案

https://arxiv.org/pdf/2107.09278.pdf

SeqModel 采用 BERT 同时对多个句子进行编码，在计算句子向量之前对较长上下文中的依赖关系进行建模

然后，它会预测是否在每个句子之后进行文本分割

此外，该模型还利用自适应滑动窗口法来提高推理速度，而不会影响准确性

SeqModel 可通过 ModelScope 框架使用。代码如下：

from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

p = pipeline(
    task = Tasks.document_segmentation,
    model = 'damo/nlp_bert_document-segmentation_english-base'
)

print('-' * 100)

result = p(documents='We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. Today is a good day')

print(result[OutputKeys.TEXT]

总结

基于模型的语义分块法仍有很大的提升空间

建议的一种改进方法是创建针对特定项目的训练数据，以便进行领域微调。这可以提高模型的性能

优化模型结构也是一个改进点

如果能找到一个在特定业务数据上表现良好的模型，那么基于模型的方法仍然有效

基于 LLM 的方法

https://arxiv.org/pdf/2312.06648.pdf

该论文介绍了一种新的检索单位--命题。命题被定义为文本中的原子表达式，每个表达式都包含一个独特的事实，并以简洁、自足的自然语言格式呈现

LlamaIndex 和 Langchain 都实现了相关算法，下面使用 LlamaIndex 进行演示

使用论文中提供的提示来生成命题：

PROPOSITIONS_PROMPT = PromptTemplate(
    """Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.

Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both
hares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America." ]

Input: {node_text}
Output:"""
)

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core                    0.10.12
llama-index-embeddings-openai       0.1.6
llama-index-llms-openai             0.1.6
llama-index-readers-file            0.1.5
llamaindex-py-client                0.1.13

from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"

# Download and install dependencies
DenseXRetrievalPack = download_llama_pack(
    "DenseXRetrievalPack", "./dense_pack"
)

# If you have already downloaded DenseXRetrievalPack, you can import it directly.
# from llama_index.packs.dense_x_retrieval import DenseXRetrievalPack

# Load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()


# Use LLM to extract propositions from every document/node
dense_pack = DenseXRetrievalPack(documents)

response = dense_pack.run("YOUR_QUERY")

class DenseXRetrievalPack 的源代码

class DenseXRetrievalPack(BaseLlamaPack):
    def __init__(
        self,
        documents: List[Document],
        proposition_llm: Optional[LLM] = None,
        query_llm: Optional[LLM] = None,
        embed_model: Optional[BaseEmbedding] = None,
        text_splitter: TextSplitter = SentenceSplitter(),
        similarity_top_k: int = 4,
    ) -> None:
        """Init params."""
        self._proposition_llm = proposition_llm or OpenAI(
            model="gpt-3.5-turbo",
            temperature=0.1,
            max_tokens=750,
        )

        embed_model = embed_model or OpenAIEmbedding(embed_batch_size=128)

        nodes = text_splitter.get_nodes_from_documents(documents)
        sub_nodes = self._gen_propositions(nodes)

        all_nodes = nodes + sub_nodes
        all_nodes_dict = {n.node_id: n for n in all_nodes}

        service_context = ServiceContext.from_defaults(
            llm=query_llm or OpenAI(),
            embed_model=embed_model,
            num_output=self._proposition_llm.metadata.num_output,
        )

        self.vector_index = VectorStoreIndex(
            all_nodes, service_context=service_context, show_progress=True
        )

        self.retriever = RecursiveRetriever(
            "vector",
            retriever_dict={
                "vector": self.vector_index.as_retriever(
                    similarity_top_k=similarity_top_k
                )
            },
            node_dict=all_nodes_dict,
        )

        self.query_engine = RetrieverQueryEngine.from_args(
            self.retriever, service_context=service_context
        )

构造函数的思路是首先使用 text_splitter 将文档划分为原始节点

然后调用 self._gen_propositions 通过生成命题获得相应的子节点

然后使用节点 + 子节点建立一个 VectorStoreIndex，并通过 RecursiveRetriever 进行检索

递归检索器可以使用小块检索，但会将相关的大块传递给生成阶段

总结

利用 LLM 构建命题的分块方法实现了更精细的分块。它与原始节点形成了一个从小到大的索引结构，从而为语义分块提供了一种新的思路

不过，这种方法依赖于 LLM，而 LLM 的成本相对较高

06: Exploring Query Rewriting（查询重写）TODO

07: Exploring RAG for Tables（表格）

实施 RAG 是一项挑战，尤其是在有效解析和理解非结构化文档中的表格时

这对于扫描文档或图像格式的文档尤其困难

这些挑战至少有三个方面：

扫描文档或图像文档的复杂性，如结构的多样性、非文本元素的包含以及手写和印刷内容的结合
如何提取表格标题并将其有效链接到相应的表格
如何设计索引结构，以有效存储表的语义信息

表格解析

利用多模式 LLM

如 GPT-4V，来识别表格，并从每个 PDF 页面提取信息

输入：图像格式的 PDF 页面
输出：JSON 或其他格式的表格。如果多模态 LLM 无法提取表格数据，则应总结图像并返回摘要

使用开源框架

如 unstructured 和其他也采用对象检测模型的框架

输入：PDF 或图像格式的文件
输出：从整个文档的解析结果中获得纯文本或 HTML 格式的表格

使用端到端模型

Nougat、Donut 等

输入：PDF 或图像格式的文件
输出：从整个文档的解析结果中获得 LaTeX 或 JSON 格式的表格

无论使用哪种方法提取表格信息，都应包含表格标题。因为在大多数情况下，表格标题是文档或论文作者对表格的简要描述，可以在很大程度上概括整个表格

索引结构

只有图像格式的索引表
只有纯文本或 JSON 格式的索引表
只有 LaTeX 格式的索引表
只为表格摘要编制索引
从小到大或文件摘要索引结构
- 小块内容可以是表格中每一行的信息，也可以是表格的摘要
- 大块内容可以是图像格式、纯文本格式或 LaTeX 格式的表格
向 VQA 模型（如 DAN 等）或多模态 LLM 发送相关图像（PDF 页）和用户查询，并返回答案
- 要编入索引的内容：图像格式文件
- 发送给 VQA 模型或多模态 LLM 的内容：查询 + 图像形式的相应页面
向 LLM 发送相关文本格式的 PDF 页面和用户的查询，然后返回答案
- 索引内容：文本格式文件
- 发送到 LLM 的内容：查询 + 文本格式的相应页面
向多模态 LLM（如 GPT-4V 等）发送相关图像（PDF 页面）、文本块和用户查询，并直接返回答案
- 需要索引的内容：图像格式的文档和文本格式的文档块
- 发送给多模态 LLM 的内容：查询 + 文档的相应图像形式 + 相应文本块
首先，应用（a）至（d）中的一种方法，将文档中的所有表格解析为图像形式。然后，将所有表格图像和用户的查询直接发送到多模态 LLM（如 GPT-4V 等），并返回答案
- 要索引的内容：无
- 发送至多模态 LLM 的内容：查询 + 所有解析表（图像格式）
使用（m）提取的图像格式的表格，然后使用 OCR 模型识别表格中的所有文本，然后直接将表格中的所有文本和用户的查询发送到 LLM，并直接返回答案
- 要索引的内容：无
- 发送到 LLM 的内容：用户查询 + 所有表格内容（文本格式）

现有开源解决方案

LlamaIndex 提出了四种方法，其中前三种使用多模态模型

检索相关图像（PDF 页面）并将其发送到 GPT-4V 以回复查询
- 不需要进行表格解析
- 结果表明，即使答案在图像中，它也无法得出正确答案
将每个 PDF 页面视为图像，让 GPT-4V 对每个页面进行图像推理。为图像推理建立文本向量存储索引。根据图像推理向量存储查询答案
- 涉及表格解析
- 根据 GPT-4V 返回的结果，索引内容要么是表格内容，要么是摘要
- 缺点是，GPT-4V 从图像中识别表格并提取其内容的能力不稳定，尤其是当图像包含表格、文本和其他图像的混合时
使用表格转换器从检索到的图像中裁剪表格信息，然后将这些裁剪后的图像发送到 GPT-4V 进行查询响应
- 不需要编制索引
对裁剪后的表格图像进行 OCR 识别，并将数据发送到 GPT4/ GPT-3.5 以回答查询
- 也不需要索引
- 错误答案的产生是由于无法从图像中提取表格信息

通过测试发现，第三种方法的整体效果最好

不过，第三种方法在检测表格方面很吃力，更不用说正确合并表格标题和表格了

Langchain 也提出了一些解决方案，半结构化 RAG 的关键技术包括

表格解析使用非结构化

索引方法是文档摘要索引，小块内容：表格摘要，大块内容：原始表格内容（文本格式）

半结构化和多模式 RAG 提出了三种解决方案，其架构如图所示

建议的解决方案

图中省略了一些 RAG 模块，如重新排序和查询重写

表格解析

使用 Nougat

它的表格检测比非结构化（catogery (c)）更有效。此外，Nougat 还能很好地提取表格标题，非常方便与表格关联

文件摘要索引结构（目录 (i)）

小块内容包括表格摘要，大块内容包括 LaTeX 格式的相应表格和文本格式的表格标题。使用多向量检索器来实现它

表格摘要获取方法

将表格和表格标题发送至 LLM 进行汇总

这种方法的优势在于，它既能高效地解析表格，又能全面考虑表格摘要与表格之间的关系。它还消除了对多模式 LLM 的需求，从而节省了成本

code

pip install langchain
pip install chromadb
pip install nougat-ocr

# 完成安装后，检查 Python 软件包的版本
langchain                                0.1.12
langchain-community                      0.0.28
langchain-core                           0.1.31
langchain-openai                         0.0.8
langchain-text-splitters                 0.0.1

chroma-hnswlib                           0.7.3
chromadb                                 0.4.24

nougat-ocr                               0.1.17

# 设置环境并导入
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

import subprocess
import uuid

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough

def june_run_nougat(file_path, output_dir):
    # Run Nougat and store results as Mathpix Markdown
    cmd = ["nougat", file_path, "-o", output_dir, "-m", "0.1.0-base", "--no-skipping"]
    res = subprocess.run(cmd) 
    if res.returncode != 0:
        print("Error when running nougat.")
        return res.returncode
    else:
        print("Operation Completed!")
        return 0

def june_get_tables_from_mmd(mmd_path):
    f = open(mmd_path)
    lines = f.readlines()
    res = []
    tmp = []
    flag = ""
    for line in lines:
        if line == "\\begin{table}\n":
            flag = "BEGINTABLE"
        elif line == "\\end{table}\n":
            flag = "ENDTABLE"
        
        if flag == "BEGINTABLE":
            tmp.append(line)
        elif flag == "ENDTABLE":
            tmp.append(line)
            flag = "CAPTION"
        elif flag == "CAPTION":
            tmp.append(line)
            flag = "MARKDOWN"
            print('-' * 100)
            print(''.join(tmp))
            res.append(''.join(tmp))
            tmp = []

    return res

file_path = "YOUR_PDF_PATH"
output_dir = "YOUR_OUTPUT_DIR_PATH"

if june_run_nougat(file_path, output_dir) == 1:
    import sys
    sys.exit(1)

mmd_path = output_dir + '/' + os.path.splitext(file_path)[0].split('/')[-1] + ".mmd" 
tables = june_get_tables_from_mmd(mmd_path)

# 使用 LLM 对表格进行汇总
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. The table is formatted in LaTeX, and its caption is in plain text format: {element}  """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
# Get table summaries
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
print(table_summaries)

# 使用多向量检索器构建文档摘要索引结构
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name = "summaries", embedding_function = OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore = vectorstore,
    docstore = store,
    id_key = id_key,
    search_kwargs={"k": 1} # Solving Number of requested results 4 is greater than number of elements in index..., updating n_results = 1
)

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content = s, metadata = {id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))

# 建立一个简单的 RAG 管道并执行查询
# Prompt template
template = """Answer the question based only on the following context, which can include text and tables, there is a table in LaTeX format and a table caption in plain text format:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")


# Simple RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)


print(chain.invoke("when layer type is Self-Attention, what is the Complexity per Layer?"))  # Query about table 1

print(chain.invoke("Which parser performs worst for BLEU EN-DE"))  # Query about table 2

print(chain.invoke("Which parser performs best for WSJ 23 F1"))  # Query about table 4

整体代码如下

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

import subprocess
import uuid

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough


def june_run_nougat(file_path, output_dir):
    # Run Nougat and store results as Mathpix Markdown
    cmd = ["nougat", file_path, "-o", output_dir, "-m", "0.1.0-base", "--no-skipping"]
    res = subprocess.run(cmd) 
    if res.returncode != 0:
        print("Error when running nougat.")
        return res.returncode
    else:
        print("Operation Completed!")
        return 0

def june_get_tables_from_mmd(mmd_path):
    f = open(mmd_path)
    lines = f.readlines()
    res = []
    tmp = []
    flag = ""
    for line in lines:
        if line == "\\begin{table}\n":
            flag = "BEGINTABLE"
        elif line == "\\end{table}\n":
            flag = "ENDTABLE"
        
        if flag == "BEGINTABLE":
            tmp.append(line)
        elif flag == "ENDTABLE":
            tmp.append(line)
            flag = "CAPTION"
        elif flag == "CAPTION":
            tmp.append(line)
            flag = "MARKDOWN"
            print('-' * 100)
            print(''.join(tmp))
            res.append(''.join(tmp))
            tmp = []

    return res

file_path = "YOUR_PDF_PATH"
output_dir = "YOUR_OUTPUT_DIR_PATH"

if june_run_nougat(file_path, output_dir) == 1:
    import sys
    sys.exit(1)

mmd_path = output_dir + '/' + os.path.splitext(file_path)[0].split('/')[-1] + ".mmd" 
tables = june_get_tables_from_mmd(mmd_path)


# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \ 
Give a concise summary of the table or text. The table is formatted in LaTeX, and its caption is in plain text format: {element}  """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()
# Get table summaries
table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})
print(table_summaries)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name = "summaries", embedding_function = OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore = vectorstore,
    docstore = store,
    id_key = id_key,
    search_kwargs={"k": 1} # Solving Number of requested results 4 is greater than number of elements in index..., updating n_results = 1
)

# Add tables
table_ids = [str(uuid.uuid4()) for _ in tables]
summary_tables = [
    Document(page_content = s, metadata = {id_key: table_ids[i]})
    for i, s in enumerate(table_summaries)
]
retriever.vectorstore.add_documents(summary_tables)
retriever.docstore.mset(list(zip(table_ids, tables)))


# Prompt template
template = """Answer the question based only on the following context, which can include text and tables, there is a table in LaTeX format and a table caption in plain text format:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")

# Simple RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

print(chain.invoke("when layer type is Self-Attention, what is the Complexity per Layer?"))  # Query about table 1

print(chain.invoke("Which parser performs worst for BLEU EN-DE"))  # Query about table 2

print(chain.invoke("Which parser performs best for WSJ 23 F1"))  # Query about table 4

09: Prompt Compression（prompt压缩）

RAG 流程可能会遇到两个问题：

大语言模型（LLM）通常有上下文长度限制。因此，输入文本越长，处理过程就越费时费力
检索到的上下文不一定总是有用的。在一个较大的语块中，可能只有一小部分与答案相关。在某些情况下，要回答一个特定的问题，可能需要将多个信息块结合起来。即使重新排序，这个问题依然存在

LLM 的提示压缩是解决这些问题的一种方法。从根本上说，其目的是保留提示中的关键信息，使输入令牌更有价值。这种方法既能提高模型的性能，又能降低成本

提示压缩方法可分为四大类：

基于信息熵的方法
- 如选择性语境、LLMLingua、LongLLMLingua。这些方法使用一个小型语言模型来计算原始提示语中每个标记的自信息或易错性。然后，它们会删除易错性较低的标记
基于软提示调整的方法
- 如 AutoCompressor 和 GIST。这些方法需要对 LLM 参数进行微调，使其适用于特定领域，但不能直接应用于黑盒 LLM
首先，从 LLM 中进行数据提炼，然后训练模型，生成更多可解释的文本摘要。这些模型可以在不同的语言模型之间转移，并应用于不需要梯度更新的黑盒 LLM。具有代表性的方法是 LLMLingua-2 和 RECOMP
基于标记合并或标记剪枝的方法（最初是针对 ViT 或 BERT 等较小模型提出的）
- 如 ToMe 和 AdapLeR。这些方法通常需要在推理过程中对模型进行微调或生成中间结果

Selective Context 选择性上下文

LLM 不需要完整的上下文或完整的对话历史记录，就能对用户的询问做出回应。即使在相关信息被遗漏的情况下，LLMs 仍能做出预期的回应

这可能要归功于 LLMs 从上下文线索和预训练中获得的先验知识中推断出缺失信息的能力

因此，可以在不影响性能的情况下，通过过滤掉信息量较少的内容来优化上下文长度。这就是选择性上下文的关键所在

选择性上下文采用小语言模型（SLM）来确定给定上下文中词性单位（如句子、短语或标记）的自信息

然后，它利用这些自信息来评估它们的信息量。通过有选择性地保留自信息较高的内容，选择性上下文为 LLM 提供了一种更简洁、更高效的上下文表示法

实现这一点不会影响它们在不同任务中的性能

自信息

自信息，又称惊喜或信息含量，是信息论中的一个重要概念。它量化了一个事件所传达的信息量。它被定义为标记的负对数可能性： $$ I(x)=-\log_2P(x_t|x_0,x_1,...,x_{t-1}) $$ I(x) 表示标记 x 的自我信息，P(x) 表示其输出概率

罕见事件传递的信息越多，自信息量就越大；常见事件传递的信息较少，其自信息量较低

算法

# 安装相应的 python 库和下载 Spacy 模型来设置环境
(base) Florian:~ Florian$ conda create -n "selective_context" python=3.10 
(base) Florian:~ Florian$ conda activate selective_context
(selective_context) Florian:~ Florian$ pip install selective-context
(selective_context) Florian:~ Florian$ python -m spacy download en_core_web_sm

# 安装完成后，版本如下：
(selective_context) Florian:~ Florian$ pip list | grep selective
selective-context   0.1.4

from selective_context import SelectiveContext

sc = SelectiveContext(model_type='gpt2', lang='en')
text = "INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .]. Ideal CL models in the real world should be deal with domain shifts , researchers have recently started to sample tasks from two different datasets . For instance , proposed to train and evaluate a model on Imagenet first and then challenge its performance on the Places365 dataset . considers more scenarios , starting with Imagenet or Places365 , and then moving on to the VOC/CUB/Scenes datasets. Few works propose more advanced scenarios built on top of more than two datasets."
context, reduced_content = sc(text)

# We can also adjust the reduce ratio
# context_ratio, reduced_content_ratio = sc(text, reduce_ratio = 0.5)

sc(text) 函数源代码

class SelectiveContext:
    ...
    ...
    def __call__(self, text: str, reduce_ratio: float = 0.35, reduce_level :str = 'phrase') -> List[str]:
        context = self.beautify_context(text)

        self.mask_ratio = reduce_ratio

        sents = [sent.strip() for sent in re.split(self.sent_tokenize_pattern, context) if sent.strip()]

        # You want the reduce happen at sentence level, phrase level, or token level?
        assert reduce_level in ['sent', 'phrase', 'token'], f"reduce_level should be one of ['sent', 'phrase', 'token'], got {reduce_level}"
        sent_lus, phrase_lus, token_lus = self._lexical_unit(sents)
        lexical_level = {
            'sent': sent_lus,
            'phrase': phrase_lus,
            'token': token_lus
        }

        # context is the reduced context, masked_sents denotes what context has been filtered out
        context, masked_sents = self.self_info_mask(lexical_level[reduce_level].text, lexical_level[reduce_level].self_info, reduce_level)
        return context, masked_sents

步骤 1：计算自我信息

给定上下文 C = x0、x1、......、xn，其中每个 xi 代表一个标记，使用因果语言模型（如 GPT-2、OPT 和 LLaMA）来计算每个标记 xi 的自我信息

步骤 2：合并为词汇单元

直接在标记级别执行选择性上下文过滤可能会导致上下文不连贯。例如，原始提示中的 "2009 "可能被压缩为 "209"

除了标记级过滤外，在短语和句子级实施过滤程序也至关重要

过滤的基本单位称为词性单位，可以是一个标记、一个短语或一个句子

如何计算每个词性单位 u = (xt, ..., xt+α)的自信息？

根据自信息的可加性原则，将组成 u 的每个标记的自信息相加： $$ I(u)=\sum_{i=t}^\alpha I(x_i) $$ 相应的代码如下

class SelectiveContext:
    ...
    ...
    def _lexical_unit(self, sents):

        if self.sent_level_self_info:
            sent_self_info = []
            all_noun_phrases = []
            all_noun_phrases_info = []
            all_tokens = []
            all_token_self_info = []

            for sent in sents:
                # print(sent)
                tokens, self_info = self.get_self_information(sent)
                '''
                ipdb> sent
                'INTRODUCTION Continual Learning ( CL ) , also known as Lifelong Learning , is a promising learning paradigm to design models that have to learn how to perform multiple tasks across different environments over their lifetime [To uniform the language and enhance the readability of the paper we adopt the unique term continual learning ( CL ) .].'

                ipdb> tokens
                ['IN', 'TR', 'ODUCT', 'ION', ' Contin', 'ual', ' Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lif', 'elong', ' Learning', ',', ' is', ' a', ' promising', ' learning', ' paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple', ' tasks', ' across', ' different', ' environments', ' over', ' their', ' lifetime', ' [', 'To', ' uniform', ' the', ' language', ' and', ' enhance', ' the', ' read', 'ability', ' of', ' the', ' paper', ' we', ' adopt', ' the', ' unique', ' term', ' continual', ' learning', ' (', ' CL', ' )', '.', '].']

                ipdb> self_info
                [7.514791011810303, 1.632637619972229, 0.024813441559672356, 0.006853647995740175, 12.09920597076416, 2.1144468784332275, 9.457701683044434, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 10.071824073791504, 0.6905602216720581, 0.01698811538517475, 1.5882389545440674, 0.4495090842247009, 0.45371606945991516, 6.932497978210449, 6.087430477142334, 3.66465425491333, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 4.6389899253845215, 0.33642446994781494, 4.918881416320801, 2.076707601547241, 3.3553669452667236, 5.5081071853637695, 5.625778675079346, 0.7966060638427734, 6.347291946411133, 12.772034645080566, 13.792041778564453, 4.11267614364624, 6.583715915679932, 3.3618998527526855, 8.434362411499023, 1.2423189878463745, 5.8330583572387695, 0.0013973338063806295, 0.3090735077857971, 1.1139129400253296, 4.160390853881836, 3.744772434234619, 7.2841596603393555, 1.4088190793991089, 7.86871337890625, 4.305004596710205, 9.69282341003418, 0.08665203303098679, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 6.892032623291016]
                '''
                sent_self_info.append(np.mean(self_info))

                all_tokens.extend(tokens)
                all_token_self_info.extend(self_info)

                noun_phrases, noun_phrases_info = self._calculate_lexical_unit(tokens, self_info)
                '''
                ipdb> noun_phrases
                ['INTRODUCTION Continual Learning', ' (', ' CL', ' )', ',', ' also', ' known', ' as', ' Lifelong Learning', ',', ' is', ' a promising learning paradigm', ' to', ' design', ' models', ' that', ' have', ' to', ' learn', ' how', ' to', ' perform', ' multiple tasks', ' across', ' different environments', ' over', ' their lifetime', ' [', 'To', ' uniform', ' the language', ' and', ' enhance', ' the readability', ' of', ' the paper', ' we', ' adopt', ' the unique term continual learning', ' (', ' CL', ' )', '.', ']', '.']
                
                ipdb> noun_phrases_info
                [4.692921464797109, 2.4503376483917236, 10.236454963684082, 0.8689146041870117, 5.269547939300537, 4.641763210296631, 0.22138957679271698, 0.010370315983891487, 3.5931241369495788, 1.5882389545440674, 0.4495090842247009, 4.284574694931507, 3.3969509601593018, 7.337691307067871, 5.881226539611816, 1.7340556383132935, 4.599822521209717, 6.482723236083984, 4.045308589935303, 4.762691497802734, 0.21346867084503174, 3.7985599040985107, 2.487707197666168, 4.918881416320801, 2.7160372734069824, 5.5081071853637695, 3.2111923694610596, 6.347291946411133, 12.772034645080566, 13.792041778564453, 5.348196029663086, 3.3618998527526855, 8.434362411499023, 2.3589248929638416, 0.3090735077857971, 2.6371518969535828, 3.744772434234619, 7.2841596603393555, 4.672402499616146, 1.6127821207046509, 1.6296097040176392, 0.46206924319267273, 3.0398476123809814, 3.446016311645508, 3.446016311645508]
                '''

                # We need to add a space before the first noun phrase for every sentence except the first one
                if all_noun_phrases:
                    noun_phrases[0] = f" {noun_phrases[0]}"
                all_noun_phrases.extend(noun_phrases)
                all_noun_phrases_info.extend(noun_phrases_info)
            
            return [
                LexicalUnits('sent', text=sents, self_info=sent_self_info),
                LexicalUnits('phrase', text=all_noun_phrases, self_info=all_noun_phrases_info),
                LexicalUnits('token', text=all_tokens, self_info=all_token_self_info)
            ]

步骤 3：有选择地保留信息

如何评估它们的信息量？
本文提出了一种自适应方法，使用基于百分位数的过滤方法来选择信息量最大的内容。这比使用固定阈值或保留固定数量的前 k 个词汇单元更可取

根据词性单位的自信息值从高到低排列词性单位

计算所有词性单元这些值的 pth 百分位数

选择性地保留自信息值大于或等于 pth 百分位数的词汇单位

相应的代码如下

class SelectiveContext:
    ...
    ...

    def self_info_mask(self, sents: List[str], self_info: List[float], mask_level):
        # mask_level: mask sentences, phrases, or tokens
        sents_after_mask = []
        masked_sents = []
                
        self.ppl_threshold = np.nanpercentile(self_info, self.mask_ratio * 100)

        # if title is not None:
        #     with open(os.path.join(self.path, title+'_prob_token.tsv'), 'w', encoding='utf-8') as f:
        #         for token, info in zip(tokens, self_info):
        #             f.write(f"{token}\t{info}\n")
        #     with open(os.path.join(self.path, title+'_prob_sent.tsv'), 'w', encoding='utf-8') as f:
        #         for sent, info in zip(sents, sent_self_info):
        #             f.write(f"{sent}\n{info}\n\n")

        for sent, info in zip(sents, self_info):
            if info < self.ppl_threshold:
                masked_sents.append(sent)
                sents_after_mask.append(self.mask_a_sent(sent, mask_level))
            else:
                sents_after_mask.append(sent)
        masked_context = " ".join(sents_after_mask) if mask_level == 'sent' else "".join(sents_after_mask)
        
        return masked_context, masked_sents

LLMLingua

LLMLingua 认为，选择性上下文往往忽略了压缩内容之间的相互联系，以及 LLM 与用于及时压缩的小语言模型之间的相关性。LLMLingua 正好解决了这些问题

如图，LLMLingua 采用了预算控制器，为原始提示的各个组成部分（如指令、演示和问题）动态分配不同的压缩率

LLMLingua 还引入了一种标记级迭代算法，用于对提示语进行细粒度压缩

与 "选择性上下文 "相比，LLMLingua 能更有效地保留提示中的关键信息，同时考虑到标记之间的条件依赖关系。它能将提示语压缩 20 倍

预算控制器

用于为原始提示音的不同部分动态分配不同的压缩比

提示语的不同部分对压缩的敏感度不同。例如，说明和问题的敏感度较高，而演示的敏感度较低

预算控制器的作用是为指令和问题分配较低的压缩率，从而保留基本信息。相反，可以为演示分配较高的压缩率，以消除冗余信息

主要变量：

M𝑠：小型语言模型，如 GPT-2 或 LLaMA
x = (x^ins , x^dems , x^que)：原始提示，包括说明、演示和问题
𝐿、𝐿_ins、𝐿_dems 和 𝐿_que 表示 x、x^ins、xdems 和 x^que 中的 token 数
𝜏_dems：根据目标总体压缩率𝜏 以及预设的指令和问题压缩率（即𝜏_ins 和 𝜏_que）计算的演示压缩率
D：这一组将包含压缩演示

主要流程：

计算演示压缩率
使用小型语言模型（如 GPT-2 或 LLaMA）计算原始演示集中每个演示的复杂度
按困惑度降序排列所有演示
迭代选择演示，并将其添加到集合 D 中
压缩演示后，将剩余预算分配给说明和问题
粗粒度压缩后的输出集 D

迭代令牌级提示压缩（ITPC）

使用困惑度来压缩提示符有其内在的局限性：独立性假设

该假设认为提示符中的每个标记都是独立的，一个标记出现的概率只取决于前面的标记，与其他标记无关

这一假设的问题在于，它忽略了自然语言中词块之间经常存在的复杂依赖关系，而这种关系对于理解上下文和保持语义完整性至关重要

为了解决这个问题，LLMLingua 引入了迭代标记级提示压缩 (ITPC) 算法

这种方法并不完全依赖其独立概率，而是在提示压缩过程中更精确地评估每个标记的重要性

它通过迭代处理提示中的每个片段，并在当前上下文中考虑每个标记的条件概率来实现这一目的。这种方法有助于更好地保留标记之间的依赖关系

通过这一过程，ITPC 算法可以有效压缩提示语的长度，同时保持提示语义的完整性，从而降低 LLM 的推理成本

相关代码：https://github.com/microsoft/LLMLingua/blob/v0.2.1/llmlingua/prompt_compressor.py#L1458

指令调优

其目的是最小化用于压缩提示的小语言模型与 LLM 之间的分布差异

code

# 设置环境

(base) Florian:~ Florian$ conda create -n "llmlingua" python=3.11

(base) Florian:~ Florian$ conda activate llmlingua

(llmlingua) Florian:~ Florian$ pip install llmlingua

# 安装的版本如下：

llmlingua          0.2.1

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"

llm_lingua = PromptCompressor()

## Or use the phi-2 model,
# llm_lingua = PromptCompressor("microsoft/phi-2")

## Or use the quantation model, like TheBloke/Llama-2-7b-Chat-GPTQ, only need <8GB GPU memory.
## Before that, you need to pip install optimum auto-gptq
# llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

compressed_prompt = llm_lingua.compress_prompt(GSM8K_PROMPT.split("\n\n")[0], instruction="", question="", target_token=200)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

LongLLMLingua

LLMLingua 的问题在于，它在压缩过程中不考虑用户的问题，可能会保留无关信息

LongLLMLingua 将用户问题纳入压缩过程，旨在解决这一问题

LongLLMLingua 提出了四个新组件，以增强对 LLM 中关键信息的感知：

问题感知的粗粒度和细粒度压缩
文件重新排序机制
动态压缩比
后续恢复算法

问题感知粗粒度压缩

LongLLMLingua 建议使用问题 x^que 在不同上下文 x^doc_k 条件下的困惑度来表示它们之间的关联

在 x^que 后面可以加上一个限制性语句，即 x^restrict = "可以在给定的文档中得到这个问题的答案"

该语句加强了 x^que 和 x^doc_k 之间的联系，并作为一个正则化项减少了幻觉效应 $$ \large\r_{k}=\frac{1}{N_{c}}\sum_{i}^{{N_{c}}p(x_{i}}{\mathrm{que,restrict}}|\mathbf{x}{k}^{\mathrm{doc}})\log p(x^{{\mathrm{que,restrict}}|\mathbf{x}_{k}}{\mathrm{doc}}),k\in{1,2,\cdots,K} $$

问题感知细粒度压缩

LongLLMLingua 引入了对比困惑的概念 $$ s_i=\text{perplexity}(x_i|x_{<i})-\text{perplexity}(x_i|x^\text{que},x_{<i}) $$ 计算一个标记的困惑度，不考虑问题，表示为 $\text{perplexity}(x_i|x_{<i})$

目的是确定每个标记的惊奇程度随问题变化的程度

如果一个词在包含问题时变得不那么令人惊讶，那么它可能与问题高度相关

文件重新排序机制

LLM 往往会使用提示开头和结尾的内容，而忽略中间的内容。这个问题被称为 "迷失在中间 "问题

当相关信息被放在开头时，LLM 的表现最佳。因此，LongLLMLingua 根据粗粒度压缩的结果来组织段落，按得分从高到低的顺序从前往后排列

动态压缩比

由于不同文档的关键信息密度不同，应该为与问题更相关的文档分配更多的预算（即更低的压缩比）

LongLLMLingua 使用粗粒度压缩的重要性分数来指导细粒度压缩的预算分配

首先使用 LLMLingua 的预算控制器为保留的文档设置初始预算

然后，在细粒度压缩阶段，为每个文档动态分配压缩预算，分配的依据是文档的重要性得分排名指数，该指数是在粗粒度压缩阶段确定的

LongLLMLingua 采用线性调度器进行自适应分配，每个令牌 xi 的预算可表示为 $$ \begin{aligned}\tau_{i}&=\tau_k^{\mathrm{doc}},\quad x_i\in\mathbf{x}_k^{{\mathrm{doc}},\\tau_k}{\mathrm{doc}}&=\max(\min((1-\frac{2I(r_k)}{N_d})\delta\tau+\tau^{\mathrm{doc}},0),1)\end{aligned} $$ 其中，Nd 表示文件数量，δτ 是一个超参数，用于控制动态分配的总体预算

相关代码 https://github.com/microsoft/LLMLingua/blob/v0.2.1/llmlingua/prompt_compressor.py#L958

后续恢复算法

在细粒度标记压缩过程中，可能会丢弃一些关键实体的标记

例如，原始提示中的 "2009 "可能被压缩为 "209"，"Wilhelm Conrad Rontgen "可能被压缩为 "Wilhelmgen"

LongLLMLingua 提出了一种子序列恢复算法，可以从 LLM 的响应中恢复出原始内容

主要步骤

遍历 LLM 响应中的标记 yl，并选择压缩提示 x˜ 中出现的最长子串 y˜key,l
找出原始提示 x 中与 y˜key,l 相对应的最大公共最短子序列 xi,j
用 xi,j 替换 LLMs 响应中的相应标记 y˜key,l

相关代码：https://github.com/microsoft/LLMLingua/blob/v0.2.1/llmlingua/prompt_compressor.py#L1686

代码

from llmlingua import PromptCompressor

GSM8K_PROMPT = "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\nAngelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\nQuestion: You can buy 4 apples or 1 watermelon for the same price. You bought 36 fruits evenly split between oranges, apples and watermelons, and the price of 1 orange is $0.50. How much does 1 apple cost if your total bill was $66?\nLet's think step by step\nIf 36 fruits were evenly split between 3 types of fruits, then I bought 36/3 = 12 units of each fruit\nIf 1 orange costs $0.50 then 12 oranges will cost $0.50 * 12 = $6\nIf my total bill was $66 and I spent $6 on oranges then I spent $66 - $6 = $60 on the other 2 fruit types.\nAssuming the price of watermelon is W, and knowing that you can buy 4 apples for the same price and that the price of one apple is A, then 1W=4A\nIf we know we bought 12 watermelons and 12 apples for $60, then we know that $60 = 12W + 12A\nKnowing that 1W=4A, then we can convert the above to $60 = 12(4A) + 12A\n$60 = 48A + 12A\n$60 = 60A\nThen we know the price of one apple (A) is $60/60= $1\nThe answer is 1\n\nQuestion: Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?\nLet's think step by step\nAfter one week, Susy has 100+40 = 140 followers.\nIn the second week, Susy gains 40/2 = 20 new followers.\nIn the third week, Susy gains 20/2 = 10 new followers.\nIn total, Susy finishes the three weeks with 140+20+10 = 170 total followers.\nAfter one week, Sarah has 50+90 = 140 followers.\nAfter the second week, Sarah gains 90/3 = 30 followers.\nAfter the third week, Sarah gains 30/3 = 10 followers.\nSo, Sarah finishes the three weeks with 140+30+10 = 180 total followers.\nThus, Sarah is the girl with the most total followers with a total of 180.\nThe answer is 180"
QUESTION = "Question: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?"



llm_lingua = PromptCompressor()

compressed_prompt = llm_lingua.compress_prompt(
    GSM8K_PROMPT.split("\n\n")[0],
    question = QUESTION,
    # ratio=0.55
    # Set the special parameter for LongLLMLingua
    condition_in_question = "after_condition",
    reorder_context = "sort",
    dynamic_context_compression_ratio = 0.3, # or 0.4
    condition_compare = True,
    context_budget = "+100",
    rank_method = "longllmlingua",
)

print('-' * 100)
print("original:")
print(GSM8K_PROMPT.split("\n\n")[0])


print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

AutoCompressor 自动压缩机

AutoCompressor 是一种基于软提示的方法

它通过扩大词汇量以及利用 "摘要标记 "和 "摘要向量 "来浓缩上下文信息，对现有模型进行了巧妙的微调

自动压缩器通过递归生成摘要向量来处理长文档，这些摘要向量将作为软提示传递给所有后续分段

运行步骤：

扩展词汇：这一步骤包括在模型现有词汇中添加 "摘要标记"。这些标记使模型能够将大量信息浓缩为一个较小的向量
分割文件：将待处理的文档分割成小段，每段都附加摘要标记。这些标记还包含前几段的摘要信息，形成摘要累积
微调训练：采用无监督训练方法，利用 "下一个词预测 "任务对模型进行微调。这项任务的目的是根据当前标记前的标记和当前段落前的段落摘要向量预测下一个词
反向传播AutoCompressor 对每个片段使用时间反向传播 (BPTT) 和梯度检查点技术，以尽量减小计算图的大小。反向传播针对整个文档执行，使模型能够学习整个上下文的关联

import torch
from transformers import AutoTokenizer
from auto_compressor import LlamaAutoCompressorModel, AutoCompressorModel

# Load AutoCompressor trained by compressing 6k tokens in 4 compression steps
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k")
# Need bfloat16 + cuda to run Llama model with flash attention
model = LlamaAutoCompressorModel.from_pretrained("princeton-nlp/AutoCompressor-Llama-2-7b-6k", torch_dtype=torch.bfloat16).eval().cuda()

prompt = 'The first name of the current US president is "'
prompt_tokens = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

context = """Joe Biden, born in Scranton, Pennsylvania, on November 20, 1942, had a modest upbringing in a middle-class family. He attended the University of Delaware, where he double-majored in history and political science, graduating in 1965. Afterward, he earned his law degree from Syracuse University College of Law in 1968.\nBiden's early political career began in 1970 when he was elected to the New Castle County Council in Delaware. In 1972, tragedy struck when his wife Neilia and 1-year-old daughter Naomi were killed in a car accident, and his two sons, Beau and Hunter, were injured. Despite this devastating loss, Biden chose to honor his commitment and was sworn in as a senator by his sons' hospital bedsides.\nHe went on to serve as the United States Senator from Delaware for six terms, from 1973 to 2009. During his time in the Senate, Biden was involved in various committees and was particularly known for his expertise in foreign affairs, serving as the chairman of the Senate Foreign Relations Committee on multiple occasions.\nIn 2008, Joe Biden was selected as the running mate for Barack Obama, who went on to win the presidential election. As Vice President, Biden played an integral role in the Obama administration, helping to shape policies and handling issues such as economic recovery, foreign relations, and the implementation of the Affordable Care Act (ACA), commonly known as Obamacare.\nAfter completing two terms as Vice President, Joe Biden decided to run for the presidency in 2020. He secured the Democratic nomination and faced the incumbent President Donald Trump in the general election. Biden campaigned on a platform of unity, promising to heal the divisions in the country and tackle pressing issues, including the COVID-19 pandemic, climate change, racial justice, and economic inequality.\nIn the November 2020 election, Biden emerged victorious, and on January 20, 2021, he was inaugurated as the 46th President of the United States. At the age of 78, Biden became the oldest person to assume the presidency in American history.\nAs President, Joe Biden has worked to implement his agenda, focusing on various initiatives, such as infrastructure investment, climate action, immigration reform, and expanding access to healthcare. He has emphasized the importance of diplomacy in international relations and has sought to rebuild alliances with global partners.\nThroughout his long career in public service, Joe Biden has been recognized for his commitment to bipartisanship, empathy, and his dedication to working-class issues. He continues to navigate the challenges facing the nation, striving to bring the country together and create positive change for all Americans."""
context_tokens = tokenizer(context, add_special_tokens=False, return_tensors="pt").input_ids.cuda()

summary_vectors = model(context_tokens, output_softprompt=True).softprompt
print(f"Compressing {context_tokens.size(1)} tokens to {summary_vectors.size(1)} summary vectors")
# >>> Compressing 660 tokens to 50 summary vectors

generation_with_summary_vecs = model.generate(prompt_tokens, do_sample=False, softprompt=summary_vectors, max_new_tokens=12)[0]
print("Generation w/ summary vectors:\n" + tokenizer.decode(generation_with_summary_vecs))
# >>> The first name of the current US president is "Joe" and the last name is "Biden".

next_tokens_without_context = model.generate(prompt_tokens, do_sample=False, max_new_tokens=11)[0]
print("Generation w/o context:\n" + tokenizer.decode(next_tokens_without_context))
# >>> The first name of the current US president is "Donald" and the last name is "Trump".

LLMLingua-2

LLMLingua-2 基于因果语言模型（如 LLaMa-7B）的信息熵，通过删除词块或词汇单位来压缩提示语，从而发现了两个问题：

用于确定信息熵的小语言模型与提示压缩目标不一致
它只利用单向上下文，可能无法包含及时压缩所需的全部信息

这些问题的核心在于，信息熵可能是一种次优的压缩措施

LLMLingua-2 的整体架构如图

为解决问题 1，LLMLingua-2 引入了数据提炼过程

该过程从 LLM 中提取知识，在不丢失关键信息的情况下压缩提示

它还构建了一个提取文本压缩数据集。在这个数据集上进行训练，有助于将小语言模型与提示压缩有效地结合起来

为解决问题 2，LLMLingua-2 将提示语压缩视为标记分类问题

这种方法确保了压缩后的提示语与原始提示语的保真度

使用Transformer编码器作为底层架构，从完整的双向语境中捕捉提示压缩所需的所有信息

如何构建有效的提示压缩数据集

数据蒸馏

数据注释

质量控制

Compressor 压缩机

视为二元分类问题

最初，提示压缩问题可以转化为二元分类问题

其基本概念是将每个词性单元视为一个独立的实体，并为其分配一个标签，即 "保留 "或 "丢弃"

这种方法既能保持压缩提示内容的完整性，又能简化模型的设计

压缩策略

原始提示 x 的压缩策略分为三个步骤。目标压缩率为 1/τ，其中 τ 定义为压缩后的提示语字数与原始提示语 x 的字数之商

确定压缩提示 x˜ 中要保留的目标标记数：N˜ = τN
使用标记分类模型来预测每个词 xi 被标记为 "保留 "的概率 pi
保留原始提示 x 中 pi 值最高的前 N 个词，并保留其原始顺序，形成压缩提示 x˜

code

from llmlingua import PromptCompressor

PROMPT = "John: So, um, I've been thinking about the project, you know, and I believe we need to, uh, make some changes. I mean, we want the project to succeed, right? So, like, I think we should consider maybe revising the timeline.\n\nSarah: I totally agree, John. I mean, we have to be realistic, you know. The timeline is, like, too tight. You know what I mean? We should definitely extend it."

llm_lingua = PromptCompressor(
    model_name = "microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2 = True,
)
compressed_prompt = llm_lingua.compress_prompt(PROMPT, rate=0.33, force_tokens = ['\n', '?'])

## Or use LLMLingua-2-small model
# llm_lingua = PromptCompressor(
#     model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
#     use_llmlingua2=True,
# )

print('-' * 100)
print("original:")
print(PROMPT)

print('-' * 100)
print("compressed_prompt:")
print(compressed_prompt)

RECOMP

RECOMP 引入了两种经过训练的压缩器：抽取式和抽象式

提取式压缩器从检索到的文档中选择有用的句子

抽象式压缩器则将多个文档中的信息结合起来生成摘要

Extractive Compressor 提取式压缩机

给定输入文档集中的 n 个句子 [s1, s2, ..., sn]

训练一个双编码器模型

模型将句子 si 和输入序列 x 嵌入到固定维度的嵌入中

这些嵌入的内积表示将 si 添加到输入 x 以生成目标输出序列对 LLM 的益处

压缩器的最终摘要 s 由排名前 N 的句子组成，按其与输入句子的内积排序

Abstractive Compressor 抽象压缩器

抽象压缩器是一种编码器-解码器模型。它将输入序列 x 和检索到的文档集连接起来，然后输出摘要

这种方法包括使用 LLM（如 GPT-3）生成训练数据集，过滤这些数据，然后使用过滤后的数据集训练编码器-解码器模型

https://github.com/carriex/recomp

10: Corrective Retrieval Augmented Generation (CRAG) 修正检索增强生成（CRAG）TODO

虽然传统的 RAG 技术可以减少 LLM 答案的不准确性，但它并不能以任何方式增强初始查询

这种方法可能会导致一些潜在的问题，例如

该系统可能会消耗过多的计算资源来处理简单的查询
对于复杂的查询，仅使用原始查询进行检索往往无法收集到足够的信息
对于可能有多个答案的模糊查询，使用原始查询进行信息检索是不够的

本文将介绍两种先进的解决方案：查询分类和查询细化。通过对小型模型的训练，这两种方法都有所改进

自适应-RAG：通过问题复杂性学习调整检索增强大型语言模型

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

自适应 RAG 引入了一个新的自适应框架。如图所示，它根据查询的复杂程度，动态选择最合适的 LLM 策略，从最简单到最复杂不等

A) 表示一种单步方法，即首先检索相关文档，然后生成答案。但是，对于需要多步骤推理的复杂查询，这种方法可能不够准确

B）是一个多步骤过程，包括迭代文档检索和生成中间响应。尽管这种方法很有效，但对于简单查询来说效率很低，因为它需要多次访问 LLM 和检索器

C）是一种自适应方法，它利用精心构建的分类器来确定查询的复杂性。这有助于为 LLM 检索选择最合适的策略，包括迭代、单一甚至无检索方法

Adaptive-RAG 流程。代码目前有四个版本：官方版本、Langchain 版本、LlamaIndex 版本和 Cohere 版本

使用 LlamaIndex 版本进行解释

cookbook/third_party/LlamaIndex/Adaptive_RAG.ipynb at e200507fba4e3404564f9249b345c89f83d73a10 · mistralai/cookbook (github.com)

对于复杂查询：需要使用多种工具。这些工具需要多个文档的上下文

对于简单查询：使用一个单独的工具，只需要单个文档中的上下文

对于直接查询：直接使用 LLM 提供答案

工具是通过分类器选择的。与最初的论文相反，这里使用的分类器没有经过训练。相反，它直接利用了已有的 LLM

虽然 LlamaIndex 代码不包括分类器的训练过程，但了解分类器的构建过程对进一步开发至关重要

数据集的构建

一个关键的挑战是缺乏查询复杂性对的注释数据集。如何解决这个问题？Adaptive-RAG 采用两种特定策略来自动构建训练数据集

分类器训练数据的标注是基于公开标注的 QA 数据集

对于收集到的问题，如果最简单的非检索方法生成了正确答案，则其对应查询的标签标记为 "A"

同样，通过单步法得到正确答案的查询标记为 "B"，而通过多步法得到正确答案的查询标记为 "C"

值得一提的是，较简单的模型优先级较高。这意味着，如果单步法和多步法都得到了正确的结果，而非检索法却失败了，就会给相应的查询贴上 "B "的标签

如果这三种方法都无法生成正确答案，则表明有些问题仍未标记

在这种情况下，会直接根据公共数据集进行分配。具体来说，将 "B "分配给单跳数据集中的查询，将 "C "分配给多跳数据集中的查询

训练和推理

训练方法包括使用交叉熵损失，根据这些自动收集的查询复杂度对来训练分类器

在推理过程中，可以通过将查询转发给分类器来确定查询的复杂度（查询是 {'A'、'B'、'C'} 中的一个）：o = Classifier(q)

不同大小的分类器之间没有明显的性能差异。即使是较小的模型也不会影响性能，从而有助于提高资源效率

查询细化方法：RQ-RAG

RQ-RAG: Learning To Refine Queries For Retrieval Augmented Generation RQ-RAG：学习完善检索增强生成查询

针对上述挑战，RQ-RAG 提出了三项改进措施

对于简单的询问，如日常问候，引入上下文实际上会降低回复质量。LLM 应该直接回复，而不是添加不必要的上下文，以避免潜在的低质量回复。换句话说，模型应该学会按需回复

对于复杂的查询，它们会被分成更简单、可回答的子查询。然后为每个子查询检索相应的信息，从而形成对原始复杂查询的全面回复

对于具有多个潜在响应的模糊查询。LLM 必须学会澄清查询、辨别用户意图，然后制定具体的搜索策略。这种方法有助于检索全面而精确的信息来回答问题

RQ-RAG 的方法包括以端到端方式训练 Llama2 7B 模型。这样，该模型就能通过改写、分解和澄清歧义来动态增强搜索查询

数据集构建

收集一个语料库，如图所示，其中包括各种场景，包括多轮对话、需要分解的查询和需要消除歧义的查询。使用该语料库创建任务池

任务池中的任务分为三类：多轮对话、分解和消歧义。例如，多轮对话数据集中的样本被归入多轮对话类别

首先，使用 ChatGPT 完善每种类型的查询。然后，使用这些细化查询从外部数据源检索信息。通常，DuckDuckGo 是主要来源，检索过程被视为黑盒

接下来，提示 ChatGPT 根据改进后的查询及其相应的上下文生成修改后的响应。通过重复这一过程，总共可以积累约 40k 个实例

与 ChatGPT 交互的提示模板：

蓝色文字是特定输入的占位符

完成上述过程后，我们就可以得到训练样本: $$ (X_{\mathrm{origin}},Y_{\mathrm{origin}})\to(X_{\mathrm{origin}},\underbrace{SPECIAL_{\mathrm{type}},Q_{\mathrm{i,_{type}},\left[D_{i1},\ldots,D_{ik}\right],\ldots}_{\mathrm{repear}for_itimes}},Y_{\mathrm{new}}) $$ 每个样本都是一个带有特殊标记的操作序列，其中

'Xorigin' 和 'Yorigin' 代表原始数据集中的输入输出对
'类型' 指优化操作：重写、分解或消除歧义
'i' 指迭代的轮次
'SPECIALtype' 表示优化类型
'Qi, type' 表示第 i 轮中根据特定特殊标记优化的查询
'[Di1, Di2, . . . , Dik]' 代表第 i 轮检索到的前 k 个文档
'Ynew' 表示在最后迭代步骤中生成的新答案

Training

以标准的自动回归方式训练 LLM，目标： $$ \mathcal{L}=\max_M\mathbb{E}_{(x,y)\sim D}\left[\log p_M(y|q_1,d_1,\ldots,q_i,d_i,x)\right] $$ 训练的目的是调整模型参数，以便在第 i 步时，在给定原始输入 x、增强查询 qi 和检索文档 di 的情况下，模型 M 能够生成最高概率的响应 y

答案选择

在每次迭代过程中，该模型都会解码针对特定需求定制的各种搜索查询：重写、分解或解决歧义。这些查询反过来又会获得不同的语境，从而导致扩展路径的多样化

针对不同的路径，制定了三种不同的策略：基于 ppl 的选择、基于置信度的选择和基于集合的选择

在 "基于 PPL 的选择 "中，会从总输出中选出困惑度（PPL）最低的答案

另一方面，基于置信度的选择会选择置信度最高的所有结果

最后，"基于集合的选择 "会选择累计置信度最高的最终结果

见解与思考

Comparison with Self-RAG and CRAG

与 Adaptive-RAG 和 RQ-RAG 不同，Self-RAG 和 CRAG 不会在检索前增强原始查询

相反，它们的重点在于决定是否执行检索以及如何优化检索后的操作

值得一提的是，CRAG 重写了网络搜索查询，以提高检索信息的质量

RQ-RAG 和 Self-RAG 都是通过训练一个较小的 LLM 来取代原来的 LLM

Adaptive-RAG 和 CRAG 不会替换原始 LLM。相反，它们增加了一个查询分类层或评估层

12: Enhancing Global Understanding（增进对全局的了解）

现实世界中的许多重要任务，包括科学文献综述、法律案件简报和医疗诊断，都需要跨块或跨文档的知识理解

现有的 RAG 方法无法帮助 LLMs 完成要求理解跨语块边界信息的任务，因为每个语块都是独立编码的

介绍四种创新方法，以增强对文档或语料库的全面理解，以及从中获得的启示和思考

RAPTOR: This is a tree-based retrieval system that recursively embeds, clusters, and summarizes text chunks 这是一个基于树的检索系统，可递归嵌入、聚类和总结文本块
Graph RAG: This method combines knowledge graph generation, community detection, RAG, and Query-Focused Summarization (QFS) to facilitate a comprehensive understanding of the entire text corpus 该方法结合了知识图谱生成、社群检测、RAG 和查询式摘要（QFS），有助于全面了解整个文本语料库
HippoRAG: This retrieval framework draws inspiration from the hippocampal indexing theory of human long-term memory. It collaborates with LLMs, knowledge graphs, and personalized PageRank algorithms 这一检索框架从人类长期记忆的海马索引理论中汲取灵感。它与 LLM、知识图谱和个性化 PageRank 算法协作
spRAG: This method enhances the performance of the standard RAG system through two key techniques, namely AutoContext and Relevant Segment Extraction (RSE) 该方法通过两项关键技术，即自动上下文和相关片段提取（RSE），提高了标准 RAG 系统的性能

Graph RAG

Graph RAG 利用 LLM 分两个阶段构建基于图的文本索引

从源文档中导出一个知识图谱
为所有紧密相连的实体组生成社区摘要

对于一个查询，每个社区摘要都会给出一个部分回复。然后将这些部分回复汇总，形成最终的全局答案

概述

采用与数据集领域相关的 LLM 提示来检测、提取和汇总节点（如实体）、边（如关系）和协变量

社群检测用于将图划分为元素组（节点、边、协变量），LLM 可以在索引和查询时对这些元素组进行总结

针对特定查询的全局答案是通过对与该查询相关的所有社区摘要进行最后一轮以查询为重点的摘要得出的

一个潜在的问题是，LLM 可能并不总是以相同的文本格式提取对同一实体的引用。这可能会导致实体元素重复，从而在图中产生重复节点

图 RAG 采用社群检测算法来识别图中的社群结构，将联系紧密的实体纳入同一社群

在这种情况下，即使 LLM 在提取过程中无法一致地识别实体的所有变体，社群检测也能帮助建立这些变体之间的联系

一旦被归入一个社区，就表明这些变体指的是相同的实体内涵，只是表达方式或同义词不同而已。这类似于知识图谱领域的实体消歧

在确定社区后，可以在莱顿层次结构中为每个社区生成类似报告的摘要。这些摘要对于理解数据集的整体结构和语义非常有用。它们还可以用来理解语料库，不会出现任何问题

步骤 4：社区摘要 → 社区答案 → 全球答案

根据上一步的社区摘要生成最终答案

由于群落结构具有层次性，不同层次的摘要可以回答各种问题

有了多层次的社区摘要，哪个层次的摘要能在细节和覆盖面之间取得平衡？
为特定社区水平生成全球答案的过程
对于给定的社区级别，会生成任何用户查询的全局答案

HippoRAG

HippoRAG 是一个新颖的检索框架，从人类长期记忆的海马索引理论中汲取灵感。它与 LLM、知识图谱和个性化 PageRank 算法协作

这种协作模仿了新皮质和海马在人类记忆中的不同作用

关键理念

下图说明了人脑是如何相对轻松地处理困难的知识整合任务的

海马记忆索引理论是著名的人类长期记忆理论，它为这种非凡的能力提供了一种可能的解释

基于环境、不断更新的记忆依赖于新皮质和 C 型海马之间的相互作用。新皮层处理并存储实际的记忆表征，而海马体则维护海马索引

该索引是一组相互连接的索引，指向新皮层中的记忆单元并存储其关联

概述

HippoRAG 的每个组成部分都与人类长期记忆的三个组成部分之一相对应

HippoRAG 模拟了人类长期记忆的三个组成部分，以模仿其模式分离和完成功能

对于离线索引，LLM 将段落处理为开放式 KG 三元组。然后将其添加到人工海马索引中，同时合成海马旁区域（PHR）检测同义词。在上述示例中，HippoRAG 提取了涉及托马斯教授的三元组，并将其纳入 KG
在线检索时，LLM 新皮质从查询中提取命名实体。然后，海马旁检索编码器将其链接到海马索引。HippoRAG 利用个性化的 PageRank 算法进行基于上下文的检索，并提取与托马斯教授相关的信息

整体流程演示

检索阶段，展示了查询命名实体识别 (NER)、查询节点检索、个性化页面排名 (PPR) 算法对节点概率的影响以及顶级检索结果的计算

如何建立长期记忆

首先，利用 LLM，使用 OpenIE 从检索语料库的每个段落中提取一组命名实体，如图所示

接下来，将命名实体添加到 OpenIE 提示中，以提取最终的三元组，如图所示

最后，利用经过微调的现成密集编码器创建知识图谱，这也将用于检索

如何检索

首先，使用 LLM 从用户查询中提取命名实体集

然后，根据检索编码器确定的相似性，将这些命名实体链接到知识图谱中的节点。将这些选定的节点称为查询节点

在海马体中，海马索引各要素之间的神经通路可使相关邻域被激活，并在上游进行回忆

为了模仿这种高效的图搜索过程，HippoRAG 利用了个性化 PageRank（PPR）算法，这是 PageRank 的一个版本，它只通过一组用户定义的源节点在图中分配概率

def rank_docs(self, query: str, top_k=10):
    """
        Rank documents based on the query
        @param query: the input phrase
        @param top_k: the number of documents to return
        @return: the ranked document ids and their scores
        """
    ...
    ...
    # Run Personalized PageRank (PPR) or other Graph Alg Doc Scores
    if len(query_ner_list) > 0:
        combined_vector = np.max([top_phrase_vectors], axis=0)

        if self.graph_alg == 'ppr':
            ppr_phrase_probs = self.run_pagerank_igraph_chunk([top_phrase_vectors])[0]
        elif self.graph_alg == 'none':
            ppr_phrase_probs = combined_vector
        elif self.graph_alg == 'neighbor_2':
            ppr_phrase_probs = self.get_neighbors(combined_vector, 2)
        elif self.graph_alg == 'neighbor_3':
            ppr_phrase_probs = self.get_neighbors(combined_vector, 3)
        elif self.graph_alg == 'paths':
            ppr_phrase_probs = self.get_neighbors(combined_vector, 3)
        else:
            assert False, f'Graph Algorithm {self.graph_alg} Not Implemented'

            fact_prob = self.facts_to_phrases_mat.dot(ppr_phrase_probs)
            ppr_doc_prob = self.docs_to_facts_mat.dot(fact_prob)
            ppr_doc_prob = min_max_normalize(ppr_doc_prob)
        else:
            ppr_doc_prob = np.ones(len(self.extracted_triples)) / len(self.extracted_triples)
            ...
            ...

最后，如同海马信号在上游发送时的做法一样，HippoRAG 将输出的 PPR 节点概率与之前索引的段落进行汇总，并以此为检索排序

spRAG

通过两种关键技术提高了标准 RAG 的性能：

自动上下文
相关片段提取 (Relevant Segment Extraction，RSE)

重点关注 spRAG 如何处理跨块的复杂查询。值得注意的是，目前还没有关于 spRAG 的论文，只有结合代码的分析

AutoContext：自动注入文档级上下文

在传统的 RAG 中，文档通常被分成固定长度的块来进行嵌入。这种简单的方法往往会忽略文档级的上下文信息，导致上下文嵌入不够准确和全面

为了解决这个问题，AutoContext 应运而生。它的主要理念是在嵌入前自动将文档级上下文信息纳入每个分块

具体来说，它会创建一个 1-2 句话的文档摘要，并将其与文件名一起添加到每个分块的开头

这样，每个分块就不再是孤立的，而是包含了整个文档的上下文信息。获取文档摘要的代码如下所示

def get_document_context(auto_context_model: LLM, text: str, document_title: str, auto_context_guidance: str = ""):
    # truncate the content if it's too long
    max_content_tokens = 6000 # if this number changes, also update the truncation message above
    text, num_tokens = truncate_content(text, max_content_tokens)
    if num_tokens < max_content_tokens:
        truncation_message = ""
    else:
        truncation_message = TRUNCATION_MESSAGE
    
    # get document context
    prompt = PROMPT.format(auto_context_guidance=auto_context_guidance, document=text, document_title=document_title, truncation_message=truncation_message)
    chat_messages = [{"role": "user", "content": prompt}]
    document_context = auto_context_model.make_llm_call(chat_messages)
    return document_context