小本本系列：基于langchain的RAG工程实践

RAG（Retrieval-Augmented Generation）是一种结合了信息检索（Retrieval）和生成（Generation）技术的方法，主要用于提高文本生成的准确性和丰富度。这个方法基于大规模预训练语言模型（LLM），通过在生成文本的过程中，从外部知识库或文档集中检索相关的信息，再将这些信息整合到生成文本中。

RAG的核心组成部分：

检索器（Retriever）：
- 功能：在大量的文档或知识库中，找到与输入提示（Prompt）或者问题相关的信息片段。
- 实现：通常使用向量检索技术，如经过训练的双塔模型（双编码器）来表示查询和文档的向量，然后使用最近邻搜索来检索相关文档。
生成器（Generator）：
- 功能：基于检索到的信息，生成符合上下文且回答明确的文本。这个过程通常由一个大型预训练语言模型（如GPT-3或T5）完成。
- 实现：接收输入提示和检索到的信息，使用这些信息作为额外的上下文来生成更准确和具体的文本。

工作流程：

输入提示：用户输入一个提示或者问题。
检索步骤：系统使用输入提示来从外部文档集中检索相关的信息。
生成步骤：将检索到的信息与输入提示一起输入到生成模型中，由模型生成最终的输出。

Generator本质上是prompt template应用，RAG工程实践更多的关注如何搭建Retriever的pipeline，这边文章主要就是讲讲这块的实践。

搜索

经典的搜索系统从文本、图像和上下文中构建简单的表示形式，并建立高效的索引以从中进行搜索。尽管这些系统可以扩展到处理大量内容，但它们通常在处理内容的含义方面存在困难，往往停留在表面层次。

经典搜索和语义搜索的主要区别在于使用向量vector来表示和处理搜索相应的数据，使用embedding（嵌入技术）非常强大，因为它可以用于表示多种类型数据（文本、音频、图片、视频等等），并支持多种类型的查询（有没有发现大模型的多模态其实没有那么神秘）。

RAG工程架构

先以文本数据相关的RAG为例，我将简单的RAG工程实现架构分为4个pipeline：

document chunk splitter pipeline(vector database index pipeline)，构建查询数据的索引indices
first phase retrieve pipeline，从语义相似性的角度进行粗召回阶段
second phase retrieve pipeline，通过MMR减少结果的冗余同时保持已排名文档/短语等结果的查询相关性从而进行精细召回阶段
LLM generation pipeline，根据查询的数据以及prompt模版进行LLM文本生成

document chunk splitter pipeline

document loader

文本数据相关的 RAG 会涉及到文本数据的读写，langchain 为了方便开发者处理文本数据抽象一个 Document 对象用于处理所有跟文本数据相关的操作：

在处理LLM相关的数据操作的时候，可以优先考虑使用 langchain 提供的相关封装，目前支持：

webpage
pdf
cloud provider(oss)
social platform(reddit, twitter)
message service(discord, whatsapp, telegram, mastodon)
common files(cvs, html, json, directory, md, doc, docx etc.)

示例，加载目录数据：

from langchain_community.document_loaders import DirectoryLoader

text_loader_kwargs = {"autodetect_encoding": True}
loader = DirectoryLoader(
	"../",                           # directory to load
	glob="**/*.md",                  # glob parameter to control which files to load, 'md' means markdown files
	show_progress=True,              # show a progress bar
	use_multithreading=True,         # utilize several threads to load files
	loader_cls=TextLoader,           # customize the loader to specify the loader class
	silent_errors=True,              # skip the files which could not be loaded and continue the load process
	loader_kwargs=text_loader_kwargs # auto detect the file encoding before failing
	)
docs = loader.load()
len(docs)

text splitting

受到 text embedding 模型输入 token 的限制，RAG 检索的时候需要对文本进行分块，以百炼平台的 text embedding 模型为例，text-embedding-v2 模型只支持最多2048 token。

在构建 RAG 系统时，这就带来一个问题 —— 超过2048 token 的文本如何处理？最简单的方式就是进行切分，工程实践上我暂时不做过多的考虑，利用 langchain 提供的 text splitter 对长文本进行分块(How-to guides | 🦜️🔗 LangChain)。

splitting by character
splitting by markdown header
splitting by semantic chunks
splitting by token
splitting by HTML
splitting by code

以 markdown header 分块为例：

from langchain_text_splitters import CharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain_community.embeddings.dashscope import DashScopeEmbeddings
from langchain_community.vectorstores import SQLiteVSS

with open("./来自 Google 内部的另外一种声音：AI 没有护城河.md", "r") as f:
    md_str = f.read()

# split it into chunks
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")]
markdown_text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_docs = markdown_text_splitter.split_text(md_str)
md_texts = [md_doc.page_content for md_doc in md_docs]

# create the open-source embedding function
embedding_function = DashScopeEmbeddings(model="text-embedding-v3")

# load it in sqlite-vss in a table named state_union.
# the db_file parameter is the name of the file you want
# as your sqlite database.
db = SQLiteVSS.from_texts(
    texts=md_texts,
    embedding=embedding_function,
    table="state_union",
    db_file="./vss.db",
)

# query it
query = "AI的壁垒没有想象中的那么高"
data = db.similarity_search(query)

# print results
data[0].page_content

large text splitting

大文本切分会涉及到上下文丢失的问题，可以参考一片详细解释的文章 Late Chunk in Long-Context Embedding Models，所以在文本分块的时候尝试将所有段落（然后是句子，然后是单词）尽可能长的效果，因为这些段落似乎通常是最强的语义相关文本片段。langchain 通过 recursively split text 来支持这个场景。

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load example document
with open("./来自 Google 内部的另外一种声音：AI 没有护城河.md") as f:
    md_str = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
    separators=[ # override the list of separators to include additional punctuation
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
)
texts = text_splitter.create_documents([md_str])
print(texts[0])
print(texts[1])

query document embedding

RAG 的文本索引建立完成之后，下一步要做的是如何进行检索，直接进行 search 即可。但是如果我们碰到检索的文本是大文本（超过2k的文本）或者是超大文本（一本红楼梦），工程实践应该怎么处理呢？有两种思路：

summary，对大文本进行 map-reduce summary

from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
import tiktoken

MODEL_2_ENCODING = {"text-embedding-v1": "cl100k_base", "text-embedding-v2": "cl100k_base", "text-embedding-v3": "cl100k_base"}
MAX_TOKEN = {"text-embedding-v1": 2048, "text-embedding-v2": 2048, "text-embedding-v3":8192}

model_name = "text-embedding-v2"
encoding = tiktoken.get_encoding(MODEL_2_ENCODING[model_name])


# Load example document
with open("./来自 Google 内部的另外一种声音：AI 没有护城河.md") as f:
    md_str = f.read()

tokens = encoding.encode(md_str)
print(len(tokens))

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)

docs = text_splitter.create_documents([md_str])

map_prompt = """
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

combine_prompt = """
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

summary_chain = load_summarize_chain(llm=llm,
                                     chain_type='map_reduce',
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
#                                      verbose=True
                                    )
output = summary_chain.run(docs)
print(output)

k-means(BRV steps)，

# Loaders
from langchain.schema import Document

# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Model
from langchain_community.chat_models import ChatTongyi

# Embedding Support
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain

# Data Science
import numpy as np
from sklearn.cluster import KMeans

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=10000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])

embeddings = DashScopeEmbeddings(model="text-embedding-v2")

vectors = embeddings.embed_documents([x.page_content for x in docs])

# Assuming 'embeddings' is a list or array of 1536-dimensional embeddings

# Choose the number of clusters, this can be adjusted based on the book's content.
# I played around and found ~10 was the best.
# Usually if you have 10 passages from a book you can tell what it's about
num_clusters = 11

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)

# Find the closest embeddings to the centroids

# Create an empty list that will hold your closest points
closest_indices = []

# Loop through the number of clusters you have
for i in range(num_clusters):
    
    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
    
    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)
    
    # Append that position to your closest indices list
    closest_indices.append(closest_index)

selected_indices = sorted(closest_indices)

llm_model = "qwen-max"

llm = ChatTongyi(temperature=0.8, model=llm_model)

map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

map_chain = load_summarize_chain(llm=llm,
                             chain_type="stuff",
                             prompt=map_prompt_template)

selected_docs = [docs[doc] for doc in selected_indices]

# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(selected_docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} \n")


summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")

combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of what happened in the story.
The reader should be able to grasp what happened in the book.

```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

reduce_chain = load_summarize_chain(llm=llm,
                             chain_type="stuff",
                             prompt=combine_prompt_template,
#                              verbose=True # Set this to true if you want to see the inner workings
                                   )

output = reduce_chain.run([summaries])
print(output)

The BRV Steps:Load your book into a single text fileSplit your text into large-ish chunksEmbed your chunks to get vectorsCluster the vectors to see which are similar to each other and likely talk about the same parts of the bookPick embeddings that represent the cluster the most (method: closest to each cluster centroid)Summarize the documents that these embeddings represent

first step retrieve pipeline

这个 pipeline 是利用语义相似性方法来做搜索，主要用来根据文本语义来剔除不相关和冗余的数据，其核心目标是寻找与给定查询最相似的项目。这种搜索方式是基于预先定义好的度量标准来进行的。在那些旨在检索出与用户偏好高度契合的项目的情境下，相似性搜索方法特别有效。我们可以通过多种度量指标来衡量相似性搜索的有效性，例如“cosine similarity”（余弦相似度）、“Jaccard index”（杰卡德指数）以及“Euclidean distance”（欧几里得距离）。这些度量指标有助于确定两个项目之间的关联程度，而准确判断出这种关联程度对于生成精准的推荐是至关重要的。

以 langchain 的 local(也可以设置为 in-memory) vector store 支持的余弦相似度方法来做语义相似性搜索：

from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_community.embeddings.dashscope import DashScopeEmbeddings
from langchain_community.vectorstores import SQLiteVSS
from langchain_text_splitters import CharacterTextSplitter, MarkdownHeaderTextSplitter
from dotenv import load_dotenv, find_dotenv


# init qwen model config
_ = load_dotenv(find_dotenv("./env/.env"))

# load the document and split it into chunks
loader = TextLoader("./来自 Google 内部的另外一种声音：AI 没有护城河.md")
documents = loader.load()
with open("./来自 Google 内部的另外一种声音：AI 没有护城河.md", "r") as f:
    md_str = f.read()

# split it into chunks
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")]
markdown_text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_docs = markdown_text_splitter.split_text(md_str)
md_texts = [md_doc.page_content for md_doc in md_docs]


# create the open-source embedding function
# embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
embedding_function = DashScopeEmbeddings(model="text-embedding-v2")


# load it in sqlite-vss in a table named state_union.
# the db_file parameter is the name of the file you want
# as your sqlite database.
db = SQLiteVSS.from_texts(
    texts=md_texts,
    embedding=embedding_function,
    table="state_union",
    db_file="./vss.db",
)

# query it
query = "AI的壁垒没有想象中的那么高"
data = db.similarity_search(query)

# print results
data[0].page_content

second step retrieve pipeline

MMR（最大边际相关性）是为了在推荐中平衡相关性和多样性而设计的。它旨在选择不仅与用户的查询相关，而且足够多样化的项目，以避免重复。这在用户可能收到多个过于相似的推荐的场景中尤其重要，因为这会导致体验不够吸引人。

为什么需要 MMR，举个简单的例子来进行说明和理解：假设用余弦相似度进行相似性搜索的时候，top_k 设置为4即搜索结果中包含四个结果。此时会出现一种情况，一个与 query 相似性非常高的大文本，在做 text chunk 的时候，本分成了5个 data chunk，此时搜索结果只会是这个大文本，因为它占据了4个位置，其他相似的文本无法进入 top_k。MMR进行搜索就是来平衡相关性和多样性得到最终的排名的。

以一个简单 retriver 检索为例，观察一下 MMR 搜索结果有什么差异：

# retriever with MMR
from langchain_community.vectorstores import InMemoryVectorStore
from langchain_community.embeddings.dashscope import DashScopeEmbeddings
from langchain_core.documents import Document
from dotenv import load_dotenv, find_dotenv

# init qwen model config
_ = load_dotenv(find_dotenv("./env/.env"))

embeddings = DashScopeEmbeddings(model="text-embedding-v3")
vectorStore = InMemoryVectorStore(embeddings)

document1 = Document(page_content="The powerhouse of the cell is the mitochondria", metadata={ "source": "https://example.com" })
document2 = Document(page_content="Buildings are made out of brick", metadata={ "source": "https://example.com" })
document3 = Document(page_content="Mitochondria are made out of lipids", metadata={ "source": "https://example.com" })
documents = [document1, document2, document3]

vectorStore.add_documents(documents)

simi_ret = vectorStore.similarity_search("biology", k=2)
print(simi_ret)

mmr_ret = vectorStore.as_retriever(search_type="mmr", search_kwargs={"k": 2, "lambda_mult": 0.5}).invoke("biology")
print(mmr_ret)

################
# output print #
################
[Document(id='cd06846b-aab4-4417-bfac-204838fb3bc1', metadata={'source': 'https://example.com'}, page_content='The powerhouse of the cell is the mitochondria'), Document(id='d6e7bb82-9b9f-4ff7-8404-96e24300295b', metadata={'source': 'https://example.com'}, page_content='Mitochondria are made out of lipids')]
[Document(id='cd06846b-aab4-4417-bfac-204838fb3bc1', metadata={'source': 'https://example.com'}, page_content='The powerhouse of the cell is the mitochondria'), Document(id='9226e3c3-7c81-4f08-9909-3ba4e066c9db', metadata={'source': 'https://example.com'}, page_content='Buildings are made out of brick')]

LLM generation pipeline

native query

结合 step1 和 step2 检索到的相关文本之后，需要根据具体的问题利用 LLM 的文本生成能力进行问答即针对特定场景的 prompt 模版进行LLM generation，如下是一个简单的例子：

prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

llm.predict(text=PROMPT.format_prompt(
    context=compressed_docs,
    question=question
).text)

self-query

self-query retrieve 顾名思义具有查询自身的能力。具体来说，给定任何自然语言查询，self-query retriever 使用 query 构建 LLM 链来编写结构化查询，然后将该结构化查询应用于其底层向量存储。这使得检索器不仅可以使用用户输入的查询与存储文档的内容进行语义相似性比较，还可以从用户查询中提取存储文档元数据的过滤条件并执行这些过滤条件。

一下是以 DashVector 向量检索服务为例的样例代码：

from langchain_core.documents import Document
from langchain_community.vectorstores import InMemoryVectorStore
from langchain_community.vectorstores import DashVector
from langchain_community.embeddings.dashscope import DashScopeEmbeddings
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_community.chat_models import ChatTongyi
from dotenv import load_dotenv, find_dotenv

# init qwen model config
_ = load_dotenv(find_dotenv("./env/.env"))

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]
embeddings = DashScopeEmbeddings(model="text-embedding-v3")
# in-memory vector store
vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents)
# sqlite vector store with local storage
vector_store = SQLiteVSS.from_documents(
    documents,
    embeddings,
    table="movies",
    db_file="./vss-self-query.db",
)
# dashscope vector store service
vector_store = DashVector.from_documents(docs, embeddings)


metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
document_content_description = "Brief summary of a movie"
llm_model = "qwen-max"
llm = ChatTongyi(temperature=0.8, model=llm_model)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vector_store,
    document_content_description,
    metadata_field_info,
)

# This example only specifies a filter
retriever.invoke("I want to watch a movie rated higher than 8.5")

# This example specifies a query and a filter
retriever.invoke("Has Greta Gerwig directed any movies about women")

# This example specifies a composite filter
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")

# This example specifies a query and composite filter
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)

P.S. 研究一下 self-query 背后的原理：

# self-query from scratch
from langchain.chains.query_constructor.base import (
    StructuredQueryOutputParser,
    get_query_constructor_prompt,
)
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain_community.chat_models import ChatTongyi
from dotenv import load_dotenv, find_dotenv

# init qwen model config
_ = load_dotenv(find_dotenv("./env/.env"))

document_content_description = "Brief summary of a movie"
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]
llm_model = "qwen-max"
llm = ChatTongyi(temperature=0.8, model=llm_model)
prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
)
output_parser = StructuredQueryOutputParser.from_components()
query_constructor = prompt | llm | output_parser

print(prompt.format(query="dummy question"))
query_constructor.invoke(
    {
        "query": "What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers"
    }
)

输出结果：

Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Example 1. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre

Structured Request:
```json
{
    "query": "teenager love",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), eq(\"artist\", \"Katy Perry\")), lt(\"length\", 180), eq(\"genre\", \"pop\"))"
}
```


<< Example 2. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs that were not published on Spotify

Structured Request:
```json
{
    "query": "",
    "filter": "NO_FILTER"
}
```


<< Example 3. >>
Data Source:
```json
{
    "content": "Brief summary of a movie",
    "attributes": {
    "genre": {
        "description": "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        "type": "string"
    },
    "year": {
        "description": "The year the movie was released",
        "type": "integer"
    },
    "director": {
        "description": "The name of the movie director",
        "type": "string"
    },
    "rating": {
        "description": "A 1-10 rating for the movie",
        "type": "float"
    }
}
}
```

User Query:
dummy question

Structured Request:


StructuredQuery(query='taxi drivers', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='year', value=1999), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Luc Besson')]), limit=None)

Next RAG -- Agentic RAG

AI Agent v.s. Agentic AI
在AI领域中，AI Agent（智能体）与Agentic AI（能动AI）虽密切相关却各有侧重。AI Agent是具体的智能实体，能在特定环境中感知、决策并执行动作以完成任务，通常基于机器学习和人工智能技术，具备一定的自主性和自适应性，主要关注单一功能或任务，如AI客服系统。而Agentic AI是一个更广泛的术语，强调AI系统在更高层面上的自主决策和问题解决能力，不仅能够感知和执行任务，还能主动思考、规划和适应环境的变化，涵盖设计和改进AI Agent的方法和框架，探索其更广泛和通用的潜力，目标是实现更广泛、更复杂的任务，能够在动态环境中自主地进行学习和优化，应用范围更广，可在不同领域和场景发挥作用，智能程度更高，不仅能处理数据、决策，还能从互动中学习并优化自身行为，使用更复杂的算法，如强化学习、元学习、大模型结合自监督学习，适用于复杂系统，如自动驾驶系统、智能金融分析、火星探测机器人等，由于其高度自主性和广泛的应用范围，伦理和风险问题更为复杂，需要更多的关注和研究。

传统的 RAG 系统受到静态工作流程的限制，缺乏多步骤推理和复杂任务管理所需的适应性。Agentic Retrieval-Augmented Generation（Agentic RAG）通过将自主人工智能代理嵌入 RAG 管道，超越了这些限制。

Agentic RAG 将 ReACT 的推理能力与 Agent 的任务执行能力相结合，创建一个动态和自适应的系统。与遵循固定管道的传统 RAG 不同，Agentic RAG 通过使用 ReACT 根据用户查询的上下文动态协调 Agent，引入了灵活性。这使得系统不仅能够检索和生成信息，还能够根据上下文、不断变化的目标和与之互动的数据采取明智的行动。这些进步使 Agentic RAG 成为一个更强大和灵活的框架。模型不再仅限于被动响应用户查询；相反，它可以主动规划、执行并调整其方法以独立解决问题。这使得系统能够处理更复杂的任务，动态适应新挑战，并提供更具上下文相关性的响应。

以搜索 Wikipedia 为例，这个 zero-shot-agent 就是 Agentic RAG 的雏形：

from langchain import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.utilities import WikipediaAPIWrapper
from langchain_community.chat_models import ChatTongyi

wikipedia = WikipediaAPIWrapper()

llm_model = "qwen-max"

llm = ChatTongyi(temperature=0.8, model=llm_model)

tools = [
    Tool(
        name="Wikipedia",
        func=wikipedia.run,
        description="Useful for when you need to get information from wikipedia about a single topic"
    ),
]

agent_executor = initialize_agent(tools, llm, agent='zero-shot-react-description', verbose=True)

output = agent_executor.run("Can you please provide a quick summary of Napoleon Bonaparte? \
Then do a separate search and tell me what the commonalities are with Serena Williams")

print(output)

搜索