部署集成了 Llama2 的 LangChain 应用

Code: https://github.com/pluto-lang/pluto/tree/main/examples/langchain-llama2-sagemaker (opens in a new tab)

这篇文档将介绍使用 Pluto 使 LangChian 应用程序轻松接入 Llama2 大语言模型,并最终将 LangChain 应用产品化部署到 AWS 云平台上,暴露出 HTTP 接口。

这篇文档最终会在 AWS 平台上创建一个 SageMaker 实例来部署一个 TinyLlama 1.1B (opens in a new tab) 大语言模型,同时会创建两个 Lambda 实例,分别基于 LangChain 和部署的大语言模型实现最基本的对话和基于文档的问答两个功能。

整个研发过程,开发者不需要关心模型部署、AWS 资源配置等琐事,只需要关注业务逻辑的实现即可。当然,这篇文档同样适用于需要部署与接入其他开源模型的场景。

import { SageMaker, Function } from "@plutolang/pluto";
import { loadQAChain } from "langchain/chains";
import { Document } from "langchain/document";
import { PromptTemplate } from "langchain/prompts";
import {
} from "@langchain/community/llms/sagemaker_endpoint";
 * Deploy the Llama2 model on AWS SageMaker using the Hugging Face Text Generation Inference (TGI)
 * container. Here will deploy the TinyLlama-1.1B-Chat-v1.0 model, which can be run on the
 * ml.m5.xlarge instance.
 * Below is a set up minimum requirements for each model size of Llama2 model:
 * ```
 * Model      Instance Type    Quantization    # of GPUs per replica
 * Llama 7B   ml.g5.2xlarge    -               1
 * Llama 13B  ml.g5.12xlarge   -               4
 * Llama 70B  ml.g5.48xlarge   bitsandbytes    8
 * Llama 70B  ml.p4d.24xlarge  -               8
 * ```
 * The initial limit set for these instances is zero. If you need more, you can request an increase
 * in quota via the [AWS Management Console](https://console.aws.amazon.com/servicequotas/home).
const sagemaker = new SageMaker(
    instanceType: "ml.m5.xlarge",
    envs: {
      // HF_MODEL_ID: "meta-llama/Llama-2-7b-chat-hf",
      HF_MODEL_ID: "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
      HF_TASK: "text-generation",
      // If you want to deploy the Meta Llama2 model, you need to request a permission and prepare the
      // token. You can get the token from https://huggingface.co/settings/tokens
      // HUGGING_FACE_HUB_TOKEN: "hf_EmXPwpnyxxxxxxx"
 * TODO: Given the constraints of the current version of Deducer, we have to place the following
 * code within the separated function. Once we've upgraded Deducer, it'll be necessary to move this
 * code outside of the function.
async function createSageMakerModel() {
  // Custom for whatever model you'll be using
  class LLama27BHandler implements SageMakerLLMContentHandler {
    contentType = "application/json";
    accepts = "application/json";
    async transformInput(prompt: string, modelKwargs: Record<string, unknown>): Promise<any> {
      const payload = {
        inputs: prompt,
        parameters: modelKwargs,
      const stringifiedPayload = JSON.stringify(payload);
      return new TextEncoder().encode(stringifiedPayload);
    async transformOutput(output: any): Promise<string> {
      const response_json = JSON.parse(new TextDecoder("utf-8").decode(output));
      const content: string = response_json[0]["generated_text"] ?? "";
      return content;
  return new SageMakerEndpoint({
    endpointName: sagemaker.endpointName,
    modelKwargs: {
      temperature: 0.5,
      max_new_tokens: 700,
      top_p: 0.9,
    endpointKwargs: {
      CustomAttributes: "accept_eula=true",
    contentHandler: new LLama27BHandler(),
    clientOptions: {
      // In theory, there's no need to supply the following details as the code will be executed within
      // the AWS Lambda environment. However, due to the way SageMakerEndpoint is implemented, it's
      // required to specify a region.
      region: process.env["AWS_REGION"],
      // credentials: {
      //   accessKeyId: "YOUR AWS ACCESS ID",
      //   secretAccessKey: "YOUR AWS SECRET ACCESS KEY",
      // },
  // TODO: Use the following statement to help the deducer identify the right relationship between
  // Lambda and SageMaker. This will be used to grant permission for the Lambda instance to call
  // upon the SageMaker endpoint. This code should be removed after the deducer supports the
  // analysis of libraries.
  // TODO: bug: only asynchrous function can be successfully analyzed by deducer.
  await sagemaker.invoke({});
 * Why we don't use the Router (Api Gateway) to handle the requests? Because the ApiGateway comes
 * with a built-in 30-second timeout limit, which unfortunately, can't be increased. This means if
 * the generation process takes longer than this half-minute window, we'll end up getting hit with a
 * 503 Service Unavailable error. Consequently, we directly use the Lambda function to handle the
 * requests.
 * For more details, check out:
 * https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html
 * You can send a POST HTTP request to the Lambda function using curl or Postman. The request body
 * needs to be set as an array representing the function's arguments. Here's an example of a curl
 * request:
 * ```sh
 * curl -X POST https://<your-lambda-url-id>.lambda-url.<region>.on.aws/ \
 *   -H "Content-Type: application/json" \
 *   -d '["What is the capital of France?"]'
 * ```
 * If you get an error message such as `{"code":400,"body":"Payload should be an array."}`, you can
 * add a query parameter, such as `?n=1`, to the URL to resolve it. I don't know why it turns into a
 * GET request when I don't include the query parameter, even though the curl log indicates it's a
 * POST request. If you know the reason, please let me know.
 * The following code is creating a chain for the chatbot task. It can answer the user-provided question.
// TODO: Bug: The deducer fails to identify the function's resources if the return value of the
// constructor isn't assigned to a variable.
const chatFunc = new Function(
  async (query) => {
    const model = await createSageMakerModel();
    const promptTemplate = PromptTemplate.fromTemplate(`<|system|>
You are a cool and aloof robot, answering questions very briefly and directly.</s>
    const chain = promptTemplate.pipe(model);
    const result = await chain.invoke({ query: query });
    const answer = result
      .substring(result.indexOf("<|assistant|>") + "<|assistant|>".length)
    return answer;
    name: "chatbot", // The name should vary between different functions, and cannot be empty if there are more than one function instances.
 * The following code is creating a chain for the question answering task. It can be used to answer
 * the question based on the given context.
const exampleDoc1 = `
Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.
Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.
Therefore, Peter stayed with her at the hospital for 3 days without leaving.
const promptTemplate = `Use the following pieces of context to answer the question at the end.
Question: {question}
const qaFunc = new Function(
  async (query) => {
    const docs = [new Document({ pageContent: exampleDoc1 })];
    const prompt = new PromptTemplate({
      template: promptTemplate,
      inputVariables: ["context", "question"],
    const chain = loadQAChain(await createSageMakerModel(), {
      type: "stuff",
      prompt: prompt,
    const result = await chain.invoke({ input_documents: docs, question: query });
    return result["text"];
    name: "qa",


如果你还没有安装 Pluto,请参考这里 (opens in a new tab)的步骤安装 Pluto,并配置好 AWS 的访问凭证。


首先,在你的工作目录下,执行 pluto new 命令,这会交互式地创建一个新项目,并在你当前目录下创建一个新文件夹,其中包含了 Pluto 项目的基本结构。

这里,我的项目名称命名为 langchain-llama2-sagemaker,选择 AWS 平台,并且使用 Pulumi 作为部署引擎。

$ pluto new
? Project name langchain-llama2-sagemaker
? Stack name dev
? Select a platform AWS
? Select an provisioning engine Pulumi
Info:  Created a project, langchain-llama2-sagemaker

创建完成后,进入创建的项目文件夹 langchain-llama2-sagemaker,会看到这样的目录结构:

├── README.md
├── package.json
├── src
│   └── index.ts
└── tsconfig.json

然后,执行 npm install 下载所需依赖。


接下来,我们修改 src/index.ts 文件来构建我们的示例应用,过程也非常简单。

1)创建 SageMaker 实例

首先,我们引入 @plutolang/pluto 包,然后创建一个 SageMaker 实例,来部署我们的模型。

SageMaker 构造函数中,我们需要提供名称、模型的 Docker 镜像 URI 和一些配置信息,其中名称与想要部署的模型没有关系,只是用于确定 SageMaker 实例的名称。

import { SageMaker, Function } from "@plutolang/pluto";
import { loadQAChain } from "langchain/chains";
import { Document } from "langchain/document";
import { PromptTemplate } from "langchain/prompts";
import {
} from "@langchain/community/llms/sagemaker_endpoint";
 * Deploy the Llama2 model on AWS SageMaker using the Hugging Face Text Generation Inference (TGI)
 * container. Here will deploy the TinyLlama-1.1B-Chat-v1.0 model, which can be run on the
 * ml.m5.xlarge instance.
 * Below is a set up minimum requirements for each model size of Llama2 model:
 * ```
 * Model      Instance Type    Quantization    # of GPUs per replica
 * Llama 7B   ml.g5.2xlarge    -               1
 * Llama 13B  ml.g5.12xlarge   -               4
 * Llama 70B  ml.g5.48xlarge   bitsandbytes    8
 * Llama 70B  ml.p4d.24xlarge  -               8
 * ```
 * The initial limit set for these instances is zero. If you need more, you can request an increase
 * in quota via the [AWS Management Console](https://console.aws.amazon.com/servicequotas/home).
const sagemaker = new SageMaker(
    instanceType: "ml.m5.xlarge",
    envs: {
      // HF_MODEL_ID: "meta-llama/Llama-2-7b-chat-hf",
      HF_MODEL_ID: "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
      HF_TASK: "text-generation",
      // If you want to deploy the Meta Llama2 model, you need to request a permission and prepare the
      // token. You can get the token from https://huggingface.co/settings/tokens
      // HUGGING_FACE_HUB_TOKEN: "hf_EmXPwpnyxxxxxxx"

如果你想部署 Meta 完整的 Llama2 7B、13B、70B 模型,有两点你需要注意:

  1. 不同的 Llama2 大语言模型对实例的要求不同,需要选择不同的实例类型,以下是各模型对应的最低要求:
    • Llama 7B: ml.g5.2xlarge
    • Llama 13B: ml.g5.12xlarge
    • Llama 70B: ml.p4d.24xlarge
  2. 你需要事先向 Meta 请求下载权限,你在这个网页 (opens in a new tab)应该能看到提示,根据提示完成权限申请。此外,你还需要准备一个 Hugging Face 的 token,你可以从这里 (opens in a new tab)获取。

如果你想部署其他大语言模型,只需要确定你要部署的大语言模型支持 TGI 即可。在这里 (opens in a new tab)可以找到支持 TGI 的模型。找到需要部署的模型后,需要将模型的 ID 和任务类型填入 envs 中。模型 ID 就是网页上模型的名称,任务类型则体现在模型的标签中。

2)将 SageMaker 部署的模型适配为 LangChain 的 LLM 类型

LangChain 社区中已经提供了一个 SageMakerEndpoint 类,用于将 SageMaker 部署的模型适配为 LangChain 接受的 LLM 模型。我们只需要实现 SageMakerLLMContentHandler 接口来适配模型的输入输出即可。

SageMakerEndpoint 构造函数的参数列表中包括 EndpointName,在基于 Pluto 的应用程序中,我们只需要调用 sagemaker.endpointName 就可获取到,不需要再去控制台上查找了。并且,由于编写的代码最终会直接部署成 AWS Lambda 实例,clientOptions 所需要的 region 参数也可以直接从环境变量中获取。

async function createSageMakerModel() {
  class LLama27BHandler implements SageMakerLLMContentHandler {
    contentType = "application/json";
    accepts = "application/json";
    async transformInput(prompt: string, modelKwargs: Record<string, unknown>): Promise<any> {
      const payload = {
        inputs: prompt,
        parameters: modelKwargs,
      const stringifiedPayload = JSON.stringify(payload);
      return new TextEncoder().encode(stringifiedPayload);
    async transformOutput(output: any): Promise<string> {
      const response_json = JSON.parse(new TextDecoder("utf-8").decode(output));
      const content: string = response_json[0]["generated_text"] ?? "";
      return content;
  return new SageMakerEndpoint({
    endpointName: sagemaker.endpointName,
    modelKwargs: {
      temperature: 0.5,
      max_new_tokens: 700,
      top_p: 0.9,
    endpointKwargs: {
      CustomAttributes: "accept_eula=true",
    contentHandler: new LLama27BHandler(),
    clientOptions: {
      region: process.env["AWS_REGION"],
  // Cannot be omitted.
  await sagemaker.invoke({});

看到这里,你或许会产生一些疑问,class 的定义为什么在函数里面?return 之后为什么还有一条语句?这是因为当前版本的 Pluto 还不成熟,目前只能通过这种方式来确保能够正确构建 AWS Lambda 实例。如果有大佬对这块原理与实现感兴趣,欢迎阅读这篇文档 (opens in a new tab),并且非常非常欢迎一起参与共建

3)创建对话功能的 Lambda 函数

接下来,我们基于 LangChain 的 PromptTemplate 实现最基本的对话功能。

我们创建一个 Function 对象 chatFunc,这个对象对应一个 AWS Lambda 实例,这个函数接收一个 query 作为输入参数,并返回大语言模型响应的结果。

const chatFunc = new Function(
  async (query) => {
    const model = await createSageMakerModel();
    const promptTemplate = PromptTemplate.fromTemplate(`<|system|>
You are a cool and aloof robot, answering questions very briefly and directly.</s>
    const chain = promptTemplate.pipe(model);
    const result = await chain.invoke({ query: query });
    const answer = result
      .substring(result.indexOf("<|assistant|>") + "<|assistant|>".length)
    return answer;
    name: "chatbot", // The name should vary between different functions, and cannot be empty if there are more than one function instances.


4)创建问答功能 Lambda 函数

最后,我们创建一个 Function 对象 qaFunc,这个对象同样对应一个 AWS Lambda 实例。这个函数接收一个 query 作为输入参数,大语言模型会根据问题与输入的文档响应结果。

const exampleDoc1 = `
Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital.
Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well.
Therefore, Peter stayed with her at the hospital for 3 days without leaving.
const promptTemplate = `Use the following pieces of context to answer the question at the end.
Question: {question}
const qaFunc = new Function(
  async (query) => {
    const docs = [new Document({ pageContent: exampleDoc1 })];
    const prompt = new PromptTemplate({
      template: promptTemplate,
      inputVariables: ["context", "question"],
    const chain = loadQAChain(await createSageMakerModel(), {
      type: "stuff",
      prompt: prompt,
    const result = await chain.invoke({ input_documents: docs, question: query });
    return result["text"];
    name: "qa",

至此,我们的代码就已经编写完成,接下来我们只要将其部署到 AWS 上,就可以通过 HTTP 请求来调用我们的模型了。


部署 Pluto 项目也非常简单,只需要在项目根目录下执行 pluto deploy 命令,Pluto 就会自动将项目部署到 AWS 上。部署的结果会像下面这样,其中红色代表对话功能的 Lambda 实例,绿色代表问答功能的 Lambda 实例。注意:SageMaker 的部署时间较长,请耐心等待。

alt text


整个应用部署后的架构就像上面这张图所展示的,整体上由一个 SageMaker 实例、两个 Lambda 函数所构成。但是,在实际部署的时候,远比展示的复杂,我们需要创建与配置将近 20 个配置项,其中就包括 SageMaker 的 Model、Endpoint,Lambda 实例,以及多个 IAM 角色、权限等。而如果使用 Pluto 的话,这所有的操作只需要一行命令就可以自动化地完成


接下来,我们就能使用返回的 URL 来访问我们的应用程序了。

我们可以使用 curl 或 Postman 向 Lambda 函数发送 POST HTTP 请求,需要注意的是,请求体需要设置成一个数组的形式,这表示函数入参列表。下面是 curl 请求的示例:

curl -X POST https://<your-lambda-url-id>.lambda-url.<region>.on.aws/ \
  -H "Content-Type: application/json" \
  -d '["What is the capital of France?"]'

你还可以尝试使用 Pluto 提供的 KVStore 来实现一个能够保持会话的对话机器人🤖️,欢迎提交 PR!


为什么不使用 Router(Api Gateway)来处理请求?

因为 ApiGateway 自带的 30 秒超时限制,无法调整。这意味着如果生成过程超过这个时间窗口,我们就会收到 503 Service Unavailable 的错误。因此,我们直接使用 Lambda 函数来处理请求。后续会尝试通过支持 WebSocket 来提升体验。