llama-cpp-agentで多機能ローカルLLM環境を作る

以前、llama.cppについて紹介しました。

今日は、そんなllama.cppをベースとした、LLMフレームワーク「llama-cpp-agent」を使ってみたいと思います。

GitHub - Maximilian-Winter/llama-cpp-agent: The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output. Works also with models not fine-tuned to JSON output and function calls.

The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users ...

llama-cpp-agentについて

llama-cpp-agentはpythonで動作するLLMフレームワークです。

バックエンドにはllama-cpp-pythonというllama.cppのpythonラッパーがあり、GUFF形式のLLMをローカルで動かすことができます。

llama.cppやllama-cpp-pythonとの違いは、なんといってもその多機能さにあります。公式のGithubでは、次のような機能を上げています。

以下翻訳文です。

シンプルなチャットインターフェース
- LLM とシームレスに会話できます。
構造化出力
- LLM から構造化出力 (オブジェクト) を生成します。
単一および並列関数呼び出し
- LLM を使用して関数を実行します。
RAG – 検索拡張生成
- コルバート再ランク付けを使用して検索拡張生成を実行します。
エージェントチェーン
- 会話型、シーケンシャル、およびマッピングチェーンをサポートするツールを備えたエージェントチェーンを使用してテキストを処理します。
ガイド付きサンプリング
- ほとんどの 7B LLM で関数呼び出しと構造化出力を実行できます。ガイド付きサンプリングの文法と JSON スキーマ生成に感謝します。
複数のプロバイダー
- プロバイダーとして llama-cpp-python、llama.cpp サーバー、TGI サーバー、および vllm サーバーで動作します。
互換性
- Python 関数、pydantic ツール、llama-index ツール、および OpenAI ツールスキーマで動作します。
柔軟性
- カジュアルなチャットから特定の機能の実行まで、さまざまなアプリケーションに適しています。

構造化出力、関数呼び出し、RAGが手軽に試せるのが良いところでしょう。

環境構築

では、実際に動かす環境を作っていきます。環境はdev containerで作りました。

ディレクトリ構成

ディレクトリ構成は次のとおりです。.devcontainerディレクトリ以下に、３つのファイルを入れていれば、環境構築は問題ありません。

.
├── .devcontainer
│   ├── Dockerfile
│   ├── devcontainer.json
│   └── docker-compose.yml
├──models
└── any program files

Dockerfile

Dockerfileは前にllama-cpp-pythonをやった回をベースにllama-cpp-agentをpipで入れます。

FROM python:3.10-slim-bullseye

ARG USERNAME=vscode
ARG USER_UID=1000
ARG USER_GID=$USER_UID

ENV LANG ja_JP.UTF-8
ENV LANGUAGE ja_JP:ja
ENV LC_ALL ja_JP.UTF-8
ENV TZ JST-9
ENV TERM xterm

RUN apt-get update \
    && groupadd --gid $USER_GID $USERNAME \
    && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && apt-get install -y sudo \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME \
    && apt-get -y install locales \
    && localedef -f UTF-8 -i ja_JP ja_JP.UTF-8

RUN apt install -y build-essential libssl-dev
RUN apt install -y gcc g++

RUN apt -y install cmake
RUN apt-get -y install git

RUN pip install --upgrade pip
RUN pip install --upgrade setuptools

RUN pip install llama-cpp-agent

devcontainer.json

devcontainer.jsonでは、nameとserviceを指定します。どちらも同じでも問題ないので同じ名前を設定しました。

拡張機能は、ms-python.pythonとms-toolsai.jupyterをいつも入れています。

{
    "name": "llama_cpp_agent_test",
    "service": "llama_cpp_agent_test",
    "dockerComposeFile": "docker-compose.yml",
    "remoteUser": "vscode",
    "workspaceFolder": "/work",
    "customizations": {
      "vscode": {
        "extensions": [
          "ms-python.python",
          "ms-toolsai.jupyter"
        ]
      }
    }
}

docker-compose.yml

docker-compose.ymlでは、devcontainer.jsonで指定したservice名と同一のservice名を指定してください。

また、コンテナ名やホスト名は適当に設定してください。

version: '3'
services:
 llama_cpp_agent_test:
    container_name: 'llama_cpp_agent_test_container'
    hostname: 'llama_cpp_agent_test_container'
    build: .
    restart: always
    working_dir: '/work' 
    tty: true
    volumes:
      - type: bind
        source: ..
        target: /work

３つのファイルを揃えて、dev-containerでビルドをすると、コンテナが立ち上がります。

使ってみる

Getting Startを見ながら使ってみます。

Getting Started - llama-cpp-agent

.ipynbのノートブック形式で進めました。

まず、providerと呼ばれるLLMのバックエンドを決めます。

今回は、llama-cpp-pythonをバックエンドにしました。他にもllama.cppサーバーや、TGIサーバー、VLLMサーバーが選べるようですね。

また、モデルはPhi-3-mini-4k-instruct-q4.ggufです。これを、modelsディレクトリに配置しておきます。

from llama_cpp import Llama
from llama_cpp_agent.providers import LlamaCppPythonProvider

llama_model = Llama('/work/models/Phi-3-mini-4k-instruct-q4.gguf')

provider = LlamaCppPythonProvider(llama_model)

モデルがロードできると、ログが流れます。

...
...........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    83.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '\n' + message['content'] + '<|end|>' + '\n' + '<|assistant|>' + '\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '\n'}}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.padding_token_id': '32000', 'tokenizer.ggml.eos_token_id': '32000', 'tokenizer.ggml.bos_token_id': '1', 'general.architecture': 'phi3', 'phi3.context_length': '4096', 'phi3.attention.head_count_kv': '32', 'general.name': 'Phi3', 'tokenizer.ggml.pre': 'default', 'phi3.embedding_length': '3072', 'tokenizer.ggml.unknown_token_id': '0', 'phi3.feed_forward_length': '8192', 'phi3.attention.layer_norm_rms_epsilon': '0.000010', 'phi3.block_count': '32', 'phi3.attention.head_count': '32', 'phi3.rope.dimension_count': '96', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|user|>' + '
' + message['content'] + '<|end|>' + '
' + '<|assistant|>' + '
'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|end|>' + '
'}}{% endif %}{% endfor %}
Using chat eos_token: <|endoftext|>
Using chat bos_token: <s>

その後、agentと呼ばれるオブジェクトを作ります。

from llama_cpp_agent import LlamaCppAgent
agent = LlamaCppAgent(provider)

get_chat_responseで対話を始めます。

agent_output = agent.get_chat_response("Hello, World!")
print(f"Agent: {agent_output.strip()}")

問題なければ、LLMからの出力が得られます。

Agent: Hello! It's great to see you here. The phrase "Hello, World!" is commonly used in programming as a simple exercise for beginners to learn their first lines of code. Essentially, when someone says or writes "Hello, World!", they are typically greeting the world or starting with basic programmatic communication.

In many programming languages, writing a "Hello, World!" program is often the very first step in learning how to code. This task involves outputting the text "Hello, World!" onto the screen and serves as an introductory exercise for understanding syntax, running programs, and interacting with your computer's environment.

For example, here's a simple "Hello, World!" program written in Python:

```python
print("Hello, World!")
```

And here is how it looks like when executed in a Python interpreter or saved as a .py file and run from the command line:

```shell
$ python hello.py
Hello, World!
```

This straightforward task sets the foundation for more complex programming endeavors by familiarizing you with the fundamental concepts of coding.

これが基本的な使い方になります。

機能を試す

llama-cpp-agentのdocumentを参考にしてやってみます。

system_prompt

システムプロンプトにて、LLMに設定を与えることができます。これは、Agentを作る際に設定できます。

agent = LlamaCppAgent(provider, system_prompt="あなたは優秀なAIアシスタントです")
agent_output = agent.get_chat_response("日本で一番高い山を教えて下さい")
print(f"Agent: {agent_output.strip()}")

よくあるAIアシスタントをsystemとして与えることもできます。

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1493.92 ms
llama_print_timings:      sample time =      95.55 ms /   169 runs   (    0.57 ms per token,  1768.69 tokens per second)
llama_print_timings: prompt eval time =    1882.46 ms /    67 tokens (   28.10 ms per token,    35.59 tokens per second)
llama_print_timings:        eval time =   10433.85 ms /   168 runs   (   62.11 ms per token,    16.10 tokens per second)
llama_print_timings:       total time =   12522.99 ms /   235 tokens
Agent: 日本には様々な高所がありますが、最も高き山は富士山（ふじさん）であり、停水湖として3,776メートルの位置を取っています。富士山は日本記録の上位の山であり、その美しい円錐形が多くの視聴者に親しまれています。また、2013年にユネスコの世界文化遺産に登録されていることも特徴です。

Chat history

会話の履歴は、chat_historyと呼ばれる引数によって与えることができます。

from pprint import pprint
from llama_cpp_agent.chat_history import BasicChatHistory, BasicChatMessageStore, BasicChatHistoryStrategy
chat_history_store = BasicChatMessageStore()
chat_history = BasicChatHistory(message_store=chat_history_store, chat_history_strategy=BasicChatHistoryStrategy.last_k_tokens, k=7000, llm_provider=provider)

agent = LlamaCppAgent(provider, system_prompt="あなたは優秀なAIアシスタントです", chat_history=chat_history)

agent_output = agent.get_chat_response("日本で一番高い山を教えて下さい")
for message in chat_history_store.get_all_messages():
    print(message)

BasicChatMessageStoreのオブジェクトに対して、get_all_messages()を呼び出すことで、すべての会話の履歴を呼び出せます。

role=<Roles.system: 'system'> content='あなたは優秀なAIアシスタントです'
role=<Roles.user: 'user'> content='日本で一番高い山を教えて下さい'
role=<Roles.assistant: 'assistant'> content='\n富士山（ふじさん）が日本で一番高く、約3,776メートルの最高峰です。活火山でありながら、昔は噴火していましたが現在も安定している山です。富士山は日本を象徴する存在とされており、2013年にユネスコの世界文化遺産に登録されました。' tool_calls=None

追加で質問してみます。

agent_output = agent.get_chat_response("その山は日本のどこにありますか？")
for message in chat_history_store.get_all_messages():
    print(message)

前の質問を汲み取って、新しい回答ができています。また、履歴に会話が追加されていることが確認できます(回答が崩れてしまったのは量子化の影響でしょうか)。

role=<Roles.system: 'system'> content='あなたは優秀なAIアシスタントです'
role=<Roles.user: 'user'> content='日本で一番高い山を教えて下さい'
role=<Roles.assistant: 'assistant'> content='\n富士山（ふじさん）が日本で一番高く、約3,776メートルの最高峰です。活火山でありながら、昔は噴火していましたが現在も安定している山です。富士山は日本を象徴する存在とされており、2013年にユネスコの世界文化遺産に登録されました。' tool_calls=None
role=<Roles.user: 'user'> content='その山は日本のどこにありますか？'
role=<Roles.assistant: 'assistant'> content='\n富士山は日本の関西地方（関西諸県）で位置しています。具体的には、静岡県と山梨県に各自部を占めており、国内外から多くの人々が観光や登山を楽しむことができる理由の一つです。富士五湖（ふじごこ）とも呼ばれる周辺は、美しい自然に満ちた日本独特の風景を提ayerする地区です。' tool_calls=None

Function Calling

関数呼び出しを試してみます。

pythonのdatetimeから現在の日付と時刻を取得して返す関数を呼び出せるようにします。

このとき、呼び出す関数のコメント部分(“”” “””)を削除すると、動作しなくなります。注意してください(このエラーで１時間悩みました)。

from typing import Optional
import datetime

from llama_cpp_agent import FunctionCallingAgent
from llama_cpp_agent import LlamaCppFunctionTool

def get_current_datetime(output_format: Optional[str] = None):
    """
    Get the current date and time in the given format.

    Args:
         output_format: formatting string for the date and time, defaults to '%Y-%m-%d %H:%M:%S'
    """
    if output_format is None:
        output_format = '%Y-%m-%d %H:%M:%S'
    return datetime.datetime.now().strftime(output_format)

def send_message_to_user_callback(message: str):
    print(message)

function_tools = [LlamaCppFunctionTool(get_current_datetime)]

function_call_agent = FunctionCallingAgent(
    provider,
    system_prompt="You are a helpful assistant.",
    llama_cpp_function_tools=function_tools,
    send_message_to_user_callback=send_message_to_user_callback,
    allow_parallel_function_calling=True)

user_input = "What time is it now?"
function_call_agent.generate_response(user_input)

時間を聞くと、現在の時刻が返ってきました。

It is currently June 13, 2024 at 1:49 PM.

複数の関数を使えるようにもできます。このときは、function_toolsに複数の関数を登録するだけです。

import datetime
from enum import Enum
from typing import Union, Optional
from pydantic import BaseModel, Field

from llama_cpp_agent import FunctionCallingAgent
from llama_cpp_agent import LlamaCppFunctionTool

from llama_cpp import Llama
from llama_cpp_agent.providers import LlamaCppPythonProvider

llama_model = Llama('/work/models/Phi-3-mini-4k-instruct-q4.gguf', n_ctx=1024)
provider = LlamaCppPythonProvider(llama_model)

def get_current_datetime(output_format: Optional[str] = None):
    """
    Get the current date and time in the given format.

    Args:
         output_format: formatting string for the date and time, defaults to '%Y-%m-%d %H:%M:%S'
    """
    if output_format is None:
        output_format = '%Y-%m-%d %H:%M:%S'
    return datetime.datetime.now().strftime(output_format)

class MathOperation(Enum):
    ADD = "add"
    SUBTRACT = "subtract"
    MULTIPLY = "multiply"
    DIVIDE = "divide"


# Simple pydantic calculator tool for the agent that can add, subtract, multiply, and divide. Docstring and description of fields will be used in system prompt.
class calculator(BaseModel):
    """
    Perform a math operation on two numbers.
    """
    number_one: Union[int, float] = Field(..., description="First number.")
    operation: MathOperation = Field(..., description="Math operation to perform.")
    number_two: Union[int, float] = Field(..., description="Second number.")

    def run(self):
        if self.operation == MathOperation.ADD:
            return self.number_one + self.number_two
        elif self.operation == MathOperation.SUBTRACT:
            return self.number_one - self.number_two
        elif self.operation == MathOperation.MULTIPLY:
            return self.number_one * self.number_two
        elif self.operation == MathOperation.DIVIDE:
            return self.number_one / self.number_two
        else:
            raise ValueError("Unknown operation.")

def send_message_to_user_callback(message: str):
    print(message)

function_tools = [
    LlamaCppFunctionTool(get_current_datetime),
    LlamaCppFunctionTool(calculator)
]

function_call_agent = FunctionCallingAgent(
    provider,
    system_prompt="You are a helpful assistant.",
    llama_cpp_function_tools=function_tools,
    send_message_to_user_callback=send_message_to_user_callback,
    allow_parallel_function_calling=True)

user_input1 = "What time is it now?"
agent_output1 = function_call_agent.generate_response(user_input1)

user_input2 = "What is 42 * 42?"
agent_output2 = function_call_agent.generate_response(user_input2)

print(f'{user_input1}\n{agent_output1}')
print(f'{user_input2}\n{agent_output2}')

複数の関数を使えるようにすると、prompt長が長くなり、コンテキストサイズが足りなくなるので、n_ctx=1024にしました。

What time is it now?
[{'function': 'send_message', 'arguments': {'content': 'The current time is now 14:12'}, 'return_value': 'Message sent.'}]
What is 42 * 42?
[{'function': 'send_message', 'arguments': {'content': 'The result of 42 multiplied by 42 is 1764.'}, 'return_value': 'Message sent.'}]

LLMが苦手な四則計算も関数を呼び出せるようにすれば、大丈夫ですね！

Structured Output

pythonのインスタンスオブジェクトを生成する事もできます。

これは、以前にやったjsonフォーマットを返答する話と仕組みは同じです。最大出力トークン数が小さいと、jsonの記述が途中になってしまい、エラーになるので、max_tokensを設定します。

from enum import Enum
from typing import List
from pydantic import BaseModel, Field
from llama_cpp_agent import StructuredOutputAgent
from llama_cpp_agent import MessagesFormatterType
llama_model = Llama('/work/models/Phi-3-mini-4k-instruct-q4.gguf', n_ctx=1024)
provider = LlamaCppPythonProvider(llama_model)
settings = provider.get_provider_default_settings()
settings.max_tokens = 1024

class Category(Enum):
    Fiction = "Fiction"
    NonFiction = "Non-Fiction"

class Book(BaseModel):
    """
    Represents an entry about a book.
    """
    title: str = Field(..., description="Title of the book.")
    author: str = Field(..., description="Author of the book.")
    published_year: int = Field(..., description="Publishing year of the book.")
    keywords: List[str] = Field(..., description="A list of keywords.")
    category: Category = Field(..., description="Category of the book.")
    summary: str = Field(..., description="Summary of the book.")


structured_output_agent = StructuredOutputAgent(
    provider,
    debug_output=True,
    messages_formatter_type=MessagesFormatterType.CHATML
)

text = """The Feynman Lectures on Physics is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the California Institute of Technology (Caltech), during 1961–1963. The book's co-authors are Feynman, Robert B. Leighton, and Matthew Sands."""
book = structured_output_agent.create_object(Book, text, llm_sampling_settings=settings)
print(book)

実行すると、生成されたオブジェクトが表示されます。

{
"model":  "Book",
"fields": {
    "title": "The Feynman Lectures on Physics",
    "author": "Richard Feynman, Robert B. Leighton, and Matthew Sands",
    "published_year": 1964,
    "keywords": [
        "Physics textbook",
        "Richard Feynman lectures",
        "California Institute of Technology (Caltech)",
        "Undergraduate education"
    ],
    "category": "Non-Fiction",
    "summary": "The Feynman Lectures on Physics is a textbook based on some undergraduate physics lectures given by Nobel laureate Richard Feynman at the California Institute of Technology (Caltech) between 1961 and 1963. The book, co-authored with Robert B. Leighton and Matthew Sands, presents fundamental concepts in physics through insightful explanations."
}
}
title='The Feynman Lectures on Physics' author='Richard Feynman, Robert B. Leighton, and Matthew Sands' published_year=1964 keywords=['Physics textbook', 'Richard Feynman lectures', 'California Institute of Technology (Caltech)', 'Undergraduate education'] category=<Category.NonFiction: 'Non-Fiction'> summary='The Feynman Lectures on Physics is a textbook based on some undergraduate physics lectures given by Nobel laureate Richard Feynman at the California Institute of Technology (Caltech) between 1961 and 1963. The book, co-authored with Robert B. Leighton and Matthew Sands, presents fundamental concepts in physics through insightful explanations.'

試しにbookのメンバ変数にアクセスしてみます。

print(book.keywords)
print(book.published_year)

ちゃんとアクセスできました。

['Physics textbook', 'Richard Feynman lectures', 'California Institute of Technology (Caltech)', 'Undergraduate education']
1964

python上のオブジェクトをLLMから作れるのは、使い道がたくさんありそうです。

まとめ

今回はllama-cpp-agentを使ってみました。さっくりと関数呼び出しや、構造化まで使うことができて、ローカルLLMの幅が広がりそうですね。

他にも、RAGやKnowledge Graph Generationなどの機能もあるので、やる気が湧いたら、次回やってみたいと思います。

追記

llama-cpp-agentでRAGを試す話を書きました。