環境構築すら不要なllamafileでローカルLLM(phi-3)

はじめに

いままでローカルのLLM構築(特にGPUを積んでいないマシン用)にはllama.cppを使っていました。

llama.cpp自体はコンパイル環境を整える必要がありますが、pythonバインドのllama-cpp-pythonを使うのであれば、pipでインストールすることができます。

ここらへんは、過去にllama.cppおよび、llama-cpp-pythonの構築記事を上げています。

上記の記事中ではGodot(もしくはUnity, UnrealEngine)でLLMを使うために、llama.cppを実行ファイルにコンパイルしているのですが、実際にコンパイルするのは結構面倒です。

特にゲーム開発はWindowsユーザーも多くいると思いますが、WindowsでのC++のコンパイルはVisual Studioを入れないといけなかったり、パスを通したりと大変でした。

そんな中、llamafileというものを耳に入れました

llamafileとは

Build software better, together

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over...

とはllama.cppをベースに作られたプロジェクトであり、マルチOSで利用可能な実行ファイルを配布しています。つまりOS毎に異なるコンパイル環境すら必要としていません。

その手軽さはなんと、ダウンロードして拡張子を変更するだけです。
さらに、ベースとなっているllama.cppよりも同等もしくは高速であることを謳っています。

大規模言語モデルを単一ファイルで配布・実行する「llamafile」のバージョン0.7で処理能力が最大10倍高速化

大規模言語モデル(LLM)をわずか4GBほどの実行ファイル1つで手軽に配布・実行できるようにしたパッケージ「llamafile v0.7」が公開されました。このバージョンではCPUとGPU両方の計算性能と計算精度が向上しており、命令セットア...

使っていみる

まずはllamafileをダウンロードします。
Googleで「モデル名　llamafile」で検索すれば、有名どころのローカルLLMは見つかります。
今回は小規模ローカルLLMで最も性能が高いと言われているPhi-3を試したいと思います。

モデル・llamafileのダウンロード

公式がphi-3用のllamafileをhuggingfaceで公開しているので、そこからダウンロードしました。

Mozilla/Phi-3-mini-4k-instruct-llamafile · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

とりあえず、Phi-3-mini-4k-instruct.Q4_K_M.llamafileとPhi-3-mini-4k-instruct.Q8_0.llamafileをダウンロードしました。

量子化のディスクサイズはこんな感じです

name	disk size
Phi-3-mini-4k-instruct.Q4_K_M.llamafile	2.42GB
Phi-3-mini-4k-instruct.Q8_0.llamafile	4.09GB

ダウンロードしたら、適当な作業用ディレクトリに移します。

名前(拡張子)の変更

Windowsでは実行ファイルはexe形式でないと認識されないので、exe形式に変更します。

PS E:\work\llamafile> mv .\Phi-3-mini-4k-instruct.Q4_K_M.llamafile .\Phi-3-mini-4k-instruct.Q4_K_M.llamafile.exe
PS E:\work\llamafile> mv .\Phi-3-mini-4k-instruct.Q8_0.llamafile .\Phi-3-mini-4k-instruct.Q8_0.llamafile.exe

実行

上のリンクにもHow to useが書かれていますが、

<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

の形式でチャットするのが良いそうです。<|end|>をつけないといけないことに注意ですね。

ターミナルで実行すると、引数が長くなってしまうので、実行する用のpythonスクリプトを書きます(引数として文字列を与えるだけなので、実際に利用する場合はpythonで書かなくても良いです)

import subprocess

question = 'How to explain Internet for a medieval knight?'
llm_prompt = f'''<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
{question}<|end|>
<|assistant|>
'''
command = [
    './Phi-3-mini-4k-instruct.Q8_0.llamafile.exe',
    '-ngl',  '9999',
    '-e', '-p', f'"{llm_prompt}"',
    '-n', '512'
]
print(' '.join(command))
result = subprocess.run(command, capture_output=True, text=True, encoding='utf8')
print(result.stderr)
print(result.stdout)

実行すると、次のような結果がでてきます。

llama_print_timings:        load time =    2381.94 ms
llama_print_timings:      sample time =      15.58 ms /   423 runs   (    0.04 ms per token, 27150.19 tokens per second)
llama_print_timings: prompt eval time =    1409.27 ms /    31 tokens (   45.46 ms per token,    22.00 tokens per second)
llama_print_timings:        eval time =   64095.61 ms /   422 runs   (  151.89 ms per token,     6.58 tokens per second)
llama_print_timings:       total time =   65616.01 ms /   453 tokens
Log end

<s> "<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
"Good morrow, noble knight. I shall endeavor to enlighten thee on a realm known as the 'Internet,' a vast and mystical network of knowledge and communication, conceived in the modern era far beyond the bounds of our time.

The Internet is akin to a grand tapestry, interwoven with threads of information, connecting people and places across great distances. It is akin to a great library, where scrolls and tomes are replaced by digital documents and images, accessible with but a mere flick of the finger or a gentle whisper of a magical device called a 'computer.'

Picture, if thou will, a realm wherein messages can be conveyed instantaneously, without the need for couriers on horseback or the passage of time. The Internet allows people to communicate through 'e-mails,' a system wherein written messages are sent electronically from one person's device to another's, much like sending a sealed letter, but in the blink of an eye.

This grand network also houses an abundance of knowledge. Just as the great libraries of lore and scholarship, it holds vast libraries of information, accessible through search engines that act as guides, helping one navigate the sea of data in search of what they desire to learn or discover.

Moreover, the Internet provides a stage for merriment and entertainment, akin to the bustling marketplaces of our age, where music, tales, and moving images can be enjoyed without leaving one's own chambers. 

But heed this, wise knight, for the Internet, while a boon to humanity, can also harbor perils and untruths. Like any powerful tool, it must be wielded with wisdom and discretion.

I hope this brief glimpse into the wonders of the Internet has given thee a glimpse of the marvels that lie beyond the confines of our time."<|end|>

無事、出力できたようです。日本語訳してみると、割といい感じで回答が帰ってきていることがわかります。

<s> "<|システム|>
あなたは役に立つ AI アシスタントです。<|end|>
<|ユーザー|>
中世の騎士にインターネットをどう説明するか?<|end|>
<|アシスタント|>
「騎士殿、想像してみてください。山、川、海に囲まれた広大な王国ではなく、私たちの周りに存在する目に見えない領域です。私たちが「インターネット」と呼ぶこの王国は、壮大な王国に似ています。 図書館、市場、情報の羊皮紙、そして世界各地からの貴族の集まり。

この領域には、世界中に伸び、すべての人、すべての城、すべての村を結ぶ、神秘的な道、目に見えない糸が存在します。 「インターネット接続」と呼ばれるこれらの経路を使用すると、迅速な配達員や魔法のカラスのように、メッセージを即座に送信できます。

壮大な図書館では、世界各地の巻物や知識の書が見つかります。 これは私たちが「ウェブサイト」と呼ぶものに似ており、それぞれのウェブサイトにはさまざまなテーマに関する豊富な情報が含まれています。 これらの Web サイトは世界の書籍に似ていますが、「インターネット接続」と呼ばれる適切なキーを持っている限り、どのデバイスからでもアクセスできます。

市場は、にぎやかな中世の町の広場のように、世界中から人々が商品を取引できる場所です。 ただし、シルクやスパイスを売る代わりに、アイデアや情報を交換します。 これらの取引は「インターネット」によって促進され、「電子メール」または「メッセージング アプリ」を通じて行われ、手紙を送るのと似ていますが、瞬時に行われます。

情報の羊皮紙は、私たちが歴史、考え、発見を記録する場所です。 これは「ソーシャルメディア」と呼ばれるもので、人々はタウンクライヤーによく似ていますが、より大規模に自分の経験を共有できます。

最後に、貴族の集まりは、私たちが「フォーラム」または「オンライン コミュニティ」と呼ぶものに似ており、さまざまな土地の人々が集まり、重要な問題について話し合い、知識を共有できます。

インターネットは、他の王国と同様に、注意、敬意、知恵を持ってナビゲートする必要があることを忘れないでください。インターネットには多くの宝物がある一方で、誤った情報やサイバー脅威という独自の脅威も存在するからです。 しかし、正しい知識と理解があれば、探求を大いに助け、私たちの世界についての理解をより豊かにすることができます。」<|end|>

GPU実行

しかも、GPU実行もできます。llama.cppやllama-cpp-pythonではGPUを使うためには専用のコンパイル環境が必要でした。それすら必要ないのです。

GPU実行は、引数として-ngl 9999,を与えるだけです。9999はGPUの使用割合で最大値です。

実行すると8.4倍くらい高速に出力できました。

type	ms / tokens	ms / 1 token
CPU	65616.01 / 453	144.84 ms
GPU	8101.59 / 470	17.23 ms

llama_print_timings:        load time =    2413.66 ms
llama_print_timings:      sample time =      18.56 ms /   440 runs   (    0.04 ms per token, 23705.62 tokens per second)
llama_print_timings: prompt eval time =      98.00 ms /    31 tokens (    3.16 ms per token,   316.31 tokens per second)
llama_print_timings:        eval time =    7855.89 ms /   439 runs   (   17.89 ms per token,    55.88 tokens per second)
llama_print_timings:       total time =    8101.59 ms /   470 tokens
Log end

<s> "<|system|>
You are a helpful AI assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
"Noble knight, imagine the vast kingdom we call the 'Internet' as a grand, invisible realm where knowledge, communication, and wondrous discoveries abound. This realm is interconnected by invisible threads that allow swift and instant communication, much like messengers delivering news across great distances in the blink of an eye.


Within this realm, one can access a vast library of knowledge, containing information on all manner of subjects, from the art of swordsmanship to the intricacies of alchemy. This library is ever-growing and constantly updated, with scholars from all over the kingdom contributing to its wealth of information.


The internet is also a marketplace for tradesmen, where goods and services can be exchanged without ever leaving the comfort of one's own keep. Merchants from distant lands can offer their wares, and patrons can procure what they seek, be it fine silks or exotic spices, with just a few strokes of the quill.


Furthermore, this realm allows for the connection of great minds across vast distances, enabling the exchange of ideas and philosophies. Scribes and thinkers can debate, collaborate, and share their wisdom with one another, fostering a greater understanding of the world and its workings.


To access this realm, one would need a magical looking glass, known as a 'computer,' which serves as the portal to this digital kingdom. Through a series of incantations (or rather, instructions), one can send and receive messages, view the vast collection of knowledge, and participate in the grand tapestry of interconnected experiences that make up the internet.


However, like any powerful tool, the internet must be handled with care and respect. Just as you would not misuse your sword, one should be mindful of the impact that their words and actions have within this realm, for the digital realm, while powerful, is also a reflection of the society that governs it."<|end|>

日本語能力

調べた限りでは日本語も出力できるようです。ただし、整合性は英語よりも落ちるらいです。
でも、phi-2ではまったく日本語に対応できなかったので、それよりはマシですね。

import subprocess

question = '中世の騎士にインターネットをどう説明しますか？'
llm_prompt = f'''<|system|>
あなたは優秀なAIアシスタントです<|end|>
<|user|>
{question}<|end|>
<|assistant|>
'''
command = [
    './Phi-3-mini-4k-instruct.Q8_0.llamafile.exe',
    '-ngl',  '9999',
    '-e', '-p', f'"{llm_prompt}"',
    '-n', '512'
]
print(' '.join(command))
result = subprocess.run(command, capture_output=True, text=True, encoding='utf8')
print(result.stderr)
print(result.stdout)

質問の意図がずれてしまっていますが、ちゃんとした日本語で返答できています。出力トークンを超えたので、途切れてしまいましたが問題なさそうですね。

llama_print_timings:        load time =    2427.86 ms
llama_print_timings:      sample time =      24.18 ms /   512 runs   (    0.05 ms per token, 21177.15 tokens per second)
llama_print_timings: prompt eval time =     219.90 ms /    58 tokens (    3.79 ms per token,   263.75 tokens per second)
llama_print_timings:        eval time =    9583.72 ms /   511 runs   (   18.75 ms per token,    53.32 tokens per second)
llama_print_timings:       total time =    9973.15 ms /   569 tokens
Log end

<s> "<|system|>
あなたは優秀なAIアシスタントです<|end|>
<|user|>
中世の騎士にインターネットをどう説明しますか？<|end|>
<|assistant|>
"もし中世の騎士がインターネットを知ることができたら、まずその存在を理解するために彼らは、インターネットの基本的な概念と機器を受け取ることになるでしょう。例えば、インターネットとは、地球上のすべての場所を繋ぐ コミュニケーションのネットワークです。これは情報が膨大な量であり、非常に高速に届くことによって、世界中の人々と情報交換を可能にします。

インターネットの中には、様々なメディアがあります。例えば、'ウェブ'と呼ばれるコンピュータ上で共有される文書、画像、動画などがあり、これらはウェブサイトにアクセスできます。また、'ソーシャルメディア'と呼ばれるプラットフォームを利用することで、即座に他の人々と情報や即興のコミュニケーションを行うことができます。

騎士たちにとって、インターネットは情報の無限の宝庫であり、知識の広がりや、遠隔地の情報交換の促進に大きな影響を与える可能性があります。ただし、その技術は彼らが完全に理解し、適切に活用できるかは不確定です。歴史的な文脈を考慮すると、これらの技

量子化サイズの違い

最後に、量子化サイズの違いについて調べてみます。

name	disk size	1token ms (CPU)	1token ms (GPU)
Phi-3-mini-4k-instruct.Q4_K_M.llamafile	2.42GB	89.66 ms	14.81 ms
Phi-3-mini-4k-instruct.Q8_0.llamafile	4.09GB	144.84 ms	17.23 ms

思ったよりも量子化しても早くはなりませんでした。GPU環境なら量子化しなくてもいいかもしれませんね。

感想

本当に便利すぎる。GodotやUnity、UnrealEngineなどのゲームエンジンでLLMを使いたいと思うと、結構障壁がある。特にオフラインでLLMを動かせるようにしたいと思うと、今まではC++やC#で組み込むしか選択肢がなかった。となると、コンパイル環境が衝突しないようにビルドしないと行けなかった。

それが、llamafileを置くだけで、あとはコマンド呼び出し関数を呼べばLLMが使えるのは本当に便利。さらにゲームであれば、GPU動作ができる前提で動かせるし、LLMを使ったゲームが流行りそう。

一方の画像生成系はstable-diffusion.cppはあるけど、llamafileのようなプロジェクトはまだ無いので、期待して待ってます。