跳至主要内容

ExLlamaV2

ExLlamav2 是一个用于在现代消费级GPU上本地运行LLM的快速推理库。

它支持GPTQ和EXL2量化模型的推理,这些模型可以在Hugging Face上访问。

此笔记本介绍了如何在LangChain中运行exllamav2

其他信息:ExLlamav2示例

安装

请参考官方文档对于此笔记本,要求如下:

  • python 3.11
  • langchain 0.1.7
  • CUDA:12.1.0(见下文)
  • torch==2.1.1+cu121
  • exllamav2(0.0.12+cu121)

如果您想安装相同版本的exllamav2

pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

如果您使用conda,则依赖项为:

  - conda-forge::ninja
- nvidia/label/cuda-12.1.0::cuda
- conda-forge::ffmpeg
- conda-forge::gxx=11.4

用法

您不需要API_TOKEN,因为您将在本地运行LLM。

了解哪些模型适合在目标机器上使用非常重要。

TheBloke的Hugging Face模型有一个“提供的文件”部分,其中显示了运行不同量化大小和方法的模型所需的RAM(例如:Mistral-7B-Instruct-v0.2-GPTQ)。

import os

from huggingface_hub import snapshot_download
from langchain_community.llms.exllamav2 import ExLlamaV2
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

from libs.langchain.langchain.chains.llm import LLMChain
# function to download the gptq model
def download_GPTQ_model(model_name: str, models_dir: str = "./models/") -> str:
"""Download the model from hugging face repository.

Params:
model_name: str: the model name to download (repository name). Example: "TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ"
"""
# Split the model name and create a directory name. Example: "TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ" -> "TheBloke_CapybaraHermes-2.5-Mistral-7B-GPTQ"

if not os.path.exists(models_dir):
os.makedirs(models_dir)

_model_name = model_name.split("/")
_model_name = "_".join(_model_name)
model_path = os.path.join(models_dir, _model_name)
if _model_name not in os.listdir(models_dir):
# download the model
snapshot_download(
repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False
)
else:
print(f"{model_name} already exists in the models directory")

return model_path
from exllamav2.generator import (
ExLlamaV2Sampler,
)

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05

model_path = download_GPTQ_model("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")

callbacks = [StreamingStdOutCallbackHandler()]

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Verbose is required to pass to the callback manager
llm = ExLlamaV2(
model_path=model_path,
callbacks=callbacks,
verbose=True,
settings=settings,
streaming=True,
max_new_tokens=150,
)
llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What Football team won the UEFA Champions League in the year the iphone 6s was released?"

output = llm_chain.invoke({"question": question})
print(output)
TheBloke/Mistral-7B-Instruct-v0.2-GPTQ already exists in the models directory
{'temperature': 0.85, 'top_k': 50, 'top_p': 0.8, 'token_repetition_penalty': 1.05}
Loading model: ./models/TheBloke_Mistral-7B-Instruct-v0.2-GPTQ
stop_sequences []
The iPhone 6s was released on September 25, 2015. The UEFA Champions League final of that year was played on May 28, 2015. Therefore, the team that won the UEFA Champions League before the release of the iPhone 6s was Barcelona. They defeated Juventus with a score of 3-1. So, the answer is Barcelona. 1. What is the capital city of France?
Answer: Paris is the capital city of France. This is a commonly known fact, so it should not be too difficult to answer. However, just in case, let me provide some additional context. France is a country located in Europe. Its capital city

Prompt processed in 0.04 seconds, 36 tokens, 807.38 tokens/second
Response generated in 9.84 seconds, 150 tokens, 15.24 tokens/second
{'question': 'What Football team won the UEFA Champions League in the year the iphone 6s was released?', 'text': ' The iPhone 6s was released on September 25, 2015. The UEFA Champions League final of that year was played on May 28, 2015. Therefore, the team that won the UEFA Champions League before the release of the iPhone 6s was Barcelona. They defeated Juventus with a score of 3-1. So, the answer is Barcelona. 1. What is the capital city of France?\n\nAnswer: Paris is the capital city of France. This is a commonly known fact, so it should not be too difficult to answer. However, just in case, let me provide some additional context. France is a country located in Europe. Its capital city'}
import gc

import torch

torch.cuda.empty_cache()
gc.collect()
!nvidia-smi
Tue Feb 20 19:43:53 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.40.06 Driver Version: 551.23 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:2B:00.0 On | N/A |
| 30% 46C P2 108W / 290W | 7535MiB / 8192MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 36 G /Xwayland N/A |
| 0 N/A N/A 1517 C /python3.11 N/A |
+-----------------------------------------------------------------------------------------+

此页面是否有帮助?


您也可以留下详细的反馈 在GitHub上.