Local GPT with Llama2
This tutorial will show how to use the LocalGPT open source initiative on the Intel® Gaudi®2 AI accelerator. LocalGPT allows you to load your own documents and run an interactive chat session with this material. This allows you to query and summarize your content by loading any .pdf or .txt documents into the
SOURCE DOCUMENTS folder, running the ingest.py script to tokenize your content and then the run_localGPT.py script to start the interaction.
In this example, we’re using the Llama2-chat 13B model from Meta (meta-llama/Llama-2-13b-chat-hf) as the reference model that will execute the inference on Intel Gaudi2 AI accelerators.
To optimize this instantiation of LocalGPT, we have created new content on top of the existing Hugging Face based “text-generation” inference task and pipelines, including:
- Using the Hugging Face Optimum Habana Library with the Llama2-13B model, which is optimized on Intel® Gaudi®2 AI accelerators.
- Using Langchain to import the source document with a custom embedding model, using the GaudiHuggingFaceEmbeddings class based on HuggingFaceEmbeddings.
- We are using a custom pipeline class,
GaudiTextGenerationPipelinethat optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.
In order to optimize LocalGPT on Intel Gaudi2 AI accelerators, custom classes were developed for text embeddings and text generation. The application uses the custom class ‘GaudiHuggingFaceEmbeddings’ to convert textual data to vector embeddings. This class extends the ‘HuggingFaceEmbeddings’ class from LangChain and utilizes a Intel® Gaudi®2 AI accelerators-optimized implementation of SentenceTransformer.
The tokenization process was modified to incorporate static shapes which provides a significant speed-up. Furthermore, the ‘GaudiTextGenerationPipeline’ class provides a link between the Optimum Habana library and Langchain. Similar to pipelines from HuggingFace transformers, this class enables text generation with optimizations such as kv-caching, static shapes and hpu graphs. It also lets the user modify the text-generation parameters (temperature, top_p, do_sample etc.) and consists of a method to compile computation graphs on Intel® Gaudi®2 AI accelerators. Instantiations of this class can be directly passed as input to Langchain classes.
Please follow the steps below to setup and run the model.
Set the Folder location and environment variables
cd /root/Gaudi-tutorials/PyTorch/localGPT_inference export DEBIAN_FRONTEND="noninteractive" export TZ=Etc/UTC /root/Gaudi-tutorials/PyTorch/localGPT_inference
Install the requirements for LocalGPT
apt-get update apt-get install -y tzdata bash-completion python3-pip openssh-server vim git iputils-ping net-tools protobuf-compiler curl bc gawk tmux rm -rf /var/lib/apt/lists/* pip install -q --upgrade pip pip install -q -r requirements.txt
Install the Optimum Habana Library from Hugging Face
pip install -q --upgrade-strategy eager optimum[habana]
Load your Local Content
Copy all of your files into the
SOURCE_DOCUMENTS directory. For this example, a copy of the United States Constitution is included in the folder. You can add and ingest additional content by adding your own content into the folder.
The current default file types are .txt, .pdf, .csv, and .xlsx, if you want to use any other file type, you will need to convert it to one of the default file types.
Run the following command to ingest all the data. The ingest.py uses LangChain tools to parse the document and create embeddings locally using the GaudiHuggingFaceEmbeddings class. It then stores the result in a local vector database (DB) using Chroma vector store.
If you want to start from an empty database, delete the DB folder and run the ingest script again.
python ingest.py --device_type hpu 2023-10-10 23:23:58,137 - INFO - ingest.py:124 - Loading documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS 2023-10-10 23:23:58,148 - INFO - ingest.py:37 - Loading document batch 2023-10-10 23:24:48,208 - INFO - ingest.py:133 - Loaded 1 documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS 2023-10-10 23:24:48,208 - INFO - ingest.py:134 - Split into 2227 chunks of text Loading Habana modules from /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib 2023-10-10 23:24:49,625 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2023-10-10 23:24:50,149 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2023-10-10 23:24:50,723 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings 2023-10-10 23:24:50,950 - INFO - duckdb.py:460 - loaded in 4454 embeddings 2023-10-10 23:24:50,952 - INFO - duckdb.py:472 - loaded in 1 collections ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 160 CPU RAM : 1056447244 KB ------------------------------------------------------------------------------ Batches: 100%|██████████████████████████████████| 70/70 [00:02<00:00, 23.41it/s] 2023-10-10 23:24:58,235 - INFO - ingest.py:161 - Time taken to create embeddings vectorstore: 7.784449464001227s 2023-10-10 23:24:58,235 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB 2023-10-10 23:24:58,619 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
How to access and Use the Llama2 model
Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.
To be able to run gated models like this Llama-2-13b-chat-hf, you need the following:
- Have a HuggingFace account
- Set a read token
- Login to your account using the HF CLI: run huggingface-cli login before launching your script
huggingface-cli login --token <your token here> Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well. Token is valid (permission: read). Your token has been saved to /root/.cache/huggingface/token Login successful
Running the LocalGPT model with Llama2 13B Chat
Set the model Usage
To change the model, you can modify the “LLM_ID = ” in the
constants.py file. For this example, the default is
Since this is interactive, it’s a better experience to launch this from a terminal window. This run_localGPT.py script uses a local LLM (Llama2 in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.
python run_localGPT.py --device_type hpu
Note: The inference is running sampling mode, so the user can optinally modify the temperature and top_p settings in run_localGPT.py line 84 to modify the output. The current settings are temperature=0.5, top_p=0.5. Type “exit” at the prompt to stop the execution.
Run this in a terminal window to start the chat: `python run_localGPT.py –device_type hpu`, the example below is showing the initial output:
python run_localGPT.py --device_type hpu 2023-10-10 23:29:55,812 - INFO - run_localGPT.py:186 - Running on: hpu 2023-10-10 23:29:55,812 - INFO - run_localGPT.py:187 - Display Source Documents set to: False 2023-10-10 23:29:56,315 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2023-10-10 23:29:56,718 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2023-10-10 23:29:56,922 - INFO - __init__.py:88 - Running Chroma using direct local API. 2023-10-10 23:29:56,931 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB 2023-10-10 23:29:56,935 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations 2023-10-10 23:29:56,938 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings 2023-10-10 23:29:57,183 - INFO - duckdb.py:460 - loaded in 6681 embeddings 2023-10-10 23:29:57,184 - INFO - duckdb.py:472 - loaded in 1 collections 2023-10-10 23:29:57,185 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection 2023-10-10 23:29:57,186 - INFO - run_localGPT.py:38 - Loading Model: meta-llama/Llama-2-13b-chat-hf, on: hpu 2023-10-10 23:29:57,186 - INFO - run_localGPT.py:39 - This action can take a few minutes! /usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 /usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2023-10-10 23:29:57,622] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect) [WARNING|utils.py:177] 2023-10-10 23:29:58,637 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but habana-frameworks v22.214.171.1240 was found, this could lead to undefined behavior! [WARNING|utils.py:190] 2023-10-10 23:29:59,786 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but the driver version is v1.12.0, this could lead to undefined behavior! Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 17427.86it/s] Loading checkpoint shards: 100%|██████████████████| 3/3 [00:02<00:00, 1.12it/s] ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 160 CPU RAM : 1056447244 KB ------------------------------------------------------------------------------ This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all. 2023-10-10 23:32:20,404 - INFO - run_localGPT.py:133 - Local LLM Loaded Enter a query: what is the Article I ? 2023-10-23 19:47:28,598 - INFO - run_localGPT.py:240 - Query processing time: 1.5914111537858844s > Question: what is the Article I ? > Answer: It is the first article of the US constitution . Enter a query: what does it say? 2023-10-23 19:47:36,684 - INFO - run_localGPT.py:240 - Query processing time: 1.872558546019718s > Question: what does it say? > Answer: The first article of the US constitution states "All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives." Enter a query: What about article II 2023-10-23 20:34:42,818 - INFO - run_localGPT.py:240 - Query processing time: 1.6038263840600848s > Question: What about article II > Answer: Article II of the US constitution deals with the executive branch of government and establishes the office of the president.Code language: PHP (php)
You can add your own content to the SOURCE DOCUMENTS folder to query and chat with your own content. You can also modify the `temperature` and `top_p` values in the run_localGPT.py file, line 84
pipe = GaudiTextGenerationPipeline(model_name_or_path=model_id, max_new_tokens=100, temperature=0.5, top_p=0.5, repetition_penalty=1.15, use_kv_cache=True, do_sample=True)
to experiment with different values to get different outputs. Please also review the updated class GaudiTextGenerationPipeline: in the /gaudi_utils/pipeline.py for information on tokenization and padding.
Copyright© 2023 Habana Labs, Ltd. an Intel Company.
Licensed under the Apache License, Version 2.0 (the “License”);
You may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.