Local GPT with Llama2
This tutorial will show how to use the LocalGPT open source initiative on the Intel® Gaudi®2 AI accelerator. LocalGPT allows you to load your own documents and run an interactive chat session with this material. This allows you to query and summarize your content by loading any .pdf or .txt documents into the SOURCE DOCUMENTS
folder, running the ingest.py script to tokenize your content and then the run_localGPT.py script to start the interaction.
In this example, we’re using the Llama2-chat 13B model from Meta (meta-llama/Llama-2-13b-chat-hf) as the reference model that will execute the inference on Intel Gaudi2 AI accelerators.
To optimize this instantiation of LocalGPT, we have created new content on top of the existing Hugging Face based “text-generation” inference task and pipelines, including:
- Using the Hugging Face Optimum Habana Library with the Llama2-13B model, which is optimized on Intel® Gaudi®2 AI accelerators.
- Using Langchain to import the source document with a custom embedding model, using the GaudiHuggingFaceEmbeddings class based on HuggingFaceEmbeddings.
- We are using a custom pipeline class,
GaudiTextGenerationPipeline
that optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.
In order to optimize LocalGPT on Intel Gaudi2 AI accelerators, custom classes were developed for text embeddings and text generation. The application uses the custom class ‘GaudiHuggingFaceEmbeddings’ to convert textual data to vector embeddings. This class extends the ‘HuggingFaceEmbeddings’ class from LangChain and utilizes a Intel® Gaudi®2 AI accelerators-optimized implementation of SentenceTransformer.
The tokenization process was modified to incorporate static shapes which provides a significant speed-up. Furthermore, the ‘GaudiTextGenerationPipeline’ class provides a link between the Optimum Habana library and Langchain. Similar to pipelines from HuggingFace transformers, this class enables text generation with optimizations such as kv-caching, static shapes and hpu graphs. It also lets the user modify the text-generation parameters (temperature, top_p, do_sample etc.) and consists of a method to compile computation graphs on Intel® Gaudi®2 AI accelerators. Instantiations of this class can be directly passed as input to Langchain classes.
Please follow the steps below to setup and run the model.
Set the Folder location and environment variables
cd /root/Gaudi-tutorials/PyTorch/localGPT_inference
export DEBIAN_FRONTEND="noninteractive"
export TZ=Etc/UTC
/root/Gaudi-tutorials/PyTorch/localGPT_inference
Install the requirements for LocalGPT
apt-get update
apt-get install -y tzdata bash-completion python3-pip openssh-server vim git iputils-ping net-tools protobuf-compiler curl bc gawk tmux
rm -rf /var/lib/apt/lists/*
pip install -q --upgrade pip
pip install -q -r requirements.txt
Install the Optimum Habana Library from Hugging Face
pip install -q --upgrade-strategy eager optimum[habana]
Load your Local Content
Copy all of your files into the SOURCE_DOCUMENTS
directory. For this example, a copy of the United States Constitution is included in the folder. You can add and ingest additional content by adding your own content into the folder.
The current default file types are .txt, .pdf, .csv, and .xlsx, if you want to use any other file type, you will need to convert it to one of the default file types.
Run the following command to ingest all the data. The ingest.py uses LangChain tools to parse the document and create embeddings locally using the GaudiHuggingFaceEmbeddings class. It then stores the result in a local vector database (DB) using Chroma vector store.
If you want to start from an empty database, delete the DB folder and run the ingest script again.
python ingest.py --device_type hpu
2023-10-10 23:23:58,137 - INFO - ingest.py:124 - Loading documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS
2023-10-10 23:23:58,148 - INFO - ingest.py:37 - Loading document batch
2023-10-10 23:24:48,208 - INFO - ingest.py:133 - Loaded 1 documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS
2023-10-10 23:24:48,208 - INFO - ingest.py:134 - Split into 2227 chunks of text
Loading Habana modules from /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib
2023-10-10 23:24:49,625 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:24:50,149 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:24:50,723 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-10-10 23:24:50,950 - INFO - duckdb.py:460 - loaded in 4454 embeddings
2023-10-10 23:24:50,952 - INFO - duckdb.py:472 - loaded in 1 collections
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056447244 KB
------------------------------------------------------------------------------
Batches: 100%|██████████████████████████████████| 70/70 [00:02<00:00, 23.41it/s]
2023-10-10 23:24:58,235 - INFO - ingest.py:161 - Time taken to create embeddings vectorstore: 7.784449464001227s
2023-10-10 23:24:58,235 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
2023-10-10 23:24:58,619 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
How to access and Use the Llama2 model
Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses.
To be able to run gated models like this Llama-2-13b-chat-hf, you need the following:
- Have a HuggingFace account
- Agree to the terms of use of the model in its model card on the HF Hub
- Set a read token
- Login to your account using the HF CLI: run huggingface-cli login before launching your script
huggingface-cli login --token <your token here>
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Running the LocalGPT model with Llama2 13B Chat
Set the model Usage
To change the model, you can modify the “LLM_ID = ” in the constants.py
file. For this example, the default is meta-llama/Llama-2-13b-chat-hf
Since this is interactive, it’s a better experience to launch this from a terminal window. This run_localGPT.py script uses a local LLM (Llama2 in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.
python run_localGPT.py --device_type hpu
Note: The inference is running sampling mode, so the user can optinally modify the temperature and top_p settings in run_localGPT.py line 84 to modify the output. The current settings are temperature=0.5, top_p=0.5. Type “exit” at the prompt to stop the execution.
Run this in a terminal window to start the chat: `python run_localGPT.py –device_type hpu`, the example below is showing the initial output:
python run_localGPT.py --device_type hpu
2023-10-10 23:29:55,812 - INFO - run_localGPT.py:186 - Running on: hpu
2023-10-10 23:29:55,812 - INFO - run_localGPT.py:187 - Display Source Documents set to: False
2023-10-10 23:29:56,315 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:29:56,718 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:29:56,922 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-10-10 23:29:56,931 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
2023-10-10 23:29:56,935 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-10-10 23:29:56,938 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-10-10 23:29:57,183 - INFO - duckdb.py:460 - loaded in 6681 embeddings
2023-10-10 23:29:57,184 - INFO - duckdb.py:472 - loaded in 1 collections
2023-10-10 23:29:57,185 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-10-10 23:29:57,186 - INFO - run_localGPT.py:38 - Loading Model: meta-llama/Llama-2-13b-chat-hf, on: hpu
2023-10-10 23:29:57,186 - INFO - run_localGPT.py:39 - This action can take a few minutes!
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2023-10-10 23:29:57,622] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[WARNING|utils.py:177] 2023-10-10 23:29:58,637 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.12.0.480 was found, this could lead to undefined behavior!
[WARNING|utils.py:190] 2023-10-10 23:29:59,786 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but the driver version is v1.12.0, this could lead to undefined behavior!
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 17427.86it/s]
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:02<00:00, 1.12it/s]
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056447244 KB
------------------------------------------------------------------------------
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
2023-10-10 23:32:20,404 - INFO - run_localGPT.py:133 - Local LLM Loaded
Enter a query: what is the Article I ?
2023-10-23 19:47:28,598 - INFO - run_localGPT.py:240 - Query processing time: 1.5914111537858844s
> Question:
what is the Article I ?
> Answer:
It is the first article of the US constitution .
Enter a query: what does it say?
2023-10-23 19:47:36,684 - INFO - run_localGPT.py:240 - Query processing time: 1.872558546019718s
> Question:
what does it say?
> Answer:
The first article of the US constitution states "All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives."
Enter a query: What about article II
2023-10-23 20:34:42,818 - INFO - run_localGPT.py:240 - Query processing time: 1.6038263840600848s
> Question:
What about article II
> Answer:
Article II of the US constitution deals with the executive branch of government and establishes the office of the president.
Code language: PHP (php)
Next Steps
You can add your own content to the SOURCE DOCUMENTS folder to query and chat with your own content. You can also modify the `temperature` and `top_p` values in the run_localGPT.py file, line 84
pipe = GaudiTextGenerationPipeline(model_name_or_path=model_id, max_new_tokens=100, temperature=0.5, top_p=0.5, repetition_penalty=1.15, use_kv_cache=True, do_sample=True)
to experiment with different values to get different outputs. Please also review the updated class GaudiTextGenerationPipeline: in the /gaudi_utils/pipeline.py for information on tokenization and padding.
Copyright© 2023 Habana Labs, Ltd. an Intel Company.
Licensed under the Apache License, Version 2.0 (the “License”);
You may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.