Vue normale

Il y a de nouveaux articles disponibles, cliquez pour rafraîchir la page.
À partir d’avant-hierFlux principal

Large Language Models (LLMs) Retrieval Augmented Generation (RAG) using Charmed OpenSearch

11 mars 2024 à 07:00

Improving the relevance of your LLM application by leveraging Charmed Opensearch’s vector database

Large Language Models (LLMs) fall under the category of Generative AI (GenAI), an artificial intelligence type that produces content based on user-defined context. These models undergo training using an extensive dataset composed of trillions of combinations of words from natural language, enabling them to empower interactive and conversational applications across various scenarios.

Renowned LLMs like GPT, BERT, PaLM, and LLaMa can experience performance improvements by gaining access to additional structured and unstructured data. This additional data may include public or internal documents, websites, and various text forms and content. This methodology, termed retrieval-augmented generation (RAG), ensures that your conversational application generates accurate results with contextual relevance and domain-specific knowledge, even in areas where the pertinent facts were not part of the initial training dataset.

RAG can drastically improve the accuracy of an LLM’s responses. See the example below:

“What is PRO?” response without RAG

Pro is a subscription-based service that offers additional features and functionality to users. For example, Pro users can access exclusive content, receive priority customer support, and more. To become a Pro user, you can sign up for a Pro subscription on our website. Once you have signed up, you can access all of the Pro features and benefits.

“What is PRO?” response with RAG

Ubuntu Pro is an additional stream of security updates and packages that meet compliance requirements, such as FIPS or HIPAA, on top of an Ubuntu LTS. It provides an SLA for security fixes for the entire distribution (‘main and universe’ packages) for ten years, with extensions for industrial use cases. Ubuntu Pro is free for personal use, offering the full suite of Ubuntu Pro capabilities on up to 5 machines.

This article guides you on leveraging Charmed OpenSearch to maintain a relevant and up-to-date LLM application.

What is OpenSearch?

OpenSearch is an open-source search and analytics engine. Users can extend the functionality of OpenSearch with a selection of plugins that enhance search, security, performance analysis, machine learning, and more. This previous article we wrote provides additional details on the comprehensive features of OpenSearch. We discussed the capability of enabling enterprise-grade solutions through Charmed OpenSearch. This blog will emphasise a specific feature pertinent to RAG: utilising OpenSearch as a vector database.

What is a vector database?

Vector databases allow you to store and index, for example, text documents, rich media, audio, geospatial coordinates, tables, and graphs into vectors. These vectors represent points in N-dimensional spaces, effectively encapsulating the context of an asset. Search tools can look into these spaces using low-latency queries to find similar assets in neighbouring data points. These search tools typically do this by exploiting the efficiency of different methods for obtaining, for example, the k-nearest neighbours (k-NN) from an index of vectors.

In particular, OpenSearch enables this feature with the k-NN plugin and augments this functionality by providing your conversational applications with other essential features, such as fault tolerance, resource access controls, and a powerful query engine.

Using the OpenSearch k-NN plugin for RAG

IIn this section, we provide a practical example of using Charmed OpenSearch in the RAG process as a retrieval tool with an experiment using a Jupyter notebook on top of Charmed Kubeflow to infer an LLM.

1. Deploy Charmed OpenSearch and enable the k-NN plugin. Follow the Charmed OpenSearch tutorial, which is a good starting point. At the end, verify if the plugin is enabled, which is enabled by default:

$ juju config opensearch plugin_opensearch_knn
true

2. Get your credentials. The easiest way to create and retrieve your first administrator credentials is to add a relation between Charmed Opensearch and the Data Integrator Charm, which is also part of the tutorial.

3. Create a vector index for your k-NN index.  Now, we can create a vector index for your additional documents encoded into the knn_vectors data type. For simplicity, we will use the opensearch-py client.

from opensearchpy import OpenSearch

os_host = 10.56.118.209
os_port = 9200
os_url = "https://10.56.118.209:9200"
os_auth = ("opensearch-client_7","sqlKjlEK7ldsBxqsOHNcFoSXayDudf30")

os_client = OpenSearch(
    hosts = [{'host': os_host, 'port': os_port}],
    http_compress = True, 
    http_auth = os_auth,
    use_ssl = True,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False
)

os_index_name = "rag-index"

settings = {
    "settings": {
        "index": {
            "knn": True,
            "knn.space_type": "cosinesimil"
        }
    }
}

opensearch_client.indices.create(index=os_index_name, body=settings)

properties={
    "properties": {
        "vector_field": {
            "type": "knn_vector",
            "dimension": 384
        },
        "text": {
            "type": "keyword"
        }
    }
}

opensearch_client.indices.put_mapping(index=os_index_name, body=properties)

4. Aggregate source documents. In this example, we will select a list of web content that we want our application to use as relevant information to provide accurate answers:

content_links = [
	https://discourse.ubuntu.com/t/ubuntu-pro-faq/34042
]

5. Load document contents into memory and split the content into chunks. It will allow us to create the embeddings from the selected documents and upload them to the index we created.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(content_links)
htmls = loader.load()

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=500, 
    chunk_overlap=0,
    separator="\n")
docs = text_splitter.split_documents(htmls)

6. Create embeddings for text chunks and store embeddings in the vector index. It will allow us to create the embeddings from the selected documents and upload them to the index we created.

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L12-v2",
            encode_kwargs={'normalize_embeddings': False})


from langchain.vectorstores import OpenSearchVectorSearch

docsearch = OpenSearchVectorSearch.from_documents(docs, embeddings,
                                    ef_construction=256,
                                    engine="faiss",
                                    space_type="innerproduct",
                                    m=48, opensearch_url=os_url,
                                    index_name=os_index_name,
                                    http_auth=os_auth,
                                    verify_certs=False)

7. Use the similarity search to retrieve the documents that provide context to your query. The search engine will perform the Approximate k-NN Search, for example,  using the cosine similarity formula, and return the relevant documents in the context of your question.

query = """
  What is Pro?
"""

similar_docs = docsearch.similarity_search(query, k=2, 
                                    raw_response=True, 
                                    search_type="approximate_search",
                                    space_type="cosinesimil")

8. Prepare you LLM. We used a simple example using a HugginFace pipeline to load an LLM.

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline

model_name="TheBloke/Llama-2-7B-Chat-GPTQ"


model = AutoModelForCausalLM.from_pretrained(
            model_name,
            cache_dir="model",
            device_map='auto'
        )

tokenizer = AutoTokenizer.from_pretrained(model_name,cache_dir="llm/tokenizer")

pl = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            max_length = 2048.
        )

llm = HuggingFacePipeline(pipeline=pl)

9. Create a prompt template. It will define the expectations of the response and specify that we will provide context for an accurate answer.

from langchain import PromptTemplate

question_prompt_template = """
    You are a friendly chatbot assistant that responds in a conversational manner to user's questions. 
    Respond in short but complete answers unless specifically asked by the user to elaborate on something. 
    Use History and Context to inform your answers.

Context:
---------
{context}
---------
Question: {question}
Helpful Answer:"""

QUESTION_PROMPT = PromptTemplate(
    template=question_prompt_template, input_variables=["context", "question"]
)

10. Infer the LLM to answer your question using the context documents retrieved from OpenSearch.

from langchain.chains.question_answering import load_qa_chain

question = "What is Pro?"

chain = load_qa_chain(llm, chain_type="stuff", prompt=QUESTION_PROMPT)
chain.run(input_documents=similar_docs, question=query)

Conclusion

Retrieval-augmented generation (RAG) is a method that enables users to converse with data repositories. It’s a tool that can revolutionise how you access and utilise data, as we showed in our tutorial. With RAG, you can improve data retrieval, enhance knowledge sharing, and enrich the results of your LLMs to give more contextually relevant, insightful responses that better reflect the most up-to-date information in your organisation.

The benefits of better LLMs that can access your knowledge base are as obvious as they are alluring: you gain better customer support, employee training and developer productivity. On top of that, you ensure that your teams get LLM answers and results that reflect accurate, up-to-date policy and information rather than generalised or even outright useless answers.

As we showed, Charmed OpenSearch is a simple and robust technology that can enable RAG capabilities. With it (and our helpful tutorial), any business can leverage RAG to transform their technical or policy manuals and logs into comprehensive knowledge bases.

Enterprise-grade and fully supported OpenSearch solution

Charmed OpenSearch is available for the open-source community. Canonical’s team of experts can help you get started with it as the vector database to leverage the power of the k-NN search for your LLM applications at any scale. Contact Canonical if you have questions. 

Watch the webinar: Future-proof AI applications with OpenSearch as a vector database

Generative AI on a GPU-Instance with Ubuntu on AWS: Part 1 – Image Generation

2 février 2024 à 21:16

We recently published a technical document showing how to install NVIDIA drivers on a G4DN instance on AWS, where we covered not only how to install the NVIDIA GPU drivers but also how to make sure to get CUDA working for any ML work. 

In this document we are going to run one of the most used generative AI models, Stable Diffusion, on Ubuntu on AWS for research and development purposes.

According to AWS, “G4dn instances, powered by NVIDIA T4 GPUs, are the lowest cost GPU-based instances in the cloud for machine learning inference and small scale training. (…) optimized for applications using NVIDIA libraries such as CUDA, CuDNN, and NVENC.”

G4DN instances come in different configurations:

Instance typevCPUsRAMGPUs
g4dn.xlarge4161
g4dn.2xlarge8321
g4dn.4xlarge16641
g4dn.8xlarge321281
g4dn.12xlarge481924
g4dn.16xlarge642561
g4dn.metal963848

For this exercise, we will be using the g4dn.xlarge instance, since we need only 1 GPU, and with 4 vCPUs and 16GB of RAM, it will provide sufficient resources for our needs, as the GPU will handle most of the workload. 

Image generation with Stable Diffusion

Stable Diffusion is a deep learning model released in 2022 that has been trained to transform text into images using latent diffusion techniques. Developed by Stability.AI, this groundbreaking technology not only provides open-source access to its trained weights but also has the ability to run on any GPU with just 4GB of RAM, making it one of the most used Generative AI models for image generation.

In addition to its primary function of text-to-image generation, Stable Diffusion can also be used for tasks such as image retouching and video generation. The license for Stable Diffusion permits both commercial and non-commercial use, making it a versatile tool for various applications.

Requirements

You’ll need SSH access. If running on Ubuntu or any other Linux distribution, opening a terminal and typing ssh will get you there. If running windows, you will need either WSL (to run a Linux shell inside windows) or PuTTY to connect to the machine using an external software.

Make sure you have NVIDIA Drivers and CUDA installed on your G4DN machine. Test with the following command:

nvidia-smi

You should be able to see the driver and CUDA versions as shown here:

Let’s get started!

Step 1: Create a python virtual environment:

First, we need to download some libraries and dependencies as shown below:

sudo apt-get install -y python3.10-venv
sudo apt-get install ffmpeg libsm6 libxext6 -y

Now we can create the Python environment.

python3 -m venv myvirtualenv

And finally, we need to activate it. Please note that every time we log in into the machine, we will need to reactivate it with the following line:

source myvirtualenv/bin/activate

Step 2: Download the web GUI and get a model.

To interact with the model easily, we are going to clone the Stable Diffusion WebUI from AUTOMATIC1111.

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

After cloning the repository, we can move on to the interesting part: choosing and downloading a Stable Diffusion model from the web. There are many versions and variants that can make the journey more complicated but more interesting as a learning experience. As you delve deeper, you will find that sometimes you need specific versions, fine-tuned or specialized releases for your purpose.

This is where HuggingFace is great, as they host a plethora of models and checkpoint versions that you can download. Please be mindful of the license model of each model you will be using.

Go to Hugging Face, click on models, and start searching for “Stable Diffusion”. For this exercise, we will use version 1.5 from runwayml.

Go to the “Files and versions” tab and scroll down to the actual checkpoint files.

Copy the link and go back to your SSH session. We will download the model using wget:

cd ~/stable-diffusion-webui/models/Stable-diffusion
wget https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned.safetensors

Now that the model is installed, we can run the script that will bootstrap everything and run the Web GUI.

Step 3: Run the WebUI securely and serve the model

Now that we have everything in place, we will run the WebUI and serve the model.

Just as a side note, since we are not installing this on a local desktop, we cannot just open the browser and enter the URL. This URL will only respond locally because of security constraints (in other words, it is not wise to open development environments to the public). Therefore, we are going to create an SSH tunnel.

Exit the SSH session.

If you are running on Linux (or Linux under WSL on Windows), you can create the tunnel using SSH by running the following command:

ssh -L 7860:localhost:7860 -i myKeyPair.pem ubuntu@<the_machine's_external_IP>

In case you are running on Windows and can’t use WSL, follow these instructions to connect via PuTTY.

If everything went well, we can now access the previous URL in our local desktop browser. The entire connection will be tunneled and encrypted via SSH.

In your new SSH session, enter the following commands to run the WebUI.

cd ~/stable-diffusion-webui
./webui.sh

The first time will take a while as it will install PyTorch and all the required dependencies. After it finishes, it will give you the following local URL:

http://127.0.0.1:7860

So open your local browser and go to the following URL: http://127.0.0.1:7860

We are ready to start playing. 

We tested our first prompt with all the default values, and this is what we got. Quite impressive, right?

Now you are ready to start generating!

Final thoughts

I hope this guide has been helpful in deploying the Stable Diffusion model on your own instance and has also provided you with a better understanding of how these models work and what can be achieved with generative AI. It is clear that generative AI is a powerful tool for businesses today. 

In our next post, we will explore how to deploy and self-host a Large Language Model, another groundbreaking AI tool. 

Remember, if you are looking to create a production-ready solution, there are several options available to assist you. From a security perspective, Ubuntu Pro offers support for your open source supply chain, while Charmed Kubeflow provides a comprehensive stack of services for all your machine learning needs. Additionally, AWS offers Amazon Bedrock, which simplifies the complexities involved and allows you to access these services through an API. 

Thank you for reading and stay tuned for more exciting AI content!

❌
❌