This is one example conversation I had with GitHub Copilot on one of the Anthropic code file:

code file :

(https://github.com/anthropics/anthropic-cookbook/blob/main/skills/classification/guide.ipynb)

Tutor session:

(https://github.com/copilot/share/02300086-43e0-8872-8040-0a0a40a22912)

the code uses VOYAGE_API_KEY to do RAG I want to learn more about the details. What is the input of the VOYAGE service? what is the output? use this code as an example to teach me how to use VOYAGE service in details

The code in the notebook provided outlines a method to classify customer support tickets using large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, it uses the VOYAGE_API_KEY for integrating VoyageAI services.

Using Voyage Service in Detail

Setup and Configuration:

  1. Install Required Packages:
    !pip install anthropic
    !pip install voyageai
    !pip install pandas
    !pip install numpy
    !pip install matplotlib
    !pip install -U scikit-learn
    
  2. Set Environment Variables:
    import os
    os.environ['VOYAGE_API_KEY'] = "YOUR_VOYAGE_API_KEY"
    os.environ['ANTHROPIC_API_KEY'] = "YOUR_ANTHROPIC_API_KEY"
    

Using VoyageAI for Embeddings:

  1. Initialization of the Vector Database:
    import voyageai
       
    class VectorDB:
        def __init__(self, api_key=None):
            if api_key is None:
                api_key = os.getenv("VOYAGE_API_KEY")
            self.client = voyageai.Client(api_key=api_key)
            self.embeddings = []
            self.metadata = []
            self.query_cache = {}
            self.db_path = "./data/vector_db.pkl"
       
        ...
    
  2. Embedding Data:
    def load_data(self, data):
        texts = [item["text"] for item in data]
        batch_size = 128
        result = [
            self.client.embed(
                texts[i : i + batch_size],
                model="voyage-2"
            ).embeddings
            for i in range(0, len(texts), batch_size)
        ]
        self.embeddings = [embedding for batch in result for embedding in batch]
        self.metadata = [item for item in data]
        self.save_db()
    
  3. Searching Similar Embeddings:
    def search(self, query, k=5, similarity_threshold=0.75):
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1]
        top_examples = [
            {"metadata": self.metadata[idx], "similarity": similarities[idx]}
            for idx in top_indices if similarities[idx] >= similarity_threshold
        ][:k]
        return top_examples
    

Classification with RAG:

  1. Define the RAG Classification Function:
    def rag_classify(X):
        rag = vectordb.search(X, 5)
        rag_string = "".join(
            f"""
            <example>
                <query>
                    "{example["metadata"]["text"]}"
                </query>
                <label>
                    {example["metadata"]["label"]}
                </label>
            </example>
            """
            for example in rag
        )
        prompt = f"""
        You will classify a customer support ticket into one of the following categories:
        <categories>{categories}</categories>
       
        Here is the customer support ticket:
        <ticket>{X}</ticket>
       
        Use the following examples to help you classify the query:
        <examples>{rag_string}</examples>
       
        Respond with just the label of the category between category tags.
        """
        response = client.messages.create(
            messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": "<category>"}],
            stop_sequences=["</category>"],
            max_tokens=4096,
            temperature=0.0,
            model="claude-3-haiku-20240307"
        )
        result = response.content[0].text.strip()
        return result
    

Inputs and Outputs:

  • Input to Voyage Service:
    • List of texts (support tickets) for embeddings.
    • Query text for searching similar embeddings.
  • Output from Voyage Service:
    • Embeddings for the input texts.
    • Top similar examples with metadata based on the query.

Example Usage:

  1. Load Training Data:
    vectordb = VectorDB()
    vectordb.load_data(training_data)
    
  2. Classify a New Ticket:
    result = rag_classify("I need help with my billing statement.")
    print(result)
    

This provides a complete workflow to use the Voyage service for classification tasks leveraging embeddings and RAG techniques.

did this code provide any clue about how to get the vector_db.pkl file?

Yes, the code in the guide.ipynb notebook does provide clues on how to get the vector_db.pkl file. Here’s a summary of the steps involved:

  1. Loading Data into Vector Database: The VectorDB class has a method load_data that takes training data as input and creates embeddings for the text data using the VoyageAI service. If the vector_db.pkl file exists, it loads the data from the file. Otherwise, it creates new embeddings and saves them to vector_db.pkl.

    class VectorDB:
        ...
    
        def load_data(self, data):
            # Check if the vector database is already loaded
            if self.embeddings and self.metadata:
                print("Vector database is already loaded. Skipping data loading.")
                return
               
            # Check if vector_db.pkl exists
            if os.path.exists(self.db_path):
                print("Loading vector database from disk.")
                self.load_db()
                return
    
            texts = [item["text"] for item in data]
            batch_size = 128
            result = [
                self.client.embed(
                    texts[i : i + batch_size],
                    model="voyage-2"
                ).embeddings
                for i in range(0, len(texts), batch_size)
            ]
    
            self.embeddings = [embedding for batch in result for embedding in batch]
            self.metadata = [item for item in data]
            self.save_db()
            print("Vector database loaded and saved.")
    
  2. Saving and Loading the Vector Database: The VectorDB class has methods save_db and load_db to save and load the vector database to/from the vector_db.pkl file.

    def save_db(self):
        data = {
            "embeddings": self.embeddings,
            "metadata": self.metadata,
            "query_cache": json.dumps(self.query_cache),
        }
        with open(self.db_path, "wb") as file:
            pickle.dump(data, file)
    
    def load_db(self):
        if not os.path.exists(self.db_path):
            raise ValueError("Vector database file not found. Use load_data to create a new database.")
           
        with open(self.db_path, "rb") as file:
            data = pickle.load(file)
           
        self.embeddings = data["embeddings"]
        self.metadata = data["metadata"]
        self.query_cache = json.loads(data["query_cache"])
    

Steps to Generate vector_db.pkl:

  1. Initialize the VectorDB:
    vectordb = VectorDB()
    
  2. Load Training Data: Assuming you have training data in the format required by the load_data method, you can load the data and create the embeddings.

    vectordb.load_data(training_data)
    

    This will create the embeddings and save them to vector_db.pkl if the file doesn’t already exist.

By following these steps, you can generate the vector_db.pkl file which contains the vector embeddings for the training data.

did this code save vector_db.pkl after we use load_data method for train.tsv, or vector_db.pkl is formed new each time class VectorDB is called

The vector_db.pkl file is saved after using the load_data method for train.tsv, and it is not formed new each time the VectorDB class is called. Here’s how it works:

  1. Initialization: When the VectorDB class is instantiated, it checks if the vector_db.pkl file exists. If it exists, it loads the data from this file. If not, it proceeds to create embeddings and save them to vector_db.pkl.

    class VectorDB:
        def __init__(self, api_key=None):
            if api_key is None:
                api_key = os.getenv("VOYAGE_API_KEY")
            self.client = voyageai.Client(api_key=api_key)
            self.embeddings = []
            self.metadata = []
            self.query_cache = {}
            self.db_path = "./data/vector_db.pkl"
    
  2. Loading Data: The load_data method checks if the vector database is already loaded or if the vector_db.pkl file exists. If the file exists, it loads the data from the file. If not, it creates new embeddings and saves them to vector_db.pkl.

    def load_data(self, data):
        # Check if the vector database is already loaded
        if self.embeddings and self.metadata:
            print("Vector database is already loaded. Skipping data loading.")
            return
    
        # Check if vector_db.pkl exists
        if os.path.exists(self.db_path):
            print("Loading vector database from disk.")
            self.load_db()
            return
    
        texts = [item["text"] for item in data]
        batch_size = 128
        result = [
            self.client.embed(
                texts[i : i + batch_size],
                model="voyage-2"
            ).embeddings
            for i in range(0, len(texts), batch_size)
        ]
    
        self.embeddings = [embedding for batch in result for embedding in batch]
        self.metadata = [item for item in data]
        self.save_db()
        print("Vector database loaded and saved.")
    
  3. Saving and Loading the Vector Database: The save_db method saves the embeddings and metadata to vector_db.pkl, and the load_db method loads the data from this file if it exists.

    def save_db(self):
        data = {
            "embeddings": self.embeddings,
            "metadata": self.metadata,
            "query_cache": json.dumps(self.query_cache),
        }
        with open(self.db_path, "wb") as file:
            pickle.dump(data, file)
    
    def load_db(self):
        if not os.path.exists(self.db_path):
            raise ValueError("Vector database file not found. Use load_data to create a new database.")
           
        with open(self.db_path, "rb") as file:
            data = pickle.load(file)
           
        self.embeddings = data["embeddings"]
        self.metadata = data["metadata"]
        self.query_cache = json.loads(data["query_cache"])
    

Summary:

  • The vector_db.pkl file is saved after the load_data method is called for train.tsv.
  • The file is not recreated each time the VectorDB class is instantiated; instead, it is loaded from the existing file if available.

I saw these two lines in the code:

vectordb = VectorDB()

vectordb.load_data(data[“train”])

does it mean that the input of vectordb.load_data is not the train.tsv, but some other formate of /data/train.tsv

Yes, you’re correct. The input to vectordb.load_data is not the raw train.tsv file but rather a processed form of the data from train.tsv.

The code reads the train.tsv file into a pandas DataFrame and then converts it into a list of dictionaries, where each dictionary represents a training example with its text and label. This processed data is then passed to vectordb.load_data.

Here is how the process works:

  1. Reading the TSV File:
    train_df = pd.read_csv("./data/train.tsv", sep='\t')
    
  2. Converting DataFrame to List of Dictionaries:
    def dataframe_to_dict_list(df):
        return df.apply(lambda x: {'text': x['text'], 'label': x['label']}, axis=1).tolist()
    
    data['train'] = dataframe_to_dict_list(train_df)
    
  3. Loading Data into VectorDB:
    vectordb = VectorDB()
    vectordb.load_data(data['train'])
    

So, the input to vectordb.load_data is a list of dictionaries with each dictionary containing the text and label from the training data, not the raw train.tsv file.

Here’s a more detailed look at the relevant code snippets:

# Read the TSV file into a DataFrame
train_df = pd.read_csv("./data/train.tsv", sep='\t')

# Helper function to convert a DataFrame to a list of dictionaries
def dataframe_to_dict_list(df):
    return df.apply(lambda x: {'text': x['text'], 'label': x['label']}, axis=1).tolist()

# Convert the DataFrame to a list of dictionaries
data['train'] = dataframe_to_dict_list(train_df)

# Initialize the VectorDB
vectordb = VectorDB()

# Load the processed training data into the VectorDB
vectordb.load_data(data['train'])

This ensures that the load_data method receives the data in the expected format for creating embeddings and storing them in the vector database.

rag = vectordb.search(X,5)

I know the X in above line is one line in text column of test.tsv. What does the 5 mean in above line

In the line rag = vectordb.search(X, 5), the 5 is a parameter that specifies the number of top examples to retrieve from the vector database. Here is the breakdown of the parameters in the search method call:

  • X: The query text for which you want to find similar examples.
  • 5: The number of similar examples (top-k) to retrieve from the vector database.

Explanation of search Method:

def search(self, query, k=5, similarity_threshold=0.75):
    query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
    similarities = np.dot(self.embeddings, query_embedding)
    top_indices = np.argsort(similarities)[::-1]
    top_examples = [
        {"metadata": self.metadata[idx], "similarity": similarities[idx]}
        for idx in top_indices if similarities[idx] >= similarity_threshold
    ][:k]
    return top_examples

In the search method:

  • query: The input text for which you are searching similar examples.
  • k: The number of top similar examples to retrieve. In this case, 5 means the method will return the top 5 most similar examples.
  • similarity_threshold: A threshold to filter out examples that do not meet a minimum similarity score (default is 0.75).

So, when you call vectordb.search(X, 5), it retrieves the top 5 most similar examples to the query text X from the vector database.


<
Previous Post
Learn To Use Claude Api Key And Component Of Response
>
Blog Archive
Archive of all previous blog posts