Let Github Copilot Teach Me Code That I Am Not Familiar
This is one example conversation I had with GitHub Copilot on one of the Anthropic code file:
code file :
(https://github.com/anthropics/anthropic-cookbook/blob/main/skills/classification/guide.ipynb)
Tutor session:
(https://github.com/copilot/share/02300086-43e0-8872-8040-0a0a40a22912)
the code uses VOYAGE_API_KEY to do RAG I want to learn more about the details. What is the input of the VOYAGE service? what is the output? use this code as an example to teach me how to use VOYAGE service in details
The code in the notebook provided outlines a method to classify customer support tickets using large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, it uses the VOYAGE_API_KEY
for integrating VoyageAI services.
Using Voyage Service in Detail
Setup and Configuration:
- Install Required Packages:
!pip install anthropic !pip install voyageai !pip install pandas !pip install numpy !pip install matplotlib !pip install -U scikit-learn
- Set Environment Variables:
import os os.environ['VOYAGE_API_KEY'] = "YOUR_VOYAGE_API_KEY" os.environ['ANTHROPIC_API_KEY'] = "YOUR_ANTHROPIC_API_KEY"
Using VoyageAI for Embeddings:
- Initialization of the Vector Database:
import voyageai class VectorDB: def __init__(self, api_key=None): if api_key is None: api_key = os.getenv("VOYAGE_API_KEY") self.client = voyageai.Client(api_key=api_key) self.embeddings = [] self.metadata = [] self.query_cache = {} self.db_path = "./data/vector_db.pkl" ...
- Embedding Data:
def load_data(self, data): texts = [item["text"] for item in data] batch_size = 128 result = [ self.client.embed( texts[i : i + batch_size], model="voyage-2" ).embeddings for i in range(0, len(texts), batch_size) ] self.embeddings = [embedding for batch in result for embedding in batch] self.metadata = [item for item in data] self.save_db()
- Searching Similar Embeddings:
def search(self, query, k=5, similarity_threshold=0.75): query_embedding = self.client.embed([query], model="voyage-2").embeddings[0] similarities = np.dot(self.embeddings, query_embedding) top_indices = np.argsort(similarities)[::-1] top_examples = [ {"metadata": self.metadata[idx], "similarity": similarities[idx]} for idx in top_indices if similarities[idx] >= similarity_threshold ][:k] return top_examples
Classification with RAG:
- Define the RAG Classification Function:
def rag_classify(X): rag = vectordb.search(X, 5) rag_string = "".join( f""" <example> <query> "{example["metadata"]["text"]}" </query> <label> {example["metadata"]["label"]} </label> </example> """ for example in rag ) prompt = f""" You will classify a customer support ticket into one of the following categories: <categories>{categories}</categories> Here is the customer support ticket: <ticket>{X}</ticket> Use the following examples to help you classify the query: <examples>{rag_string}</examples> Respond with just the label of the category between category tags. """ response = client.messages.create( messages=[{"role": "user", "content": prompt}, {"role": "assistant", "content": "<category>"}], stop_sequences=["</category>"], max_tokens=4096, temperature=0.0, model="claude-3-haiku-20240307" ) result = response.content[0].text.strip() return result
Inputs and Outputs:
- Input to Voyage Service:
- List of texts (support tickets) for embeddings.
- Query text for searching similar embeddings.
- Output from Voyage Service:
- Embeddings for the input texts.
- Top similar examples with metadata based on the query.
Example Usage:
- Load Training Data:
vectordb = VectorDB() vectordb.load_data(training_data)
- Classify a New Ticket:
result = rag_classify("I need help with my billing statement.") print(result)
This provides a complete workflow to use the Voyage service for classification tasks leveraging embeddings and RAG techniques.
did this code provide any clue about how to get the vector_db.pkl file?
Yes, the code in the guide.ipynb
notebook does provide clues on how to get the vector_db.pkl
file. Here’s a summary of the steps involved:
-
Loading Data into Vector Database: The
VectorDB
class has a methodload_data
that takes training data as input and creates embeddings for the text data using the VoyageAI service. If thevector_db.pkl
file exists, it loads the data from the file. Otherwise, it creates new embeddings and saves them tovector_db.pkl
.class VectorDB: ... def load_data(self, data): # Check if the vector database is already loaded if self.embeddings and self.metadata: print("Vector database is already loaded. Skipping data loading.") return # Check if vector_db.pkl exists if os.path.exists(self.db_path): print("Loading vector database from disk.") self.load_db() return texts = [item["text"] for item in data] batch_size = 128 result = [ self.client.embed( texts[i : i + batch_size], model="voyage-2" ).embeddings for i in range(0, len(texts), batch_size) ] self.embeddings = [embedding for batch in result for embedding in batch] self.metadata = [item for item in data] self.save_db() print("Vector database loaded and saved.")
-
Saving and Loading the Vector Database: The
VectorDB
class has methodssave_db
andload_db
to save and load the vector database to/from thevector_db.pkl
file.def save_db(self): data = { "embeddings": self.embeddings, "metadata": self.metadata, "query_cache": json.dumps(self.query_cache), } with open(self.db_path, "wb") as file: pickle.dump(data, file) def load_db(self): if not os.path.exists(self.db_path): raise ValueError("Vector database file not found. Use load_data to create a new database.") with open(self.db_path, "rb") as file: data = pickle.load(file) self.embeddings = data["embeddings"] self.metadata = data["metadata"] self.query_cache = json.loads(data["query_cache"])
Steps to Generate vector_db.pkl
:
- Initialize the VectorDB:
vectordb = VectorDB()
-
Load Training Data: Assuming you have training data in the format required by the
load_data
method, you can load the data and create the embeddings.vectordb.load_data(training_data)
This will create the embeddings and save them to
vector_db.pkl
if the file doesn’t already exist.
By following these steps, you can generate the vector_db.pkl
file which contains the vector embeddings for the training data.
did this code save vector_db.pkl after we use load_data method for train.tsv, or vector_db.pkl is formed new each time class VectorDB is called
The vector_db.pkl
file is saved after using the load_data
method for train.tsv
, and it is not formed new each time the VectorDB
class is called. Here’s how it works:
-
Initialization: When the
VectorDB
class is instantiated, it checks if thevector_db.pkl
file exists. If it exists, it loads the data from this file. If not, it proceeds to create embeddings and save them tovector_db.pkl
.class VectorDB: def __init__(self, api_key=None): if api_key is None: api_key = os.getenv("VOYAGE_API_KEY") self.client = voyageai.Client(api_key=api_key) self.embeddings = [] self.metadata = [] self.query_cache = {} self.db_path = "./data/vector_db.pkl"
-
Loading Data: The
load_data
method checks if the vector database is already loaded or if thevector_db.pkl
file exists. If the file exists, it loads the data from the file. If not, it creates new embeddings and saves them tovector_db.pkl
.def load_data(self, data): # Check if the vector database is already loaded if self.embeddings and self.metadata: print("Vector database is already loaded. Skipping data loading.") return # Check if vector_db.pkl exists if os.path.exists(self.db_path): print("Loading vector database from disk.") self.load_db() return texts = [item["text"] for item in data] batch_size = 128 result = [ self.client.embed( texts[i : i + batch_size], model="voyage-2" ).embeddings for i in range(0, len(texts), batch_size) ] self.embeddings = [embedding for batch in result for embedding in batch] self.metadata = [item for item in data] self.save_db() print("Vector database loaded and saved.")
-
Saving and Loading the Vector Database: The
save_db
method saves the embeddings and metadata tovector_db.pkl
, and theload_db
method loads the data from this file if it exists.def save_db(self): data = { "embeddings": self.embeddings, "metadata": self.metadata, "query_cache": json.dumps(self.query_cache), } with open(self.db_path, "wb") as file: pickle.dump(data, file) def load_db(self): if not os.path.exists(self.db_path): raise ValueError("Vector database file not found. Use load_data to create a new database.") with open(self.db_path, "rb") as file: data = pickle.load(file) self.embeddings = data["embeddings"] self.metadata = data["metadata"] self.query_cache = json.loads(data["query_cache"])
Summary:
- The
vector_db.pkl
file is saved after theload_data
method is called fortrain.tsv
. - The file is not recreated each time the
VectorDB
class is instantiated; instead, it is loaded from the existing file if available.
I saw these two lines in the code:
vectordb = VectorDB()
vectordb.load_data(data[“train”])
does it mean that the input of vectordb.load_data is not the train.tsv, but some other formate of /data/train.tsv
Yes, you’re correct. The input to vectordb.load_data
is not the raw train.tsv
file but rather a processed form of the data from train.tsv
.
The code reads the train.tsv
file into a pandas DataFrame and then converts it into a list of dictionaries, where each dictionary represents a training example with its text and label. This processed data is then passed to vectordb.load_data
.
Here is how the process works:
- Reading the TSV File:
train_df = pd.read_csv("./data/train.tsv", sep='\t')
- Converting DataFrame to List of Dictionaries:
def dataframe_to_dict_list(df): return df.apply(lambda x: {'text': x['text'], 'label': x['label']}, axis=1).tolist() data['train'] = dataframe_to_dict_list(train_df)
- Loading Data into VectorDB:
vectordb = VectorDB() vectordb.load_data(data['train'])
So, the input to vectordb.load_data
is a list of dictionaries with each dictionary containing the text and label from the training data, not the raw train.tsv
file.
Here’s a more detailed look at the relevant code snippets:
# Read the TSV file into a DataFrame
train_df = pd.read_csv("./data/train.tsv", sep='\t')
# Helper function to convert a DataFrame to a list of dictionaries
def dataframe_to_dict_list(df):
return df.apply(lambda x: {'text': x['text'], 'label': x['label']}, axis=1).tolist()
# Convert the DataFrame to a list of dictionaries
data['train'] = dataframe_to_dict_list(train_df)
# Initialize the VectorDB
vectordb = VectorDB()
# Load the processed training data into the VectorDB
vectordb.load_data(data['train'])
This ensures that the load_data
method receives the data in the expected format for creating embeddings and storing them in the vector database.
rag = vectordb.search(X,5)
I know the X in above line is one line in text column of test.tsv. What does the 5 mean in above line
In the line rag = vectordb.search(X, 5)
, the 5
is a parameter that specifies the number of top examples to retrieve from the vector database. Here is the breakdown of the parameters in the search
method call:
X
: The query text for which you want to find similar examples.5
: The number of similar examples (top-k) to retrieve from the vector database.
Explanation of search
Method:
def search(self, query, k=5, similarity_threshold=0.75):
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = np.dot(self.embeddings, query_embedding)
top_indices = np.argsort(similarities)[::-1]
top_examples = [
{"metadata": self.metadata[idx], "similarity": similarities[idx]}
for idx in top_indices if similarities[idx] >= similarity_threshold
][:k]
return top_examples
In the search
method:
query
: The input text for which you are searching similar examples.k
: The number of top similar examples to retrieve. In this case,5
means the method will return the top 5 most similar examples.similarity_threshold
: A threshold to filter out examples that do not meet a minimum similarity score (default is 0.75).
So, when you call vectordb.search(X, 5)
, it retrieves the top 5 most similar examples to the query text X
from the vector database.