Skip to content

adjust to Milvus Lite #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,5 @@ jobs:
- name: Test with pytest
run: |
pytest --color=yes
env:
PYTHONWARNINGS: "ignore::DeprecationWarning:pkg_resources.*"
153 changes: 61 additions & 92 deletions docs/milvus.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,72 @@
# Tutorial of Rule-based Retrieval through Milvus

The `whyhow_rbr` package helps create customized RAG pipelines. It is built on top
The `rule-based-retrieval` package helps create customized RAG pipelines. It is built on top
of the following technologies (and their respective Python SDKs)

- **OpenAI** - text generation
- **Milvus** - vector database
- **OpenAI** - text generation

## Initialization

Install package
```shell
pip install rule-based-retrieval
```

Please import some essential package
```python
from pymilvus import DataType

from src.whyhow_rbr.rag_milvus import ClientMilvus
from whyhow_rbr import ClientMilvus, MilvusRule
```

## Client
## ClientMilvus

The central object is a `ClientMilvus`. It manages all necessary resources
The central object is `ClientMilvus`. It manages all necessary resources
and provides a simple interface for all the RAG related tasks.

First of all, to instantiate it one needs to provide the following
credentials:

- `OPENAI_API_KEY`
- `Milvus_URI`
- `Milvus_API_TOKEN`
- `milvus_uri`
- `milvus_token` (optional)
- `openai_api_key`

You need to create a file with the format "xxx.db" in your current directory
and use the file path as milvus_uri.

Initialize the ClientMilvus like this:

```python
# Set up your Milvus Cloud information
YOUR_MILVUS_CLOUD_END_POINT="YOUR_MILVUS_CLOUD_END_POINT"
YOUR_MILVUS_CLOUD_TOKEN="YOUR_MILVUS_CLOUD_TOKEN"
# Set up your Milvus Client information
YOUR_MILVUS_LITE_FILE_PATH = "./milvus_demo.db" # random name for milvus lite local db
OPENAI_API_KEY="<YOUR_OPEN_AI_KEY>"

# Initialize the ClientMilvus
milvus_client = ClientMilvus(
milvus_uri=YOUR_MILVUS_CLOUD_END_POINT,
milvus_token=YOUR_MILVUS_CLOUD_TOKEN
milvus_uri=YOUR_MILVUS_LITE_FILE_PATH,
openai_api_key=OPENAI_API_KEY
)
```

## Vector database operations

This tutorial `whyhow_rbr` uses Milvus for everything related to vector databses.

### Defining necessary variables
### Create the collection

```python
# Define collection name
COLLECTION_NAME="YOUR_COLLECTION_NAME" # take your own collection name

# Define vector dimension size
DIMENSION=1536 # decide by the model you use
```

### Add schema

Before inserting any data into Milvus database, we need to first define the data field, which is called schema in here. Through create object `CollectionSchema` and add data field through `addd_field()`, we can control our data type and their characteristics. This step is required.

```python
schema = milvus_client.create_schema(auto_id=True) # Enable id matching

schema = milvus_client.add_field(schema=schema, field_name="id", datatype=DataType.INT64, is_primary=True)
schema = milvus_client.add_field(schema=schema, field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)
```
We only defined `id` and `embedding` here because we need to define a primary field for each collection. For embedding, we need to define the dimension. We allow `enable_dynamic_field` which support auto adding schema, but we still encourage you to add schema by yourself. This method is a thin wrapper around the official Milvus implementation ([official docs](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Collections/create_schema.md))

### Creating an index

For each schema, it is better to have an index so that the querying will be much more efficient. To create an index, we first need an index_params and later add more index data on this `IndexParams` object.
```python
# Start to indexing data field
index_params = milvus_client.prepare_index_params()
index_params = milvus_client.add_index(
index_params=index_params, # pass in index_params object
field_name="embedding",
index_type="AUTOINDEX", # use autoindex instead of other complex indexing method
metric_type="COSINE", # L2, COSINE, or IP
)
```
This method is a thin wrapper around the official Milvus implementation ([official docs](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Management/add_index.md)).

### Create Collection

After defining all the data field and indexing them, we now need to create our database collection so that we can access our data quick and precise. What's need to be mentioned is that we initialized the `enable_dynamic_field` to be true so that you can upload any data freely. The cost is the data querying might be inefficient.
```python
# Create Collection
milvus_client.create_collection(
collection_name=COLLECTION_NAME,
schema=schema,
index_params=index_params
)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=DIMENSION)
```

## Uploading documents

After creating a collection, we are ready to populate it with documents. In
`whyhow_rbr` this is done using the `upload_documents` method of the `MilvusClient`.
`whyhow_rbr` this is done using the `upload_documents` method of the `ClientMilvus`.
It performs the following steps under the hood:

- **Preprocessing**: Reading and splitting the provided PDF files into chunks
Expand All @@ -112,28 +81,26 @@ pdfs = ["harry-potter.pdf", "game-of-thrones.pdf"] # replace to your pdfs path

# Uploading the PDF document
milvus_client.upload_documents(
collection_name=COLLECTION_NAME,
documents=pdfs
)
```
## Question answering

Now we can finally move to retrieval augmented generation.

In `whyhow_rbr` with Milvus, it can be done via the `search` method.
In `whyhow_rbr` with Milvus, it can be done via the `query` method.

1. Simple example:
1. Simple example without rules:

```python
# Search data and implement RAG!
res = milvus_client.search(
question='What food does Harry Potter like to eat?',
collection_name=COLLECTION_NAME,
anns_field='embedding',
output_fields='text'
result = milvus_client.query(
question="What is Harry Potter's favorite food?",
process_rules_separately=True,
keyword_trigger=False,
)
print(res['answer'])
print(res['matches'])
print(result["answer"])
print(result["matches"])
```

The `result` is a dictionary that has the following keys
Expand All @@ -142,22 +109,20 @@ The `result` is a dictionary that has the following keys
- `matches` - the `limit` most relevant documents from the index

Note that the number of matches will be in general equal to `limit` which
can be specified as a parameter.
can be specified as a parameter. The default value is 5.

### Clean up

At last, after implemented all the instructuons, you can clean up the database
by calling `drop_collection()`.
```python
# Clean up
milvus_client.drop_collection(
collection_name=COLLECTION_NAME
)
milvus_client.drop_collection()
```

### Rules

In the previous example, every single document in our index was considered.
In the previous example, every single document in our collection was considered.
However, sometimes it might be beneficial to only retrieve documents satisfying some
predefined conditions (e.g. `filename=harry-potter.pdf`). In `whyhow_rbr` through Milvus, this
can be done via adjusting searching parameters.
Expand All @@ -166,37 +131,41 @@ A rule can control the following metadata attributes

- `filename` - name of the file
- `page_numbers` - list of integers corresponding to page numbers (0 indexing)
- `id` - unique identifier of a chunk (this is the most "extreme" filter)
- `uuid` - unique identifier of a chunk (this is the most "extreme" filter)
- `keywords` - list of keywords to trigger the rule
- Other rules base on [Boolean Expressions](https://milvus.io/docs/boolean.md)

Rules Example:

```python
# RULES(search on book harry-potter on page 8):
PARTITION_NAME='harry-potter' # search on books
page_number='page_number == 8'

# first create a partitions to store the book and later search on this specific partition:
milvus_client.crate_partition(
collection_name=COLLECTION_NAME,
partition_name=PARTITION_NAME # separate base on your pdfs type
)
# RULES:
rules = [
MilvusRule(
# Replace with your rule
filename="harry-potter.pdf",
page_numbers=[120, 121, 150],
),
MilvusRule(
# Replace with your rule
filename="harry-potter.pdf",
page_numbers=[120, 121, 150],
keywords=["food", "favorite", "likes to eat"]
),
]

# search with rules
res = milvus_client.search(
question='Tell me about the greedy method',
collection_name=COLLECTION_NAME,
partition_names=PARTITION_NAME,
filter=page_number, # append any rules follow the Boolean Expression Rule
anns_field='embedding',
output_fields='text'
res = milvus_client.query(
question="What is Harry Potter's favorite food?",
rules=rules,
process_rules_separately=True,
keyword_trigger=False,
)
print(res['answer'])
print(res['matches'])
print(res["answer"])
print(res["matches"])
```

In this example, we first create a partition that store harry-potter related pdfs, and through searching within this partition, we can get the most direct information.
Also, we apply page number as a filter to specify the exact page we wish to search on.
Remember, the filer parameter need to follow the [boolean rule](https://milvus.io/docs/boolean.md).
In this example, the process_rules_separately parameter is set to True. This means that each rule will be processed independently, ensuring that both rules contribute to the final result set.

By default, all rules are run as one joined query, which means that one rule can dominate the others, and given the return limit, a lower priority rule might not return any results. However, by setting process_rules_separately to True, each rule will be processed independently, ensuring that every rule returns results, and the results will be combined at the end.

That's all for the Milvus implementation of Rule-based Retrieval.
95 changes: 30 additions & 65 deletions examples/milvus_tutorial.py
Original file line number Diff line number Diff line change
@@ -1,94 +1,59 @@
"""Script that demonstrates how to use the RAG model with Milvus to implement rule-based retrieval."""

import os
from whyhow_rbr.rag_milvus import ClientMilvus, MilvusRule

from pymilvus import DataType

from src.whyhow_rbr.rag_milvus import ClientMilvus

# Set up your Milvus Cloud information
YOUR_MILVUS_CLOUD_END_POINT = os.getenv("YOUR_MILVUS_CLOUD_END_POINT")
YOUR_MILVUS_CLOUD_TOKEN = os.getenv("YOUR_MILVUS_CLOUD_TOKEN")

# Initialize the ClientMilvus
milvus_client = ClientMilvus(
milvus_uri=YOUR_MILVUS_CLOUD_END_POINT,
milvus_token=YOUR_MILVUS_CLOUD_TOKEN,
)
# Set up your Milvus Client information
YOUR_MILVUS_LITE_FILE_PATH = "./milvus_demo.db" # local file name used by Milvus Lite to persist data


# Define collection name
COLLECTION_NAME = "YOUR_COLLECTION_NAME" # take your own collection name


# Create necessary schema to store data
DIMENSION = 1536 # decide by the model you use

schema = milvus_client.create_schema(auto_id=True) # Enable id matching

schema = milvus_client.add_field(
schema=schema, field_name="id", datatype=DataType.INT64, is_primary=True
)
schema = milvus_client.add_field(
schema=schema,
field_name="embedding",
datatype=DataType.FLOAT_VECTOR,
dim=DIMENSION,
)


# Start to indexing data field
index_params = milvus_client.prepare_index_params()
index_params = milvus_client.add_index(
index_params=index_params, # pass in index_params object
field_name="embedding",
index_type="AUTOINDEX", # use autoindex instead of other complex indexing method
metric_type="COSINE", # L2, COSINE, or IP
# Initialize the ClientMilvus
milvus_client = ClientMilvus(
milvus_uri=YOUR_MILVUS_LITE_FILE_PATH,
openai_api_key="<YOUR_OPEN_AI_KEY>",
)


# Create Collection
milvus_client.create_collection(
collection_name=COLLECTION_NAME, schema=schema, index_params=index_params
)


# Create a Partition, list it out
milvus_client.crate_partition(
collection_name=COLLECTION_NAME,
partition_name="xxx", # Put in your own partition name, better fit the document you upload
)

partitions = milvus_client.list_partition(collection_name=COLLECTION_NAME)
print(partitions)
milvus_client.create_collection(collection_name=COLLECTION_NAME)


# Uploading the PDF document
# get pdfs
pdfs = ["harry-potter.pdf", "game-of-thrones.pdf"] # replace to your pdfs path
# get pdfs from data directory in current directory
pdfs = ["data/1.pdf", "data/2.pdf"] # replace to your pdfs path

milvus_client.upload_documents(
collection_name=COLLECTION_NAME, partition_name="xxx", documents=pdfs
)

milvus_client.upload_documents(documents=pdfs)


# add your rules:
filter = ""
partition_names = None
rules = [
MilvusRule(
# Replace with your filename
filename="data/1.pdf",
page_numbers=[],
),
MilvusRule(
# Replace with your filename
filename="data/2.pdf",
page_numbers=[],
),
]


# Search data and implement RAG!
res = milvus_client.search(
question="Tell me about the greedy method",
collection_name=COLLECTION_NAME,
filter=filter,
partition_names=None,
anns_field="embedding",
output_fields="text",
res = milvus_client.query(
question="YOUR_QUESTIONS",
rules=rules,
process_rules_separately=True,
keyword_trigger=False,
)
print(res["answer"])
print(res["matches"])


# Clean up
milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.drop_collection()
Loading
Loading