Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .env-example
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# slack app token (must be set)
export SLACK_APP_TOKEN="xapp-..."
# slack bot token (must be set)
export SLACK_BOT_TOKEN="xoxb-..."
# openai api key (set if using openai api)
export OPENAI_API_KEY="sk-..."

# slack bot user id (used when scanning for messages)
export SLACK_BOT_USERID = "U01234567ABC"
# channel name on slack for scanning
export SLACK_CHANNEL_NAME = "bot-channel-name"
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
venv/*
pdfs/*
.env
18 changes: 18 additions & 0 deletions .justfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
run *FLAGS: activate
source ./.env && ipython {{FLAGS}} virtualpi.py pdfs

scan *FLAGS: activate
source ./.env && ipython {{FLAGS}} scan_messages.py

activate:
source ./venv/bin/activate

clean:
rm -f ./pdfs/docs.pkl

setup: clean
rm -rf ./venv
python -m venv venv
just activate
pip install -r requirements.txt
mkdir -p pdfs
6 changes: 6 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM python:3.10
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY virtualpi.py .
CMD ["ipython","virtualpi.py","./pdfs"]
68 changes: 61 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,38 +9,82 @@ Why the name? When your Principal Investigator goes on holidays, you need a *Vir
This work was first inspired by a conversation with the authors of [Galactic ChitChat: Using Large Language Models to Converse with Astronomy Literature](https://arxiv.org/abs/2304.05406), who implemented a similar tool, using a similar software stack. Virtual PI was first implemented and used for querying documentation for an astronomical instrument, [MAVIS](https://mavis-ao.org/).

## Configuration
#### API keys
```bash
# create your .env file:
cp .env-example .env
# set your environment variables:
vim .env
```
#### Launching the bot

To run the script, you require:
* A directory with the PDFs you wish the expert system to ingest;
* A directory with the PDFs you wish the expert system to ingest (e.g., `./pdfs/*.pdfs`)
* A working Python3 environment with the following packages available:
* `pip3 install slack_bolt paper-qa==1.2`
* NB: At the time of writing the default pip version of paper-qa and its langchain dependency are out of sync, hence requesting version 1.2.
* `pip3 install -r requirements.txt`
* An OpenAI [API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key).
* You can [Create a new Slack app](https://api.slack.com/tutorials/tracks/responding-to-app-mentions) that is preconfigured with the neccessary permissions by pressing the green 'Create App' button on that link.
* You can change the name of your app/bot (you'll use this to interact with it on Slack, by editing the 'manifest' file when the option is presented.
* You will need to copy the App and Bot Tokens to set as environment variables, as described below.

The three API tokens you have generated should be exported to your shell environment at runtime:

```
```bash
export OPENAI_API_KEY="sk-M...M"
export SLACK_APP_TOKEN="xapp-1...d"
export SLACK_BOT_TOKEN="xoxb-2...C"
```
e.g., by `source`ing the `.env` file after modifying it.

Then you can start the app as follows.

`python3 virtualpi.py /path/to/your/PDF/directory/`
```bash
python3 virtualpi.py /path/to/your/PDF/directory/
```

#### Recording Reactions
In some cases, you may wish to gather the reactions to bot messages (e.g., for further optimisation of the bot) by scanning a channel.
Assuming the `.env` is setup correctly, you can save this data to disk `bot_messages.json` using by running the `scan_messages.py` script:
```bash
python scan_messages.py
```

To get the bot's user id (required in `.env`), find the bot's profile on your slack channel, and copy the id shown (starting with `U...`), e.g.:

<img src="images/vpiuid.png" style="width:500px;display:block;margin-left:auto;margin-right:auto"/>


After you run it the first time (when it embeds all of the documents), the script will exit and ask you to restart it (to avoid what appears to be a timeout issue in the Slack libraries).
### Using [just](https://github.com/casey/just):
`just` allows the abstraction of a few of these setup tasks, see the full set of tasks in the `.justfile`.

After setting API keys (as above), you can create a virtual environment, install dependencies, and create a `./pdfs/` directory, by running:
```bash
just setup
```

Then (after adding your PDFs to `./pdfs/` you can start the slackbot using:
```bash
just run
```

To record the reactions by scanning a slack channel, set the appropriate `.env` variables and run:
```bash
just scan
```

## Saving State

When the script starts it will check if a pickled version of the dense vector containing the documents is already available in the PDF directory. If found it will use that existing state (which saves time and the cost of API calls), otherwise it will parse the PDFs, embed them into the FAISS dense vector and then save this state for next time.

NB: If you add/remove PDFs you will need to remove the state file!

`rm /path/to/your/PDF/directory/docs.pkl`
```bash
rm /path/to/your/PDF/directory/docs.pkl
```
or
```bash
just clean
```

## Add to Slack Workspace

Expand All @@ -55,3 +99,13 @@ By now your app should be happily running. The final step is to actually add it

An example interaction is shown below:
![alt text](images/MAVIS-IMBH.png "Example Slack interaction")

## Docker
Running with Docker is probably the easiest all round solution, but can make debugging a bit more tedious. To run with docker, use:
```bash
docker build -t virtualpi:latest
docker run --restart=unless-stopped -d -v ./pdfs:/app/pdfs --env-file=./.env virtualpi
```
This has the benefit of allowing multiple bots running on varied pdf sources. You can build the image once, then spin up a new container (changing the `./pdfs` directory and probably `.env`.

Note that for now, the `.env` format is not compatible between `just run` and `docker run`. For Docker, remove the `export` and quotation marks from the `.env` file. TODO: fix this.
Binary file added images/vpiuid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 36 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
annotated-types==0.6.0
anyio==4.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
distro==1.9.0
h11==0.14.0
html2text==2020.1.16
httpcore==1.0.2
httpx==0.26.0
idna==3.6
jsonpatch==1.33
jsonpointer==2.4
langchain-core==0.1.18
langchain-openai==0.0.5
langsmith==0.0.86
numpy==1.26.4
openai==1.11.1
packaging==23.2
paper-qa==4.0.0rc7
pycryptodome==3.20.0
pydantic==2.6.1
pydantic_core==2.16.2
pypdf==4.0.1
PyYAML==6.0.1
regex==2023.12.25
requests==2.31.0
slack-bolt==1.18.1
slack_sdk==3.26.2
sniffio==1.3.0
tenacity==8.2.3
tiktoken==0.5.2
tqdm==4.66.1
typing_extensions==4.9.0
urllib3==2.2.0
ipython
tqdm
62 changes: 62 additions & 0 deletions scan_messages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/python3

import os
from slack_bolt import App
import json

#Create handle to Slack
app = App(token=os.environ["SLACK_BOT_TOKEN"])
bot_userid = os.environ["SLACK_BOT_USERID"]
channel_name = os.environ["SLACK_CHANNEL_NAME"]

save_to_file = "./bot_messages.json"

channel_id = None
# Call the conversations.list method using the WebClient
for result in app.client.conversations_list():
if channel_id is not None:
break
for channel in result["channels"]:
if channel["name"] == channel_name:
channel_id = channel["id"]
#Print result
print(f"Found conversation ID: {channel_id}")
break

if channel_id is None:
raise ValueError(f"Unable to find channel named: {channel_name:s}")

# Store conversation history
conversation_history = []

# Call the conversations.history method using the WebClient
# conversations.history returns the first 100 messages by default
# These results are paginated, see: https://api.slack.com/methods/conversations.history$pagination
result = app.client.conversations_history(channel=channel_id)
while True:
conversation_history += result["messages"]
if not result.data["has_more"]:
break
cursor = result.data["response_metadata"]["next_cursor"]
result = app.client.conversations_history(channel=channel_id,cursor=cursor)

# Print results
print(f"{len(conversation_history):d} messages found in {channel_id:s}")

bot_messages = []
for message in conversation_history:
if message["user"]!=bot_userid:
continue
if "subtype" in message and message["subtype"] == "channel_join":
continue
message.pop("blocks")
message.pop("bot_profile")
bot_messages.append(message)

# Print results
print(f"{len(bot_messages):d} bot messages found in {channel_id:s}")

with open(save_to_file,"w") as f:
json.dump(bot_messages,f,indent=4)

print(f"finished successfully, saved to {save_to_file:s}")
52 changes: 33 additions & 19 deletions virtualpi.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
from paperqa import Docs
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
from openai import AsyncOpenAI
from tqdm import tqdm
chat = AsyncOpenAI()

#Create handle to Slack
app = App(token=os.environ["SLACK_BOT_TOKEN"])
Expand All @@ -21,6 +24,7 @@
#This function is called when a Slack user mentions the bot
@app.event("app_mention")
def event_test(say, body):
print("received question, working on answer.")
try:
#This gets the question text from the user
user_question=body["event"]["blocks"][0]["elements"][0]["elements"][1]["text"]
Expand All @@ -29,11 +33,9 @@ def event_test(say, body):
answer = docs.query(user_question, k=30, max_sources=10)
#Print some stuff locally
print(answer.formatted_answer)
for p in answer.passages:
print("* %s: %s\n"%(p, answer.passages[p]))
print("\n\n\n")
#Send the answer to Slack
say(answer.formatted_answer)
#Send the (minimal) answer to Slack
say(answer.answer)
except Exception as e:
print("Error: %s"%e)

Expand All @@ -52,10 +54,14 @@ def event_test(say, body):
try:
#Load the pre-pickled document vector if it exists
with open("%s/docs.pkl"%PAPERDIR, "rb") as f:
docs = pickle.load(f)
docs = pickle.loads(pickle.load(f))
docs.set_client(chat)
print("Loaded previous state from %s/docs.pkl"%PAPERDIR)
print(" - remove this file if you change the set of PDFs\n")
except:
except FileNotFoundError:
docs = None

if docs is None:
#Couldn't load a pre-picked version
papers=[]
filesfound=glob.glob("%s/*"%PAPERDIR)
Expand All @@ -70,34 +76,42 @@ def event_test(say, body):
print("Found %d PDFs in %s"%(len(papers),PAPERDIR))

#Add each paper in turn to paper-qa/FAISS/OpenAI embedding
docs = Docs(llm='gpt-3.5-turbo', summary_llm="davinci")
for p in papers:
docs = Docs(llm="gpt-3.5-turbo",client=chat)
print("Embedding documents")
pbar = tqdm(papers,leave=True,desc="")
for p in pbar:
try:
#Get the base file name to use as the citation
citation=os.path.split(p)[-1]
#Strip off the ".pdf" or ".PDF"
citation=citation[0:citation.rfind(".")]
#Embed this doc
print("Embedding %s"%citation)
docs.add(p, citation=citation, key=citation)
pbar.set_description(f"doc={citation:s}")
docs.add(p,docname=citation,citation=citation)
except Exception as e:
print("Error processing %s: %s"%(p,e))
try:
with open("%s/docs.pkl"%PAPERDIR, "wb") as f:
#Save this state for next time
print("\nSaving state to file %s/docs.pkl - this may take some time."%PAPERDIR)
pickle.dump(docs, f)
pickle.dump(pickle.dumps(docs), f)
except Exception as e:
print("Couldn't save state into %s - is it writeable?"%PAPERDIR)
print("Error was: %s"%e)
sys.exit(1)
finally:
#This is only necessary as the Slack handle created above seems to break
#during the long delay of embedding and pickling. Some kind of bug?
print("State saved okay - please restart program.")
sys.exit(1)
sys.exit(2)

docs.prompts.qa = ("Write an answer ({answer_length}) "
"for the question below based on the provided context. "
"If the context provides insufficient information, "
'reply "I cannot answer". '
"For each part of your answer, indicate which sources most support it "
"via valid citation markers at the end of sentences, like (Example2012). "
"Answer in an unbiased, comprehensive, and scholarly tone. "
"If the question is subjective, provide an opinionated answer in the concluding 1-2 sentences. "
"Use Markdown for formatting code or text, and try to use direct quotes to support arguments.\n\n"
"{context}\n"
"Question: {question}\n"
"Answer: ")

#Set up the Slack interface to start servicing requests
print("Starting Slack handler - bot is ready to answer your questions!")
SocketModeHandler(app, os.environ["SLACK_APP_TOKEN"]).start()