The quanteda.llm package makes it easy to use LLMs with quanteda.corpora (or data frames), to enable classification, summarisation, scoring, and analysis of documents and text. quanteda provides a host of convenient functions for managing, manipulating, and describing corpora as well as linking their document variables and metadata to these documents. quanteda.llm makes it convenient to link these to LLMs for analysing or classifying these texts, creating new variables from what is created by the LLMs. Using a tidy approach and linking to the new quanteda.tidy package, we enable convenient operations using common Tidyverse functions for manipulating LLM-created obejcts and variables.
The package includes the following functions:
ai_summarize
: Summarizes documents in a corpus.ai_relevance
: Classifies documents in a corpus according to a set of topics and assesses the relevance of each document to each topic.ai_score
: Scores documents in a corpus according to a defined scale.ai_validate
: Starts an interactive app to manually validate the LLM-generated summaries, labels or scores.
More to follow.
The package supports all LLMs currently available with the ellmer
package, including:
- Anthropic’s Claude:
chat_claude
. - AWS Bedrock:
chat_bedrock
. - Azure OpenAI:
chat_azure
. - Databricks:
chat_databricks
. - DeepSeek:
chat_deepseek
. - GitHub model marketplace:
chat_github
. - Google Gemini:
chat_gemini
. - Groq:
chat_groq
. - Ollama:
chat_ollama
. - OpenAI:
chat_openai
. - OpenRouter:
chat_openrouter
. - perplexity.ai:
chat_perplexity
. - Snowflake Cortex:
chat_snowflake
andchat_cortex_analyst
. - VLLM:
chat_vllm
.
For authentication and usage of each of these LLMs, please refer to the
respective ellmer
documentation
here. For
example, to use the chat_ollama
models, first download and install
Ollama. Then install some models either from the
command line (e.g. with ollama pull llama3.1) or within R using the
rollama
package. The Ollama app must be running for the models to be
used. To use the chat_openai
models, you would need to sign up for an
API key from OpenAI which you can save in your .Renviron
file as
OPENAI_API_KEY
.
You can install the development version of quanteda.llm from GitHub with:
# install.packages("pak")
pak::pak("quanteda/quanteda.llm")
library(quanteda)
library(quanteda.llm)
#pak::pak("quanteda/quanteda.tidy")
library(quanteda.tidy)
corpus <- quanteda::data_corpus_inaugural %>%
quanteda.tidy::mutate(llm_sum = ai_summarize(text,
chat_fn = chat_openai, model = "gpt-4o"),
api_args = list(temperature = 0, seed = 42))
# llm_sum is created as a new docvar in the corpus
library(quanteda)
library(quanteda.llm)
#pak::pak("quanteda/quanteda.tidy")
library(quanteda.tidy)
topics = c("Politics", "Sports", "Technology", "Entertainment", "Business", "Other")
corpus <- quanteda::data_corpus_inaugural %>%
quanteda.tidy::mutate(llm_relevance <- ai_relevance(text,
chat_fn = chat_openai, model = "gpt-4o",
api_args = list(temperature = 0, seed = 42), topics = topics))
# llm_relevance is created as a new docvar in the corpus
library(quanteda)
library(quanteda.llm)
#pak::pak("quanteda/quanteda.tidy")
library(quanteda.tidy)
scale = "Score the following document on a scale of how much it aligns
with the political left. The political left is defined as groups which
advocate for social equality, government intervention in the economy,
and progressive policies. Use the following metrics:
SCORING METRIC:
1 : extremely left
0 : not at all left"
corpus <- quanteda::data_corpus_inaugural %>%
quanteda.tidy::mutate(llm_score = ai_score(text, chat_fn = chat_openai, model = "gpt-4o",
api_args = list(temperature = 0, seed = 42), scale = scale, evidence = TRUE))
# llm_score is created as a new docvar in the corpus
# evidence is created as a new docvar in the corpus with the LLM's reasoning
# `few_shot_examples` can be used to provide a labelled dataset with examples of the scoring scale
library(quanteda)
library(quanteda.llm)
#pak::pak("quanteda/quanteda.tidy")
library(quanteda.tidy)
scale = "Score the following document on a scale of how much it aligns
with the political left. The political left is defined as groups which
advocate for social equality, government intervention in the economy,
and progressive policies. Use the following metrics:
SCORING METRIC:
1 : extremely left
0 : not at all left"
corpus <- quanteda::data_corpus_inaugural %>%
quanteda.tidy::mutate(llm_score = ai_score(text, chat_fn = chat_openai, model = "gpt-4o",
api_args = list(temperature = 0, seed = 42), scale = scale))
# llm_score is created as a new docvar in the corpus
# Start the interactive app to validate the LLM-generated scores
corpus <- corpus %>%
quanteda.tidy::mutate(validated = ai_validate(text, llm_score))
# validated is created as a new docvar in the corpus with all non-validated scores set to NA