Eliminating Multilingual Noise in AI with Real-Time Language Filtering in vLLM

This project demonstrates a setup for deploying the Meta-Llama-3.1-8B-Instruct model with a real-time character filtering solution to reduce multilingual noise, specifically in low-resource languages. By integrating a Python-only development environment, we enable rapid development and testing for multilingual LLMs with custom filtering.

Features

Character Filtering: Filters out characters from unintended languages (e.g., Chinese, Hebrew, Korean) to maintain language coherence in responses.
Real-Time Processing: Handles multilingual noise filtering in real-time, preserving relevant tokens for output.
Python-Only Development Mode: Enables live code changes without recompilation, streamlining development.

Quick Setup

Install vLLM:

pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Set Up Local Development Environment:

git clone https://github.com/vllm-project/vllm.git
cd vllm
python python_only_dev.py

Run the vLLM Server:

vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --api-key token-abc123

Implementing Character Filtering

In the step function of the LLMEngine class, a character filter detects and suppresses tokens from unintended languages by adjusting their log probabilities. See llm_engine.py for the core code implementation.

import re
def contains_specific_language(self, text):
    # Regex pattern to match Chinese, Korean, and Hebrew Unicode ranges
    pattern = r'[\u4E00-\u9FFF\uAC00-\uD7AF\u0590-\u05FF]'
    return bool(re.search(pattern, text))
outputs = self.model_executor.execute_model(
    execute_model_req=execute_model_req
)
print("Original output:", outputs)
logprobs = outputs[0].outputs[0].samples[0].logprobs
# Iterate through tokens and filter based on language-specific characters
for token_id in logprobs.keys():
    decoded_token = self.tokenizer_llama.decode(token_id)
    if self.contains_specific_language(decoded_token):
        logprobs[token_id].logprob = float('-inf')  # Set probability to near-zero
# Sorting log probabilities by updated values and re-ranking tokens
sorted_logprobs = sorted(logprobs.items(), key=lambda item: item[1].logprob, reverse=True)
sorted_logprobs_with_new_ranks = {
    item[0]: Logprob(logprob=item[1].logprob, rank=index + 1, decoded_token=item[1].decoded_token)
    for index, item in enumerate(sorted_logprobs)
}
# Update the output token with the highest-ranked token post-filtering
outputs[0].outputs[0].samples[0].output_token = list(sorted_logprobs.keys())[0]
print("Filtered output:", outputs)

Exiting Development Mode

To exit the Python-only development environment:

python python_only_dev.py --quit-dev

Key Takeaways

Efficient Multilingual Support: The character filter significantly improves output coherence without requiring model retraining.
Scalability: This setup supports large-scale deployments of multilingual LLMs, offering cleaner, language-specific outputs.
Adaptability: Ideal for use cases where maintaining language purity is essential, particularly in educational and inclusive AI applications.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
llm_engine.py		llm_engine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eliminating Multilingual Noise in AI with Real-Time Language Filtering in vLLM

Features

Quick Setup

Implementing Character Filtering

Exiting Development Mode

Key Takeaways

License

About

Uh oh!

Releases

Packages

Languages

License

mohammadaminabbasi/eliminating-multilingual-noise-llm

Folders and files

Latest commit

History

Repository files navigation

Eliminating Multilingual Noise in AI with Real-Time Language Filtering in vLLM

Features

Quick Setup

Implementing Character Filtering

Exiting Development Mode

Key Takeaways

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages