omni_split: Split commonly used document (md, doc etc.) forms for RAG that support LLM.

note: All other text formats are highly recommended to be converted to Markdown, and we focus on optimizing documents for Markdown.

usage

install

pip install omni_split

use case

import json
from omni_split import OmniSplit
from omni_split import word_preprocessing_and_return_bytesIO
from omni_split import download_files_to_test_doc

### == step 2: download test_doc file ==

doc_dict = download_files_to_test_doc()
text_doc_file_path = doc_dict["text_test.txt"]
json_list_doc_file_path = doc_dict["json_list_test.json"]
markdown_doc_file_path = doc_dict["markdown_test.md"]
word_doc_file_path = doc_dict["docx_test.docx"]


### == step 3: split  to chunk ==

omni_spliter = OmniSplit()

## note: test text split
test_text = True
if test_text:
    with open(text_doc_file_path, "r") as f:
        text_content = "".join(f.readlines())
    res = omni_spliter.text_chunk_func(text_content,txt_chunk_size=1000)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

## note: test markdown json split
test_markdown = True
if test_markdown:
    with open(json_list_doc_file_path, "r") as f:
        md_content_json = json.load(f)
    res = omni_spliter.markdown_json_chunk_func(md_content_json)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

    res = omni_spliter.markdown_json_chunk_func(md_content_json, clear_model=True)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

## note: test markdown split
test_markdown = True
if test_markdown:
    with open(markdown_doc_file_path, "r") as f:
        md_content = f.read()
    res = omni_spliter.markdown_chunk_func(md_content)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

    res = omni_spliter.markdown_chunk_func(md_content, clear_model=True)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)


## note: test word split
test_document = True
if test_document:

    new_doc_io = word_preprocessing_and_return_bytesIO(word_doc_file_path)
    res = omni_spliter.document_chunk_func(new_doc_io, txt_chunk_size=1000, clear_model=False)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

    res = omni_spliter.document_chunk_func(new_doc_io, txt_chunk_size=1000, clear_model=False, save_local_images_dir="./images")
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

    res = omni_spliter.document_chunk_func(new_doc_io, txt_chunk_size=1000, clear_model=True)
    for item in res:
        print(item)
        print("------------")
    print("=" * 10)

Reminder of dependency:

To automatically convert binary metafiles(e.g. x-wmf.) in Word to PNG, you need to install ImageMagick on Linux. Try to install: https://docs.wand-py.org/en/latest/guide/install.html

This project is inspired by:

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
docs		docs
omni_split		omni_split
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

omni_split: Split commonly used document (md, doc etc.) forms for RAG that support LLM.

note: All other text formats are highly recommended to be converted to Markdown, and we focus on optimizing documents for Markdown.

usage

install

use case

Reminder of dependency:

This project is inspired by:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dinobot22/omni_split

Folders and files

Latest commit

History

Repository files navigation

omni_split: Split commonly used document (md, doc etc.) forms for RAG that support LLM.

note: All other text formats are highly recommended to be converted to Markdown, and we focus on optimizing documents for Markdown.

usage

install

use case

Reminder of dependency:

This project is inspired by:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages