Skip to content

[ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks

License

Notifications You must be signed in to change notification settings

zjunlp/OceanGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OceanGPT (沧渊): Ocean Foundation Model

ProjectPaperModelsWebOverviewQuickstartCitation

License: MIT

Table of Contents

🔔News

  • 2025-04-20, we release the OceanGPT-o-7B and OceanGPT-coder-7B.
  • 2025-02-01, we collect sonar data for model training and test OceanGPT-coder.
  • 2024-12-01, we collect more publicly available sonar data and scientific images for model training.
  • 2024-08-01, we launch bilingual (Chinese-English) multimodal large language model OceanGPT-o with sonar and ocean science image data collection and training.
  • 2024-07-04, we release the OceanGPT-basic-14B/2B and the updated version of OceanGPT-basic-7B (v0.2).
  • 2024-06-04, OceanGPT is accepted by ACL 2024. 🎉🎉
  • 2023-10-04, we release the paper "OceanGPT: A Large Language Model for Ocean Science Tasks" and release OceanGPT-basic-7B (v0.1) based on LLaMA2.
  • 2023-05-01, we launch the OceanGPT (沧渊) project.

Models

Model Name ModelScope HuggingFace
OceanGPT-o-7B (based on Qwen, recommended) 14B 14B
OceanGPT-coder-7B (based on Qwen, recommended) To be released To be released
OceanGPT-basic-8B (based on Qwen, recommended) To be released To be released
OceanGPT-basic-14B (based on Qwen, legacy) 14B 14B
OceanGPT-basic-7B (based on Qwen, legacy) 7B 7B
OceanGPT-basic-2B (based on MiniCPM, legacy) 2B 2B

  • Please note that the ocean domain Q&A in the online demo system (including the video) is based on knowledge base augmentation and a "general-specialized integration" approach, and the generated content differs from that of the open-source models (注意:在线演示系统和视频里的海洋专业问答采用了知识增强与通专结合等技术,因此和开源模型存在差异)!
  • Due to limited computing resources, OceanGPT-o is currently only applicable for natural language interpretation and generation of certain types of sonar images and marine science images. It is recommended to use a GPU that is greater than or equal to 24GB.

Instruction Data

Data Name HuggingFace ModelScope
OceanInstruct 50K 50K
OceanInstruct-o 50K 50K

  • Some of the instruction data are synthetic data; we apologize for any inaccuracies that may exist (部分指令数据为合成数据,如存在错误敬请谅解)!

🌟Overview

This is the OceanGPT (沧渊) project, which aims to build ocean foundation model.

  • Disclaimer: This project is purely an academic exploration rather than a product(本项目仅为学术探索并非产品应用). Please be aware that due to the inherent limitations of large language models, there may be issues such as hallucinations.

⏩Quickstart

conda create -n py3.11 python=3.11
conda activate py3.11
pip install -r requirements.txt

Download the model

Download from HuggingFace

git lfs install
git clone https://huggingface.co/zjunlp/OceanGPT-14B-v0.1

or

huggingface-cli download --resume-download zjunlp/OceanGPT-14B-v0.1 --local-dir OceanGPT-14B-v0.1 --local-dir-use-symlinks False

Download from WiseModel

git lfs install
git clone https://www.wisemodel.cn/zjunlp/OceanGPT-14B-v0.1.git

Download from ModelScope

git lfs install
git clone https://www.modelscope.cn/ZJUNLP/OceanGPT-14B-v0.1.git

Inference

Inference by HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" # the device to load the model onto
path = 'YOUR-MODEL-PATH'

model = AutoModelForCausalLM.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(path)

prompt = "Which is the largest ocean in the world?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Inference by vllm

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

path = 'YOUR-MODEL-PATH'

tokenizer = AutoTokenizer.from_pretrained(path)

prompt = "Which is the largest ocean in the world?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sampling_params = SamplingParams(temperature=0.8, top_k=50)
llm = LLM(model=path)

response = llm.generate(text, sampling_params)

🤗Chat with Our Demo on Gradio

Online Demo

We provide users with an interactive Gradio demo accessible online.

Local WebUI Demo

You can easily deploy the interactive interface locally using the code we provide.

python app.py

Open https://localhost:7860/ in browser and enjoy the interaction with OceanGPT.

📌Inference

Efficient Inference with llama.cpp, ollama, vLLM

llama.cpp now officially supports Models based Qwen2.5-hf convert to gguf. Click to see.

Download OceanGPT PyTorch model from huggingface to "OceanGPT" folder.

Clone llama.cpp and make:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make llama-cli

And then convert PyTorch model to gguf files:

python convert-hf-to-gguf.py OceanGPT --outfile OceanGPT.gguf

Running the model:

./llama-cli -m OceanGPT.gguf \
    -co -cnv -p "Your prompt" \
    -fa -ngl 80 -n 512
ollama now officially supports Models based Qwen2.5. Click to see.

Create a file named Modelfile

FROM ./OceanGPT.gguf
TEMPLATE "[INST] {{ .Prompt }} [/INST]"

Create the model in Ollama:

ollama create example -f Modelfile

Running the model:

ollama run example "What is your favourite condiment?"
vLLM now officially supports Models based Qwen2.5-VL and Qwen2.5. Click to see.
  1. Install vLLM(>=0.7.3):
pip install vllm
  1. Run Example:

🌻Acknowledgement

OceanGPT (沧渊) is trained based on the open-sourced large language models including Qwen, MiniCPM, LLaMA.

OceanGPT is trained based on the open-sourced data and tools including Moos, UATD, Forward-looking Sonar Detection Dataset, NKSID, SeabedObjects-KLSG, Marine Debris.

Thanks for their great contributions!

Limitations

  • The model may have hallucination issues.

  • Due to limited computational resources, OceanGPT-o currently only supports natural language generation for certain types of sonar images and ocean science images. OceanGPT-coder currently only supports MOOS code generation.

  • We did not optimize the identity and the model may generate identity information similar to that of Qwen/MiniCPM/LLaMA/GPT series models.

  • The model's output is influenced by prompt tokens, which may result in inconsistent results across multiple attempts.

🚩Citation

Please cite the following paper if you use OceanGPT in your work.

@article{bi2024oceangpt,
  title={OceanGPT: A Large Language Model for Ocean Science Tasks},
  author={Bi, Zhen and Zhang, Ningyu and Xue, Yida and Ou, Yixin and Ji, Daxiong and Zheng, Guozhou and Chen, Huajun},
  journal={arXiv preprint arXiv:2310.02031},
  year={2024}
}