Skip to content

implement the model to reside in memory, rather than reloading the model initialization every time you execute the system command, which is inefficient. #1

Closed
@CementZhang

Description

@CementZhang

Your inference core logic function:

def run_command(command: List[str]) -> str:
"""Run a system command and capture its output."""
try:
result = subprocess.run(command, check=True, capture_output=True, text=True)
return result.stdout
except subprocess.CalledProcessError as e:
raise HTTPException(status_code=500, detail=f"Error occurred while running command: {e}")
except Exception as e:
raise HTTPException(status_code=500, detail=f"Unexpected error: {e}")

This is the same as the way run_inference.pyrun_inference.py executes system commands. Your project simply encapsulates run_command. If you provide API interfaces as a server, you can consider using the resident memory mode instead of loading the model into memory from scratch every time (this is time-consuming and inefficient). For details, you can refer to the "llama.cpp" method and use "llama_cpp.server" to provide services。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions