Description
Your inference core logic function:
def run_command(command: List[str]) -> str:
"""Run a system command and capture its output."""
try:
result = subprocess.run(command, check=True, capture_output=True, text=True)
return result.stdout
except subprocess.CalledProcessError as e:
raise HTTPException(status_code=500, detail=f"Error occurred while running command: {e}")
except Exception as e:
raise HTTPException(status_code=500, detail=f"Unexpected error: {e}")
This is the same as the way run_inference.pyrun_inference.py executes system commands. Your project simply encapsulates run_command. If you provide API interfaces as a server, you can consider using the resident memory mode instead of loading the model into memory from scratch every time (this is time-consuming and inefficient). For details, you can refer to the "llama.cpp" method and use "llama_cpp.server" to provide services。