implement the model to reside in memory, rather than reloading the model initialization every time you execute the system command, which is inefficient.

Your inference core logic function:

def run_command(command: List[str]) -> str:
    """Run a system command and capture its output."""
    try:
        result = subprocess.run(command, check=True, capture_output=True, text=True)
        return result.stdout
    except subprocess.CalledProcessError as e:
        raise HTTPException(status_code=500, detail=f"Error occurred while running command: {e}")
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Unexpected error: {e}")

This is the same as the way run_inference.py[run_inference.py](https://github.com/microsoft/BitNet/blob/main/run_inference.py) executes system commands. Your project simply encapsulates run_command. If you provide API interfaces as a server, you can consider using the resident memory mode instead of loading the model into memory from scratch every time (this is time-consuming and inefficient). For details, you can refer to the "llama.cpp" method and use "llama_cpp.server" to provide services。



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

implement the model to reside in memory, rather than reloading the model initialization every time you execute the system command, which is inefficient. #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

implement the model to reside in memory, rather than reloading the model initialization every time you execute the system command, which is inefficient. #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions