模型部署框架xinference

安装xinference

https://inference.readthedocs.io/zh-cn/latest/index.html

xinference是一个llm部署框架，和ollama类似，是一个快速运行llm和嵌入模型的框架
部署前确定已经在当前虚拟环境和目录中source venv/bin/activate
uv pip install xinference
或者带推理的
uv pip install "xinference[vllm]" -i https://mirrors.cloud.tencent.com/pypi/simple
安装嵌入模型支持库
uv pip install sentence-transformers -i https://mirrors.cloud.tencent.com/pypi/simple
安装视觉模块,如果需要使用qwen-vl-utils
uv pip install qwen-vl-utils -i https://mirrors.cloud.tencent.com/pypi/simple

xinference-local --host 0.0.0.0 --port 9997运行
运行后在本机使用http://127.0.0.1:9997访问

点击+号可填选路径访问本地模型
我添加了一个Language（llm）模型起名deepseek15b，一个embedding模型，起名bge
根据选择添加好模型后点击小火箭就可运行

命令启动某个模型(用户开机自启动)
xinference launch --help 查看启动参数

启动8bit量化
xinference launch -mp /home/ai/DeepSeek-R1-Distill-Qwen-14B -u deepseek14b -n deepseek14b -s 14 -f pytorch -en Transformers -t LLM --n-gpu 4 --gpu-idx 0,1,2,3 --quantization 8bit

示范启动一个本地ll模型
xinference launch -mp /home/deepseek/DeepSeek-R1-Distill-Qwen-1.5B -u deepseek15b -n deepseek15b -s 1_5 -f pytorch -en Transformers -t LLM --n-gpu 4 --gpu-idx 0,1,2,3

启动一个嵌入模型
xinference launch -mp /home/deepseek/bge-large-zh-v1.5 -u bge -n bge -t embedding --n-gpu auto
上面的参数说明：
mp 模型路径
n 名称
u id
s 大小
f 格式
en 推理引擎
t 类型
n-gpu 使用的gpu数量
--gpu-idx 使用哪些gpu(按gpu索引号)

如果使用deepseek R1模型，一定要使用 V3模板，让对话出现完整的think标签，才能被dify识别为思考

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{# 检查 `add_generation_prompt` 是否已定义，如果没有定义则设置为 `false` #}

{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}
{# 创建一个命名空间 `ns` 来存储一些状态变量，初始化为以下值：
   - `is_first`: 是否是第一次处理，初始化为 `false`。
   - `is_tool`: 是否正在处理工具的输出，初始化为 `false`。
   - `is_output_first`: 是否是第一次处理输出，初始化为 `true`。
   - `system_prompt`: 系统消息内容，初始化为空字符串。
   - `is_first_sp`: 是否是第一次处理系统消息，初始化为 `true`。 #}

{%- for message in messages %}
{# 开始遍历 `messages` 列表，这通常是一个包含多轮对话的消息对象。 #}

{%- if message['role'] == 'system' %}
{# 如果当前消息的角色是 `system`，即系统消息 #}

{%- if ns.is_first_sp %}
{# 如果是第一次处理系统消息 #}

{% set ns.system_prompt = ns.system_prompt + message['content'] %}
{# 将系统消息的内容加到 `system_prompt` 中 #}

{% set ns.is_first_sp = false %}
{# 将 `is_first_sp` 设置为 `false`，表示已处理过第一次系统消息 #}

{%- else %}
{# 如果不是第一次处理系统消息 #}

{% set ns.system_prompt = ns.system_prompt + '\n' + message['content'] %}
{# 将当前系统消息内容追加到 `system_prompt` 中，并加上换行符 #}

{%- endif %}
{%- endif %}
{%- endfor %}
{# 结束遍历 `messages` #}

{{bos_token}}{{ns.system_prompt}}
{# 输出开始标记符（`bos_token`）和合并后的系统消息内容 `system_prompt`。 #}

{%- for message in messages %}
{# 重新开始遍历 `messages` 列表，处理用户和助手的消息 #}

{%- if message['role'] == 'user' %}
{# 如果当前消息的角色是 `user`，即用户的输入 #}

{%- set ns.is_tool = false -%}
{# 将 `is_tool` 设置为 `false`，表示当前消息不是工具输出 #}

{{'<｜User｜>' + message['content']}}
{# 在用户输入前加上 `<｜User｜>` 标签，输出用户的消息内容 #}

{%- endif %}

{%- if message['role'] == 'assistant' and message['content'] is none %}
{# 如果当前消息的角色是 `assistant` 且助手的消息内容为空（通常表示工具调用） #}

{%- set ns.is_tool = false -%}
{# 将 `is_tool` 设置为 `false`，表示当前处理的不是工具输出 #}

{%- for tool in message['tool_calls'] %}
{# 遍历助手消息中的工具调用（`tool_calls`） #}

{%- if not ns.is_first %}
{# 如果不是第一次输出工具调用 #}

{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}
{# 输出工具调用的开始标记，并显示工具类型、函数名和参数等信息，格式为 JSON。 #}

{% set ns.is_first = true %}
{# 设置 `is_first` 为 `true`，表示已经输出过第一次工具调用 #}

{%- else %}
{# 如果是第一次输出工具调用 #}

{{'<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}
{# 仅输出工具调用的内容，没有重复标记 #}

{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}
{# 输出工具调用结束标记，并结束句子 #}

{%- endif %}
{%- endfor %}
{%- endif %}

{%- if message['role'] == 'assistant' and message['content'] is not none %}
{# 如果当前消息的角色是 `assistant` 且助手的消息内容不为空 #}

{%- if ns.is_tool %}
{# 如果正在处理工具输出 #}

{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}
{# 输出工具输出的结束标记，并附带助手的输出内容。 #}

{% set ns.is_tool = false %}
{# 将 `is_tool` 设置为 `false`，表示结束工具输出的处理 #}

{%- else %}
{# 如果不是工具输出，输出助手的回复 #}

{{'<｜Assistant｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}
{# 输出助手的回复内容，并结束句子。 #}

{%- endif %}
{%- endif %}

{%- if message['role'] == 'tool' %}
{# 如果当前消息的角色是 `tool`，即工具的输出 #}

{%- set ns.is_tool = true -%}
{# 将 `is_tool` 设置为 `true`，表示正在处理工具输出 #}

{%- if ns.is_output_first %}
{# 如果是第一次输出工具的内容 #}

{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
{# 输出工具的开始标记，并显示工具的输出内容 #}

{% set ns.is_output_first = false %}
{# 将 `is_output_first` 设置为 `false`，表示已经输出过工具内容 #}

{%- else %}
{# 如果不是第一次输出工具内容 #}

{{'<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}
{# 输出工具的内容，添加开始和结束标记 #}

{%- endif %}
{%- endif %}
{%- endfor -%}

{% if ns.is_tool %}
{# 如果正在处理工具的输出，输出工具输出结束标记 #}

{{'<｜tool▁outputs▁end｜>'}}
{# 输出工具输出的结束标记 #}

{% endif %}

{% if add_generation_prompt and not ns.is_tool %}
{# 如果需要添加生成提示，并且当前消息不是工具输出 #}

{{'<｜Assistant｜>'}}
{# 输出 `<｜Assistant｜>` 标签 #}

{% endif %}

#重启
sudo systemctl restart xinference
#验证服务
systemctl status xinference
查错
sudo journalctl -u xinference.service -r