Issue/74 基于InfiniCore::nn::module适配Llama模型 #75

Ceng23333 · 2025-11-20T06:53:58Z

端到端验证截图

pengcheng888 · 2025-11-25T07:04:17Z

test/models/llama/test_llama_inference.py

+        # Check if forward method is available
+        if hasattr(infinilm_model._model, 'forward'):
+            # Call forward method
+            infini_logits = infinilm_model._model.forward(


infinilm_model属于LlamaForCausalLM类。LlamaForCausalLM类中应该提供一个重载()的函数，通过infinilm_model( infini_input_ids, infini_position_ids, None )去调用。

在LlamaForCausalLM提供一个类似这样的函数，
def call(self, infini_input_ids, infini_position_ids, kv_cache=None):
return self._model.forward( infini_input_ids, infini_position_ids, kv_cache)

pengcheng888 · 2025-11-25T14:19:51Z

python/infinilm/models/llama/llama_cpp.py

+        elif not isinstance(config, LlamaConfig):
+            config = LlamaConfig(**config)
+
+        if device is None:


Device() 类是不是已经删除了

pengcheng888 · 2025-11-25T14:48:36Z

python/infinilm/models/llama/llama_cpp.py

+        return self._cpp_config
+
+
+class LlamaModel(infinicore.nn.Module):


class LlamaModel 并没有被使用到，可以删了

pengcheng888 · 2025-11-26T07:17:15Z

python/infinilm/generation/utils.py


-        eos_token_id = config.eos_token_id
+        # eos_token_id = config.eos_token_id
+        eos_token_id = 128001


eos_token_id = config.eos_token_id

eos_token_id 不是确定数字,是从config.json中得到的, 不同模型可能不一样

pengcheng888 · 2025-11-26T07:18:56Z

python/infinilm/generation/utils.py

            #                     处理输出
            # -------------------------------------------------------------------------- #
-            token_scores = logits
+            seq_l = logits.shape[1]


这个逻辑写到 c++中, python端看到的已经是 last_token了

pengcheng888 · 2025-11-26T07:21:04Z

python/infinilm/models/llama/backends/cpp.py

+        return self._cpp_config
+
+
+class LlamaModel(infinicore.nn.Module):


class LlamaModel 定义了,但没有使用到. 可以删掉这个类

pengcheng888 · 2025-11-26T07:25:37Z

python/infinilm/models/llama/backends/cpp.py

@@ -1,15 +1,224 @@
 from ....generation.utils import GenerationMixin
 import infinicore
+from infinicore.device import device as Device


为什么不使用infinicore.device, 而是重命名为 Device. 还以为是两种类型呢

pengcheng888 · 2025-11-26T07:29:10Z

examples/llama.py


 if __name__ == "__main__":
+    if False:
+        model_path = "/var/qy_home/zenghua/.cache/modelscope/hub/models/LLM-Research/Llama-3.2-1B-Instruct"


临时的测试代码,需要删掉

pengcheng888 · 2025-11-26T08:31:35Z

examples/llama.py


    infini_device = infinicore.device(device_str, 0)
-    infini_dtype = infinicore.bfloat16
+    infini_dtype = infinicore.float32 if backend == "cpp" else infinicore.bfloat16


cpp默认用float32，那么llama的cpp支持其他fp16或bf16吗

pengcheng888 · 2025-11-26T08:32:22Z

csrc/models/llama/llama_attention.hpp

+        : cache_position(0), max_capacity(0), initialized(false),
+          // Create empty placeholder tensors (will be replaced on first use)
+          k_cache(infinicore::Tensor::empty({1, 1, 1}, infinicore::DataType::F32,
+                                            infinicore::Device(infinicore::Device::Type::CPU, 0))),


这样写的话，是不是只能跑Float32

pengcheng888 · 2025-11-26T08:36:49Z

csrc/models/llama/llama_attention.hpp

+
+    // Rotary Position Embeddings (RoPE)
+    INFINICORE_NN_MODULE(infinicore::nn::RoPE, rotary_emb);
+


每一层都有一个rope对象了

pengcheng888 · 2025-11-26T08:43:47Z

csrc/models/llama/pybind11_llama.cc

@@ -0,0 +1,4 @@
+#include "pybind11_llama.hpp"


文件中没有内容

pengcheng888 · 2025-11-26T08:45:17Z

csrc/models/llama/pybind11_llama.hpp

+                                    } else {
+                                        throw std::runtime_error("Invalid KV cache type. Expected LlamaKVCache or None.");
+                                    }
+                                }


嵌套太多了

pengcheng888 · 2025-11-26T08:46:30Z

csrc/models/llama/pybind11_llama.hpp

+                            // Try to cast to LlamaKVCache shared_ptr
+                            try {
+                                auto cache = item.cast<std::shared_ptr<LlamaKVCache>>();
+                                kv_caches_vec.push_back(cache.get());


嵌套太多了

pengcheng888 · 2025-11-26T08:48:17Z

examples/llama.py

-                    _dec.Fuse(),
-                ]
-            )
+    # if "llama" == config.model_type:


恢复 if "llama" == config.model_type:

pengcheng888 · 2025-11-26T09:39:20Z

有4个commit 信息，可以整合为1个

PanZezhong1725 · 2025-11-26T09:44:40Z

csrc/models/llama/llama_attention.cpp

+
+}
+
+infinicore::Tensor LlamaAttention::forward(const infinicore::Tensor &hidden_states,


写的太复杂，理论不应该比python版本长多少

PanZezhong1725 · 2025-11-26T09:45:33Z

csrc/models/llama/llama_attention.cpp

+infinicore::Tensor LlamaAttention::forward(const infinicore::Tensor &hidden_states,
+                                            const infinicore::Tensor &position_ids,
+                                            void *kv_cache,
+                                            const HookRegistry *hook_registry,


hook在哪用了？

调试的时候用的，已经提交的代码删了

PanZezhong1725 · 2025-11-26T09:46:40Z

csrc/models/llama/llama_attention.cpp

+    // For batch=1 (common in inference), reshape to [n_q_head, seq_len, head_dim]
+    // Note: For batch > 1, this would need to be handled differently
+    // Make contiguous before final view since permute can make tensor non-contiguous
+    auto q_permuted_cont = q_permuted->contiguous();


为什么这么多contiguous，性能不要了？

PanZezhong1725 · 2025-11-26T09:47:29Z

csrc/models/llama/llama_attention.cpp

+    return output;
+}
+
+infinicore::Tensor LlamaAttention::project_q(const infinicore::Tensor &hidden_states) const {


为什么要这种一两行的函数

PanZezhong1725 · 2025-11-26T09:48:01Z

csrc/models/llama/llama_attention.hpp

+ * Stores key and value caches with shape [n_kv_head, capacity, head_dim]
+ * Similar to DynamicLayer in Python cache_utils.py
+ */
+struct LlamaKVCache {


cache应该是通用的

PanZezhong1725 · 2025-11-26T09:51:14Z

csrc/models/debug_utils/hooks.hpp

+/**
+ * @brief Hook registry for managing hooks
+ */
+class HookRegistry {


这个有啥用啊？

PanZezhong1725 · 2025-11-27T02:07:31Z

csrc/models/llama/llama_attention.cpp

+        }
+
+        auto scaling_broadcast = scaling_value_tensor->as_strided(attn_weight->shape(), {0, 0, 0});
+        attn_weight = infinicore::op::mul(attn_weight, scaling_broadcast);


这个scale应该直接作为gemm的alpha传进入

PanZezhong1725 · 2025-11-27T02:07:53Z

csrc/models/debug_utils/hooks.hpp

@@ -0,0 +1,184 @@
+#pragma once


这个东西可以作为nn::module的通用基建放在infinicore里

PanZezhong1725 · 2025-11-27T02:08:26Z

csrc/models/pybind11/models/llama.hpp

@@ -3,7 +3,8 @@
 #include <pybind11/pybind11.h>


按照pybind11/models/llama.hpp 命名吧

Signed-off-by: Ceng23333 <[email protected]>

PanZezhong1725 · 2025-11-27T03:20:00Z

csrc/models/llama/llama.hpp

+ * - LlamaDecoderLayer: Single transformer decoder layer
+ * - LlamaModel: Core transformer model (without LM head)
+ * - LlamaForCausalLM: Complete model with language modeling head
+ * - HookRegistry: Hook system for capturing intermediate values


还有comment没删掉

Ceng23333 marked this pull request as draft November 20, 2025 06:54

Ceng23333 force-pushed the issue/74 branch 2 times, most recently from 3553814 to 294383d Compare November 25, 2025 07:39

Ceng23333 marked this pull request as ready for review November 25, 2025 07:39

Ceng23333 requested review from PanZezhong1725 and pengcheng888 November 25, 2025 07:39

pengcheng888 requested changes Nov 25, 2025

View reviewed changes

pengcheng888 reviewed Nov 25, 2025

View reviewed changes

Ceng23333 closed this Nov 26, 2025

Ceng23333 reopened this Nov 26, 2025

Ceng23333 force-pushed the issue/74 branch 2 times, most recently from fae0713 to f94518b Compare November 26, 2025 06:43

pengcheng888 requested changes Nov 26, 2025

View reviewed changes

PanZezhong1725 requested changes Nov 26, 2025

View reviewed changes

PanZezhong1725 requested changes Nov 27, 2025

View reviewed changes

issue/74 add c++ Llama models and align to AutoLlama interface

37f13f4

Signed-off-by: Ceng23333 <[email protected]>

Ceng23333 force-pushed the issue/74 branch from 4444101 to 37f13f4 Compare November 27, 2025 02:38

PanZezhong1725 requested changes Nov 27, 2025

View reviewed changes

		return self._cpp_config


		class LlamaModel(infinicore.nn.Module):


		// Rotary Position Embeddings (RoPE)
		INFINICORE_NN_MODULE(infinicore::nn::RoPE, rotary_emb);


		}

		infinicore::Tensor LlamaAttention::forward(const infinicore::Tensor &hidden_states,

Issue/74 基于InfiniCore::nn::module适配Llama模型 #75

Are you sure you want to change the base?

Issue/74 基于InfiniCore::nn::module适配Llama模型 #75

Uh oh!

Conversation

Ceng23333 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengcheng888 commented Nov 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ceng23333 commented Nov 20, 2025 •

edited

Loading