{"id":35924,"date":"2025-05-07T09:09:38","date_gmt":"2025-05-07T01:09:38","guid":{"rendered":"https:\/\/www.wsisp.com\/helps\/35924.html"},"modified":"2025-05-07T09:09:38","modified_gmt":"2025-05-07T01:09:38","slug":"%e8%af%a6%e8%a7%a3%e5%a6%82%e4%bd%95%e5%a4%8d%e7%8e%b0llama-4%e4%bb%8e%e9%9b%b6%e5%bc%80%e5%a7%8b%e5%88%a9%e7%94%a8python%e6%9e%84%e5%bb%ba","status":"publish","type":"post","link":"https:\/\/www.wsisp.com\/helps\/35924.html","title":{"rendered":"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa"},"content":{"rendered":"<h4>&#x1f9e0; \u5411\u6240\u6709\u5b66\u4e60\u8005\u81f4\u656c&#xff01;<\/h4>\n<p>\u201c\u5b66\u4e60\u4e0d\u662f\u88c5\u6ee1\u4e00\u6876\u6c34&#xff0c;\u800c\u662f\u70b9\u71c3\u4e00\u628a\u706b\u3002\u201d \u2014\u2014 \u53f6\u829d<\/p>\n<hr \/>\n<p>\u6211\u7684\u535a\u5ba2\u4e3b\u9875&#xff1a; https:\/\/lizheng.blog.csdn.net<\/p>\n<h4>&#x1f310; \u6b22\u8fce\u70b9\u51fb\u52a0\u5165AI\u4eba\u5de5\u667a\u80fd\u793e\u533a&#xff01;<\/h4>\n<h4>&#x1f680; \u8ba9\u6211\u4eec\u4e00\u8d77\u52aa\u529b&#xff0c;\u5171\u521bAI\u672a\u6765&#xff01; &#x1f680;<\/h4>\n<p>LLaMA 4 \u53d1\u5e03\u4ee5\u6765\u5df2\u7ecf\u9762\u4e34\u4e86\u5927\u91cf\u7684\u6279\u8bc4&#xff0c;\u4f46LLaMA 4 \u662f\u7ee7 Mistral \u4e4b\u540e\u7684\u4e00\u4e2a\u65b0\u8fdb\u5c55&#xff0c;\u5c55\u793a\u4e86\u57fa\u4e8e MoE&#xff08;Mixture-of-Experts&#xff0c;\u6df7\u5408\u4e13\u5bb6&#xff09;\u6a21\u578b\u7684\u4f18\u52bf\u3002<\/p>\n<p>\u5728\u672c\u535a\u5ba2\u4e2d&#xff0c;\u6211\u4eec\u4ece\u96f6\u5f00\u59cb\u6784\u5efa LLaMA 4 \u7684 MoE \u67b6\u6784&#xff0c;\u4ee5\u4e86\u89e3\u5b83\u662f\u5982\u4f55\u5b9e\u9645\u6784\u5efa\u7684\u3002 \u66f4\u591aLLM\u56fe\u89e3\u5185\u5bb9\u53ef\u4ee5\u67e5\u770b \u8be6\u89e3\u5982\u4f55\u590d\u73b0DeepSeek R1:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa \u8be6\u89e3\u5982\u4f55\u4ece\u96f6\u7528 Python\u590d\u73b0\u7c7b\u4f3c GPT-4o \u7684\u591a\u6a21\u6001\u6a21\u578b \u590d\u73b0BPE<\/p>\n<p>\u4ee5\u4e0b\u662f\u6211\u4eec\u5728GPU \u4e0a\u8bad\u7ec3\u7684 220 \u4e07\u53c2\u6570\u7684 LLaMA MoE \u5728\u4e00\u4e2a\u5fae\u5c0f\u7684\u82f1\u8bed\u6570\u636e\u96c6\u4e0a\u8bad\u7ec3 3000 \u4e2aepoch\u540e\u7684\u8f93\u51fa\u7ed3\u679c&#xff1a;<\/p>\n<p>\u8f93\u5165&#xff1a;Alice<\/p>\n<p>\u8f93\u51fa&#xff1a;Alice &#039;without pictures or conversation?&#039;<br \/>\nSo she was considering in her own mind (as well as she could, for the<br \/>\nhot day made her feel very sleepy and stupid), whether the pleasure<br \/>\nof making a daisy-chain wo &#8230;<\/p>\n<p>\u4e0d\u8981\u590d\u5236\u4ee3\u7801&#xff0c;\u4f60\u53ef\u4ee5\u76f4\u63a5 GitHub \u4ed3\u5e93clone&#xff1a;<\/p>\n<h4>LLaMA 4 MoE \u67b6\u6784\u6982\u8ff0<\/h4>\n<p>\u9996\u5148&#xff0c;\u8ba9\u6211\u4eec\u4ee5\u4e00\u4e2a\u4e2d\u7ea7\u6280\u672f\u4eba\u5458\u7684\u8eab\u4efd\u6765\u7406\u89e3 LLaMA 4 \u67b6\u6784&#xff0c;\u7136\u540e\u901a\u8fc7\u4e00\u4e2a\u4f8b\u5b50 \u201cthe cat sat\u201d \u6765\u770b\u770b\u5b83\u662f\u5982\u4f55\u901a\u8fc7\u67b6\u6784\u5904\u7406\u7684&#xff0c;\u4ee5\u4fbf\u66f4\u6e05\u6670\u5730\u7406\u89e3\u3002<\/p>\n<p>\u60f3\u8c61\u4e00\u4e0b&#xff0c;\u4f60\u6709\u4e00\u4e2a\u975e\u5e38\u8270\u5de8\u7684\u4efb\u52a1\u3002\u4e0e\u5176\u96c7\u4f63\u4e00\u4e2a\u5bf9\u4ec0\u4e48\u90fd\u61c2\u4e00\u70b9\u7684\u4eba&#xff0c;\u4e0d\u5982\u96c7\u4f63\u4e00\u4e2a\u56e2\u961f&#xff0c;\u6bcf\u4e2a\u6210\u5458\u90fd\u662f\u67d0\u4e2a\u7279\u5b9a\u9886\u57df\u7684\u4e13\u5bb6&#xff08;\u6bd4\u5982\u7535\u5de5\u3001\u6c34\u7ba1\u5de5\u3001\u6cb9\u6f06\u5de5&#xff09;\u3002\u4f60\u8fd8\u4f1a\u96c7\u4f63\u4e00\u4e2a\u7ecf\u7406&#xff0c;\u4ed6\u67e5\u770b\u5f53\u524d\u7684\u4efb\u52a1&#xff0c;\u5e76\u5c06\u5176\u5206\u914d\u7ed9\u6700\u9002\u5408\u7684\u4e13\u5bb6\u3002<\/p>\n<p>AI \u6a21\u578b\u4e2d\u7684 MoE \u5c31\u6709\u70b9\u50cf\u8fd9\u6837\u3002\u4e0e\u5176\u8ba9\u4e00\u4e2a\u5de8\u5927\u7684\u795e\u7ecf\u7f51\u7edc\u8bd5\u56fe\u5b66\u4e60\u4e00\u5207&#xff0c;MoE \u5c42\u6709&#xff1a;<\/p>\n<li>\u4e00\u7ec4\u201c\u4e13\u5bb6\u201d&#xff1a;\u8fd9\u4e9b\u662f\u8f83\u5c0f\u7684\u3001\u4e13\u95e8\u5316\u7684\u795e\u7ecf\u7f51\u7edc&#xff08;\u901a\u5e38\u662f\u7b80\u5355\u7684\u524d\u9988\u7f51\u7edc\u6216 MLP&#xff09;\u3002\u6bcf\u4e2a\u4e13\u5bb6\u53ef\u80fd\u64c5\u957f\u5904\u7406\u67d0\u4e9b\u7c7b\u578b\u7684\u4fe1\u606f\u6216\u6a21\u5f0f\u3002<\/li>\n<li>\u4e00\u4e2a\u201c\u8def\u7531\u5668\u201d&#xff08;\u7ecf\u7406&#xff09;&#xff1a;\u8fd9\u662f\u53e6\u4e00\u4e2a\u5c0f\u578b\u7f51\u7edc\u3002\u5b83\u7684\u4efb\u52a1\u662f\u67e5\u770b\u8f93\u5165\u6570\u636e&#xff08;\u6bd4\u5982\u4e00\u4e2a\u8bcd\u6216\u8bcd\u7684\u4e00\u90e8\u5206&#xff09;&#xff0c;\u5e76\u51b3\u5b9a\u54ea\u4e2a\u4e13\u5bb6\u6700\u9002\u5408\u5904\u7406\u5b83\u3002<\/li>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010936-681ab2d0d6731.png\" alt=\"\u67b6\u6784\u56fe\" \/><\/p>\n<p>LLaMA 4 \u6982\u8ff0<\/p>\n<p>\u5047\u8bbe\u6211\u4eec\u7684\u6a21\u578b\u6b63\u5728\u5904\u7406\u53e5\u5b50&#xff1a;\u201cThe cat sat\u3002\u201d<\/p>\n<li>\u5206\u8bcd&#xff1a;\u9996\u5148&#xff0c;\u6211\u4eec\u5c06\u53e5\u5b50\u5206\u89e3\u6210\u7247\u6bb5&#xff08;\u5206\u8bcd&#xff09;&#xff1a;\u201cThe\u201d \u201ccat\u201d \u201csat\u201d<\/li>\n<li>\u8def\u7531\u5668\u63a5\u6536\u5206\u8bcd&#xff1a;MoE \u5c42\u63a5\u6536\u5230\u5206\u8bcd cat&#xff08;\u8868\u793a\u4e3a\u4e00\u7ec4\u6570\u5b57&#xff0c;\u5373\u5d4c\u5165\u5411\u91cf&#xff09;\u3002\u8def\u7531\u5668\u67e5\u770b\u8fd9\u4e2a cat \u5411\u91cf\u3002<\/li>\n<li>\u8def\u7531\u5668\u9009\u62e9&#xff1a;\u5047\u8bbe\u6211\u4eec\u6709 4 \u4e2a\u4e13\u5bb6&#xff08;E1\u3001E2\u3001E3\u3001E4&#xff09;\u3002\u8def\u7531\u5668\u51b3\u5b9a\u54ea\u4e9b\u4e13\u5bb6\u6700\u9002\u5408\u5904\u7406 cat\u3002<\/li>\n<li>**\u5047\u8bbe\u5b83\u8ba4\u4e3a E2&#xff08;\u53ef\u80fd\u64c5\u957f\u5904\u7406\u540d\u8bcd&#xff1f;&#xff09;\u548c E4&#xff08;\u53ef\u80fd\u64c5\u957f\u5904\u7406\u52a8\u7269\u6982\u5ff5&#xff1f;&#xff09;\u662f\u6700\u5408\u9002\u7684\u9009\u62e9\u3002\u5b83\u4f1a\u7ed9\u8fd9\u4e9b\u9009\u62e9\u5206\u914d\u5206\u6570\u6216\u201c\u6743\u91cd\u201d&#xff08;\u4f8b\u5982&#xff0c;E2 \u4e3a 70%&#xff0c;E4 \u4e3a 30%&#xff09;\u3002<\/li>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010936-681ab2d0e7aee.png\" alt=\"\u5904\u7406\u8fc7\u7a0b\" \/><\/p>\n<p>cat \u5411\u91cf\u53ea\u53d1\u9001\u7ed9 Expert 2 \u548c Expert 4\u3002Experts 1 \u548c 3 \u4e0d\u5904\u7406\u8fd9\u4e2a\u5206\u8bcd&#xff0c;\u8282\u7701\u4e86\u8ba1\u7b97\u91cf&#xff01;E2 \u5904\u7406 cat \u5e76\u751f\u6210\u5176\u7ed3\u679c&#xff08;Output_E2&#xff09;\u3002E4 \u5904\u7406 cat \u5e76\u751f\u6210\u5176\u7ed3\u679c&#xff08;Output_E4&#xff09;\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d101455.png\" alt=\"\u9009\u62e9\u7684\u4e13\u5bb6\" \/><\/p>\n<p>\u73b0\u5728&#xff0c;\u6211\u4eec\u4f7f\u7528\u8def\u7531\u5668\u6743\u91cd\u5c06\u9009\u5b9a\u4e13\u5bb6\u7684\u7ed3\u679c\u7ec4\u5408\u8d77\u6765&#xff1a;Final_Output &#061; (0.7 * Output_E2) &#043; (0.3 * Output_E4)\u3002<\/p>\n<p>\u8fd9\u4e2a Final_Output \u5c31\u662f MoE \u5c42\u4f20\u9012\u7ed9 cat \u7684\u7ed3\u679c\u3002\u8fd9\u4e2a\u8fc7\u7a0b\u4f1a\u9488\u5bf9\u5e8f\u5217\u4e2d\u7684\u6bcf\u4e2a\u5206\u8bcd\u91cd\u590d\u8fdb\u884c&#xff01;\u4e0d\u540c\u7684\u5206\u8bcd\u53ef\u80fd\u4f1a\u88ab\u8def\u7531\u5230\u4e0d\u540c\u7684\u4e13\u5bb6\u3002<\/p>\n<p>\u6240\u4ee5&#xff0c;\u5f53\u6211\u4eec\u7684\u6a21\u578b\u5904\u7406\u50cf &#034;The cat sat.&#034; \u8fd9\u6837\u7684\u6587\u672c\u65f6&#xff0c;\u6574\u4e2a\u6d41\u7a0b\u5982\u4e0b\u6240\u793a&#xff1a;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d1130d6.png\" alt=\"\u8be6\u7ec6\u67b6\u6784\" \/><\/p>\n<p>LLaMA 4 \u8be6\u7ec6\u67b6\u6784<\/p>\n<p>\u8f93\u5165\u6587\u672c\u8fdb\u5165\u5206\u8bcd\u5668\u3002\u5206\u8bcd\u5668\u5c06\u5206\u8bcd ID \u8f6c\u6362\u4e3a\u6709\u610f\u4e49\u7684\u6570\u5b57\u5411\u91cf&#xff08;\u5d4c\u5165\u5411\u91cf&#xff09;&#xff0c;\u5e76\u6dfb\u52a0\u4f4d\u7f6e\u4fe1\u606f&#xff08;\u7a0d\u540e\u5728\u6ce8\u610f\u529b\u4e2d\u4f7f\u7528 RoPE&#xff09;\u3002<\/p>\n<p>\u8fd9\u4e9b\u5411\u91cf\u901a\u8fc7\u591a\u4e2aTransformer \u5757\u3002\u6bcf\u4e2a\u5757\u5305\u542b&#xff1a;<\/p>\n<ul>\n<li>\u81ea\u6ce8\u610f\u529b&#xff08;\u5206\u8bcd\u76f8\u4e92\u67e5\u770b&#xff0c;\u7531 RoPE \u589e\u5f3a&#xff09;\u3002<\/li>\n<li>MoE \u5c42&#xff08;\u8def\u7531\u5668\u5c06\u5206\u8bcd\u53d1\u9001\u5230\u7279\u5b9a\u7684\u4e13\u5bb6&#xff09;\u3002<\/li>\n<li>\u5f52\u4e00\u5316&#xff08;RMSNorm&#xff09;\u548c\u6b8b\u5dee\u8fde\u63a5\u6709\u52a9\u4e8e\u5b66\u4e60\u3002<\/li>\n<\/ul>\n<p>\u6700\u540e\u4e00\u4e2a\u5757\u7684\u8f93\u51fa\u8fdb\u5165\u6700\u7ec8\u5c42\u3002\u8fd9\u4e00\u5c42\u4e3a\u8bcd\u6c47\u8868\u4e2d\u6bcf\u4e2a\u53ef\u80fd\u7684\u4e0b\u4e00\u4e2a\u5206\u8bcd\u751f\u6210\u5206\u6570&#xff08;logits&#xff09;\u3002<\/p>\n<p>\u6211\u4eec\u5c06\u5206\u6570\u8f6c\u6362\u4e3a\u6982\u7387&#xff0c;\u5e76\u9884\u6d4b\u4e0b\u4e00\u4e2a\u5206\u8bcd\u3002<\/p>\n<p>\u73b0\u5728\u6211\u4eec\u5bf9 MoE \u5728\u6574\u4e2a\u67b6\u6784\u4e2d\u7684\u4f5c\u7528\u6709\u4e86\u521d\u6b65\u7684\u4e86\u89e3&#xff0c;\u63a5\u4e0b\u6765\u8ba9\u6211\u4eec\u6df1\u5165\u4ee3\u7801&#xff0c;\u9010\u6b65\u6784\u5efa\u8fd9\u4e9b\u7ec4\u4ef6&#xff01;\u6211\u4eec\u5148\u4ece\u642d\u5efa\u7f16\u7801\u73af\u5883\u5f00\u59cb\u3002<\/p>\n<h4>\u642d\u5efa\u821e\u53f0<\/h4>\n<p>\u5728\u5f00\u59cb\u7f16\u5199\u6a21\u578b\u4ee3\u7801\u4e4b\u524d&#xff0c;\u6211\u4eec\u9700\u8981\u5bfc\u5165\u6211\u4eec\u5c06\u8981\u4f7f\u7528\u7684\u6a21\u5757&#xff0c;\u6240\u4ee5\u8ba9\u6211\u4eec\u5148\u4ece\u8fd9\u91cc\u5f00\u59cb\u3002<\/p>\n<p><span class=\"token comment\"># \u5bfc\u5165\u5fc5\u8981\u7684\u5e93<\/span><br \/>\n<span class=\"token keyword\">import<\/span> torch<br \/>\n<span class=\"token keyword\">import<\/span> torch<span class=\"token punctuation\">.<\/span>nn <span class=\"token keyword\">as<\/span> nn<br \/>\n<span class=\"token keyword\">from<\/span> torch<span class=\"token punctuation\">.<\/span>nn <span class=\"token keyword\">import<\/span> functional <span class=\"token keyword\">as<\/span> F<br \/>\n<span class=\"token keyword\">import<\/span> torch<span class=\"token punctuation\">.<\/span>optim <span class=\"token keyword\">as<\/span> optim<br \/>\n<span class=\"token keyword\">import<\/span> math<br \/>\n<span class=\"token keyword\">import<\/span> os<br \/>\n<span class=\"token keyword\">import<\/span> collections <span class=\"token comment\"># \u7528\u4e8e\u6269\u5c55\u7684 BPE \u7c7b\u4f3c\u5904\u7406<\/span><br \/>\n<span class=\"token keyword\">import<\/span> re          <span class=\"token comment\"># \u7528\u4e8e\u521d\u59cb\u5206\u5272<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u8bbe\u5907\u914d\u7f6e &#8212;<\/span><br \/>\n<span class=\"token comment\"># \u7406\u8bba&#xff1a;\u8bbe\u7f6e\u8bbe\u5907&#xff08;\u5982\u679c\u6709 GPU \u5219\u4e3a &#039;cuda&#039;&#xff0c;\u5426\u5219\u4e3a CPU&#xff09;&#xff0c;\u4ee5\u4fbf\u5728\u53ef\u7528\u786c\u4ef6\u4e0a\u9ad8\u6548\u5904\u7406\u5f20\u91cf\u64cd\u4f5c\u3002<\/span><br \/>\ndevice <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#039;cuda&#039;<\/span> <span class=\"token keyword\">if<\/span> torch<span class=\"token punctuation\">.<\/span>cuda<span class=\"token punctuation\">.<\/span>is_available<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">else<\/span> <span class=\"token string\">&#039;cpu&#039;<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u4f7f\u7528\u8bbe\u5907&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>device<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u5e93\u5df2\u5bfc\u5165&#xff0c;\u8bbe\u5907\u5df2\u914d\u7f6e\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\nPyTorch \u7248\u672c&#xff1a;<span class=\"token number\">2.6<\/span><span class=\"token number\">.0<\/span><span class=\"token operator\">&#043;<\/span>cu124<br \/>\n\u4f7f\u7528\u8bbe\u5907&#xff1a;cuda<br \/>\n\u5e93\u5df2\u5bfc\u5165&#xff0c;\u8bbe\u5907\u5df2\u914d\u7f6e\u3002<\/p>\n<p>\u8f93\u51fa\u786e\u8ba4\u6211\u4eec\u5df2\u6210\u529f\u5bfc\u5165\u5e93\u3002\u6211\u5c06\u4f7f\u7528 Colab T4 GPU \u6765\u8bad\u7ec3\u6a21\u578b\u3002\u5982\u679c\u4f60\u60f3\u5728\u66f4\u4fbf\u5b9c\u7684 GPU \u4e0a\u8bad\u7ec3&#xff0c;\u53ef\u4ee5\u51cf\u5c11\u8bad\u7ec3\u5468\u671f\u6570\u3002<\/p>\n<h4>\u5b9a\u4e49\u8bad\u7ec3\u8bed\u6599\u5e93<\/h4>\n<p>\u6211\u4eec\u9700\u8981\u4e00\u4e9b\u6587\u672c\u6570\u636e\u6765\u8bad\u7ec3\u6211\u4eec\u7684\u8bed\u8a00\u6a21\u578b\u3002\u50cf LLaMA 4 \u8fd9\u6837\u7684\u771f\u5b9e\u6a21\u578b\u662f\u5728\u6570\u4e07\u4ebf\u4e2a\u5355\u8bcd\u4e0a\u8bad\u7ec3\u7684&#xff01;<\/p>\n<p>\u5728\u6211\u4eec\u7684\u5c0f\u4f8b\u5b50\u4e2d&#xff0c;\u53ea\u662f\u4e3a\u4e86\u770b\u770b\u4ee3\u7801\u662f\u5982\u4f55\u5de5\u4f5c\u7684&#xff0c;\u6211\u4eec\u5c06\u4f7f\u7528\u5218\u6613\u65af\u00b7\u5361\u7f57\u5c14\u7684\u300a\u7231\u4e3d\u4e1d\u68a6\u6e38\u4ed9\u5883\u300b\u4e2d\u7684\u4e00\u4e2a\u5c0f\u6bb5\u843d\u3002\u8fd9\u4e2a\u5c0f\u5c3a\u5bf8\u8ba9\u6211\u4eec\u53ef\u4ee5\u8f7b\u677e\u8ddf\u8e2a\u53d1\u751f\u4e86\u4ec0\u4e48\u3002<\/p>\n<p><span class=\"token comment\"># \u5b9a\u4e49\u539f\u59cb\u6587\u672c\u8bed\u6599\u5e93\u7528\u4e8e\u8bad\u7ec3<\/span><br \/>\ncorpus_raw <span class=\"token operator\">&#061;<\/span> <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\nAlice was beginning to get very tired of sitting by her sister on the<br \/>\nbank, and of having nothing to do: once or twice she had peeped into the<br \/>\nbook her sister was reading, but it had no pictures or conversations in<br \/>\nit, &#039;and what is the use of a book,&#039; thought Alice &#039;without pictures or<br \/>\nconversation?&#039;<br \/>\nSo she was considering in her own mind (as well as she could, for the<br \/>\nhot day made her feel very sleepy and stupid), whether the pleasure<br \/>\nof making a daisy-chain would be worth the trouble of getting up and<br \/>\npicking the daisies, when suddenly a White Rabbit with pink eyes ran<br \/>\nclose by her.<br \/>\n&#034;&#034;&#034;<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u8bad\u7ec3\u8bed\u6599\u5e93\u5df2\u5b9a\u4e49&#xff08;\u957f\u5ea6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>corpus_raw<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u4e2a\u5b57\u7b26&#xff09;\u3002&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u8bad\u7ec3\u8bed\u6599\u5e93\u5df2\u5b9a\u4e49&#xff08;\u957f\u5ea6&#xff1a;<span class=\"token number\">593<\/span> \u4e2a\u5b57\u7b26&#xff09;\u3002<\/p>\n<p>\u8fd9\u4ec5\u4ec5\u5b9a\u4e49\u4e86\u4e00\u4e2a\u5305\u542b\u6211\u4eec\u793a\u4f8b\u6587\u672c\u7684 corpus_raw \u5b57\u7b26\u4e32\u53d8\u91cf&#xff0c;\u5e76\u6253\u5370\u51fa\u5176\u603b\u957f\u5ea6&#xff08;593 \u4e2a\u5b57\u7b26&#xff0c;\u5305\u62ec\u7a7a\u683c\u3001\u6362\u884c\u7b26\u548c\u6807\u70b9\u7b26\u53f7&#xff09;\u3002<\/p>\n<h4>\u5b57\u7b26\u7ea7\u5206\u8bcd<\/h4>\n<p>\u8ba1\u7b97\u673a\u4e0d\u61c2\u5b57\u6bcd&#xff0c;\u5b83\u53ea\u61c2\u6570\u5b57\u3002\u5206\u8bcd\u662f\u5c06\u6587\u672c\u8f6c\u6362\u4e3a\u6a21\u578b\u53ef\u4ee5\u5904\u7406\u7684\u6570\u5b57&#xff08;\u5206\u8bcd&#xff09;\u7684\u8fc7\u7a0b\u3002\u6211\u4eec\u5c06\u4f7f\u7528\u6700\u7b80\u5355\u7684\u65b9\u6cd5&#xff1a;\u5b57\u7b26\u7ea7\u5206\u8bcd\u3002<\/p>\n<li>\u627e\u51fa corpus_raw \u4e2d\u7684\u6240\u6709\u552f\u4e00\u5b57\u7b26\u3002<\/li>\n<li>\u4e3a\u6bcf\u4e2a\u552f\u4e00\u5b57\u7b26\u5206\u914d\u4e00\u4e2a\u552f\u4e00\u7684\u6574\u6570 ID\u3002<\/li>\n<li>\u521b\u5efa\u6620\u5c04&#xff08;\u5b57\u5178&#xff09;&#xff0c;\u5c06\u5b57\u7b26\u8f6c\u6362\u4e3a ID&#xff08;char_to_int&#xff09;\u548c\u5c06 ID \u8f6c\u6362\u56de\u5b57\u7b26&#xff08;int_to_char&#xff09;\u3002\u552f\u4e00\u5b57\u7b26\u7684\u603b\u6570\u5c31\u662f\u6211\u4eec\u7684 vocab_size\u3002<\/li>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d125478.png\" alt=\"\u5206\u8bcd\u8fc7\u7a0b\" \/><\/p>\n<p>\u5206\u8bcd\u8fc7\u7a0b<\/p>\n<p><span class=\"token comment\"># \u627e\u51fa\u539f\u59cb\u8bed\u6599\u5e93\u4e2d\u7684\u6240\u6709\u552f\u4e00\u5b57\u7b26<\/span><br \/>\nchars <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">sorted<\/span><span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span><span class=\"token builtin\">set<\/span><span class=\"token punctuation\">(<\/span>corpus_raw<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\nvocab_size <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>chars<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u521b\u5efa\u5b57\u7b26\u5230\u6574\u6570\u7684\u6620\u5c04&#xff08;\u7f16\u7801&#xff09;<\/span><br \/>\nchar_to_int <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span> ch<span class=\"token punctuation\">:<\/span>i <span class=\"token keyword\">for<\/span> i<span class=\"token punctuation\">,<\/span>ch <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">enumerate<\/span><span class=\"token punctuation\">(<\/span>chars<span class=\"token punctuation\">)<\/span> <span class=\"token punctuation\">}<\/span><\/p>\n<p><span class=\"token comment\"># \u521b\u5efa\u6574\u6570\u5230\u5b57\u7b26\u7684\u6620\u5c04&#xff08;\u89e3\u7801&#xff09;<\/span><br \/>\nint_to_char <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span> i<span class=\"token punctuation\">:<\/span>ch <span class=\"token keyword\">for<\/span> i<span class=\"token punctuation\">,<\/span>ch <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">enumerate<\/span><span class=\"token punctuation\">(<\/span>chars<span class=\"token punctuation\">)<\/span> <span class=\"token punctuation\">}<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521b\u5efa\u4e86\u5927\u5c0f\u4e3a&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>vocab_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u7684\u5b57\u7b26\u8bcd\u6c47\u8868&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u8bcd\u6c47\u8868&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token string\">&#039;&#039;<\/span><span class=\"token punctuation\">.<\/span>join<span class=\"token punctuation\">(<\/span>chars<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u53ef\u9009&#xff1a;\u6253\u5370\u6620\u5c04\u793a\u4f8b<\/span><br \/>\n<span class=\"token comment\"># print(f&#034;Char-to-Int \u6620\u5c04\u793a\u4f8b&#xff1a;{{k: char_to_int[k] for k in list(char_to_int)[:5]}}&#034;)<\/span><br \/>\n<span class=\"token comment\"># print(f&#034;Int-to-Char \u6620\u5c04\u793a\u4f8b&#xff1a;{{k: int_to_char[k] for k in list(int_to_char)[:5]}}&#034;)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u521b\u5efa\u4e86\u5927\u5c0f\u4e3a&#xff1a;<span class=\"token number\">36<\/span> \u7684\u5b57\u7b26\u8bcd\u6c47\u8868<br \/>\n\u8bcd\u6c47\u8868&#xff1a;<br \/>\n &#039;<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">:<\/span>?ARSWabcdefghiklmnoprstuvwy<\/p>\n<p>\u4ee3\u7801\u627e\u5230\u4e86 36 \u4e2a\u552f\u4e00\u5b57\u7b26&#xff08;\u5305\u62ec\u6362\u884c\u7b26 \\\\n\u3001\u7a7a\u683c\u3001\u6807\u70b9\u7b26\u53f7\u3001\u5927\u5199\u5b57\u6bcd\u548c\u5c0f\u5199\u5b57\u6bcd&#xff09;\u3002<\/p>\n<p>\u8fd9\u4e2a vocab_size \u5bf9\u4e8e\u540e\u7eed\u8bbe\u7f6e\u6a21\u578b\u5c42\u975e\u5e38\u91cd\u8981\u3002\u5b83\u8fd8\u521b\u5efa\u4e86 char_to_int \u548c int_to_char \u5b57\u5178\u7528\u4e8e\u8f6c\u6362&#xff0c;\u5e76\u6253\u5370\u4e86\u8bcd\u6c47\u8868\u4e2d\u6240\u6709\u5b57\u7b26\u7684\u5b8c\u6574\u5217\u8868\u3002<\/p>\n<h4>\u7f16\u7801\u8bed\u6599\u5e93<\/h4>\n<p>\u73b0\u5728\u6211\u4eec\u4f7f\u7528\u521a\u624d\u521b\u5efa\u7684 char_to_int \u6620\u5c04&#xff0c;\u5c06\u6574\u4e2a corpus_raw \u5b57\u7b26\u4e32\u8f6c\u6362\u4e3a\u5bf9\u5e94\u7684\u6574\u6570 ID \u5e8f\u5217\u3002<\/p>\n<p>\u8fd9\u4e2a\u6570\u503c\u8868\u793a\u5c31\u662f\u6a21\u578b\u5b9e\u9645\u8bad\u7ec3\u7684\u5185\u5bb9\u3002\u6211\u4eec\u5c06\u8fd9\u4e2a\u5e8f\u5217\u5b58\u50a8\u4e3a\u4e00\u4e2a PyTorch \u5f20\u91cf&#xff0c;\u4ee5\u4fbf\u63d0\u9ad8\u6548\u7387\u3002<\/p>\n<p><span class=\"token comment\"># \u5c06\u6574\u4e2a\u8bed\u6599\u5e93\u7f16\u7801\u4e3a\u6574\u6570 ID \u5217\u8868<\/span><br \/>\nencoded_corpus <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span>char_to_int<span class=\"token punctuation\">[<\/span>ch<span class=\"token punctuation\">]<\/span> <span class=\"token keyword\">for<\/span> ch <span class=\"token keyword\">in<\/span> corpus_raw<span class=\"token punctuation\">]<\/span><\/p>\n<p><span class=\"token comment\"># \u5c06\u5217\u8868\u8f6c\u6362\u4e3a PyTorch \u5f20\u91cf<\/span><br \/>\nfull_data_sequence <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>tensor<span class=\"token punctuation\">(<\/span>encoded_corpus<span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>torch<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">long<\/span><span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u5c06\u8bed\u6599\u5e93\u7f16\u7801\u4e3a\u5f20\u91cf&#xff0c;\u5f62\u72b6\u4e3a&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>full_data_sequence<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u53ef\u9009&#xff1a;\u663e\u793a\u524d 50 \u4e2a\u7f16\u7801\u7684 ID<\/span><br \/>\n<span class=\"token comment\"># print(f&#034;\u524d 50 \u4e2a\u7f16\u7801\u7684\u5206\u8bcd ID&#xff1a;{full_data_sequence[:50].tolist()}&#034;)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u5c06\u8bed\u6599\u5e93\u7f16\u7801\u4e3a\u5f20\u91cf&#xff0c;\u5f62\u72b6\u4e3a&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">593<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u6211\u4eec 593 \u4e2a\u5b57\u7b26\u7684\u6587\u672c\u5df2\u6210\u529f\u8f6c\u6362\u4e3a\u4e00\u4e2a\u957f\u5ea6\u4e3a 593 \u7684\u5355\u4e2a PyTorch \u5f20\u91cf&#xff08;\u672c\u8d28\u4e0a\u662f\u4e00\u4e2a\u6570\u5b57\u5217\u8868&#xff09;\u3002\u5f20\u91cf\u4e2d\u7684\u6bcf\u4e2a\u6570\u5b57\u4ee3\u8868\u539f\u59cb\u6587\u672c\u4e2d\u7684\u4e00\u4e2a\u5b57\u7b26\u3002\u5b83\u4e5f\u88ab\u653e\u7f6e\u5728\u6211\u4eec\u4e4b\u524d\u6307\u5b9a\u7684\u8bbe\u5907\u4e0a&#xff08;\u4f8b\u5982 &#039;cuda&#039;&#xff09;\u3002<\/p>\n<h4>\u5b9a\u4e49\u8d85\u53c2\u6570<\/h4>\n<p>\u63a5\u4e0b\u6765&#xff0c;\u6211\u4eec\u9700\u8981\u5b9a\u4e49\u8d85\u53c2\u6570\u8bbe\u7f6e&#xff0c;\u8fd9\u4e9b\u662f\u5728\u8bad\u7ec3\u4e4b\u524d\u9009\u62e9\u7684\u3002\u5b83\u4eec\u5b9a\u4e49\u4e86\u6a21\u578b\u7684\u67b6\u6784&#xff08;\u6709\u591a\u5927\u3001\u6709\u591a\u5c11\u5c42\u7b49&#xff09;\u4ee5\u53ca\u5b83\u662f\u5982\u4f55\u5b66\u4e60\u7684\u3002\u5bf9\u4e8e\u6211\u4eec\u7684 LLaMA 4 \u7c7b\u578b\u6a21\u578b&#xff0c;\u5173\u952e\u8d85\u53c2\u6570\u5305\u62ec&#xff1a;<\/p>\n<ul>\n<li>d_model&#xff1a;\u6a21\u578b\u4e2d\u4f7f\u7528\u7684\u4e3b\u7ef4\u5ea6&#xff08;\u5d4c\u5165\u7ef4\u5ea6\u548c\u9690\u85cf\u72b6\u6001\u7684\u5927\u5c0f&#xff09;\u3002<\/li>\n<li>n_layers&#xff1a;\u5806\u53e0\u5728\u4e00\u8d77\u7684 Transformer \u5757\u7684\u6570\u91cf\u3002\u5c42\u6570\u8d8a\u591a&#xff0c;\u6a21\u578b\u901a\u5e38\u8d8a\u5f3a\u5927&#xff08;\u4f46\u901f\u5ea6\u8d8a\u6162&#xff09;\u3002<\/li>\n<li>n_heads&#xff1a;\u591a\u5934\u6ce8\u610f\u529b\u673a\u5236\u4e2d\u5e76\u884c\u6ce8\u610f\u529b\u8ba1\u7b97\u7684\u6570\u91cf\u3002d_model \u5fc5\u987b\u80fd\u88ab n_heads \u6574\u9664\u3002<\/li>\n<li>block_size&#xff1a;\u6a21\u578b\u5728\u8bad\u7ec3\u671f\u95f4\u67e5\u770b\u7684\u6700\u5927\u8f93\u5165\u5e8f\u5217\u957f\u5ea6&#xff08;\u4e5f\u79f0\u4e3a\u4e0a\u4e0b\u6587\u957f\u5ea6&#xff09;\u3002<\/li>\n<li>rms_norm_eps&#xff1a;\u5728 RMSNorm \u4e2d\u7528\u4e8e\u6570\u503c\u7a33\u5b9a\u7684\u5fae\u5c0f\u503c\u3002<\/li>\n<li>rope_theta&#xff1a;\u63a7\u5236 RoPE \u4e2d\u4f7f\u7528\u7684\u9891\u7387\u7684\u53c2\u6570\u3002<\/li>\n<\/ul>\n<p>MoE \u53c2\u6570&#xff1a;<\/p>\n<ul>\n<li>num_local_experts&#xff1a;\u6bcf\u4e2a MoE \u5c42\u4e2d\u7684\u201c\u4e13\u5bb6\u201d MLP \u6570\u91cf\u3002<\/li>\n<li>num_experts_per_tok&#xff1a;\u8def\u7531\u5668\u5c06\u6bcf\u4e2a\u5206\u8bcd\u53d1\u9001\u5230\u7684\u4e13\u5bb6\u6570\u91cf&#xff08;Top-K \u8def\u7531&#xff09;\u3002<\/li>\n<li>intermediate_size_expert\/shared&#xff1a;\u4e13\u5bb6\/\u5171\u4eab MLP \u4e2d\u7684\u9690\u85cf\u7ef4\u5ea6\u3002<\/li>\n<\/ul>\n<p>\u6211\u4eec\u4f7f\u7528\u7684\u503c\u6bd4\u771f\u5b9e\u7684 LLaMA 4 \u5c0f\u5f97\u591a&#xff0c;\u4ee5\u4fbf\u5728\u5178\u578b\u786c\u4ef6\u4e0a\u5feb\u901f\u8fd0\u884c\u3002<\/p>\n<p><span class=\"token comment\"># &#8212; \u6a21\u578b\u67b6\u6784\u8d85\u53c2\u6570 &#8212;<\/span><br \/>\n<span class=\"token comment\"># vocab_size \u5df2\u7ecf\u7531\u6570\u636e\u786e\u5b9a<\/span><br \/>\nd_model <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">128<\/span>         <span class=\"token comment\"># \u5d4c\u5165\u7ef4\u5ea6&#xff08;\u5927\u5e45\u964d\u4f4e&#xff09;<\/span><br \/>\nn_layers <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">4<\/span>          <span class=\"token comment\"># Transformer \u5757\u7684\u6570\u91cf&#xff08;\u964d\u4f4e&#xff09;<\/span><br \/>\nn_heads <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">4<\/span>           <span class=\"token comment\"># \u6ce8\u610f\u529b\u5934\u7684\u6570\u91cf<\/span><br \/>\nblock_size <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">64<\/span>       <span class=\"token comment\"># \u6700\u5927\u4e0a\u4e0b\u6587\u957f\u5ea6&#xff08;\u5e8f\u5217\u957f\u5ea6&#xff09;<\/span><br \/>\nrms_norm_eps <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">1e<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">5<\/span>   <span class=\"token comment\"># RMSNorm \u7a33\u5b9a\u6027\u7684\u5fae\u5c0f\u503c<\/span><br \/>\nrope_theta <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">10000.0<\/span>  <span class=\"token comment\"># RoPE \u7684 theta \u53c2\u6570&#xff08;\u4ece Llama 4 \u7684 500k \u964d\u4f4e&#xff09;<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; MoE \u7279\u5b9a\u8d85\u53c2\u6570 &#8212;<\/span><br \/>\nnum_local_experts <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">4<\/span>      <span class=\"token comment\"># \u6bcf\u4e2a MoE \u5c42\u4e2d\u7684\u4e13\u5bb6\u6570\u91cf&#xff08;\u4ece 16 \u964d\u4f4e&#xff09;<\/span><br \/>\nnum_experts_per_tok <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">2<\/span>   <span class=\"token comment\"># \u6bcf\u4e2a\u5206\u8bcd\u8def\u7531\u5230\u7684\u4e13\u5bb6\u6570\u91cf&#xff08;Top-K&#xff0c;\u4ece 4 \u964d\u4f4e&#xff1f;&#xff09;<\/span><br \/>\nintermediate_size_expert <span class=\"token operator\">&#061;<\/span> d_model <span class=\"token operator\">*<\/span> <span class=\"token number\">2<\/span>  <span class=\"token comment\"># \u4e13\u5bb6 MLP \u4e2d\u7684\u9690\u85cf\u7ef4\u5ea6&#xff08;\u6309\u6bd4\u4f8b\u7f29\u5c0f&#xff09;<\/span><br \/>\nintermediate_size_shared <span class=\"token operator\">&#061;<\/span> d_model <span class=\"token operator\">*<\/span> <span class=\"token number\">2<\/span>  <span class=\"token comment\"># \u5171\u4eab MLP \u4e2d\u7684\u9690\u85cf\u7ef4\u5ea6&#xff08;\u6309\u6bd4\u4f8b\u7f29\u5c0f&#xff09;<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u6ce8\u610f\u529b\u8d85\u53c2\u6570 &#8212;<\/span><br \/>\n<span class=\"token comment\"># d_k&#xff08;\u6bcf\u4e2a\u5934\u7684\u7ef4\u5ea6&#xff09;\u5c06\u4ece d_model \u548c n_heads \u63a8\u5bfc\u800c\u6765<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u8bad\u7ec3\u8d85\u53c2\u6570 &#8212;<\/span><br \/>\nlearning_rate <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">5e<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">4<\/span>  <span class=\"token comment\"># \u5b66\u4e60\u7387<\/span><br \/>\nbatch_size <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">16<\/span>       <span class=\"token comment\"># \u5e76\u884c\u5904\u7406\u7684\u5e8f\u5217\u6570\u91cf<\/span><br \/>\nepochs <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">3000<\/span>         <span class=\"token comment\"># \u8bad\u7ec3\u8fed\u4ee3\u6b21\u6570&#xff08;\u6839\u636e\u9700\u8981\u8c03\u6574&#xff09;<\/span><br \/>\neval_interval <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">300<\/span>  <span class=\"token comment\"># \u6253\u5370\u635f\u5931\u7684\u9891\u7387<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u63a8\u5bfc\u8d85\u53c2\u6570 &#8212;<\/span><br \/>\n<span class=\"token keyword\">assert<\/span> d_model <span class=\"token operator\">%<\/span> n_heads <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;d_model \u5fc5\u987b\u80fd\u88ab n_heads \u6574\u9664&#034;<\/span><br \/>\nd_k <span class=\"token operator\">&#061;<\/span> d_model <span class=\"token operator\">\/\/<\/span> n_heads <span class=\"token comment\"># \u6bcf\u4e2a\u5934\u7684\u952e\/\u67e5\u8be2\/\u503c\u7ef4\u5ea6<\/span><br \/>\nexpert_dim <span class=\"token operator\">&#061;<\/span> intermediate_size_expert <span class=\"token comment\"># \u4e3a\u6e05\u6670\u8d77\u89c1\u7684\u522b\u540d<\/span><br \/>\nshared_expert_dim <span class=\"token operator\">&#061;<\/span> intermediate_size_shared <span class=\"token comment\"># \u4e3a\u6e05\u6670\u8d77\u89c1\u7684\u522b\u540d<\/span><\/p>\n<p>\u8ba9\u6211\u4eec\u770b\u770b\u6211\u4eec\u521a\u521a\u5b9a\u4e49\u7684\u6240\u6709\u53c2\u6570\u503c\u3002<\/p>\n<p>&#8212; \u8d85\u53c2\u6570\u5b9a\u4e49 &#8212;<br \/>\n\u8bcd\u6c47\u8868\u5927\u5c0f (vocab_size): 36<br \/>\n\u5d4c\u5165\u7ef4\u5ea6 (d_model): 128<br \/>\n\u5c42\u6570 (n_layers): 4<br \/>\n\u6ce8\u610f\u529b\u5934\u6570\u91cf (n_heads): 4<br \/>\n\u6bcf\u4e2a\u5934\u7684\u7ef4\u5ea6 (d_k): 32<br \/>\n\u6700\u5927\u5e8f\u5217\u957f\u5ea6 (block_size): 64<br \/>\nRMSNorm \u7a33\u5b9a\u6027\u503c (rms_norm_eps): 1e-05<br \/>\nRoPE theta \u53c2\u6570 (rope_theta): 10000.0<\/p>\n<p>&#8212; MoE \u7279\u5b9a &#8212;<br \/>\n\u6bcf\u4e2a MoE \u5c42\u7684\u672c\u5730\u4e13\u5bb6\u6570\u91cf (num_local_experts): 4<br \/>\n\u6bcf\u4e2a\u5206\u8bcd\u7684\u4e13\u5bb6\u6570\u91cf (num_experts_per_tok): 2<br \/>\n\u4e13\u5bb6\u4e2d\u95f4\u5c42\u5927\u5c0f (expert_dim): 256<br \/>\n\u5171\u4eab MLP \u4e2d\u95f4\u5c42\u5927\u5c0f (shared_expert_dim): 256<\/p>\n<p>&#8212; \u8bad\u7ec3\u7279\u5b9a &#8212;<br \/>\n\u5b66\u4e60\u7387&#xff1a;0.0005<br \/>\n\u6279\u91cf\u5927\u5c0f&#xff1a;16<br \/>\n\u8bad\u7ec3\u5468\u671f\u6570&#xff1a;3000<\/p>\n<p>\u8fd9\u4e2a\u8f93\u51fa\u6e05\u6670\u5730\u5217\u51fa\u4e86\u6211\u4eec\u521a\u521a\u4e3a\u6a21\u578b\u548c\u8bad\u7ec3\u8fc7\u7a0b\u8bbe\u7f6e\u7684\u6240\u6709\u914d\u7f6e\u503c\u3002\u6211\u4eec\u53ef\u4ee5\u770b\u5230\u6a21\u578b\u7ef4\u5ea6&#xff08;\u5982 d_model&#061;128&#xff09;\u3001MoE \u4e2d\u7684\u4e13\u5bb6\u6570\u91cf&#xff08;4&#xff09;\u3001\u6bcf\u4e2a\u5206\u8bcd\u4f7f\u7528\u7684\u4e13\u5bb6\u6570\u91cf&#xff08;2&#xff09;\u3001\u4e0a\u4e0b\u6587\u7a97\u53e3&#xff08;block_size&#061;64&#xff09;\u4ee5\u53ca\u8bad\u7ec3\u53c2\u6570&#xff08;learning_rate&#061;0.0005\u3001batch_size&#061;16\u3001epochs&#061;3000&#xff09;\u3002<\/p>\n<h4>\u8bad\u7ec3\u6570\u636e\u51c6\u5907<\/h4>\n<p>\u50cf\u6211\u4eec\u8fd9\u6837\u7684\u8bed\u8a00\u6a21\u578b\u662f\u901a\u8fc7\u9884\u6d4b\u7ed9\u5b9a\u4e4b\u524d\u5206\u8bcd\u7684\u4e0b\u4e00\u4e2a\u5206\u8bcd\u6765\u5b66\u4e60\u7684\u3002\u4e3a\u4e86\u51c6\u5907\u6570\u636e&#xff0c;\u6211\u4eec\u5728 full_data_sequence \u4e0a\u6ed1\u52a8\u4e00\u4e2a\u957f\u5ea6\u4e3a block_size \u7684\u7a97\u53e3\u3002<\/p>\n<li>\u8f93\u5165 (x) \u662f\u4e00\u4e2a\u957f\u5ea6\u4e3a block_size \u7684\u5206\u8bcd\u5757\u3002<\/li>\n<li>\u76ee\u6807 (y) \u662f\u76f8\u540c\u5757\u5411\u53f3\u79fb\u52a8\u4e00\u4e2a\u4f4d\u7f6e\u3002<\/li>\n<li>\u56e0\u6b64&#xff0c;\u5bf9\u4e8e\u8f93\u5165 x \u4e2d\u7684\u6bcf\u4e2a\u5206\u8bcd&#xff0c;\u6a21\u578b\u7684\u76ee\u6807\u662f\u9884\u6d4b\u76ee\u6807 y \u4e2d\u76f8\u540c\u4f4d\u7f6e\u7684\u5206\u8bcd\u3002<\/li>\n<p>\u6211\u4eec\u4ece\u8bed\u6599\u5e93\u4e2d\u63d0\u53d6\u6240\u6709\u53ef\u80fd\u7684\u91cd\u53e0\u5757\u3002<\/p>\n<p><span class=\"token comment\"># \u521b\u5efa\u5217\u8868\u4ee5\u4fdd\u5b58\u6240\u6709\u53ef\u80fd\u7684\u8f93\u5165&#xff08;x&#xff09;\u548c\u76ee\u6807&#xff08;y&#xff09;\u5e8f\u5217<\/span><br \/>\nall_x <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\nall_y <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><\/p>\n<p><span class=\"token comment\"># \u904d\u5386\u7f16\u7801\u540e\u7684\u8bed\u6599\u5e93\u5f20\u91cf\u4ee5\u63d0\u53d6\u91cd\u53e0\u5e8f\u5217<\/span><br \/>\nnum_total_tokens <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>full_data_sequence<span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>num_total_tokens <span class=\"token operator\">&#8211;<\/span> block_size<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># \u63d0\u53d6\u8f93\u5165\u5e8f\u5217\u5757<\/span><br \/>\n    x_chunk <span class=\"token operator\">&#061;<\/span> full_data_sequence<span class=\"token punctuation\">[<\/span>i <span class=\"token punctuation\">:<\/span> i <span class=\"token operator\">&#043;<\/span> block_size<span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token comment\"># \u63d0\u53d6\u76ee\u6807\u5e8f\u5217\u5757&#xff08;\u5411\u53f3\u79fb\u52a8\u4e00\u4e2a\u4f4d\u7f6e&#xff09;<\/span><br \/>\n    y_chunk <span class=\"token operator\">&#061;<\/span> full_data_sequence<span class=\"token punctuation\">[<\/span>i <span class=\"token operator\">&#043;<\/span> <span class=\"token number\">1<\/span> <span class=\"token punctuation\">:<\/span> i <span class=\"token operator\">&#043;<\/span> block_size <span class=\"token operator\">&#043;<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    all_x<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>x_chunk<span class=\"token punctuation\">)<\/span><br \/>\n    all_y<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>y_chunk<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u5c06\u5217\u8868\u4e2d\u7684\u5f20\u91cf\u5806\u53e0\u6210\u5355\u4e2a\u5927\u5f20\u91cf<\/span><br \/>\ntrain_x <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>stack<span class=\"token punctuation\">(<\/span>all_x<span class=\"token punctuation\">)<\/span><br \/>\ntrain_y <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>stack<span class=\"token punctuation\">(<\/span>all_y<span class=\"token punctuation\">)<\/span><\/p>\n<p>num_sequences_available <span class=\"token operator\">&#061;<\/span> train_x<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521b\u5efa\u4e86 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>num_sequences_available<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u4e2a\u91cd\u53e0\u7684\u8f93\u5165\/\u76ee\u6807\u5e8f\u5217\u5bf9\u3002&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;train_x \u7684\u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>train_x<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span> <span class=\"token comment\"># \u5e94\u4e3a (num_sequences, block_size)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;train_y \u7684\u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>train_y<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span> <span class=\"token comment\"># \u5e94\u4e3a (num_sequences, block_size)<\/span><\/p>\n<p><span class=\"token comment\"># \u53ef\u9009&#xff1a;\u9a8c\u8bc1\u8bbe\u5907<\/span><br \/>\n<span class=\"token comment\"># print(f&#034;train_x \u6240\u5728\u8bbe\u5907&#xff1a;{train_x.device}&#034;) # \u53ef\u80fd\u4ecd\u5728 CPU \u4e0a&#xff0c;\u7a0d\u540e\u5728\u6279\u91cf\u5904\u7406\u4e2d\u79fb\u52a8<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u521b\u5efa\u4e86 <span class=\"token number\">529<\/span> \u4e2a\u91cd\u53e0\u7684\u8f93\u5165<span class=\"token operator\">\/<\/span>\u76ee\u6807\u5e8f\u5217\u5bf9\u3002<br \/>\ntrain_x \u7684\u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">529<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">64<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\ntrain_y \u7684\u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">529<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">64<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u4ece\u6211\u4eec 593 \u4e2a\u5b57\u7b26\u7684\u6587\u672c\u4e2d&#xff0c;\u6211\u4eec\u80fd\u591f\u63d0\u53d6\u51fa 529 \u4e2a\u957f\u5ea6\u4e3a 64&#xff08;block_size&#xff09;\u7684\u91cd\u53e0\u5e8f\u5217\u3002<\/p>\n<p>\u8f93\u51fa\u786e\u8ba4\u4e86\u8fd9\u4e00\u70b9&#xff0c;\u663e\u793a train_x&#xff08;\u8f93\u5165&#xff09;\u548c train_y&#xff08;\u76ee\u6807&#xff09;\u73b0\u5728\u662f\u5f62\u72b6\u4e3a [529, 64] \u7684\u5f20\u91cf\u3002<\/p>\n<p>\u6ce8\u610f&#xff0c;\u8fd9\u4e9b\u5f20\u91cf\u53ef\u80fd\u4ecd\u7136\u5728 CPU \u4e0a&#xff1b;\u6211\u4eec\u5c06\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u5c06\u6bcf\u4e2a\u6279\u91cf\u79fb\u52a8\u5230 GPU&#xff08;device&#xff09;\u3002<\/p>\n<h4>\u6279\u91cf\u7b56\u7565&#xff08;\u968f\u673a\u62bd\u6837&#xff09;<\/h4>\n<p>\u4e00\u6b21\u6027\u5728\u6574\u4e2a\u6570\u636e\u96c6\u4e0a\u8fdb\u884c\u8bad\u7ec3\u901a\u5e38\u4f1a\u5360\u7528\u8fc7\u591a\u7684\u5185\u5b58\u3002\u76f8\u53cd&#xff0c;\u6211\u4eec\u4f7f\u7528 mini-batch \u8fdb\u884c\u8bad\u7ec3\u3002<\/p>\n<p>\u4e00\u4e2a\u5e38\u89c1\u7684\u7b56\u7565&#xff0c;\u4e5f\u662f\u6211\u4eec\u8fd9\u91cc\u4e3a\u4e86\u7b80\u5355\u8d77\u89c1\u6240\u91c7\u7528\u7684&#xff0c;\u662f \u968f\u673a\u62bd\u6837\u3002\u5728\u6bcf\u4e2a\u8bad\u7ec3\u6b65\u9aa4\u4e2d&#xff0c;\u6211\u4eec\u5c06\u968f\u673a\u9009\u62e9 batch_size \u4e2a\u7d22\u5f15&#xff08;\u4ece 0 \u5230 num_sequences_available &#8211; 1&#xff09;&#xff0c;\u5e76\u4ece train_x \u548c train_y \u4e2d\u6293\u53d6\u5bf9\u5e94\u7684\u8f93\u5165\/\u76ee\u6807\u5bf9\u3002<\/p>\n<p>\u8fd9\u4e9b\u9009\u5b9a\u7684\u6279\u91cf\u968f\u540e\u5c06\u88ab\u79fb\u52a8\u5230 device&#xff08;GPU \u6216 CPU&#xff09;\u4e0a&#xff0c;\u4f9b\u6a21\u578b\u8fdb\u884c\u5904\u7406\u3002<\/p>\n<p><span class=\"token comment\"># \u68c0\u67e5\u6211\u4eec\u662f\u5426\u6709\u8db3\u591f\u7684\u5e8f\u5217\u7528\u4e8e\u6240\u9700\u7684\u6279\u91cf\u5927\u5c0f<\/span><br \/>\n<span class=\"token keyword\">if<\/span> num_sequences_available <span class=\"token operator\">&lt;<\/span> batch_size<span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u8b66\u544a&#xff1a;\u5e8f\u5217\u6570\u91cf (<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>num_sequences_available<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">) \u5c0f\u4e8e\u6279\u91cf\u5927\u5c0f (<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>batch_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">)\u3002\u6b63\u5728\u8c03\u6574\u6279\u91cf\u5927\u5c0f\u3002&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    batch_size <span class=\"token operator\">&#061;<\/span> num_sequences_available<\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u6570\u636e\u5df2\u51c6\u5907\u597d\u7528\u4e8e\u8bad\u7ec3\u3002\u5c06\u968f\u673a\u62bd\u53d6\u5927\u5c0f\u4e3a <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>batch_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u7684\u6279\u91cf\u3002&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u6279\u91cf\u5c06\u5728\u8bad\u7ec3\u5faa\u73af\u4e2d\u79fb\u52a8\u5230\u8bbe\u5907\u4e0a\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u793a\u4f8b&#xff1a;\u5982\u4f55\u5728\u5faa\u73af\u4e2d\u9009\u62e9\u4e00\u4e2a\u6279\u91cf<\/span><br \/>\n<span class=\"token comment\"># indices &#061; torch.randint(0, num_sequences_available, (batch_size,))<\/span><br \/>\n<span class=\"token comment\"># xb &#061; train_x[indices].to(device)<\/span><br \/>\n<span class=\"token comment\"># yb &#061; train_y[indices].to(device)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u6570\u636e\u5df2\u51c6\u5907\u597d\u7528\u4e8e\u8bad\u7ec3\u3002\u5c06\u968f\u673a\u62bd\u53d6\u5927\u5c0f\u4e3a <span class=\"token number\">16<\/span> \u7684\u6279\u91cf\u3002<br \/>\n\u6279\u91cf\u5c06\u5728\u8bad\u7ec3\u5faa\u73af\u4e2d\u79fb\u52a8\u5230\u8bbe\u5907\u4e0a\u3002<\/p>\n<p>\u8fd9\u786e\u8ba4\u4e86\u6211\u4eec\u7684\u8ba1\u5212\u3002\u6211\u4eec\u6709\u8db3\u591f\u7684\u5e8f\u5217&#xff08;529 \u4e2a&#xff09;\u7528\u4e8e\u6211\u4eec\u9009\u62e9\u7684\u6279\u91cf\u5927\u5c0f&#xff08;16 \u4e2a&#xff09;\u3002\u5b83\u63d0\u9192\u6211\u4eec&#xff0c;\u5728\u6bcf\u4e2a\u8bad\u7ec3\u6b65\u9aa4\u4e2d&#xff0c;\u6211\u4eec\u5c06\u968f\u673a\u6293\u53d6 16 \u4e2a\u8f93\u5165\/\u76ee\u6807\u5e8f\u5217\u5bf9&#xff0c;\u5e76\u5c06\u5b83\u4eec\u53d1\u9001\u5230 GPU \u6216 CPU&#xff0c;\u4ee5\u4fbf\u8fdb\u884c\u8be5\u6b65\u9aa4\u7684\u8ba1\u7b97\u3002<\/p>\n<h4>\u6a21\u578b\u7ec4\u4ef6\u521d\u59cb\u5316<\/h4>\n<p>\u8fd9\u662f\u6a21\u578b\u7684\u7b2c\u4e00\u5c42\u3002\u5b83\u5c06\u6574\u6570\u5206\u8bcd ID&#xff08;\u5982 train_x \u4e2d\u7684&#xff09;\u8f6c\u6362\u4e3a\u5927\u5c0f\u4e3a d_model \u7684\u5bc6\u96c6\u5411\u91cf\u3002\u53ef\u4ee5\u5c06\u5176\u89c6\u4e3a\u4e00\u4e2a\u67e5\u627e\u8868&#xff0c;\u5176\u4e2d\u6bcf\u4e2a\u5206\u8bcd ID \u90fd\u6709\u81ea\u5df1\u7684\u552f\u4e00\u5411\u91cf\u8868\u793a\u3002<\/p>\n<p>\u8fd9\u4e9b\u5411\u91cf\u6355\u6349\u4e86\u5206\u8bcd\u7684\u4e00\u4e9b\u521d\u59cb\u201c\u542b\u4e49\u201d&#xff0c;\u6a21\u578b\u5c06\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u5b66\u4e60\u5e76\u5b8c\u5584\u8fd9\u4e9b\u8868\u793a\u3002<\/p>\n<p>\u8f93\u5165\u5f62\u72b6&#xff1a;(Batch, SequenceLength) \u2192 \u8f93\u51fa\u5f62\u72b6&#xff1a;(Batch, SequenceLength, d_model)\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d137f57.png\" alt=\"\u5d4c\u5165\u5c42\u521d\u59cb\u5316\" \/><\/p>\n<p>\u5d4c\u5165\u5c42\u521d\u59cb\u5316<\/p>\n<p><span class=\"token comment\"># \u521d\u59cb\u5316\u5206\u8bcd\u5d4c\u5165\u8868<\/span><br \/>\ntoken_embedding_table <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Embedding<span class=\"token punctuation\">(<\/span>vocab_size<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u5316\u5206\u8bcd\u5d4c\u5165\u5c42&#xff1a;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8f93\u5165\u8bcd\u6c47\u8868\u5927\u5c0f&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>vocab_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8f93\u51fa\u5d4c\u5165\u7ef4\u5ea6 (d_model)&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>d_model<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u6743\u91cd\u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>token_embedding_table<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8bbe\u5907&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>token_embedding_table<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>device<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u521d\u59cb\u5316\u5206\u8bcd\u5d4c\u5165\u5c42&#xff1a;<br \/>\n  \u8f93\u5165\u8bcd\u6c47\u8868\u5927\u5c0f&#xff1a;<span class=\"token number\">36<\/span><br \/>\n  \u8f93\u51fa\u5d4c\u5165\u7ef4\u5ea6 <span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">)<\/span>&#xff1a;<span class=\"token number\">128<\/span><br \/>\n  \u6743\u91cd\u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">36<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u8bbe\u5907&#xff1a;cuda<span class=\"token punctuation\">:<\/span><span class=\"token number\">0<\/span><\/p>\n<p>\u6211\u4eec\u521b\u5efa\u4e86 nn.Embedding \u5c42\u3002\u8f93\u51fa\u663e\u793a\u5b83\u5df2\u6b63\u786e\u914d\u7f6e&#xff1a;\u5b83\u77e5\u9053\u6211\u4eec\u7684 vocab_size \u662f 36&#xff0c;\u5e76\u5c06\u8f93\u51fa\u5927\u5c0f\u4e3a d_model&#xff08;128&#xff09;\u7684\u5411\u91cf\u3002<\/p>\n<p>Weight \u7684\u5f62\u72b6\u786e\u8ba4\u4e86\u67e5\u627e\u8868\u7684\u5927\u5c0f&#xff1a;36 \u884c&#xff08;\u6bcf\u4e2a\u5b57\u7b26\u4e00\u884c&#xff09;\u548c 128 \u5217&#xff08;\u5d4c\u5165\u7ef4\u5ea6&#xff09;\u3002\u5b83\u4e5f\u88ab\u653e\u7f6e\u5728\u6211\u4eec\u7684 GPU&#xff08;cuda:0&#xff09;\u4e0a\u3002<\/p>\n<h4>\u65cb\u8f6c\u4f4d\u7f6e\u5d4c\u5165&#xff08;RoPE&#xff09;\u9884\u8ba1\u7b97<\/h4>\n<p>Transformer \u672c\u8eab\u5e76\u4e0d\u7406\u89e3\u8bcd\u5e8f\u3002\u4f4d\u7f6e\u7f16\u7801\u4f1a\u6dfb\u52a0\u8fd9\u79cd\u4fe1\u606f\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d149ab0.png\" alt=\"RoPE \u673a\u5236\" \/><\/p>\n<p>RoPE \u673a\u5236<\/p>\n<p>RoPE \u662f\u50cf LLaMA \u8fd9\u6837\u7684\u6a21\u578b\u4e2d\u4f7f\u7528\u7684\u4e00\u79cd\u5de7\u5999\u65b9\u6cd5\u3002\u4e0e\u5176\u6dfb\u52a0\u5355\u72ec\u7684\u4f4d\u7f6e\u5411\u91cf&#xff0c;\u5b83\u4f1a\u6839\u636e\u4f4d\u7f6e\u65cb\u8f6c Query&#xff08;Q&#xff09;\u548c Key&#xff08;K&#xff09;\u5411\u91cf\u7684\u4e00\u90e8\u5206\u3002<\/p>\n<p>\u65cb\u8f6c\u91cf\u53d6\u51b3\u4e8e\u4f4d\u7f6e\u548c\u4ece rope_theta \u8d85\u53c2\u6570\u5bfc\u51fa\u7684\u9884\u8ba1\u7b97\u9891\u7387\u3002\u5728\u8fd9\u91cc&#xff0c;\u6211\u4eec\u9884\u8ba1\u7b97\u9006\u9891\u7387&#xff08;inv_freq&#xff09;&#xff0c;\u5b83\u4eec\u662f\u5e38\u91cf\u3002<\/p>\n<p>\u5b9e\u9645\u7684\u65cb\u8f6c&#xff08;\u4f7f\u7528\u590d\u6570 freqs_cis&#xff09;\u5c06\u5728\u524d\u5411\u4f20\u64ad\u671f\u95f4\u52a8\u6001\u8ba1\u7b97&#xff0c;\u5177\u4f53\u53d6\u51b3\u4e8e\u6bcf\u4e2a\u5e8f\u5217\u957f\u5ea6\u3002<\/p>\n<p><span class=\"token comment\"># \u9884\u8ba1\u7b97 RoPE \u7684\u9006\u9891\u7387<\/span><br \/>\n<span class=\"token comment\"># \u516c\u5f0f&#xff1a;1.0 \/ (rope_theta ** (torch.arange(0, d_k, 2) \/ d_k))<\/span><br \/>\nrope_freq_indices <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>arange<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> d_k<span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>torch<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\ninv_freq <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">1.0<\/span> <span class=\"token operator\">\/<\/span> <span class=\"token punctuation\">(<\/span>rope_theta <span class=\"token operator\">**<\/span> <span class=\"token punctuation\">(<\/span>rope_freq_indices <span class=\"token operator\">\/<\/span> d_k<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u9884\u8ba1\u7b97\u7684 RoPE \u9006\u9891\u7387 (inv_freq)&#xff1a;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>inv_freq<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span> <span class=\"token comment\"># \u5e94\u4e3a (d_k \/ 2,)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u503c&#xff08;\u524d 5 \u4e2a&#xff09;&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>inv_freq<span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">:<\/span><span class=\"token number\">5<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>tolist<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8bbe\u5907&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>inv_freq<span class=\"token punctuation\">.<\/span>device<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># &#039;freqs_cis&#039;&#xff08;\u590d\u6570&#xff09;\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u4f7f\u7528\u8fd9\u4e9b inv_freq \u548c position_ids \u8ba1\u7b97<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u9884\u8ba1\u7b97\u7684 RoPE \u9006\u9891\u7387 <span class=\"token punctuation\">(<\/span>inv_freq<span class=\"token punctuation\">)<\/span>&#xff1a;<br \/>\n  \u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">16<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u503c&#xff08;\u524d <span class=\"token number\">5<\/span> \u4e2a&#xff09;&#xff1a;<span class=\"token punctuation\">[<\/span><span class=\"token number\">1.0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">0.5623413324356079<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">0.3162277638912201<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">0.17782793939113617<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">0.10000000149011612<\/span><span class=\"token punctuation\">]<\/span><br \/>\n  \u8bbe\u5907&#xff1a;cuda<span class=\"token punctuation\">:<\/span><span class=\"token number\">0<\/span><\/p>\n<p>\u8fd9\u4e2a\u4ee3\u7801\u5757\u8ba1\u7b97\u5e76\u5b58\u50a8\u4e86 inv_freq \u5f20\u91cf\u3002\u7531\u4e8e\u6211\u4eec\u7684\u6bcf\u4e2a\u5934\u7684\u7ef4\u5ea6&#xff08;d_k&#xff09;\u662f 32&#xff0c;RoPE \u5728\u6210\u5bf9\u5de5\u4f5c&#xff0c;\u56e0\u6b64\u5f62\u72b6\u4e3a (16,)&#xff08;\u5373 d_k \/ 2&#xff09;\u3002<\/p>\n<p>\u8fd9\u4e9b\u503c\u4ee3\u8868\u65cb\u8f6c\u7684\u57fa\u7840\u9891\u7387\u3002\u6211\u4eec\u7a0d\u540e\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u4f7f\u7528\u8fd9\u4e2a inv_freq \u5f20\u91cf&#xff0c;\u6839\u636e\u6bcf\u4e2a\u5206\u8bcd\u7684\u4f4d\u7f6e\u8ba1\u7b97\u5b9e\u9645\u7684\u65cb\u8f6c\u89d2\u5ea6&#xff08;freqs_cis&#xff09;\u3002<\/p>\n<h4>RMSNorm \u5c42\u521d\u59cb\u5316<\/h4>\n<p>\u5f52\u4e00\u5316\u5c42\u6709\u52a9\u4e8e\u7a33\u5b9a\u8bad\u7ec3\u3002LLaMA \u4f7f\u7528 RMSNorm&#xff08;Root Mean Square Normalization&#xff09;&#xff0c;\u5b83\u6bd4\u6807\u51c6\u5c42\u5f52\u4e00\u5316\u66f4\u7b80\u5355\u3001\u66f4\u5feb\u3002<\/p>\n<p>\u5b83\u901a\u8fc7\u5bf9\u8f93\u5165\u5411\u91cf\u7684\u5747\u65b9\u6839\u503c\u8fdb\u884c\u5f52\u4e00\u5316&#xff0c;\u7136\u540e\u4f7f\u7528\u53ef\u5b66\u4e60\u7684\u53c2\u6570 gamma&#xff08;\u6743\u91cd&#xff09;\u8fdb\u884c\u7f29\u653e\u3002\u6211\u4eec\u901a\u5e38\u6ca1\u6709\u50cf LayerNorm \u90a3\u6837\u7684\u53ef\u5b66\u4e60\u504f\u5dee&#xff08;beta&#xff09;\u3002<\/p>\n<p>\u6211\u4eec\u9700\u8981\u5728\u6bcf\u4e2a\u5c42\u7684\u6ce8\u610f\u529b\u5757\u4e4b\u524d\u548c MoE\/FFN \u5757\u4e4b\u524d&#xff0c;\u4ee5\u53ca\u6700\u7ec8\u8f93\u51fa\u5c42\u4e4b\u524d\u5404\u6709\u4e00\u4e2a RMSNorm\u3002<\/p>\n<p>\u7531\u4e8e\u6211\u4eec\u5728\u8fd9\u91cc\u662f\u5185\u8054\u5b8c\u6210\u7684&#xff0c;\u6211\u4eec\u53ea\u9700\u8981\u521d\u59cb\u5316\u53ef\u5b66\u4e60\u7684 gamma \u6743\u91cd&#xff08;nn.Parameter&#xff09;&#xff1b;\u5b9e\u9645\u7684 RMS \u8ba1\u7b97\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u8fdb\u884c\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d15c4fe.png\" alt=\"RMSNorm \u5c42\u521d\u59cb\u5316\" \/><\/p>\n<p>RMSNorm \u5c42\u521d\u59cb\u5316<\/p>\n<p><span class=\"token comment\"># \u5217\u8868&#xff0c;\u7528\u4e8e\u5b58\u50a8\u6bcf\u4e2a Transformer \u5757\u7684 RMSNorm \u5c42\u6743\u91cd<\/span><br \/>\nrmsnorm_weights_input <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span>      <span class=\"token comment\"># \u6ce8\u610f\u529b\u4e4b\u524d\u7684 RMSNorm<\/span><br \/>\nrmsnorm_weights_post_attn <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span>  <span class=\"token comment\"># MoE\/FFN&#xff08;\u6ce8\u610f\u529b\u4e4b\u540e&#xff09;\u4e4b\u524d\u7684 RMSNorm<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u5316 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>n_layers<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u5c42\u7684 RMSNorm \u6743\u91cd&#8230;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># \u6ce8\u610f\u529b\u8f93\u5165\u7684 RMSNorm \u6743\u91cd<\/span><br \/>\n    <span class=\"token comment\"># \u521d\u59cb\u5316\u6743\u91cd\u4e3a torch.ones&#xff0c;\u7c7b\u4f3c\u4e8e nn.LayerNorm \u7684\u9ed8\u8ba4 gamma<\/span><br \/>\n    weight_in <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Parameter<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>ones<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    rmsnorm_weights_input<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>weight_in<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># MoE\/FFN \u8f93\u5165\u7684 RMSNorm \u6743\u91cd&#xff08;\u6ce8\u610f\u529b\u4e4b\u540e&#xff09;<\/span><br \/>\n    weight_post <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Parameter<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>ones<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    rmsnorm_weights_post_attn<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>weight_post<span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u521d\u59cb\u5316\u7b2c <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>i<span class=\"token operator\">&#043;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u5c42\u7684 RMSNorm \u6743\u91cd&#xff08;\u8f93\u5165&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>weight_in<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#xff0c;\u6ce8\u610f\u529b\u4e4b\u540e&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>weight_post<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#xff09;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u6700\u7ec8\u8f93\u51fa\u5c42\u4e4b\u524d\u7684 RMSNorm<\/span><br \/>\nfinal_rmsnorm_weight <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Parameter<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>ones<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u5316\u6700\u7ec8 RMSNorm \u6743\u91cd&#xff0c;\u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>final_rmsnorm_weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;RMSNorm \u6743\u91cd\u5df2\u521d\u59cb\u5316&#xff08;\u4f5c\u4e3a nn.Parameter&#xff09;\u3002\u5f52\u4e00\u5316\u903b\u8f91\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u5185\u8054\u5b8c\u6210\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u521d\u59cb\u5316 <span class=\"token number\">4<\/span> \u5c42\u7684 RMSNorm \u6743\u91cd<span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">1<\/span> \u5c42\u7684 RMSNorm \u6743\u91cd&#xff08;\u8f93\u5165&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u6ce8\u610f\u529b\u4e4b\u540e&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff09;<br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">2<\/span> \u5c42\u7684 RMSNorm \u6743\u91cd&#xff08;\u8f93\u5165&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u6ce8\u610f\u529b\u4e4b\u540e&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff09;<br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">3<\/span> \u5c42\u7684 RMSNorm \u6743\u91cd&#xff08;\u8f93\u5165&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u6ce8\u610f\u529b\u4e4b\u540e&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff09;<br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">4<\/span> \u5c42\u7684 RMSNorm \u6743\u91cd&#xff08;\u8f93\u5165&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u6ce8\u610f\u529b\u4e4b\u540e&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff09;<br \/>\n\u521d\u59cb\u5316\u6700\u7ec8 RMSNorm \u6743\u91cd&#xff0c;\u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\nRMSNorm \u6743\u91cd\u5df2\u521d\u59cb\u5316&#xff08;\u4f5c\u4e3a nn<span class=\"token punctuation\">.<\/span>Parameter&#xff09;\u3002\u5f52\u4e00\u5316\u903b\u8f91\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u5185\u8054\u5b8c\u6210\u3002<\/p>\n<p>\u5728\u8fd9\u91cc&#xff0c;\u6211\u4eec\u4e3a\u6240\u6709\u9700\u8981\u7684 RMSNorm \u64cd\u4f5c\u521b\u5efa\u4e86\u53ef\u5b66\u4e60\u7684 gamma \u6743\u91cd\u3002\u5bf9\u4e8e\u6211\u4eec\u7684 n_layers&#xff08;4 \u5c42&#xff09;&#xff0c;\u6211\u4eec\u9700\u8981\u6bcf\u4e2a\u5c42\u6709\u4e00\u4e2a\u6743\u91cd\u7528\u4e8e\u6ce8\u610f\u529b\u4e4b\u524d&#xff08;rmsnorm_weights_input&#xff09;\u548c\u4e00\u4e2a\u7528\u4e8e MoE \u5757\u4e4b\u524d&#xff08;rmsnorm_weights_post_attn&#xff09;\u3002<\/p>\n<p>\u6211\u4eec\u8fd8\u9700\u8981\u4e00\u4e2a\u6700\u7ec8\u6743\u91cd&#xff08;final_rmsnorm_weight&#xff09;&#xff0c;\u7528\u4e8e\u6700\u540e\u4e00\u5c42\u4e4b\u540e\u3002\u6bcf\u4e2a\u6743\u91cd\u90fd\u662f\u4e00\u4e2a\u5927\u5c0f\u4e3a d_model&#xff08;128&#xff09;\u7684 Parameter \u5f20\u91cf&#xff0c;\u521d\u59cb\u5316\u4e3a 1\u3002\u5b9e\u9645\u7684 RMSNorm \u8ba1\u7b97\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u4f7f\u7528\u8fd9\u4e9b\u6743\u91cd\u3002<\/p>\n<h5>\u6ce8\u610f\u529b\u5c42\u521d\u59cb\u5316&#xff08;MHA&#xff09;<\/h5>\n<p>Transformer \u7684\u6838\u5fc3\u662f\u81ea\u6ce8\u610f\u529b\u673a\u5236\u3002\u6211\u4eec\u4f7f\u7528\u7684\u662f\u591a\u5934\u6ce8\u610f\u529b&#xff08;MHA&#xff09;\u3002<\/p>\n<p>\u5bf9\u4e8e\u6bcf\u4e00\u5c42&#xff0c;\u6211\u4eec\u9700\u8981\u7ebf\u6027\u6295\u5f71\u5c42&#xff0c;\u5c06\u8f93\u5165\u5411\u91cf\u8f6c\u6362\u4e3a Query&#xff08;Q&#xff09;\u3001Key&#xff08;K&#xff09;\u548c Value&#xff08;V&#xff09;\u7a7a\u95f4\u3002<\/p>\n<li>QKV \u6295\u5f71&#xff1a;\u8fd9\u662f\u4e00\u4e2a\u5355\u4e00\u7684\u5927\u578b\u7ebf\u6027\u5c42&#xff0c;\u5b83\u5c06\u8f93\u5165&#xff08;\u5927\u5c0f\u4e3a d_model&#xff09;\u6295\u5f71\u5230\u7ec4\u5408\u7684 QKV \u7a7a\u95f4&#xff08;\u5927\u5c0f\u4e3a 3 * d_model&#xff09;\u3002<\/li>\n<li>\u8f93\u51fa\u6295\u5f71&#xff1a;\u5728\u4f7f\u7528\u591a\u4e2a\u5934\u8ba1\u7b97\u6ce8\u610f\u529b\u540e&#xff0c;\u53e6\u4e00\u4e2a\u7ebf\u6027\u5c42\u5c06\u7ec4\u5408\u7ed3\u679c\u6295\u5f71\u56de\u539f\u59cb\u7684 d_model \u7ef4\u5ea6\u3002<\/li>\n<p>\u6211\u4eec\u5c06\u4e3a\u6bcf\u4e2a Transformer \u5757\u521d\u59cb\u5316\u8fd9\u4e9b nn.Linear \u5c42\u3002\u901a\u5e38&#xff0c;\u8fd9\u4e9b\u6295\u5f71\u4e2d\u7684\u504f\u5dee\u662f\u5173\u95ed\u7684\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d16aa29.png\" alt=\"\u591a\u5934\u6ce8\u610f\u529b\" \/><\/p>\n<p>\u591a\u5934\u6ce8\u610f\u529b<\/p>\n<p><span class=\"token comment\"># \u5217\u8868&#xff0c;\u7528\u4e8e\u5b58\u50a8\u6bcf\u4e2a Transformer \u5757\u7684\u6ce8\u610f\u529b\u5c42<\/span><br \/>\nmha_qkv_linears <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span>    <span class=\"token comment\"># QKV \u6295\u5f71\u7684\u7ec4\u5408\u7ebf\u6027\u5c42<\/span><br \/>\nmha_output_linears <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token comment\"># MHA \u7684\u8f93\u51fa\u7ebf\u6027\u5c42<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u5316 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>n_layers<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u5c42\u7684\u6ce8\u610f\u529b&#xff08;MHA&#xff09;\u7ebf\u6027\u5c42&#8230;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># QKV \u6295\u5f71\u5c42<\/span><br \/>\n    <span class=\"token comment\"># \u5927\u578b Transformer \u7684 QKV \u6295\u5f71\u901a\u5e38\u5173\u95ed\u504f\u5dee<\/span><br \/>\n    qkv_linear <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span> <span class=\"token operator\">*<\/span> d_model<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n    mha_qkv_linears<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>qkv_linear<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># \u8f93\u51fa\u6295\u5f71\u5c42<\/span><br \/>\n    <span class=\"token comment\"># \u8fd9\u91cc\u7684\u504f\u5dee\u901a\u5e38\u4e5f\u662f\u5173\u95ed\u7684&#xff0c;\u4f46\u4e5f\u53ef\u4ee5\u6253\u5f00<\/span><br \/>\n    output_linear <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n    mha_output_linears<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>output_linear<span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u521d\u59cb\u5316\u7b2c <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>i<span class=\"token operator\">&#043;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u5c42\u7684 MHA \u7ebf\u6027\u5c42&#xff08;QKV&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>qkv_linear<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#xff0c;\u8f93\u51fa&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>output_linear<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#xff09;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u6ce8\u610f\u529b&#xff08;MHA&#xff09;\u7ebf\u6027\u5c42\u5df2\u521d\u59cb\u5316\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u521d\u59cb\u5316 <span class=\"token number\">4<\/span> \u5c42\u7684\u6ce8\u610f\u529b&#xff08;MHA&#xff09;\u7ebf\u6027\u5c42<span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">1<\/span> \u5c42\u7684 MHA \u7ebf\u6027\u5c42&#xff08;QKV&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">384<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u8f93\u51fa&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">2<\/span> \u5c42\u7684 MHA \u7ebf\u6027\u5c42&#xff08;QKV&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">384<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u8f93\u51fa&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">3<\/span> \u5c42\u7684 MHA \u7ebf\u6027\u5c42&#xff08;QKV&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">384<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u8f93\u51fa&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u521d\u59cb\u5316\u7b2c <span class=\"token number\">4<\/span> \u5c42\u7684 MHA \u7ebf\u6027\u5c42&#xff08;QKV&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">384<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span>&#xff0c;\u8f93\u51fa&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n\u6ce8\u610f\u529b&#xff08;MHA&#xff09;\u7ebf\u6027\u5c42\u5df2\u521d\u59cb\u5316\u3002<\/p>\n<p>\u8fd9\u4e3a\u6211\u4eec\u7684 4 \u4e2a Transformer \u5757\u4e2d\u7684\u6bcf\u4e00\u4e2a\u90fd\u8bbe\u7f6e\u4e86\u6ce8\u610f\u529b\u6240\u9700\u7684\u7ebf\u6027\u5c42\u3002\u5bf9\u4e8e\u6bcf\u4e00\u5c42&#xff0c;\u6211\u4eec\u6709&#xff1a;<\/p>\n<ul>\n<li>qkv_linear&#xff1a;\u4e00\u4e2a\u5c06 d_model&#xff08;128&#xff09;\u6620\u5c04\u5230 3 * d_model&#xff08;384&#xff09;\u7684\u5c42\u3002\u5176\u6743\u91cd\u5f62\u72b6\u4e3a [384, 128]\u3002<\/li>\n<li>output_linear&#xff1a;\u4e00\u4e2a\u5c06 d_model&#xff08;128&#xff09;\u6620\u5c04\u56de d_model&#xff08;128&#xff09;\u7684\u5c42\u3002\u5176\u6743\u91cd\u5f62\u72b6\u4e3a [128, 128]\u3002<\/li>\n<\/ul>\n<p>\u8fd9\u4e9b\u5c42\u88ab\u5b58\u50a8\u5728\u5217\u8868&#xff08;mha_qkv_linears \u548c mha_output_linears&#xff09;\u4e2d&#xff0c;\u4ee5\u4fbf\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u8bbf\u95ee\u6b63\u786e\u7684\u5c42\u3002<\/p>\n<h4>\u6df7\u5408\u4e13\u5bb6&#xff08;MoE&#xff09;\u5c42\u521d\u59cb\u5316<\/h4>\n<p>\u8fd9\u662f\u7279\u6b8a\u7684\u90e8\u5206\u3002\u5728\u6ce8\u610f\u529b\u5757\u4e4b\u540e&#xff0c;\u6211\u4eec\u6ca1\u6709\u4f7f\u7528\u4e00\u4e2a\u5927\u578b\u7684\u524d\u9988\u7f51\u7edc&#xff08;FFN&#xff09;&#xff0c;\u800c\u662f\u4f7f\u7528\u4e86\u4e00\u4e2a MoE \u5c42\u3002\u5bf9\u4e8e\u6bcf\u4e00\u5c42&#xff0c;\u8fd9\u6d89\u53ca&#xff1a;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d1781b8.png\" alt=\"MoE \u5c42\" \/><\/p>\n<p>MoE \u5c42<\/p>\n<ul>\n<li>\u8def\u7531\u5668&#xff1a;\u4e00\u4e2a\u7b80\u5355\u7684\u7ebf\u6027\u5c42&#xff0c;\u5b83\u5c06\u5206\u8bcd\u7684\u9690\u85cf\u72b6\u6001&#xff08;\u5927\u5c0f\u4e3a d_model&#xff09;\u4f5c\u4e3a\u8f93\u5165&#xff0c;\u5e76\u8f93\u51fa\u6bcf\u4e2a\u53ef\u7528\u201c\u4e13\u5bb6\u201d\u7684\u5206\u6570&#xff08;logit&#xff09;\u3002<\/li>\n<li>\u4e13\u5bb6&#xff1a;\u4e00\u7ec4&#xff08;num_local_experts&#xff09;\u72ec\u7acb\u7684\u5c0f\u578b MLP\u3002\u6bcf\u4e2a\u4e13\u5bb6\u901a\u5e38\u662f\u4e00\u4e2a\u201c\u95e8\u63a7 MLP\u201d&#xff0c;\u7c7b\u4f3c\u4e8e LLaMA \u4e2d\u7684\u6807\u51c6 FFN&#xff1a;\u5b83\u6709\u5e76\u884c\u7684\u201c\u95e8\u201d\u548c\u201c\u4e0a\u201d\u6295\u5f71&#xff0c;\u7136\u540e\u662f\u4e00\u4e2a\u6fc0\u6d3b\u51fd\u6570&#xff08;SiLU\/Swish&#xff09;&#xff0c;\u4e58\u6cd5&#xff08;\u95e8\u63a7&#xff09;\u548c\u4e00\u4e2a\u201c\u4e0b\u201d\u6295\u5f71\u3002<\/li>\n<li>\u6211\u4eec\u521d\u59cb\u5316\u6240\u6709\u4e13\u5bb6\u7684\u6743\u91cd\u3002\u6211\u4eec\u5c06\u76f4\u63a5\u5c06\u8fd9\u4e9b\u4e13\u5bb6\u6743\u91cd\u5b58\u50a8\u4e3a nn.Parameter \u5f20\u91cf&#xff0c;\u800c\u4e0d\u662f\u5c06\u5b83\u4eec\u5b58\u50a8\u4e3a nn.Linear \u5c42\u7684\u5217\u8868\u3002<\/li>\n<li>\u5171\u4eab\u4e13\u5bb6&#xff1a;\u4e00\u4e2a\u6807\u51c6\u7684\u95e8\u63a7 MLP&#xff08;\u5c31\u50cf\u5176\u4e2d\u4e00\u4e2a\u4e13\u5bb6\u4e00\u6837&#xff09;&#xff0c;\u6240\u6709\u5206\u8bcd\u90fd\u4f1a\u901a\u8fc7\u5b83\u3002\u5b83\u7684\u8f93\u51fa\u5c06\u6dfb\u52a0\u5230\u9009\u5b9a\u4e13\u5bb6\u7684\u7ec4\u5408\u8f93\u51fa\u4e2d\u3002<\/li>\n<\/ul>\n<p>\u8def\u7531\u5668\u51b3\u5b9a\u6bcf\u4e2a\u5206\u8bcd\u5e94\u8be5\u8def\u7531\u5230\u7684 num_experts_per_tok \u4e2a\u4e13\u5bb6&#xff08;Top-K \u8def\u7531&#xff09;\u3002\u7136\u540e\u5c06\u9009\u5b9a\u4e13\u5bb6\u7684\u8f93\u51fa\u7ec4\u5408\u8d77\u6765&#xff0c;\u6309\u8def\u7531\u5668\u7684\u7f6e\u4fe1\u5ea6\u5206\u6570\u52a0\u6743\u3002<\/p>\n<p><span class=\"token comment\"># \u5217\u8868&#xff0c;\u7528\u4e8e\u5b58\u50a8\u6bcf\u4e2a\u5c42\u7684 MoE \u7ec4\u4ef6<\/span><br \/>\nmoe_routers <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span>             <span class=\"token comment\"># \u8def\u7531\u5668\u7ebf\u6027\u5c42<\/span><br \/>\nmoe_expert_gate_up_proj <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token comment\"># \u4e13\u5bb6\u95e8\u63a7\/\u4e0a\u6295\u5f71\u6743\u91cd<\/span><br \/>\nmoe_expert_down_proj <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span>    <span class=\"token comment\"># \u4e13\u5bb6\u4e0b\u6295\u5f71\u6743\u91cd<\/span><br \/>\nshared_expert_gate_proj <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token comment\"># \u5171\u4eab\u4e13\u5bb6\u95e8\u63a7\u6295\u5f71<\/span><br \/>\nshared_expert_up_proj <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span>   <span class=\"token comment\"># \u5171\u4eab\u4e13\u5bb6\u4e0a\u6295\u5f71<\/span><br \/>\nshared_expert_down_proj <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token comment\"># \u5171\u4eab\u4e13\u5bb6\u4e0b\u6295\u5f71<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u5316 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>n_layers<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u5c42\u7684 MoE \u548c\u5171\u4eab MLP \u7ec4\u4ef6&#8230;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u6bcf\u5c42\u7684\u4e13\u5bb6\u6570\u91cf&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>num_local_experts<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u4e13\u5bb6\u7ef4\u5ea6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>expert_dim<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u5171\u4eab MLP \u7ef4\u5ea6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>shared_expert_dim<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># 1. \u8def\u7531\u5668<\/span><br \/>\n    router_linear <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> num_local_experts<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n    moe_routers<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>router_linear<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># 2. \u4e13\u5bb6&#xff08;\u6743\u91cd\u4f5c\u4e3a\u53c2\u6570&#xff09;<\/span><br \/>\n    <span class=\"token comment\"># \u95e8\u63a7\/\u4e0a\u6295\u5f71\u6743\u91cd&#xff1a;(num_experts, d_model, 2 * expert_dim)<\/span><br \/>\n    <span class=\"token comment\"># \u6ce8\u610f&#xff1a;\u5c06\u95e8\u63a7\u548c\u4e0a\u6295\u5f71\u5408\u5e76\u5230\u4e00\u4e2a\u6743\u91cd\u77e9\u9635\u4e2d<\/span><br \/>\n    gate_up_w <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Parameter<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>empty<span class=\"token punctuation\">(<\/span>num_local_experts<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span> <span class=\"token operator\">*<\/span> expert_dim<span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    nn<span class=\"token punctuation\">.<\/span>init<span class=\"token punctuation\">.<\/span>normal_<span class=\"token punctuation\">(<\/span>gate_up_w<span class=\"token punctuation\">,<\/span> mean<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.0<\/span><span class=\"token punctuation\">,<\/span> std<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.02<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token comment\"># \u793a\u4f8b\u521d\u59cb\u5316<\/span><br \/>\n    moe_expert_gate_up_proj<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>gate_up_w<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># \u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;(num_experts, expert_dim, d_model)<\/span><br \/>\n    down_w <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Parameter<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>empty<span class=\"token punctuation\">(<\/span>num_local_experts<span class=\"token punctuation\">,<\/span> expert_dim<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    nn<span class=\"token punctuation\">.<\/span>init<span class=\"token punctuation\">.<\/span>normal_<span class=\"token punctuation\">(<\/span>down_w<span class=\"token punctuation\">,<\/span> mean<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.0<\/span><span class=\"token punctuation\">,<\/span> std<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.02<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token comment\"># \u793a\u4f8b\u521d\u59cb\u5316<\/span><br \/>\n    moe_expert_down_proj<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>down_w<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># 3. \u5171\u4eab\u4e13\u5bb6&#xff08;\u6807\u51c6 MLP \u5c42&#xff09;<\/span><br \/>\n    shared_gate <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> shared_expert_dim<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n    shared_up <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> shared_expert_dim<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n    shared_down <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>shared_expert_dim<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n    shared_expert_gate_proj<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>shared_gate<span class=\"token punctuation\">)<\/span><br \/>\n    shared_expert_up_proj<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>shared_up<span class=\"token punctuation\">)<\/span><br \/>\n    shared_expert_down_proj<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>shared_down<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u521d\u59cb\u5316\u7b2c <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>i<span class=\"token operator\">&#043;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u5c42\u7684 MoE \u7ec4\u4ef6&#xff1a;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;    \u8def\u7531\u5668\u6743\u91cd&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>router_linear<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;    \u4e13\u5bb6\u95e8\u63a7\/\u4e0a\u6295\u5f71\u6743\u91cd&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>gate_up_w<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;    \u4e13\u5bb6\u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>down_w<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;    \u5171\u4eab\u95e8\u63a7\u6743\u91cd&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>shared_gate<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;    \u5171\u4eab\u4e0a\u6295\u5f71\u6743\u91cd&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>shared_up<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;    \u5171\u4eab\u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>shared_down<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;MoE \u548c\u5171\u4eab MLP \u7ec4\u4ef6\u5df2\u521d\u59cb\u5316\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6fc0\u6d3b\u51fd\u6570&#xff08;\u5185\u8054\u4f7f\u7528&#xff09;<\/span><br \/>\nactivation_fn <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>SiLU<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u8fd9\u4e2a\u8f93\u51fa\u663e\u793a\u4e86\u6211\u4eec\u5728 4 \u4e2a\u5c42\u4e2d\u6bcf\u4e00\u4e2a\u521d\u59cb\u5316\u7684 MoE \u7ec4\u4ef6\u3002\u5bf9\u4e8e\u6bcf\u4e00\u5c42&#xff0c;\u6211\u4eec\u521b\u5efa\u4e86&#xff1a;<\/p>\n<p>\u521d\u59cb\u5316 4 \u5c42\u7684 MoE \u548c\u5171\u4eab MLP \u7ec4\u4ef6&#8230;<br \/>\n  \u6bcf\u5c42\u7684\u4e13\u5bb6\u6570\u91cf&#xff1a;4<br \/>\n  \u4e13\u5bb6\u7ef4\u5ea6&#xff1a;256<br \/>\n  \u5171\u4eab MLP \u7ef4\u5ea6&#xff1a;256<br \/>\n  \u521d\u59cb\u5316\u7b2c 1 \u5c42\u7684 MoE \u7ec4\u4ef6&#xff1a;<br \/>\n    \u8def\u7531\u5668\u6743\u91cd&#xff1a;torch.Size([4, 128])<br \/>\n    \u4e13\u5bb6\u95e8\u63a7\/\u4e0a\u6295\u5f71\u6743\u91cd&#xff1a;torch.Size([4, 128, 512]) # num_experts, d_model, 2*expert_dim<br \/>\n    \u4e13\u5bb6\u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;torch.Size([4, 256, 128])  # num_experts, expert_dim, d_model<br \/>\n    \u5171\u4eab\u95e8\u63a7\u6743\u91cd&#xff1a;torch.Size([256, 128])<br \/>\n    \u5171\u4eab\u4e0a\u6295\u5f71\u6743\u91cd&#xff1a;torch.Size([256, 128])<br \/>\n    \u5171\u4eab\u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;torch.Size([128, 256])<br \/>\n  &#8230; (\u7b2c 2\u30013\u30014 \u5c42\u7684\u7c7b\u4f3c\u8f93\u51fa) &#8230;<br \/>\nMoE \u548c\u5171\u4eab MLP \u7ec4\u4ef6\u5df2\u521d\u59cb\u5316\u3002<\/p>\n<ul>\n<li>\u8def\u7531\u5668\u6743\u91cd&#xff1a;\u4e00\u4e2a\u7ebf\u6027\u5c42&#xff0c;\u5c06 d_model&#xff08;128&#xff09;\u6620\u5c04\u5230\u4e13\u5bb6\u6570\u91cf&#xff08;4&#xff09;\u3002\u5f62\u72b6\u4e3a [4, 128]\u3002<\/li>\n<li>\u4e13\u5bb6\u95e8\u63a7\/\u4e0a\u6295\u5f71\u6743\u91cd&#xff1a;\u4e00\u4e2a\u5355\u4e00\u7684\u53c2\u6570\u5f20\u91cf&#xff0c;\u5305\u542b\u6240\u6709 4 \u4e2a\u4e13\u5bb6\u7684\u7ec4\u5408\u95e8\u63a7\u548c\u4e0a\u6295\u5f71\u6743\u91cd\u3002\u5f62\u72b6\u4e3a [num_experts, d_model, 2 * expert_dim] &#061; [4, 128, 512]\u3002<\/li>\n<li>\u4e13\u5bb6\u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;\u4e00\u4e2a\u53c2\u6570\u5f20\u91cf&#xff0c;\u5305\u542b\u6240\u6709 4 \u4e2a\u4e13\u5bb6\u7684\u4e0b\u6295\u5f71\u6743\u91cd\u3002\u5f62\u72b6\u4e3a [num_experts, expert_dim, d_model] &#061; [4, 256, 128]\u3002<\/li>\n<li>\u5171\u4eab\u95e8\u63a7\/\u4e0a\/\u4e0b\u6295\u5f71\u6743\u91cd&#xff1a;\u6807\u51c6\u7ebf\u6027\u5c42&#xff0c;\u7528\u4e8e\u5171\u4eab\u4e13\u5bb6 MLP&#xff0c;\u5f62\u72b6\u5bf9\u5e94\u4e8e d_model&#xff08;128&#xff09;\u548c shared_expert_dim&#xff08;256&#xff09;\u3002<\/li>\n<\/ul>\n<p>\u8fd9\u4e9b\u7ec4\u4ef6\u88ab\u5b58\u50a8\u5728\u5217\u8868\u4e2d&#xff0c;\u4ee5\u4fbf\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u6267\u884c\u590d\u6742\u7684 MoE \u903b\u8f91\u3002\u6211\u4eec\u8fd8\u5b9a\u4e49\u4e86 SiLU \u6fc0\u6d3b\u51fd\u6570\u3002<\/p>\n<h4>\u6700\u7ec8\u8f93\u51fa\u5c42\u521d\u59cb\u5316<\/h4>\n<p>\u7ecf\u8fc7\u6240\u6709 Transformer \u5c42\u4e4b\u540e&#xff0c;\u6700\u7ec8\u7684\u9690\u85cf\u72b6\u6001&#xff08;\u7ecf\u8fc7\u6700\u540e\u4e00\u6b21 RMSNorm \u4e4b\u540e&#xff09;\u9700\u8981\u8f6c\u6362\u4e3a\u4e0b\u4e00\u4e2a\u5206\u8bcd\u7684\u9884\u6d4b\u3002<\/p>\n<p>\u8fd9\u4e2a\u6700\u7ec8\u7684\u7ebf\u6027\u5c42\u5c06\u6bcf\u4e2a\u4f4d\u7f6e\u7684 d_model \u5927\u5c0f\u7684\u5411\u91cf\u6295\u5f71\u5230\u5927\u5c0f\u4e3a vocab_size \u7684\u5411\u91cf\u3002<\/p>\n<p>\u8f93\u51fa\u5411\u91cf\u4e2d\u7684\u6bcf\u4e2a\u5143\u7d20\u4ee3\u8868\u8bcd\u6c47\u8868\u4e2d\u4e00\u4e2a\u53ef\u80fd\u7684\u4e0b\u4e00\u4e2a\u5b57\u7b26\u7684\u539f\u59cb\u5206\u6570&#xff08;logit&#xff09;\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d18c483.png\" alt=\"\u8f93\u51fa\u5c42\" \/><\/p>\n<p>\u8f93\u51fa\u5c42<\/p>\n<p><span class=\"token comment\"># \u6700\u7ec8\u7ebf\u6027\u5c42&#xff08;\u8bed\u8a00\u5efa\u6a21\u5934&#xff09;<\/span><br \/>\noutput_linear_layer <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>Linear<span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">,<\/span> vocab_size<span class=\"token punctuation\">,<\/span> bias<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u5316\u6700\u7ec8\u8f93\u51fa\u7ebf\u6027\u5c42&#xff1a;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8f93\u5165\u7ef4\u5ea6 (d_model)&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>d_model<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8f93\u51fa\u7ef4\u5ea6 (vocab_size)&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>vocab_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u6743\u91cd\u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>output_linear_layer<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u8bbe\u5907&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>output_linear_layer<span class=\"token punctuation\">.<\/span>weight<span class=\"token punctuation\">.<\/span>device<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u521d\u59cb\u5316\u6700\u7ec8\u8f93\u51fa\u7ebf\u6027\u5c42&#xff1a;<br \/>\n  \u8f93\u5165\u7ef4\u5ea6 <span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">)<\/span>&#xff1a;<span class=\"token number\">128<\/span><br \/>\n  \u8f93\u51fa\u7ef4\u5ea6 <span class=\"token punctuation\">(<\/span>vocab_size<span class=\"token punctuation\">)<\/span>&#xff1a;<span class=\"token number\">36<\/span><br \/>\n  \u6743\u91cd\u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">36<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u8bbe\u5907&#xff1a;cuda<span class=\"token punctuation\">:<\/span><span class=\"token number\">0<\/span><\/p>\n<p>\u6211\u4eec\u521d\u59cb\u5316\u4e86\u6700\u7ec8\u7684 nn.Linear \u5c42\u3002\u5b83\u5c06 d_model&#xff08;128&#xff09;\u4f5c\u4e3a\u8f93\u5165\u7ef4\u5ea6&#xff0c;\u5e76\u8f93\u51fa vocab_size&#xff08;36&#xff09;\u4e2a logits\u3002\u6743\u91cd\u5f62\u72b6 [36, 128] \u786e\u8ba4\u4e86\u8fd9\u79cd\u6620\u5c04\u3002<\/p>\n<h4>\u56e0\u679c\u63a9\u7801\u9884\u8ba1\u7b97<\/h4>\n<p>\u5728\u50cf\u8fd9\u6837\u7684\u4ec5\u89e3\u7801\u5668 Transformer \u4e2d&#xff0c;\u5f53\u9884\u6d4b\u4f4d\u7f6e t \u7684\u5206\u8bcd\u65f6&#xff0c;\u6a21\u578b\u53ea\u80fd\u5173\u6ce8\u4f4d\u7f6e 0 \u5230 t&#xff08;\u5305\u62ec\u5b83\u81ea\u5df1&#xff09;\u7684\u5206\u8bcd&#xff0c;\u800c\u4e0d\u80fd\u5173\u6ce8\u672a\u6765\u7684\u5206\u8bcd&#xff08;t&#043;1\u3001t&#043;2 \u7b49&#xff09;\u3002<\/p>\n<p>\u56e0\u679c\u63a9\u7801\u5f3a\u5236\u6267\u884c\u8fd9\u4e00\u70b9\u3002\u5b83\u662f\u4e00\u4e2a\u5728\u6ce8\u610f\u529b\u8ba1\u7b97\u4e2d\u4f7f\u7528\u7684\u77e9\u9635\u3002\u6211\u4eec\u521b\u5efa\u4e00\u4e2a\u4e0b\u4e09\u89d2\u77e9\u9635&#xff08;\u5927\u5c0f\u4e3a block_size x block_size&#xff09;&#xff0c;\u6a21\u578b\u53ef\u4ee5\u5173\u6ce8\u7684\u4f4d\u7f6e\u503c\u4e3a&#xff08;\u6bd4\u5982 1&#xff09;&#xff0c;\u4e0d\u80fd\u5173\u6ce8\u7684\u4f4d\u7f6e\u503c\u4e3a&#xff08;\u6bd4\u5982 0&#xff09;\u3002<\/p>\n<p>\u8fd9\u4e2a\u63a9\u7801\u5728\u6ce8\u610f\u529b\u7684 softmax \u6b65\u9aa4\u4e4b\u524d\u5e94\u7528&#xff0c;\u6709\u6548\u5730\u5c06\u672a\u6765\u4f4d\u7f6e\u7684\u5206\u6570\u8bbe\u7f6e\u4e3a\u8d1f\u65e0\u7a77\u5927\u3002\u6211\u4eec\u4e3a\u6700\u5927\u5e8f\u5217\u957f\u5ea6&#xff08;block_size&#xff09;\u9884\u8ba1\u7b97\u8fd9\u4e2a\u63a9\u7801\u3002<\/p>\n<p><span class=\"token comment\"># \u521b\u5efa\u56e0\u679c\u81ea\u6ce8\u610f\u529b\u7684\u4e0b\u4e09\u89d2\u63a9\u7801<\/span><br \/>\n<span class=\"token comment\"># \u503c\u4e3a 1 \u7684\u4f4d\u7f6e\u8868\u793a\u53ef\u4ee5\u5173\u6ce8&#xff0c;\u503c\u4e3a 0 \u7684\u4f4d\u7f6e\u8868\u793a\u88ab\u63a9\u7801\u3002<\/span><br \/>\n<span class=\"token comment\"># \u5f62\u72b6&#xff1a;(1, 1, block_size, block_size)&#xff0c;\u4ee5\u4fbf\u4e0e (B, n_heads, T, T) \u5e7f\u64ad<\/span><br \/>\ncausal_mask <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>tril<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>ones<span class=\"token punctuation\">(<\/span>block_size<span class=\"token punctuation\">,<\/span> block_size<span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\ncausal_mask <span class=\"token operator\">&#061;<\/span> causal_mask<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> block_size<span class=\"token punctuation\">,<\/span> block_size<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u9884\u8ba1\u7b97\u7684\u56e0\u679c\u6ce8\u610f\u529b\u63a9\u7801&#xff1a;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>causal_mask<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u662f\u5426\u9700\u8981\u68af\u5ea6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>causal_mask<span class=\"token punctuation\">.<\/span>requires_grad<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u53ef\u9009&#xff1a;\u53ef\u89c6\u5316\u8f83\u5c0f block_size \u7684\u63a9\u7801<\/span><br \/>\n<span class=\"token comment\"># if block_size &lt;&#061; 8:<\/span><br \/>\n<span class=\"token comment\">#    print(causal_mask[0, 0].cpu().numpy())<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n\u9884\u8ba1\u7b97\u7684\u56e0\u679c\u6ce8\u610f\u529b\u63a9\u7801&#xff1a;<br \/>\n  \u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">64<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">64<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n  \u662f\u5426\u9700\u8981\u68af\u5ea6&#xff1a;<span class=\"token boolean\">False<\/span><\/p>\n<p>\u8fd9\u521b\u5efa\u4e86\u56e0\u679c\u63a9\u7801\u3002\u5b83\u662f\u4e00\u4e2a\u5f20\u91cf&#xff0c;\u5176\u4e0b\u4e09\u89d2&#xff08;\u5305\u62ec\u5bf9\u89d2\u7ebf&#xff09;\u586b\u5145\u4e86 1&#xff0c;\u5176\u4f59\u90e8\u5206\u586b\u5145\u4e86 0\u3002<\/p>\n<p>\u5f62\u72b6 [1, 1, 64, 64] \u662f\u4e3a\u4e86\u65b9\u4fbf\u4e0e\u6ce8\u610f\u529b\u5206\u6570\u5f20\u91cf&#xff08;\u5f62\u72b6\u4e3a [Batch, n_heads, SeqLen, SeqLen]&#xff09;\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u8fdb\u884c\u5e7f\u64ad\u3002\u5b83\u4e0d\u9700\u8981\u68af\u5ea6&#xff0c;\u56e0\u4e3a\u5b83\u662f\u4e00\u4e2a\u56fa\u5b9a\u7684\u503c\u3002<\/p>\n<h4>\u8bad\u7ec3\u8bbe\u7f6e<\/h4>\n<p>\u4f18\u5316\u5668\u662f\u6839\u636e\u53cd\u5411\u4f20\u64ad&#xff08;\u5b66\u4e60&#xff09;\u671f\u95f4\u8ba1\u7b97\u7684\u68af\u5ea6\u66f4\u65b0\u6a21\u578b\u6743\u91cd\u7684\u7b97\u6cd5\u3002\u6211\u4eec\u9700\u8981\u5148\u6536\u96c6\u6a21\u578b\u4e2d\u6240\u6709\u9700\u8981\u8bad\u7ec3\u7684\u53c2\u6570&#xff08;\u5373 requires_grad&#061;True \u7684\u53c2\u6570&#xff09;\u3002<\/p>\n<p>\u8fd9\u5305\u62ec\u5d4c\u5165\u8868\u7684\u6743\u91cd\u3001\u6240\u6709\u7ebf\u6027\u5c42&#xff08;QKV\u3001output\u3001MoE \u8def\u7531\u5668\u3001\u5171\u4eab\u4e13\u5bb6&#xff09;\u7684\u6743\u91cd&#xff0c;\u4ee5\u53ca\u6211\u4eec\u4e3a RMSNorm \u6743\u91cd\u548c MoE \u4e13\u5bb6\u6743\u91cd\u521b\u5efa\u7684 nn.Parameter \u5f20\u91cf\u3002<\/p>\n<p><span class=\"token comment\"># \u6536\u96c6\u6240\u6709\u9700\u8981\u68af\u5ea6\u7684\u6a21\u578b\u53c2\u6570<\/span><br \/>\nall_model_parameters <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>token_embedding_table<span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6dfb\u52a0 RMSNorm \u6743\u91cd<\/span><br \/>\nall_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span>rmsnorm_weights_input<span class=\"token punctuation\">)<\/span><br \/>\nall_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span>rmsnorm_weights_post_attn<span class=\"token punctuation\">)<\/span><br \/>\nall_model_parameters<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>final_rmsnorm_weight<span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6dfb\u52a0\u6ce8\u610f\u529b\u7ebf\u6027\u5c42\u6743\u91cd<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    all_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>mha_qkv_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    all_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>mha_output_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6dfb\u52a0 MoE \u8def\u7531\u5668\u7ebf\u6027\u5c42\u6743\u91cd<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    all_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>moe_routers<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6dfb\u52a0 MoE \u4e13\u5bb6\u6743\u91cd&#xff08;\u5df2\u7ecf\u662f nn.Parameters&#xff09;<\/span><br \/>\nall_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span>moe_expert_gate_up_proj<span class=\"token punctuation\">)<\/span><br \/>\nall_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span>moe_expert_down_proj<span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6dfb\u52a0\u5171\u4eab\u4e13\u5bb6\u7ebf\u6027\u5c42\u6743\u91cd<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    all_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>shared_expert_gate_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    all_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>shared_expert_up_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    all_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>shared_expert_down_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6dfb\u52a0\u6700\u7ec8\u8f93\u51fa\u7ebf\u6027\u5c42\u6743\u91cd<\/span><br \/>\nall_model_parameters<span class=\"token punctuation\">.<\/span>extend<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span>output_linear_layer<span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u8ba1\u7b97\u603b\u53c2\u6570\u7ec4\u6570\u91cf\u548c\u53ef\u8bad\u7ec3\u53c2\u6570\u603b\u6570<\/span><br \/>\nnum_param_groups <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>all_model_parameters<span class=\"token punctuation\">)<\/span><br \/>\ntotal_params <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">sum<\/span><span class=\"token punctuation\">(<\/span>p<span class=\"token punctuation\">.<\/span>numel<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> p <span class=\"token keyword\">in<\/span> all_model_parameters <span class=\"token keyword\">if<\/span> p<span class=\"token punctuation\">.<\/span>requires_grad<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u5b9a\u4e49 AdamW \u4f18\u5316\u5668<\/span><br \/>\noptimizer <span class=\"token operator\">&#061;<\/span> optim<span class=\"token punctuation\">.<\/span>AdamW<span class=\"token punctuation\">(<\/span>all_model_parameters<span class=\"token punctuation\">,<\/span> lr<span class=\"token operator\">&#061;<\/span>learning_rate<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u4f18\u5316\u5668\u8bbe\u7f6e&#xff1a;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u4f18\u5316\u5668&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token builtin\">type<\/span><span class=\"token punctuation\">(<\/span>optimizer<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>__name__<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u5b66\u4e60\u7387&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>learning_rate<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u7ba1\u7406 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>num_param_groups<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u4e2a\u53c2\u6570\u7ec4\/\u5f20\u91cf\u3002&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u603b\u53ef\u8bad\u7ec3\u53c2\u6570&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>total_params<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">,<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">#### \u8f93\u51fa ####<\/span><br \/>\n\u4f18\u5316\u5668\u8bbe\u7f6e&#xff1a;<br \/>\n  \u4f18\u5316\u5668&#xff1a;AdamW<br \/>\n  \u5b66\u4e60\u7387&#xff1a;<span class=\"token number\">0.0005<\/span><br \/>\n  \u7ba1\u7406 <span class=\"token number\">43<\/span> \u4e2a\u53c2\u6570\u7ec4<span class=\"token operator\">\/<\/span>\u5f20\u91cf\u3002<br \/>\n  \u603b\u53ef\u8bad\u7ec3\u53c2\u6570&#xff1a;<span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span><span class=\"token number\">240<\/span><span class=\"token punctuation\">,<\/span><span class=\"token number\">640<\/span><\/p>\n<p>\u4ee3\u7801\u6210\u529f\u6536\u96c6\u4e86\u6a21\u578b\u7684\u6240\u6709\u53ef\u8bad\u7ec3\u90e8\u5206&#xff08;43 \u4e2a\u4e0d\u540c\u7684\u6743\u91cd\/\u504f\u5dee\u5f20\u91cf\u6216\u53c2\u6570\u5bf9\u8c61&#xff09;&#xff0c;\u5e76\u521b\u5efa\u4e86\u4f7f\u7528\u6211\u4eec\u6307\u5b9a\u7684\u5b66\u4e60\u7387\u7684 AdamW \u4f18\u5316\u5668\u3002<\/p>\n<p>\u5b83\u8fd8\u8ba1\u7b97\u4e86\u6a21\u578b\u4e2d\u7684\u603b\u53ef\u8bad\u7ec3\u53c2\u6570\u6570\u91cf&#xff0c;\u5927\u7ea6\u4e3a 224 \u4e07\u4e2a\u2014\u2014\u4e0e\u771f\u5b9e\u6a21\u578b\u76f8\u6bd4&#xff0c;\u8fd9\u975e\u5e38\u5c0f\u3002<\/p>\n<h4>\u5b9a\u4e49\u635f\u5931\u51fd\u6570<\/h4>\n<p>\u6211\u4eec\u9700\u8981\u4e00\u79cd\u65b9\u6cd5\u6765\u8861\u91cf\u6a21\u578b\u7684\u9884\u6d4b\u4e0e\u5b9e\u9645\u76ee\u6807\u5206\u8bcd\u4e4b\u95f4\u7684\u201c\u9519\u8bef\u201d\u7a0b\u5ea6\u3002\u7531\u4e8e\u9884\u6d4b\u4e0b\u4e00\u4e2a\u5206\u8bcd\u662f\u4e00\u4e2a\u5206\u7c7b\u95ee\u9898&#xff08;\u4ece\u8bcd\u6c47\u8868\u4e2d\u9009\u62e9\u6b63\u786e\u7684\u5b57\u7b26&#xff09;&#xff0c;\u6807\u51c6\u7684\u635f\u5931\u51fd\u6570\u662f \u4ea4\u53c9\u71b5\u635f\u5931\u3002<\/p>\n<p>\u5b83\u63a5\u53d7\u6a21\u578b\u7684\u8f93\u51fa logits \u548c\u771f\u5b9e\u7684\u5206\u8bcd ID&#xff0c;\u5e76\u8ba1\u7b97\u4e00\u4e2a\u4ee3\u8868\u8bef\u5dee\u7684\u5206\u6570\u3002<\/p>\n<p><span class=\"token comment\"># \u5b9a\u4e49\u635f\u5931\u51fd\u6570<\/span><br \/>\ncriterion <span class=\"token operator\">&#061;<\/span> nn<span class=\"token punctuation\">.<\/span>CrossEntropyLoss<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u6211\u4eec\u521d\u59cb\u5316\u4e86 nn.CrossEntropyLoss \u51fd\u6570\u3002\u8fd9\u4e2a criterion \u5bf9\u8c61\u5c06\u5728\u8bad\u7ec3\u5faa\u73af\u4e2d\u7528\u4e8e\u8ba1\u7b97\u6bcf\u4e2a\u6279\u91cf\u7684\u635f\u5931\u503c\u3002<\/p>\n<h4>\u8bad\u7ec3\u6a21\u578b<\/h4>\n<p>\u6211\u4eec\u5c06\u901a\u8fc7\u8fed\u4ee3\u5730\u5411\u6a21\u578b\u8f93\u5165\u6279\u91cf\u6570\u636e&#xff0c;\u8ba1\u7b97\u635f\u5931&#xff0c;\u5e76\u4f7f\u7528\u4f18\u5316\u5668\u66f4\u65b0\u53c2\u6570\u6765\u8fdb\u884c\u8bad\u7ec3\u3002<\/p>\n<p>\u6240\u6709\u4e4b\u524d\u521d\u59cb\u5316\u7684\u7ec4\u4ef6\u5c06\u5728\u524d\u5411\u4f20\u64ad\u4e2d\u534f\u540c\u5de5\u4f5c\u3002<\/p>\n<p>\u5bf9\u4e8e\u8bbe\u5b9a\u7684\u8bad\u7ec3\u5468\u671f\u6570&#xff0c;\u6211\u4eec\u91cd\u590d\u4ee5\u4e0b\u6b65\u9aa4&#xff1a;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d196e24.png\" alt=\"\u8bad\u7ec3\u5faa\u73af\" \/><\/p>\n<p>\u8bad\u7ec3\u5faa\u73af<\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\\\\n&#8212; \u5f00\u59cb\u8bad\u7ec3\u5faa\u73af&#xff0c;\u5171 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>epochs<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u4e2a\u5468\u671f &#8212;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>losses <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><\/p>\n<p><span class=\"token keyword\">for<\/span> epoch <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>epochs<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># \u968f\u673a\u62bd\u53d6\u6279\u91cf\u6570\u636e<\/span><br \/>\n    xb<span class=\"token punctuation\">,<\/span> yb <span class=\"token operator\">&#061;<\/span> train_x<span class=\"token punctuation\">[<\/span>torch<span class=\"token punctuation\">.<\/span>randint<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> num_sequences_available<span class=\"token punctuation\">,<\/span> <span class=\"token punctuation\">(<\/span>batch_size<span class=\"token punctuation\">,<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> \\\\<br \/>\n             train_y<span class=\"token punctuation\">[<\/span>torch<span class=\"token punctuation\">.<\/span>randint<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> num_sequences_available<span class=\"token punctuation\">,<\/span> <span class=\"token punctuation\">(<\/span>batch_size<span class=\"token punctuation\">,<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>to<span class=\"token punctuation\">(<\/span>device<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># \u5206\u8bcd\u5d4c\u5165<\/span><br \/>\n    token_embed <span class=\"token operator\">&#061;<\/span> token_embedding_table<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">)<\/span><br \/>\n    position_ids <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>arange<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    freqs_cis <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>polar<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>ones_like<span class=\"token punctuation\">(<\/span>position_ids<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                            <span class=\"token punctuation\">(<\/span>inv_freq<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>expand<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> &#064;<br \/>\n                             position_ids<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>expand<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>transpose<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    x <span class=\"token operator\">&#061;<\/span> token_embed<br \/>\n    <span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token comment\"># RMSNorm \u548c\u6ce8\u610f\u529b<\/span><br \/>\n        x_norm <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> torch<span class=\"token punctuation\">.<\/span>rsqrt<span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">pow<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> keepdim<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> rms_norm_eps<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> rmsnorm_weights_input<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><br \/>\n        qkv <span class=\"token operator\">&#061;<\/span> mha_qkv_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> n_heads<span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span> <span class=\"token operator\">*<\/span> d_k<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>chunk<span class=\"token punctuation\">(<\/span><span class=\"token number\">3<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        q<span class=\"token punctuation\">,<\/span> k<span class=\"token punctuation\">,<\/span> v <span class=\"token operator\">&#061;<\/span> qkv<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> qkv<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> qkv<span class=\"token punctuation\">[<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">]<\/span><\/p>\n<p>        q_rope<span class=\"token punctuation\">,<\/span> k_rope <span class=\"token operator\">&#061;<\/span> q<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>reshape<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> n_heads<span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> k<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>reshape<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> n_heads<span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        q<span class=\"token punctuation\">,<\/span> k <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>view_as_real<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>view_as_complex<span class=\"token punctuation\">(<\/span>q_rope<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> freqs_cis<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>flatten<span class=\"token punctuation\">(<\/span><span class=\"token number\">3<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> \\\\<br \/>\n               torch<span class=\"token punctuation\">.<\/span>view_as_real<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>view_as_complex<span class=\"token punctuation\">(<\/span>k_rope<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> freqs_cis<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>flatten<span class=\"token punctuation\">(<\/span><span class=\"token number\">3<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        attn_scores <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span>q &#064; k<span class=\"token punctuation\">.<\/span>transpose<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>d_k <span class=\"token operator\">**<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">0.5<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        attn_scores <span class=\"token operator\">&#061;<\/span> attn_scores<span class=\"token punctuation\">.<\/span>masked_fill<span class=\"token punctuation\">(<\/span>causal_mask<span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">:<\/span><span class=\"token punctuation\">,<\/span><span class=\"token punctuation\">:<\/span><span class=\"token punctuation\">,<\/span><span class=\"token punctuation\">:<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><span class=\"token punctuation\">:<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#039;-inf&#039;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        attention_weights <span class=\"token operator\">&#061;<\/span> F<span class=\"token punctuation\">.<\/span>softmax<span class=\"token punctuation\">(<\/span>attn_scores<span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        attn_output <span class=\"token operator\">&#061;<\/span> attention_weights &#064; v<br \/>\n        x <span class=\"token operator\">&#061;<\/span> x <span class=\"token operator\">&#043;<\/span> mha_output_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>attn_output<span class=\"token punctuation\">.<\/span>permute<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>contiguous<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        <span class=\"token comment\"># MoE \u5757<\/span><br \/>\n        x_norm <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> torch<span class=\"token punctuation\">.<\/span>rsqrt<span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">pow<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> keepdim<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> rms_norm_eps<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> rmsnorm_weights_post_attn<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><br \/>\n        router_logits <span class=\"token operator\">&#061;<\/span> moe_routers<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm<span class=\"token punctuation\">)<\/span><br \/>\n        routing_weights<span class=\"token punctuation\">,<\/span> selected_experts <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>sigmoid<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>topk<span class=\"token punctuation\">(<\/span>router_logits<span class=\"token punctuation\">,<\/span> num_experts_per_tok<span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> \\\\<br \/>\n                                             torch<span class=\"token punctuation\">.<\/span>topk<span class=\"token punctuation\">(<\/span>router_logits<span class=\"token punctuation\">,<\/span> num_experts_per_tok<span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><br \/>\n        x_flat <span class=\"token operator\">&#061;<\/span> x_norm<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span><br \/>\n        selected_experts_flat <span class=\"token operator\">&#061;<\/span> selected_experts<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        routing_weights_flat <span class=\"token operator\">&#061;<\/span> routing_weights<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        token_idx <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>arange<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">*<\/span> xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>repeat_interleave<span class=\"token punctuation\">(<\/span>num_experts_per_tok<span class=\"token punctuation\">)<\/span><br \/>\n        expert_inputs <span class=\"token operator\">&#061;<\/span> x_flat<span class=\"token punctuation\">[<\/span>token_idx<span class=\"token punctuation\">]<\/span><br \/>\n        gate_up_states <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>bmm<span class=\"token punctuation\">(<\/span>expert_inputs<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> moe_expert_gate_up_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">[<\/span>selected_experts_flat<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        activated_states <span class=\"token operator\">&#061;<\/span> activation_fn<span class=\"token punctuation\">(<\/span>gate_up_states<span class=\"token punctuation\">.<\/span>chunk<span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> gate_up_states<span class=\"token punctuation\">.<\/span>chunk<span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><br \/>\n        expert_outputs_weighted <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>bmm<span class=\"token punctuation\">(<\/span>activated_states<span class=\"token punctuation\">,<\/span> moe_expert_down_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">[<\/span>selected_experts_flat<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>squeeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> \\\\<br \/>\n                                  routing_weights_flat<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        combined_expert_outputs <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>zeros_like<span class=\"token punctuation\">(<\/span>x_flat<span class=\"token punctuation\">)<\/span><br \/>\n        combined_expert_outputs<span class=\"token punctuation\">.<\/span>scatter_add_<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> token_idx<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>expand<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> expert_outputs_weighted<span class=\"token punctuation\">)<\/span><\/p>\n<p>        shared_output <span class=\"token operator\">&#061;<\/span> shared_expert_down_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span><br \/>\n            activation_fn<span class=\"token punctuation\">(<\/span>shared_expert_gate_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> shared_expert_up_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        x <span class=\"token operator\">&#061;<\/span> x <span class=\"token operator\">&#043;<\/span> combined_expert_outputs<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span>xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> xb<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> shared_output<\/p>\n<p>    <span class=\"token comment\"># \u6700\u7ec8 RMSNorm \u548c\u8f93\u51fa<\/span><br \/>\n    logits <span class=\"token operator\">&#061;<\/span> output_linear_layer<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> torch<span class=\"token punctuation\">.<\/span>rsqrt<span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">pow<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> keepdim<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> rms_norm_eps<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> final_rmsnorm_weight<span class=\"token punctuation\">)<\/span><br \/>\n    loss <span class=\"token operator\">&#061;<\/span> criterion<span class=\"token punctuation\">(<\/span>logits<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> logits<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">[<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> yb<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    optimizer<span class=\"token punctuation\">.<\/span>zero_grad<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    loss<span class=\"token punctuation\">.<\/span>backward<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    optimizer<span class=\"token punctuation\">.<\/span>step<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    losses<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>loss<span class=\"token punctuation\">.<\/span>item<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">if<\/span> epoch <span class=\"token operator\">%<\/span> eval_interval <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token number\">0<\/span> <span class=\"token keyword\">or<\/span> epoch <span class=\"token operator\">&#061;&#061;<\/span> epochs <span class=\"token operator\">&#8211;<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;  \u7b2c <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>epoch<span class=\"token operator\">&#043;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">\/<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>epochs<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>loss<span class=\"token punctuation\">.<\/span>item<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.4f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8212; \u8bad\u7ec3\u5faa\u73af\u5b8c\u6210 &#8212;&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">try<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">import<\/span> matplotlib<span class=\"token punctuation\">.<\/span>pyplot <span class=\"token keyword\">as<\/span> plt<br \/>\n    plt<span class=\"token punctuation\">.<\/span>plot<span class=\"token punctuation\">(<\/span>losses<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>title<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u8bad\u7ec3\u635f\u5931\u968f\u5468\u671f\u53d8\u5316&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>xlabel<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u5468\u671f&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>ylabel<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u635f\u5931&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>show<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">except<\/span> ImportError<span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u672a\u627e\u5230 Matplotlib&#xff0c;\u8df3\u8fc7\u635f\u5931\u56fe\u7ed8\u5236\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u5f53\u6211\u4eec\u5f00\u59cb\u8bad\u7ec3\u65f6&#xff0c;\u5b83\u5c06\u5f00\u59cb\u6253\u5370\u8bad\u7ec3\u635f\u5931\u3002<\/p>\n<p>&#8212; \u5f00\u59cb\u8bad\u7ec3\u5faa\u73af&#xff0c;\u5171 3000 \u4e2a\u5468\u671f &#8212;<br \/>\n  \u7b2c 1\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;3.8124<br \/>\n  \u7b2c 301\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0734<br \/>\n  \u7b2c 601\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0595<br \/>\n  \u7b2c 901\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0609<br \/>\n  \u7b2c 1201\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0707<br \/>\n  \u7b2c 1501\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0664<br \/>\n  \u7b2c 1801\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0559<br \/>\n  \u7b2c 2101\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0610<br \/>\n  \u7b2c 2401\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0680<br \/>\n  \u7b2c 2701\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0641<br \/>\n  \u7b2c 3000\/3000 \u4e2a\u5468\u671f&#xff0c;\u635f\u5931&#xff1a;0.0553<br \/>\n&#8212; \u8bad\u7ec3\u5faa\u73af\u5b8c\u6210 &#8212;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d1a7ccc.png\" alt=\"\u8bad\u7ec3\u635f\u5931\u56fe\" \/><\/p>\n<p>\u8bad\u7ec3\u635f\u5931\u56fe<\/p>\n<p>\u8f93\u51fa\u663e\u793a\u4e86\u8bad\u7ec3\u8fdb\u5ea6\u3002\u635f\u5931\u4ece\u5927\u7ea6 3.8 \u5f00\u59cb&#xff0c;\u5e76\u5728 3000 \u4e2a\u5468\u671f\u5185\u663e\u8457\u4e0b\u964d&#xff0c;\u6700\u7ec8\u7a33\u5b9a\u5728 0.05-0.07 \u4e4b\u95f4\u3002<\/p>\n<p>\u8fd9\u79cd\u6025\u5267\u4e0b\u964d\u6b63\u662f\u6211\u4eec\u5e0c\u671b\u770b\u5230\u7684&#xff01;\u8fd9\u610f\u5473\u7740\u6a21\u578b\u6b63\u5728\u5b66\u4e60\u201c\u7231\u4e3d\u4e1d\u68a6\u6e38\u4ed9\u5883\u201d\u6587\u672c\u4e2d\u7684\u6a21\u5f0f&#xff0c;\u5e76\u4e14\u5728\u9884\u6d4b\u4e0b\u4e00\u4e2a\u5b57\u7b26\u65b9\u9762\u53d8\u5f97\u8d8a\u6765\u8d8a\u597d\u3002<\/p>\n<p>\u56fe\u76f4\u89c2\u5730\u786e\u8ba4\u4e86\u8fd9\u79cd\u635f\u5931\u4e0b\u964d\u8d8b\u52bf\u3002MoE \u5c42\u3001RMSNorm \u548c RoPE \u90fd\u534f\u540c\u5de5\u4f5c\u3002<\/p>\n<h4>\u6587\u672c\u751f\u6210<\/h4>\n<p>\u73b0\u5728\u6a21\u578b\u5df2\u7ecf\u8bad\u7ec3\u5b8c\u6210&#xff0c;\u8ba9\u6211\u4eec\u770b\u770b\u5b83\u80fd\u5199\u51fa\u4ec0\u4e48&#xff01;\u6211\u4eec\u4ece\u4e00\u4e2a\u7b80\u77ed\u7684\u63d0\u793a&#xff08;\u79cd\u5b50\u6587\u672c&#xff09;\u5f00\u59cb\u3002\u6211\u4eec\u5c06\u8fd9\u4e2a\u63d0\u793a\u8f6c\u6362\u4e3a\u5206\u8bcd ID\u3002<\/p>\n<p>\u6211\u4eec\u8fd8\u6307\u5b9a\u8981\u751f\u6210\u7684\u65b0\u5206\u8bcd&#xff08;\u5b57\u7b26&#xff09;\u6570\u91cf\u3002\u5c06\u6a21\u578b\u7ec4\u4ef6\u8bbe\u7f6e\u4e3a\u201c\u8bc4\u4f30\u6a21\u5f0f\u201d&#xff08;\u4f7f\u7528 .eval()&#xff09;\u5f88\u91cd\u8981\u3002<\/p>\n<p>\u5982\u679c\u4f7f\u7528\u4e86 Dropout \u6216 BatchNorm&#xff0c;\u8fd9\u5c06\u5173\u95ed\u5b83\u4eec&#xff0c;\u786e\u4fdd\u8f93\u51fa\u4e00\u81f4\u3002\u6211\u4eec\u8fd8\u4f7f\u7528 torch.no_grad()&#xff0c;\u56e0\u4e3a\u6211\u4eec\u4e0d\u518d\u8bad\u7ec3&#xff0c;\u6240\u4ee5\u4e0d\u9700\u8981 PyTorch \u8ddf\u8e2a\u68af\u5ea6&#xff0c;\u8fd9\u4f1a\u4f7f\u751f\u6210\u8fc7\u7a0b\u66f4\u5feb\u5e76\u4f7f\u7528\u66f4\u5c11\u7684\u5185\u5b58\u3002<\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\\\\n&#8212; \u7b2c 7 \u6b65&#xff1a;\u6587\u672c\u751f\u6210 &#8212;&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u751f\u6210\u53c2\u6570 &#8212;<\/span><br \/>\nseed_chars <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;Alice &#034;<\/span> <span class=\"token comment\"># \u8d77\u59cb\u6587\u672c\u63d0\u793a<\/span><br \/>\nnum_tokens_to_generate <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">200<\/span> <span class=\"token comment\"># \u8981\u751f\u6210\u7684\u65b0\u5b57\u7b26\u6570\u91cf<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u79cd\u5b50\u6587\u672c&#xff1a;&#039;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>seed_chars<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#039;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u751f\u6210 <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>num_tokens_to_generate<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> \u4e2a\u65b0\u5206\u8bcd&#8230;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u51c6\u5907\u521d\u59cb\u4e0a\u4e0b\u6587 &#8212;<\/span><br \/>\n<span class=\"token comment\"># \u5c06\u79cd\u5b50\u5b57\u7b26\u8f6c\u6362\u4e3a\u5206\u8bcd ID<\/span><br \/>\nseed_ids <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span>char_to_int<span class=\"token punctuation\">[<\/span>ch<span class=\"token punctuation\">]<\/span> <span class=\"token keyword\">for<\/span> ch <span class=\"token keyword\">in<\/span> seed_chars <span class=\"token keyword\">if<\/span> ch <span class=\"token keyword\">in<\/span> char_to_int<span class=\"token punctuation\">]<\/span><br \/>\n<span class=\"token comment\"># \u521b\u5efa\u521d\u59cb\u4e0a\u4e0b\u6587\u5f20\u91cf&#xff08;\u6dfb\u52a0\u6279\u91cf\u7ef4\u5ea6&#xff09;<\/span><br \/>\ngenerated_sequence <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>tensor<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span>seed_ids<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>torch<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">long<\/span><span class=\"token punctuation\">,<\/span> device<span class=\"token operator\">&#061;<\/span>device<span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u521d\u59cb\u4e0a\u4e0b\u6587\u5f62\u72b6&#xff1a;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>generated_sequence<span class=\"token punctuation\">.<\/span>shape<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># &#8212; \u5c06\u6a21\u578b\u7ec4\u4ef6\u8bbe\u7f6e\u4e3a\u8bc4\u4f30\u6a21\u5f0f &#8212;<\/span><br \/>\n<span class=\"token comment\"># &#xff08;\u5982\u679c\u4f7f\u7528\u4e86 Dropout \u6216 BatchNorm&#xff0c;\u8fd9\u662f\u5f88\u91cd\u8981\u7684&#xff0c;\u65e0\u8bba\u5982\u4f55\u90fd\u662f\u597d\u4e60\u60ef&#xff09;<\/span><br \/>\ntoken_embedding_table<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># RMSNorm \u6ca1\u6709 eval \u6a21\u5f0f&#xff0c;\u53ea\u4f7f\u7528\u6743\u91cd<\/span><br \/>\n    mha_qkv_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    mha_output_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    moe_routers<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token comment\"># \u4e13\u5bb6\u6743\u91cd&#xff08;Parameters&#xff09;\u6ca1\u6709 eval()<\/span><br \/>\n    shared_expert_gate_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    shared_expert_up_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    shared_expert_down_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\noutput_linear_layer<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">eval<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token comment\"># \u6700\u7ec8 RMSNorm \u6743\u91cd\u6ca1\u6709 eval()<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u5df2\u5c06\u6a21\u578b\u7ec4\u4ef6\u8bbe\u7f6e\u4e3a\u8bc4\u4f30\u6a21\u5f0f&#xff08;\u9002\u7528\u65f6&#xff09;\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n<span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&#8211;<\/span> \u7b2c <span class=\"token number\">7<\/span> \u6b65&#xff1a;\u6587\u672c\u751f\u6210 <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&#8211;<\/span><br \/>\n\u79cd\u5b50\u6587\u672c&#xff1a;<span class=\"token string\">&#039;Alice &#039;<\/span><br \/>\n\u751f\u6210 <span class=\"token number\">200<\/span> \u4e2a\u65b0\u5206\u8bcd<span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><br \/>\n\u521d\u59cb\u4e0a\u4e0b\u6587\u5f62\u72b6&#xff1a;torch<span class=\"token punctuation\">.<\/span>Size<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">6<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n\u5df2\u5c06\u6a21\u578b\u7ec4\u4ef6\u8bbe\u7f6e\u4e3a\u8bc4\u4f30\u6a21\u5f0f&#xff08;\u9002\u7528\u65f6&#xff09;\u3002<\/p>\n<p>\u8fd9\u8bbe\u7f6e\u4e86\u751f\u6210\u8fc7\u7a0b\u3002\u6211\u4eec\u7684\u8d77\u59cb\u63d0\u793a\u662f &#034;Alice &#034;\u3002\u6211\u4eec\u8ba1\u5212\u751f\u6210 200 \u4e2a\u66f4\u591a\u5b57\u7b26\u3002\u521d\u59cb\u63d0\u793a\u88ab\u8f6c\u6362\u4e3a\u4e00\u4e2a\u5f62\u72b6\u4e3a [1, 6] \u7684\u5206\u8bcd ID \u5f20\u91cf&#xff08;\u6279\u91cf\u4e2d\u6709 1 \u4e2a\u5e8f\u5217&#xff0c;\u957f\u5ea6\u4e3a 6 \u4e2a\u5206\u8bcd&#xff09;\u3002\u76f8\u5173\u7684\u6a21\u578b\u5c42\u5df2\u5207\u6362\u5230\u8bc4\u4f30\u6a21\u5f0f\u3002<\/p>\n<h4>\u751f\u6210\u5faa\u73af<\/h4>\n<p>\u6211\u4eec\u5c06\u4e00\u6b21\u751f\u6210\u4e00\u4e2a\u5b57\u7b26&#xff0c;\u5728\u4e00\u4e2a\u5faa\u73af\u4e2d&#xff1a;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010937-681ab2d1b6fbb.png\" alt=\"\u751f\u6210\u5faa\u73af\" \/><\/p>\n<p>\u751f\u6210\u5faa\u73af<\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\u5f00\u59cb\u751f\u6210\u5faa\u73af&#8230;&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">with<\/span> torch<span class=\"token punctuation\">.<\/span>no_grad<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> _ <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>num_tokens_to_generate<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        current_context <span class=\"token operator\">&#061;<\/span> generated_sequence<span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">:<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span>block_size<span class=\"token punctuation\">:<\/span><span class=\"token punctuation\">]<\/span><br \/>\n        B_gen<span class=\"token punctuation\">,<\/span> T_gen <span class=\"token operator\">&#061;<\/span> current_context<span class=\"token punctuation\">.<\/span>shape<\/p>\n<p>        <span class=\"token comment\"># \u5206\u8bcd\u5d4c\u5165\u548c RoPE \u9891\u7387<\/span><br \/>\n        token_embed_gen <span class=\"token operator\">&#061;<\/span> token_embedding_table<span class=\"token punctuation\">(<\/span>current_context<span class=\"token punctuation\">)<\/span><br \/>\n        freqs_gen <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>polar<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>ones_like<span class=\"token punctuation\">(<\/span>position_ids_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                                <span class=\"token punctuation\">(<\/span>inv_freq<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>expand<span class=\"token punctuation\">(<\/span>B_gen<span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span> &#064; position_ids_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>transpose<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        x_gen <span class=\"token operator\">&#061;<\/span> token_embed_gen<\/p>\n<p>        <span class=\"token keyword\">for<\/span> i <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span>n_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token comment\"># RMSNorm \u548c\u6ce8\u610f\u529b<\/span><br \/>\n            x_norm_gen <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span>x_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> torch<span class=\"token punctuation\">.<\/span>rsqrt<span class=\"token punctuation\">(<\/span>x_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">pow<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> keepdim<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> rms_norm_eps<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> rmsnorm_weights_input<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><br \/>\n            qkv_gen <span class=\"token operator\">&#061;<\/span> mha_qkv_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span>B_gen<span class=\"token punctuation\">,<\/span> T_gen<span class=\"token punctuation\">,<\/span> n_heads<span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span> <span class=\"token operator\">*<\/span> d_k<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>chunk<span class=\"token punctuation\">(<\/span><span class=\"token number\">3<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            q_rotated_gen <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>view_as_real<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>view_as_complex<span class=\"token punctuation\">(<\/span>qkv_gen<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>reshape<span class=\"token punctuation\">(<\/span>B_gen<span class=\"token punctuation\">,<\/span> T_gen<span class=\"token punctuation\">,<\/span> n_heads<span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> freqs_gen<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            k_rotated_gen <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>view_as_real<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>view_as_complex<span class=\"token punctuation\">(<\/span>qkv_gen<span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>reshape<span class=\"token punctuation\">(<\/span>B_gen<span class=\"token punctuation\">,<\/span> T_gen<span class=\"token punctuation\">,<\/span> n_heads<span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> freqs_gen<span class=\"token punctuation\">.<\/span>unsqueeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            attn_output_gen <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span>F<span class=\"token punctuation\">.<\/span>softmax<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">(<\/span>q_rotated_gen<span class=\"token punctuation\">.<\/span>permute<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span><span class=\"token punctuation\">)<\/span> &#064; k_rotated_gen<span class=\"token punctuation\">.<\/span>permute<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>transpose<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>d_k <span class=\"token operator\">**<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">0.5<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span> &#064; qkv_gen<span class=\"token punctuation\">[<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>permute<span class=\"token punctuation\">(<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span>B_gen<span class=\"token punctuation\">,<\/span> T_gen<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span><br \/>\n            x_gen <span class=\"token operator\">&#061;<\/span> x_gen <span class=\"token operator\">&#043;<\/span> mha_output_linears<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>attn_output_gen<span class=\"token punctuation\">)<\/span><\/p>\n<p>            <span class=\"token comment\"># MoE \u5757<\/span><br \/>\n            x_norm_gen <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span>x_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> torch<span class=\"token punctuation\">.<\/span>rsqrt<span class=\"token punctuation\">(<\/span>x_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">pow<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> keepdim<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> rms_norm_eps<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> rmsnorm_weights_post_attn<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><br \/>\n            routing_weights_gen <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>sigmoid<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>topk<span class=\"token punctuation\">(<\/span>moe_routers<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> num_experts_per_tok<span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            expert_outputs_gen <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>bmm<span class=\"token punctuation\">(<\/span>activation_fn<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>chunk<span class=\"token punctuation\">(<\/span>torch<span class=\"token punctuation\">.<\/span>bmm<span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> moe_expert_gate_up_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">[<\/span>torch<span class=\"token punctuation\">.<\/span>topk<span class=\"token punctuation\">(<\/span>moe_routers<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> num_experts_per_tok<span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>squeeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> routing_weights_gen<span class=\"token punctuation\">,<\/span> moe_expert_down_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">[<\/span>torch<span class=\"token punctuation\">.<\/span>topk<span class=\"token punctuation\">(<\/span>moe_routers<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> num_experts_per_tok<span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>squeeze<span class=\"token punctuation\">(<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>            <span class=\"token comment\"># \u7ec4\u5408\u4e13\u5bb6\u8f93\u51fa<\/span><br \/>\n            x_gen <span class=\"token operator\">&#061;<\/span> x_gen <span class=\"token operator\">&#043;<\/span> expert_outputs_gen<span class=\"token punctuation\">.<\/span>view<span class=\"token punctuation\">(<\/span>B_gen<span class=\"token punctuation\">,<\/span> T_gen<span class=\"token punctuation\">,<\/span> d_model<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> shared_expert_down_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>activation_fn<span class=\"token punctuation\">(<\/span>shared_expert_gate_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> shared_expert_up_proj<span class=\"token punctuation\">[<\/span>i<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">(<\/span>x_norm_gen<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        <span class=\"token comment\"># \u6700\u7ec8 RMSNorm \u548c\u8f93\u51fa<\/span><br \/>\n        logits_gen <span class=\"token operator\">&#061;<\/span> output_linear_layer<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">(<\/span>x_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> torch<span class=\"token punctuation\">.<\/span>rsqrt<span class=\"token punctuation\">(<\/span>x_gen<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span><span class=\"token builtin\">pow<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> keepdim<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> rms_norm_eps<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> final_rmsnorm_weight<span class=\"token punctuation\">)<\/span><br \/>\n        next_token <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>multinomial<span class=\"token punctuation\">(<\/span>F<span class=\"token punctuation\">.<\/span>softmax<span class=\"token punctuation\">(<\/span>logits_gen<span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">:<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token punctuation\">:<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> num_samples<span class=\"token operator\">&#061;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        generated_sequence <span class=\"token operator\">&#061;<\/span> torch<span class=\"token punctuation\">.<\/span>cat<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">(<\/span>generated_sequence<span class=\"token punctuation\">,<\/span> next_token<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> dim<span class=\"token operator\">&#061;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8230;\u751f\u6210\u5faa\u73af\u5b8c\u6210\u3002&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u751f\u6210\u5faa\u73af\u5df2\u6309\u6307\u5b9a\u7684\u6b65\u6570&#xff08;200 \u6b21&#xff09;\u8fd0\u884c\u3002\u5728\u5faa\u73af\u5185\u90e8&#xff08;\u5b83\u672c\u8eab\u4e0d\u6253\u5370\u4efb\u4f55\u5185\u5bb9&#xff09;&#xff0c;\u6a21\u578b\u6839\u636e\u5230\u76ee\u524d\u4e3a\u6b62\u751f\u6210\u7684\u5e8f\u5217\u53cd\u590d\u9884\u6d4b\u5e76\u8ffd\u52a0\u4e0b\u4e00\u4e2a\u5b57\u7b26\u3002<\/p>\n<h4>\u89e3\u7801\u751f\u6210\u5e8f\u5217<\/h4>\n<p>generated_sequence \u5f20\u91cf\u73b0\u5728\u5305\u542b\u4e86\u539f\u59cb\u79cd\u5b50\u5206\u8bcd ID \u52a0\u4e0a\u65b0\u751f\u6210\u7684 200 \u4e2a\u5206\u8bcd ID\u3002\u8981\u67e5\u770b\u5b9e\u9645\u6587\u672c&#xff0c;\u6211\u4eec\u9700\u8981\u5c06\u8fd9\u4e9b\u6570\u5b57\u8f6c\u6362\u56de\u5b57\u7b26&#xff0c;\u4f7f\u7528\u6211\u4eec\u4e4b\u524d\u521b\u5efa\u7684 int_to_char \u6620\u5c04\u3002<\/p>\n<p>\u6211\u4eec\u5c06\u5206\u8bcd ID \u5217\u8868\u53d6\u51fa\u6765&#xff0c;\u67e5\u627e\u6bcf\u4e2a ID \u5bf9\u5e94\u7684\u5b57\u7b26&#xff0c;\u5e76\u5c06\u5b83\u4eec\u8fde\u63a5\u6210\u4e00\u4e2a\u5b57\u7b26\u4e32\u3002<\/p>\n<p><span class=\"token comment\"># \u83b7\u53d6\u7b2c\u4e00\u4e2a&#xff08;\u4e5f\u662f\u552f\u4e00\u4e00\u4e2a&#xff09;\u6279\u91cf\u9879\u7684\u751f\u6210\u5e8f\u5217<\/span><br \/>\nfinal_generated_ids <span class=\"token operator\">&#061;<\/span> generated_sequence<span class=\"token punctuation\">[<\/span><span class=\"token number\">0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token operator\">.<\/span><span class=\"token function\">tolist<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u5c06 ID \u5217\u8868\u89e3\u7801\u56de\u5b57\u7b26\u4e32<\/span><br \/>\ndecoded_text <span class=\"token operator\">&#061;<\/span> <span class=\"token string single-quoted-string\">&#039;&#039;<\/span><span class=\"token operator\">.<\/span><span class=\"token function\">join<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span>int_to_char<span class=\"token operator\">.<\/span><span class=\"token function\">get<\/span><span class=\"token punctuation\">(<\/span>id_val<span class=\"token punctuation\">,<\/span> <span class=\"token string single-quoted-string\">&#039;[UNK]&#039;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> id_val in final_generated_ids<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string double-quoted-string\">&#034;\\\\n&#8212; \u6700\u7ec8\u751f\u6210\u7684\u6587\u672c &#8212;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span>decoded_text<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\">### \u8f93\u51fa ###<\/span><br \/>\n<span class=\"token operator\">&#8212;<\/span><span class=\"token operator\">&#8211;<\/span> \u6700\u7ec8\u751f\u6210\u7684\u6587\u672c <span class=\"token operator\">&#8212;<\/span><span class=\"token operator\">&#8211;<\/span><br \/>\nAlice <span class=\"token string single-quoted-string\">&#039;without pictures or<br \/>\nconversation?&#039;<\/span><br \/>\nSo she was considering in her own <span class=\"token function\">mind <\/span><span class=\"token punctuation\">(<\/span><span class=\"token keyword\">as<\/span> well <span class=\"token keyword\">as<\/span> she could<span class=\"token punctuation\">,<\/span> <span class=\"token keyword\">for<\/span> the<br \/>\nhot day made her feel very sleepy <span class=\"token keyword\">and<\/span> stupid<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> whether the pleasure<br \/>\nof making a daisy<span class=\"token operator\">&#8211;<\/span>chain wo <span class=\"token operator\">&#8230;<\/span><\/p>\n<p>\u6700\u7ec8\u7ed3\u679c\u51fa\u6765\u4e86&#xff01;\u4ece &#034;Alice &#034; \u5f00\u59cb&#xff0c;\u6211\u4eec\u7684\u8bad\u7ec3\u6a21\u578b\u751f\u6210\u4e86\u63a5\u4e0b\u6765\u7684 200 \u4e2a\u5b57\u7b26\u3002\u67e5\u770b\u8f93\u51fa&#xff0c;\u6211\u4eec\u53ef\u4ee5\u770b\u5230\u5b83\u786e\u5b9e\u5b66\u4e60\u4e86\u8bad\u7ec3\u6587\u672c\u7684\u98ce\u683c\u548c\u5185\u5bb9\u3002<\/p>\n<p>\u5b83\u7ee7\u7eed\u4e86\u53e5\u5b50\u7ed3\u6784&#xff0c;\u4f7f\u7528\u4e86\u9002\u5f53\u7684\u6807\u70b9\u7b26\u53f7&#xff0c;\u5e76\u751f\u6210\u4e86\u76f4\u63a5\u6765\u81ea\u539f\u59cb\u8bed\u6599\u5e93\u7684\u5355\u8bcd\u548c\u77ed\u8bed&#xff08;\u201cwithout pictures or conversation?\u201d\u3001\u201cSo she was considering\u2026\u201d&#xff09;\u3002<\/p>\n<p>\u8fd9\u8868\u660e\u5373\u4f7f\u662f\u6211\u4eec\u7684\u5c0f\u6a21\u578b&#xff0c;\u5e26\u6709 MoE \u5c42&#xff0c;\u4e5f\u6210\u529f\u5730\u6839\u636e\u8bad\u7ec3\u6570\u636e\u4e2d\u7684\u6a21\u5f0f\u9884\u6d4b\u4e86\u4e0b\u4e00\u4e2a\u5b57\u7b26\u3002<\/p>\n<p>\u5b83\u6ca1\u6709\u751f\u6210\u6781\u5177\u521b\u610f\u7684\u65b0\u6587\u672c&#xff08;\u56e0\u4e3a\u8bad\u7ec3\u6570\u636e\u5f88\u5c0f\u4e14\u91cd\u590d&#xff09;&#xff0c;\u4f46\u5b83\u5c55\u793a\u4e86\u6838\u5fc3\u7684\u751f\u6210\u80fd\u529b\u3002<\/p>\n<h4>\u4fdd\u5b58\u6a21\u578b\u72b6\u6001&#xff08;\u53ef\u9009&#xff09;<\/h4>\n<p>\u7ecf\u8fc7\u4e00\u756a\u8bad\u7ec3\u540e&#xff0c;\u6211\u4eec\u901a\u5e38\u5e0c\u671b\u4fdd\u5b58\u6a21\u578b\u7684\u72b6\u6001\u3002\u8fd9\u6d89\u53ca\u6536\u96c6\u6240\u6709\u5fc5\u8981\u7684\u4fe1\u606f\u3002<\/p>\n<p><span class=\"token comment\"># \u521b\u5efa\u4e00\u4e2a\u76ee\u5f55\u6765\u5b58\u50a8\u6a21\u578b&#xff08;\u5982\u679c\u5b83\u4e0d\u5b58\u5728\u7684\u8bdd&#xff09;<\/span><br \/>\nsave_dir <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#039;saved_models&#039;<\/span><br \/>\nos<span class=\"token punctuation\">.<\/span>makedirs<span class=\"token punctuation\">(<\/span>save_dir<span class=\"token punctuation\">,<\/span> exist_ok<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\nsave_path <span class=\"token operator\">&#061;<\/span> os<span class=\"token punctuation\">.<\/span>path<span class=\"token punctuation\">.<\/span>join<span class=\"token punctuation\">(<\/span>save_dir<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#039;llama4_moe_model.pt&#039;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># \u624b\u52a8\u521b\u5efa\u4e00\u4e2a\u72b6\u6001\u5b57\u5178&#xff0c;\u6536\u96c6\u6240\u6709\u7ec4\u4ef6<\/span><br \/>\nmodel_state <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n    <span class=\"token comment\"># \u914d\u7f6e<\/span><br \/>\n    <span class=\"token string\">&#039;config&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n        <span class=\"token string\">&#039;vocab_size&#039;<\/span><span class=\"token punctuation\">:<\/span> vocab_size<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;d_model&#039;<\/span><span class=\"token punctuation\">:<\/span> d_model<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;n_layers&#039;<\/span><span class=\"token punctuation\">:<\/span> n_layers<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;n_heads&#039;<\/span><span class=\"token punctuation\">:<\/span> n_heads<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;block_size&#039;<\/span><span class=\"token punctuation\">:<\/span> block_size<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;rms_norm_eps&#039;<\/span><span class=\"token punctuation\">:<\/span> rms_norm_eps<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;rope_theta&#039;<\/span><span class=\"token punctuation\">:<\/span> rope_theta<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;num_local_experts&#039;<\/span><span class=\"token punctuation\">:<\/span> num_local_experts<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;num_experts_per_tok&#039;<\/span><span class=\"token punctuation\">:<\/span> num_experts_per_tok<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;intermediate_size_expert&#039;<\/span><span class=\"token punctuation\">:<\/span> intermediate_size_expert<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;intermediate_size_shared&#039;<\/span><span class=\"token punctuation\">:<\/span> intermediate_size_shared<br \/>\n    <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token comment\"># \u5206\u8bcd\u5668<\/span><br \/>\n    <span class=\"token string\">&#039;tokenizer&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n        <span class=\"token string\">&#039;char_to_int&#039;<\/span><span class=\"token punctuation\">:<\/span> char_to_int<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#039;int_to_char&#039;<\/span><span class=\"token punctuation\">:<\/span> int_to_char<br \/>\n    <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token comment\"># \u6a21\u578b\u53c2\u6570&#xff08;\u6a21\u5757\u7684\u72b6\u6001\u5b57\u5178&#xff0c;\u53c2\u6570\u7684\u5f20\u91cf&#xff09;<\/span><br \/>\n    <span class=\"token string\">&#039;token_embedding_table&#039;<\/span><span class=\"token punctuation\">:<\/span> token_embedding_table<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;rmsnorm_weights_input&#039;<\/span><span class=\"token punctuation\">:<\/span> rmsnorm_weights_input<span class=\"token punctuation\">,<\/span> <span class=\"token comment\"># \u53c2\u6570\u5217\u8868<\/span><br \/>\n    <span class=\"token string\">&#039;rmsnorm_weights_post_attn&#039;<\/span><span class=\"token punctuation\">:<\/span> rmsnorm_weights_post_attn<span class=\"token punctuation\">,<\/span> <span class=\"token comment\"># \u53c2\u6570\u5217\u8868<\/span><br \/>\n    <span class=\"token string\">&#039;final_rmsnorm_weight&#039;<\/span><span class=\"token punctuation\">:<\/span> final_rmsnorm_weight<span class=\"token punctuation\">,<\/span> <span class=\"token comment\"># \u53c2\u6570<\/span><br \/>\n    <span class=\"token string\">&#039;mha_qkv_linears&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span>l<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> l <span class=\"token keyword\">in<\/span> mha_qkv_linears<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;mha_output_linears&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span>l<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> l <span class=\"token keyword\">in<\/span> mha_output_linears<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;moe_routers&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span>r<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> r <span class=\"token keyword\">in<\/span> moe_routers<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;moe_expert_gate_up_proj&#039;<\/span><span class=\"token punctuation\">:<\/span> moe_expert_gate_up_proj<span class=\"token punctuation\">,<\/span> <span class=\"token comment\"># \u53c2\u6570\u5217\u8868<\/span><br \/>\n    <span class=\"token string\">&#039;moe_expert_down_proj&#039;<\/span><span class=\"token punctuation\">:<\/span> moe_expert_down_proj<span class=\"token punctuation\">,<\/span> <span class=\"token comment\"># \u53c2\u6570\u5217\u8868<\/span><br \/>\n    <span class=\"token string\">&#039;shared_expert_gate_proj&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span>l<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> l <span class=\"token keyword\">in<\/span> shared_expert_gate_proj<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;shared_expert_up_proj&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span>l<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> l <span class=\"token keyword\">in<\/span> shared_expert_up_proj<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;shared_expert_down_proj&#039;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span>l<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> l <span class=\"token keyword\">in<\/span> shared_expert_down_proj<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string\">&#039;output_linear_layer&#039;<\/span><span class=\"token punctuation\">:<\/span> output_linear_layer<span class=\"token punctuation\">.<\/span>state_dict<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token comment\"># \u6ce8\u610f&#xff1a;RoPE inv_freq \u4e0d\u4fdd\u5b58&#xff0c;\u56e0\u4e3a\u5b83\u53ef\u4ee5\u4ece\u914d\u7f6e\u4e2d\u5bfc\u51fa<\/span><br \/>\n<span class=\"token punctuation\">}<\/span><\/p>\n<p><span class=\"token comment\"># \u4fdd\u5b58\u72b6\u6001\u5b57\u5178<\/span><br \/>\ntorch<span class=\"token punctuation\">.<\/span>save<span class=\"token punctuation\">(<\/span>model_state<span class=\"token punctuation\">,<\/span> save_path<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\u6a21\u578b\u72b6\u6001\u5df2\u6210\u529f\u4fdd\u5b58\u5230 &#039;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>save_path<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#039;&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u6211\u4eec\u8bad\u7ec3\u6a21\u578b\u7684\u6240\u6709\u5fc5\u8981\u90e8\u5206&#xff08;\u914d\u7f6e\u3001\u5206\u8bcd\u5668\u548c\u6240\u6709\u53ef\u5b66\u4e60\u7684\u6743\u91cd&#xff09;\u90fd\u88ab\u6253\u5305\u5230\u4e00\u4e2a\u5b57\u5178\u4e2d&#xff0c;\u5e76\u4fdd\u5b58\u5230\u6587\u4ef6 saved_models\/llama4_moe_model.pt \u4e2d\u3002<\/p>\n<p>\u6211\u4eec\u53ef\u4ee5\u7f16\u5199\u5355\u72ec\u7684\u4ee3\u7801\u6765\u52a0\u8f7d\u8fd9\u4e2a\u6587\u4ef6&#xff0c;\u5e76\u4f7f\u7528\u6a21\u578b\u8fdb\u884c\u751f\u6210&#xff0c;\u800c\u65e0\u9700\u91cd\u65b0\u8fd0\u884c\u6574\u4e2a\u8bad\u7ec3\u8fc7\u7a0b\u3002<\/p>\n<h4>\u7ed3\u8bba<\/h4>\n<p>\u6240\u4ee5&#xff0c;\u6211\u4eec\u6db5\u76d6\u4e86&#xff1a;<\/p>\n<li>\u8bbe\u7f6e\u548c\u5206\u8bcd&#xff1a;\u57fa\u672c\u7684\u73af\u5883\u8bbe\u7f6e\u548c\u5b57\u7b26\u7ea7\u5206\u8bcd\u3002<\/li>\n<li>\u8d85\u53c2\u6570\u5b9a\u4e49&#xff1a;\u4ece\u5927\u578b\u6a21\u578b\u4e2d\u7f29\u5c0f\u7684\u914d\u7f6e\u503c\u3002<\/li>\n<li>\u6570\u636e\u51c6\u5907&#xff1a;\u4e3a\u4e0b\u4e00\u4e2a\u5206\u8bcd\u9884\u6d4b\u521b\u5efa\u8f93\u5165\/\u76ee\u6807\u5e8f\u5217\u3002<\/li>\n<li>\u6a21\u578b\u521d\u59cb\u5316&#xff08;\u5185\u8054&#xff09;&#xff1a;\u663e\u5f0f\u521b\u5efa\u548c\u521d\u59cb\u5316\u7ec4\u4ef6&#xff0c;\u5982\u5206\u8bcd\u5d4c\u5165\u3001RMSNorm \u6743\u91cd\u3001\u6ce8\u610f\u529b\u7ebf\u6027\u5c42\u3001RoPE \u9891\u7387\u57fa\u7840\u3001MoE \u8def\u7531\u5668\u3001MoE \u4e13\u5bb6\u6743\u91cd\u3001\u5171\u4eab\u4e13\u5bb6 MLP \u548c\u6700\u7ec8\u8f93\u51fa\u5c42\u3002<\/li>\n<li>\u8bad\u7ec3\u5faa\u73af&#xff08;\u5185\u8054&#xff09;&#xff1a;\u5728\u5faa\u73af\u4e2d\u5b9e\u73b0\u5b8c\u6574\u7684\u524d\u5411\u4f20\u64ad&#xff0c;\u5c55\u793a&#xff1a;<\/li>\n<ul>\n<li>\u5e94\u7528 RMSNorm\u3002<\/li>\n<li>\u5728 MHA \u5757\u4e2d\u8ba1\u7b97\u548c\u5e94\u7528 RoPE\u3002<\/li>\n<li>MoE \u524d\u5411\u4f20\u64ad&#xff1a;\u8def\u7531\u3001\u4e13\u5bb6\u9009\u62e9&#xff08;Top-K&#xff09;\u3001\u5e76\u884c\u4e13\u5bb6\u8ba1\u7b97&#xff08;\u4f7f\u7528 BMM&#xff09;\u3001\u7ec4\u5408\u4e13\u5bb6\u8f93\u51fa&#xff08;scatter_add_&#xff09;\u4ee5\u53ca\u4e0e\u5171\u4eab\u4e13\u5bb6 MLP \u7684\u96c6\u6210\u3002<\/li>\n<li>\u6807\u51c6 Transformer \u64cd\u4f5c&#xff0c;\u5982\u6b8b\u5dee\u8fde\u63a5\u548c\u6ce8\u610f\u529b\u3002<\/li>\n<li>\u635f\u5931\u8ba1\u7b97\u3001\u53cd\u5411\u4f20\u64ad\u548c\u4f18\u5316\u5668\u6b65\u9aa4\u3002<\/li>\n<\/ul>\n<li>\u6587\u672c\u751f\u6210&#xff1a;\u5728\u8bc4\u4f30\u6a21\u5f0f\u4e0b\u4f7f\u7528\u8bad\u7ec3\u597d\u7684\u6a21\u578b\u7ec4\u4ef6\u8fdb\u884c\u81ea\u56de\u5f52\u91c7\u6837\u3002<\/li>\n","protected":false},"excerpt":{"rendered":"<p>\u6587\u7ae0\u6d4f\u89c8\u9605\u8bfb2.3k\u6b21\uff0c\u70b9\u8d5e79\u6b21\uff0c\u6536\u85cf63\u6b21\u3002\u9996\u5148\uff0c\u6211\u4eec\u4ee5\u7684\u5f00\u53d1\u4eba\u5458\u8eab\u4efd\u6765\u7406\u89e3 LLaMA 4 \u67b6\u6784\uff0c\u7136\u540e\u901a\u8fc7\u4e00\u4e2a\u4f8b\u5b50\u6765\u770b\u770b\u5b83\u662f\u5982\u4f55\u901a\u8fc7\u67b6\u6784\u5904\u7406\u7684\uff0c\u4ee5\u4fbf\u66f4\u6e05\u6670\u5730\u7406\u89e3\u3002\u60f3\u8c61\u4e00\u4e0b\uff0c\u4f60\u6709\u4e00\u4e2a\u975e\u5e38\u8270\u5de8\u7684\u4efb\u52a1\u3002\u4e0e\u5176\u96c7\u4f63\u4e00\u4e2a\u5bf9\u4ec0\u4e48\u90fd\u61c2\u4e00\u70b9\u7684\u4eba\uff0c\u4e0d\u5982\u96c7\u4f63\u4e00\u4e2a\u56e2\u961f\uff0c\u6bcf\u4e2a\u6210\u5458\u90fd\u662f\u67d0\u4e2a\u7279\u5b9a\u9886\u57df\u7684\u4e13\u5bb6\uff08\u6bd4\u5982\u7535\u5de5\u3001\u6c34\u7ba1\u5de5\u3001\u6cb9\u6f06\u5de5\uff09\u3002\u4f60\u8fd8\u4f1a\u96c7\u4f63\u4e00\u4e2a\u7ecf\u7406\uff0c\u4ed6\u67e5\u770b\u5f53\u524d\u7684\u4efb\u52a1\uff0c\u5e76\u5c06\u5176\u5206\u914d\u7ed9\u6700\u9002\u5408\u7684\u4e13\u5bb6\u3002AI \u6a21\u578b\u4e2d\u7684 MoE \u5c31\u6709\u70b9\u50cf\u8fd9\u6837\u3002\u4e00\u7ec4\u201c\u4e13\u5bb6\u201d\uff1a\u8fd9\u4e9b\u662f\u8f83\u5c0f\u7684\u3001\u4e13\u95e8\u5316\u7684\u795e\u7ecf\u7f51\u7edc\uff08\u901a\u5e38\u662f\u7b80\u5355\u7684\u524d\u9988\u7f51\u7edc\u6216 MLP\uff09\u3002\u6bcf\u4e2a\u4e13\u5bb6\u53ef\u80fd\u64c5\u957f\u5904\u7406\u67d0\u4e9b\u7c7b\u578b\u7684\u4fe1\u606f\u6216\u6a21\u5f0f\u3002_python llama<\/p>\n","protected":false},"author":2,"featured_media":35910,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[3066,66,3067,347,81,50,2983],"topic":[],"class_list":["post-35924","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-server","tag-llm","tag-ai","tag-ai-agents","tag-llama","tag-python","tag-50","tag-2983"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.wsisp.com\/helps\/35924.html\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\" \/>\n<meta property=\"og:description\" content=\"\u6587\u7ae0\u6d4f\u89c8\u9605\u8bfb2.3k\u6b21\uff0c\u70b9\u8d5e79\u6b21\uff0c\u6536\u85cf63\u6b21\u3002\u9996\u5148\uff0c\u6211\u4eec\u4ee5\u7684\u5f00\u53d1\u4eba\u5458\u8eab\u4efd\u6765\u7406\u89e3 LLaMA 4 \u67b6\u6784\uff0c\u7136\u540e\u901a\u8fc7\u4e00\u4e2a\u4f8b\u5b50\u6765\u770b\u770b\u5b83\u662f\u5982\u4f55\u901a\u8fc7\u67b6\u6784\u5904\u7406\u7684\uff0c\u4ee5\u4fbf\u66f4\u6e05\u6670\u5730\u7406\u89e3\u3002\u60f3\u8c61\u4e00\u4e0b\uff0c\u4f60\u6709\u4e00\u4e2a\u975e\u5e38\u8270\u5de8\u7684\u4efb\u52a1\u3002\u4e0e\u5176\u96c7\u4f63\u4e00\u4e2a\u5bf9\u4ec0\u4e48\u90fd\u61c2\u4e00\u70b9\u7684\u4eba\uff0c\u4e0d\u5982\u96c7\u4f63\u4e00\u4e2a\u56e2\u961f\uff0c\u6bcf\u4e2a\u6210\u5458\u90fd\u662f\u67d0\u4e2a\u7279\u5b9a\u9886\u57df\u7684\u4e13\u5bb6\uff08\u6bd4\u5982\u7535\u5de5\u3001\u6c34\u7ba1\u5de5\u3001\u6cb9\u6f06\u5de5\uff09\u3002\u4f60\u8fd8\u4f1a\u96c7\u4f63\u4e00\u4e2a\u7ecf\u7406\uff0c\u4ed6\u67e5\u770b\u5f53\u524d\u7684\u4efb\u52a1\uff0c\u5e76\u5c06\u5176\u5206\u914d\u7ed9\u6700\u9002\u5408\u7684\u4e13\u5bb6\u3002AI \u6a21\u578b\u4e2d\u7684 MoE \u5c31\u6709\u70b9\u50cf\u8fd9\u6837\u3002\u4e00\u7ec4\u201c\u4e13\u5bb6\u201d\uff1a\u8fd9\u4e9b\u662f\u8f83\u5c0f\u7684\u3001\u4e13\u95e8\u5316\u7684\u795e\u7ecf\u7f51\u7edc\uff08\u901a\u5e38\u662f\u7b80\u5355\u7684\u524d\u9988\u7f51\u7edc\u6216 MLP\uff09\u3002\u6bcf\u4e2a\u4e13\u5bb6\u53ef\u80fd\u64c5\u957f\u5904\u7406\u67d0\u4e9b\u7c7b\u578b\u7684\u4fe1\u606f\u6216\u6a21\u5f0f\u3002_python llama\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.wsisp.com\/helps\/35924.html\" \/>\n<meta property=\"og:site_name\" content=\"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\" \/>\n<meta property=\"article:published_time\" content=\"2025-05-07T01:09:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010936-681ab2d0d6731.png\" \/>\n<meta name=\"author\" content=\"admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"23 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/35924.html\",\"url\":\"https:\/\/www.wsisp.com\/helps\/35924.html\",\"name\":\"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\",\"isPartOf\":{\"@id\":\"https:\/\/www.wsisp.com\/helps\/#website\"},\"datePublished\":\"2025-05-07T01:09:38+00:00\",\"dateModified\":\"2025-05-07T01:09:38+00:00\",\"author\":{\"@id\":\"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41\"},\"breadcrumb\":{\"@id\":\"https:\/\/www.wsisp.com\/helps\/35924.html#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.wsisp.com\/helps\/35924.html\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/35924.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u9996\u9875\",\"item\":\"https:\/\/www.wsisp.com\/helps\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/#website\",\"url\":\"https:\/\/www.wsisp.com\/helps\/\",\"name\":\"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\",\"description\":\"\u9999\u6e2f\u670d\u52a1\u5668_\u9999\u6e2f\u4e91\u670d\u52a1\u5668\u8d44\u8baf_\u670d\u52a1\u5668\u5e2e\u52a9\u6587\u6863_\u670d\u52a1\u5668\u6559\u7a0b\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.wsisp.com\/helps\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41\",\"name\":\"admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery\",\"contentUrl\":\"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery\",\"caption\":\"admin\"},\"sameAs\":[\"http:\/\/wp.wsisp.com\"],\"url\":\"https:\/\/www.wsisp.com\/helps\/author\/admin\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.wsisp.com\/helps\/35924.html","og_locale":"zh_CN","og_type":"article","og_title":"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","og_description":"\u6587\u7ae0\u6d4f\u89c8\u9605\u8bfb2.3k\u6b21\uff0c\u70b9\u8d5e79\u6b21\uff0c\u6536\u85cf63\u6b21\u3002\u9996\u5148\uff0c\u6211\u4eec\u4ee5\u7684\u5f00\u53d1\u4eba\u5458\u8eab\u4efd\u6765\u7406\u89e3 LLaMA 4 \u67b6\u6784\uff0c\u7136\u540e\u901a\u8fc7\u4e00\u4e2a\u4f8b\u5b50\u6765\u770b\u770b\u5b83\u662f\u5982\u4f55\u901a\u8fc7\u67b6\u6784\u5904\u7406\u7684\uff0c\u4ee5\u4fbf\u66f4\u6e05\u6670\u5730\u7406\u89e3\u3002\u60f3\u8c61\u4e00\u4e0b\uff0c\u4f60\u6709\u4e00\u4e2a\u975e\u5e38\u8270\u5de8\u7684\u4efb\u52a1\u3002\u4e0e\u5176\u96c7\u4f63\u4e00\u4e2a\u5bf9\u4ec0\u4e48\u90fd\u61c2\u4e00\u70b9\u7684\u4eba\uff0c\u4e0d\u5982\u96c7\u4f63\u4e00\u4e2a\u56e2\u961f\uff0c\u6bcf\u4e2a\u6210\u5458\u90fd\u662f\u67d0\u4e2a\u7279\u5b9a\u9886\u57df\u7684\u4e13\u5bb6\uff08\u6bd4\u5982\u7535\u5de5\u3001\u6c34\u7ba1\u5de5\u3001\u6cb9\u6f06\u5de5\uff09\u3002\u4f60\u8fd8\u4f1a\u96c7\u4f63\u4e00\u4e2a\u7ecf\u7406\uff0c\u4ed6\u67e5\u770b\u5f53\u524d\u7684\u4efb\u52a1\uff0c\u5e76\u5c06\u5176\u5206\u914d\u7ed9\u6700\u9002\u5408\u7684\u4e13\u5bb6\u3002AI \u6a21\u578b\u4e2d\u7684 MoE \u5c31\u6709\u70b9\u50cf\u8fd9\u6837\u3002\u4e00\u7ec4\u201c\u4e13\u5bb6\u201d\uff1a\u8fd9\u4e9b\u662f\u8f83\u5c0f\u7684\u3001\u4e13\u95e8\u5316\u7684\u795e\u7ecf\u7f51\u7edc\uff08\u901a\u5e38\u662f\u7b80\u5355\u7684\u524d\u9988\u7f51\u7edc\u6216 MLP\uff09\u3002\u6bcf\u4e2a\u4e13\u5bb6\u53ef\u80fd\u64c5\u957f\u5904\u7406\u67d0\u4e9b\u7c7b\u578b\u7684\u4fe1\u606f\u6216\u6a21\u5f0f\u3002_python llama","og_url":"https:\/\/www.wsisp.com\/helps\/35924.html","og_site_name":"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","article_published_time":"2025-05-07T01:09:38+00:00","og_image":[{"url":"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2025\/05\/20250507010936-681ab2d0d6731.png"}],"author":"admin","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"23 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.wsisp.com\/helps\/35924.html","url":"https:\/\/www.wsisp.com\/helps\/35924.html","name":"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","isPartOf":{"@id":"https:\/\/www.wsisp.com\/helps\/#website"},"datePublished":"2025-05-07T01:09:38+00:00","dateModified":"2025-05-07T01:09:38+00:00","author":{"@id":"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41"},"breadcrumb":{"@id":"https:\/\/www.wsisp.com\/helps\/35924.html#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.wsisp.com\/helps\/35924.html"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.wsisp.com\/helps\/35924.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u9996\u9875","item":"https:\/\/www.wsisp.com\/helps"},{"@type":"ListItem","position":2,"name":"\u8be6\u89e3\u5982\u4f55\u590d\u73b0LLaMA 4:\u4ece\u96f6\u5f00\u59cb\u5229\u7528Python\u6784\u5efa"}]},{"@type":"WebSite","@id":"https:\/\/www.wsisp.com\/helps\/#website","url":"https:\/\/www.wsisp.com\/helps\/","name":"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","description":"\u9999\u6e2f\u670d\u52a1\u5668_\u9999\u6e2f\u4e91\u670d\u52a1\u5668\u8d44\u8baf_\u670d\u52a1\u5668\u5e2e\u52a9\u6587\u6863_\u670d\u52a1\u5668\u6559\u7a0b","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.wsisp.com\/helps\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"zh-Hans"},{"@type":"Person","@id":"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41","name":"admin","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/image\/","url":"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery","contentUrl":"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery","caption":"admin"},"sameAs":["http:\/\/wp.wsisp.com"],"url":"https:\/\/www.wsisp.com\/helps\/author\/admin"}]}},"_links":{"self":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/posts\/35924","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/comments?post=35924"}],"version-history":[{"count":0,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/posts\/35924\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/media\/35910"}],"wp:attachment":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/media?parent=35924"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/categories?post=35924"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/tags?post=35924"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/topic?post=35924"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}