{"id":78952,"date":"2026-02-28T17:42:39","date_gmt":"2026-02-28T09:42:39","guid":{"rendered":"https:\/\/www.wsisp.com\/helps\/78952.html"},"modified":"2026-02-28T17:42:39","modified_gmt":"2026-02-28T09:42:39","slug":"%e6%96%af%e5%9d%a6%e7%a6%8f%e5%a4%a7%e5%ad%a6-cs336-%e4%bb%8e%e9%9b%b6%e5%bc%80%e5%a7%8b%e6%9e%84%e5%bb%ba%e8%af%ad%e8%a8%80%e6%a8%a1%e5%9e%8b-spring-2025-%e7%ac%94%e8%ae%b0-assignment-3-sc","status":"publish","type":"post","link":"https:\/\/www.wsisp.com\/helps\/78952.html","title":{"rendered":"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement"},"content":{"rendered":"<\/p>\n<h4>\u76ee\u5f55<\/h4>\n<ul>\n<li>\n<ul>\n<li>\u524d\u8a00<\/li>\n<li>1. Problem (chinchilla_isoflops): 5 points<\/li>\n<li>2. Problem (scaling_laws): 50 points<\/li>\n<li>\n<ul>\n<li>2.1 API \u8c03\u7528\u4e0e\u7f13\u5b58\u5c42\u811a\u672c\u5b9e\u73b0<\/li>\n<li>2.2 \u5b9e\u9a8c\u8bbe\u8ba1 \/ \u641c\u7d22\u811a\u672c\u5b9e\u73b0<\/li>\n<li>2.3 \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u4e0e\u9884\u6d4b\u811a\u672c\u5b9e\u73b0<\/li>\n<li>2.4 \u6574\u4f53\u8bbe\u8ba1\u601d\u8def\u5206\u6790<\/li>\n<\/ul>\n<\/li>\n<li>\u7ed3\u8bed<\/li>\n<li>\u6e90\u7801\u4e0b\u8f7d\u94fe\u63a5<\/li>\n<li>\u53c2\u8003<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>\u524d\u8a00<\/h3>\n<p>\u5728\u4e0a\u7bc7\u6587\u7ae0 \u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws \u4e2d&#xff0c;\u6211\u4eec\u5df2\u7ecf\u4e86\u89e3\u4e86 Scaling Laws \u7684\u4f5c\u4e1a\u8981\u6c42&#xff0c;\u4e0b\u9762\u6211\u4eec\u5c31\u4e00\u8d77\u6765\u770b\u770b\u8fd9\u4e9b\u4f5c\u4e1a\u8be5\u5982\u4f55\u5b9e\u73b0&#xff0c;\u672c\u7bc7\u6587\u7ae0\u8bb0\u5f55 CS336 \u4f5c\u4e1a Assignment 3: Scaling \u4e2d\u7684 Scaling Laws \u5b9e\u73b0&#xff0c;\u4ec5\u4f9b\u81ea\u5df1\u53c2\u8003&#x1f604;<\/p>\n<p>Note&#xff1a;\u535a\u4e3b\u5e76\u672a\u9075\u5faa from-scratch \u7684\u5b97\u65e8&#xff0c;\u6240\u6709\u4ee3\u7801\u51e0\u4e4e\u5747\u7531 ChatGPT \u5b8c\u6210<\/p>\n<p>Assignment 3&#xff1a;https:\/\/github.com\/stanford-cs336\/assignment3-scaling<\/p>\n<p>reference&#xff1a;https:\/\/chatgpt.com\/<\/p>\n<h3>1. Problem (chinchilla_isoflops): 5 points<\/h3>\n<p>\u8bf7\u7f16\u5199\u4e00\u4e2a\u811a\u672c&#xff0c;\u590d\u73b0\u4e0a\u6587\u6240\u63cf\u8ff0\u7684 IsoFLOPs \u65b9\u6cd5&#xff0c;\u7528\u4e8e\u6839\u636e\u591a\u6b21\u8bad\u7ec3\u8fd0\u884c\u7684 \u6700\u7ec8\u8bad\u7ec3\u635f\u5931 \u6765\u62df\u5408\u7f29\u653e\u5b9a\u5f8b&#xff08;scaling laws&#xff09;\u3002<\/p>\n<p>\u5728\u672c\u9898\u4e2d&#xff0c;\u8bf7\u4f7f\u7528\u6587\u4ef6 data\/isoflops_curves.json \u4e2d\u7ed9\u51fa\u7684&#xff08;\u5408\u6210\u7684&#xff09;\u8bad\u7ec3\u8fd0\u884c\u6570\u636e\u3002\u8be5\u6587\u4ef6\u5305\u542b\u4e00\u4e2a JSON \u6570\u503c&#xff0c;\u5176\u4e2d\u6bcf\u4e2a\u5143\u7d20\u90fd\u662f\u4e00\u4e2a\u63cf\u8ff0\u4e00\u6b21\u8bad\u7ec3\u8fd0\u884c\u7684\u5bf9\u8c61\u3002\u4e0b\u9762\u7ed9\u51fa\u524d\u4e24\u4e2a\u793a\u4f8b&#xff0c;\u7528\u4e8e\u8bf4\u660e\u6570\u636e\u683c\u5f0f&#xff1a;<\/p>\n<p><span class=\"token punctuation\">[<\/span><br \/>\n  <span class=\"token punctuation\">{<\/span><br \/>\n    <span class=\"token string-property property\">&#034;parameters&#034;<\/span><span class=\"token operator\">:<\/span> <span class=\"token number\">4999999<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string-property property\">&#034;compute_budget&#034;<\/span><span class=\"token operator\">:<\/span> <span class=\"token number\">6e&#043;18<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string-property property\">&#034;final_loss&#034;<\/span><span class=\"token operator\">:<\/span> <span class=\"token number\">7.192784500319437<\/span><br \/>\n  <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n  <span class=\"token punctuation\">{<\/span><br \/>\n    <span class=\"token string-property property\">&#034;parameters&#034;<\/span><span class=\"token operator\">:<\/span> <span class=\"token number\">78730505<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string-property property\">&#034;compute_budget&#034;<\/span><span class=\"token operator\">:<\/span> <span class=\"token number\">6e&#043;18<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token string-property property\">&#034;final_loss&#034;<\/span><span class=\"token operator\">:<\/span> <span class=\"token number\">6.750171320661809<\/span><br \/>\n  <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n  <span class=\"token operator\">&#8230;<\/span><br \/>\n<span class=\"token punctuation\">]<\/span><\/p>\n<p>\u5728\u62df\u5408\u7f29\u653e\u5b9a\u5f8b\u65f6&#xff0c;\u53ef\u4ee5\u4f7f\u7528 scipy \u5305&#xff08;\u5c24\u5176\u662f scipy.optimize.curve_fit&#xff09;&#xff0c;\u5f53\u7136\u4f60\u4e5f\u53ef\u4ee5\u4f7f\u7528\u4efb\u4f55\u4f60\u559c\u6b22\u7684\u66f2\u7ebf\u62df\u5408\u65b9\u6cd5\u3002\u867d\u7136 [Hoffmann&#043; 2022] \u5bf9\u6bcf\u6761 IsoFLOPs \u66f2\u7ebf\u62df\u5408\u7684\u662f\u4e00\u4e2a\u4e8c\u6b21\u51fd\u6570\u6765\u5bfb\u627e\u6700\u5c0f\u503c&#xff0c;\u4f46\u4ed6\u4eec\u5b9e\u9645\u4e0a\u5efa\u8bae\u7684\u505a\u6cd5\u662f&#xff1a;\u76f4\u63a5\u9009\u53d6\u5728\u7ed9\u5b9a\u8ba1\u7b97\u9884\u7b97\u4e0b\u8bad\u7ec3\u635f\u5931\u6700\u4f4e\u7684\u90a3\u4e00\u6b21\u8fd0\u884c&#xff0c;\u4f5c\u4e3a\u6700\u4f18\u70b9\u3002<\/p>\n<p>1. \u6a21\u578b\u89c4\u6a21\u7684\u8ba1\u7b97\u6700\u4f18\u7f29\u653e\u5b9a\u5f8b<\/p>\n<p>\u8bf7\u5c55\u793a\u4f60 \u5916\u63a8\u5f97\u5230\u7684\u8ba1\u7b97\u6700\u4f18\u6a21\u578b\u89c4\u6a21&#xff0c;\u5e76\u540c\u65f6\u7ed9\u51fa\u4f60\u83b7\u5f97\u7684 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        ,<\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        )<\/p>\n<p>        )<\/p>\n<p>       (C_i,N_{\\\\text{opt}}(C_i))<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">))<\/span><\/span><\/span><\/span><\/span> \u6570\u636e\u70b9\u3002<\/p>\n<ul>\n<li>\u5728\u8ba1\u7b97\u9884\u7b97\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          10<\/p>\n<p>          23<\/p>\n<p>        10^{23}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">23<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u65f6&#xff0c;\u4f60\u9884\u6d4b\u7684\u6700\u4f18\u6a21\u578b\u89c4\u6a21\u662f\u591a\u5c11&#xff1f;<\/li>\n<li>\u5728\u8ba1\u7b97\u9884\u7b97\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          10<\/p>\n<p>          24<\/p>\n<p>        10^{24}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u65f6\u5462&#xff1f;<\/li>\n<\/ul>\n<p>Deliverable&#xff1a;\u4e00\u5f20\u5c55\u793a\u6a21\u578b\u89c4\u6a21\u968f\u8ba1\u7b97\u9884\u7b97\u53d8\u5316\u7684\u7f29\u653e\u5b9a\u5f8b\u56fe&#xff0c;\u6807\u51fa\u7528\u4e8e\u62df\u5408\u7684\u539f\u59cb\u6570\u636e\u70b9&#xff0c;\u5e76\u5c06\u66f2\u7ebf\u81f3\u5c11\u5916\u63a8\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         24<\/p>\n<p>       10^{24}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff1b;\u4e00\u53e5\u8bdd\u6587\u5b57\u8bf4\u660e&#xff1a;\u7ed9\u51fa\u4f60\u9884\u6d4b\u7684\u6700\u4f18\u6a21\u578b\u89c4\u6a21\u3002<\/p>\n<p>2. \u6570\u636e\u96c6\u89c4\u6a21\u7684\u8ba1\u7b97\u6700\u4f18\u7f29\u653e\u5b9a\u5f8b<\/p>\n<p>\u8bf7\u5c55\u793a\u4f60 \u5916\u63a8\u5f97\u5230\u7684\u8ba1\u7b97\u6700\u4f18\u6570\u636e\u96c6\u89c4\u6a21&#xff0c;\u5e76\u540c\u65f6\u7ed9\u51fa\u4f60\u83b7\u5f97\u7684 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        ,<\/p>\n<p>         D<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        )<\/p>\n<p>        )<\/p>\n<p>       (C_i,D_{\\\\text{opt}}(C_i))<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">))<\/span><\/span><\/span><\/span><\/span> \u6570\u636e\u70b9&#xff08;\u6765\u81ea\u8bad\u7ec3\u8fd0\u884c&#xff09;\u3002<\/p>\n<ul>\n<li>\u5728\u8ba1\u7b97\u9884\u7b97\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          10<\/p>\n<p>          23<\/p>\n<p>        10^{23}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">23<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u65f6&#xff0c;\u4f60\u9884\u6d4b\u7684\u6700\u4f18\u6570\u636e\u96c6\u89c4\u6a21\u662f\u591a\u5c11&#xff1f;<\/li>\n<li>\u5728\u8ba1\u7b97\u9884\u7b97\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          10<\/p>\n<p>          24<\/p>\n<p>        10^{24}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u65f6\u5462&#xff1f;<\/li>\n<\/ul>\n<p>Deliverable&#xff1a;\u4e00\u5f20\u5c55\u793a\u6570\u636e\u96c6\u89c4\u6a21\u968f\u8ba1\u7b97\u9884\u7b97\u53d8\u5316\u7684\u7f29\u653e\u5b9a\u5f8b\u56fe&#xff0c;\u6807\u51fa\u7528\u4e8e\u62df\u5408\u7684\u539f\u59cb\u6570\u636e\u70b9&#xff0c;\u5e76\u5c06\u66f2\u7ebf\u81f3\u5c11\u5916\u63a8\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         24<\/p>\n<p>       10^{24}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff1b;\u4e00\u53e5\u8bdd\u6587\u5b57\u8bf4\u660e&#xff1a;\u7ed9\u51fa\u4f60\u9884\u6d4b\u7684\u6700\u4f18\u6570\u636e\u96c6\u89c4\u6a21\u3002<\/p>\n<p>\u4ee3\u7801\u5b9e\u73b0\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> argparse<br \/>\n<span class=\"token keyword\">import<\/span> json<br \/>\n<span class=\"token keyword\">from<\/span> dataclasses <span class=\"token keyword\">import<\/span> dataclass<br \/>\n<span class=\"token keyword\">from<\/span> pathlib <span class=\"token keyword\">import<\/span> Path<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Dict<span class=\"token punctuation\">,<\/span> List<span class=\"token punctuation\">,<\/span> Tuple<\/p>\n<p><span class=\"token keyword\">import<\/span> numpy <span class=\"token keyword\">as<\/span> np<br \/>\n<span class=\"token keyword\">import<\/span> matplotlib<span class=\"token punctuation\">.<\/span>pyplot <span class=\"token keyword\">as<\/span> plt<\/p>\n<p><span class=\"token decorator annotation punctuation\">&#064;dataclass<\/span><span class=\"token punctuation\">(<\/span>frozen<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">class<\/span> <span class=\"token class-name\">Run<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    parameters<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span>       <span class=\"token comment\"># N<\/span><br \/>\n    compute_budget<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span>   <span class=\"token comment\"># C<\/span><br \/>\n    final_loss<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span>       <span class=\"token comment\"># L<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">load_runs<\/span><span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">:<\/span> Path<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> List<span class=\"token punctuation\">[<\/span>Run<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    data <span class=\"token operator\">&#061;<\/span> json<span class=\"token punctuation\">.<\/span>loads<span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">.<\/span>read_text<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    runs<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>Run<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> r <span class=\"token keyword\">in<\/span> data<span class=\"token punctuation\">:<\/span><br \/>\n        runs<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span><br \/>\n            Run<span class=\"token punctuation\">(<\/span><br \/>\n                parameters<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;parameters&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                compute_budget<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;compute_budget&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                final_loss<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;final_loss&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> runs<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">select_opt_points<\/span><span class=\"token punctuation\">(<\/span>runs<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>Run<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> Run<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;For each compute budget C, pick the run with the lowest final_loss&#034;&#034;&#034;<\/span><br \/>\n    best<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> Run<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">}<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> r <span class=\"token keyword\">in<\/span> runs<span class=\"token punctuation\">:<\/span><br \/>\n        C <span class=\"token operator\">&#061;<\/span> r<span class=\"token punctuation\">.<\/span>compute_budget<br \/>\n        <span class=\"token keyword\">if<\/span> C <span class=\"token keyword\">not<\/span> <span class=\"token keyword\">in<\/span> best <span class=\"token keyword\">or<\/span> r<span class=\"token punctuation\">.<\/span>final_loss <span class=\"token operator\">&lt;<\/span> best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>final_loss<span class=\"token punctuation\">:<\/span><br \/>\n            best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> r<br \/>\n    <span class=\"token keyword\">return<\/span> best<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">fit_power_law<\/span><span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span> ys<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Tuple<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Fit y &#061; k * x^a via log-log linear regression:<br \/>\n      log(y) &#061; log(k) &#043; a * log(x)<br \/>\n    Returns (k, a).<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    <span class=\"token keyword\">if<\/span> np<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">any<\/span><span class=\"token punctuation\">(<\/span>xs <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">or<\/span> np<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">any<\/span><span class=\"token punctuation\">(<\/span>ys <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;x and y must be positive for log-log fit.&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    lx <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>log<span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">)<\/span><br \/>\n    ly <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>log<span class=\"token punctuation\">(<\/span>ys<span class=\"token punctuation\">)<\/span><br \/>\n    a<span class=\"token punctuation\">,<\/span> logk <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>polyfit<span class=\"token punctuation\">(<\/span>lx<span class=\"token punctuation\">,<\/span> ly<span class=\"token punctuation\">,<\/span> deg<span class=\"token operator\">&#061;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># slope&#061;a, intercept&#061;logk<\/span><br \/>\n    k <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>exp<span class=\"token punctuation\">(<\/span>logk<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> k<span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>a<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">predict_power_law<\/span><span class=\"token punctuation\">(<\/span>k<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> a<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> x<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> k <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>x <span class=\"token operator\">**<\/span> a<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">plot_scaling<\/span><span class=\"token punctuation\">(<\/span><br \/>\n    x_points<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span><br \/>\n    y_points<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span><br \/>\n    k<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    a<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    out_path<span class=\"token punctuation\">:<\/span> Path<span class=\"token punctuation\">,<\/span><br \/>\n    title<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    y_label<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    x_min<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    x_max<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><br \/>\n<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    xs <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>logspace<span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>log10<span class=\"token punctuation\">(<\/span>x_min<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> np<span class=\"token punctuation\">.<\/span>log10<span class=\"token punctuation\">(<\/span>x_max<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">300<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ys <span class=\"token operator\">&#061;<\/span> predict_power_law<span class=\"token punctuation\">(<\/span>k<span class=\"token punctuation\">,<\/span> a<span class=\"token punctuation\">,<\/span> xs<span class=\"token punctuation\">)<\/span><\/p>\n<p>    plt<span class=\"token punctuation\">.<\/span>figure<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>loglog<span class=\"token punctuation\">(<\/span>x_points<span class=\"token punctuation\">,<\/span> y_points<span class=\"token punctuation\">,<\/span> marker<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;o&#034;<\/span><span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;None&#034;<\/span><span class=\"token punctuation\">,<\/span> label<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;opt points&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>loglog<span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">,<\/span> ys<span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;-&#034;<\/span><span class=\"token punctuation\">,<\/span> label<span class=\"token operator\">&#061;<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;fit: y &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>k<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>a<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>xlabel<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Compute budget C (FLOPs)&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>ylabel<span class=\"token punctuation\">(<\/span>y_label<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>title<span class=\"token punctuation\">(<\/span>title<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>grid<span class=\"token punctuation\">(<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> which<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;both&#034;<\/span><span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#8211;&#034;<\/span><span class=\"token punctuation\">,<\/span> linewidth<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.5<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>legend<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>tight_layout<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>savefig<span class=\"token punctuation\">(<\/span>out_path<span class=\"token punctuation\">,<\/span> dpi<span class=\"token operator\">&#061;<\/span><span class=\"token number\">200<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>close<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    ap <span class=\"token operator\">&#061;<\/span> argparse<span class=\"token punctuation\">.<\/span>ArgumentParser<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;data&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;data\/isoflops_curves.json&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">help<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;Path to data\/isoflops_curves.json&#034;<\/span><span class=\"token punctuation\">,<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;outdir&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;runs\/isoflops&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">help<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;Directory to write plots\/results&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    args <span class=\"token operator\">&#061;<\/span> ap<span class=\"token punctuation\">.<\/span>parse_args<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    runs <span class=\"token operator\">&#061;<\/span> load_runs<span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>data<span class=\"token punctuation\">)<\/span><br \/>\n    best <span class=\"token operator\">&#061;<\/span> select_opt_points<span class=\"token punctuation\">(<\/span>runs<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Sort by compute budget<\/span><br \/>\n    budgets <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">sorted<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">.<\/span>keys<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    n_opt <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span>best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> budgets<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    d_opt <span class=\"token operator\">&#061;<\/span> budgets <span class=\"token operator\">\/<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">6.0<\/span> <span class=\"token operator\">*<\/span> n_opt<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Fit power laws<\/span><br \/>\n    kN<span class=\"token punctuation\">,<\/span> aN <span class=\"token operator\">&#061;<\/span> fit_power_law<span class=\"token punctuation\">(<\/span>budgets<span class=\"token punctuation\">,<\/span> n_opt<span class=\"token punctuation\">)<\/span><br \/>\n    kD<span class=\"token punctuation\">,<\/span> aD <span class=\"token operator\">&#061;<\/span> fit_power_law<span class=\"token punctuation\">(<\/span>budgets<span class=\"token punctuation\">,<\/span> d_opt<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Predictions required by the problem<\/span><br \/>\n    targets <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1e23<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e24<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    pred_N <span class=\"token operator\">&#061;<\/span> predict_power_law<span class=\"token punctuation\">(<\/span>kN<span class=\"token punctuation\">,<\/span> aN<span class=\"token punctuation\">,<\/span> targets<span class=\"token punctuation\">)<\/span><br \/>\n    pred_D <span class=\"token operator\">&#061;<\/span> predict_power_law<span class=\"token punctuation\">(<\/span>kD<span class=\"token punctuation\">,<\/span> aD<span class=\"token punctuation\">,<\/span> targets<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Print results<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; IsoFLOPs opt points (C, N_opt, D_opt, loss) &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> budgets<span class=\"token punctuation\">:<\/span><br \/>\n        r <span class=\"token operator\">&#061;<\/span> best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;C&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>C<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  N_opt&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  D_opt&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>C<span class=\"token operator\">\/<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6<\/span><span class=\"token operator\">*<\/span>r<span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  loss&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>final_loss<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\\\\n&#061;&#061;&#061; Power-law fits &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;N_opt(C) &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>kN<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>aN<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;D_opt(C) &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>kD<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>aD<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\\\\n&#061;&#061;&#061; Extrapolated predictions &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> C<span class=\"token punctuation\">,<\/span> Np<span class=\"token punctuation\">,<\/span> Dp <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">zip<\/span><span class=\"token punctuation\">(<\/span>targets<span class=\"token punctuation\">,<\/span> pred_N<span class=\"token punctuation\">,<\/span> pred_D<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;C&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>C<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.1e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">:  N_opt\u2248<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>Np<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> params,  D_opt\u2248<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>Dp<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> tokens&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Plot range: cover observed budgets and extrapolate to 1e24<\/span><br \/>\n    x_min <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span>budgets<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e16<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># just in case; won&#039;t hurt<\/span><br \/>\n    x_max <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">1e24<\/span><\/p>\n<p>    args<span class=\"token punctuation\">.<\/span>outdir<span class=\"token punctuation\">.<\/span>mkdir<span class=\"token punctuation\">(<\/span>parents<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> exist_ok<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plot_scaling<span class=\"token punctuation\">(<\/span><br \/>\n        x_points<span class=\"token operator\">&#061;<\/span>budgets<span class=\"token punctuation\">,<\/span><br \/>\n        y_points<span class=\"token operator\">&#061;<\/span>n_opt<span class=\"token punctuation\">,<\/span><br \/>\n        k<span class=\"token operator\">&#061;<\/span>kN<span class=\"token punctuation\">,<\/span><br \/>\n        a<span class=\"token operator\">&#061;<\/span>aN<span class=\"token punctuation\">,<\/span><br \/>\n        out_path<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;n_opt_vs_compute.png&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        title<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;Compute-optimal model size (IsoFLOPs)&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        y_label<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;N_opt (parameters)&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        x_min<span class=\"token operator\">&#061;<\/span>x_min<span class=\"token punctuation\">,<\/span><br \/>\n        x_max<span class=\"token operator\">&#061;<\/span>x_max<span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><br \/>\n    plot_scaling<span class=\"token punctuation\">(<\/span><br \/>\n        x_points<span class=\"token operator\">&#061;<\/span>budgets<span class=\"token punctuation\">,<\/span><br \/>\n        y_points<span class=\"token operator\">&#061;<\/span>d_opt<span class=\"token punctuation\">,<\/span><br \/>\n        k<span class=\"token operator\">&#061;<\/span>kD<span class=\"token punctuation\">,<\/span><br \/>\n        a<span class=\"token operator\">&#061;<\/span>aD<span class=\"token punctuation\">,<\/span><br \/>\n        out_path<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;d_opt_vs_compute.png&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        title<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;Compute-optimal dataset size (IsoFLOPs)&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        y_label<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;D_opt (tokens)&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        x_min<span class=\"token operator\">&#061;<\/span>x_min<span class=\"token punctuation\">,<\/span><br \/>\n        x_max<span class=\"token operator\">&#061;<\/span>x_max<span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Save a small json for writeup convenience<\/span><br \/>\n    result <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n        <span class=\"token string\">&#034;opt_points&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span><br \/>\n            <span class=\"token punctuation\">{<\/span><br \/>\n                <span class=\"token string\">&#034;compute_budget&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;n_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;d_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>C <span class=\"token operator\">\/<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">6.0<\/span> <span class=\"token operator\">*<\/span> best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;loss&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>final_loss<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">}<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> budgets<br \/>\n        <span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;fit&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n            <span class=\"token string\">&#034;n_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">:<\/span> kN<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">:<\/span> aN<span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;d_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">:<\/span> kD<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">:<\/span> aD<span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;predictions&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span><br \/>\n            <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;compute_budget&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;n_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>Np<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;d_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>Dp<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> C<span class=\"token punctuation\">,<\/span> Np<span class=\"token punctuation\">,<\/span> Dp <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">zip<\/span><span class=\"token punctuation\">(<\/span>targets<span class=\"token punctuation\">,<\/span> pred_N<span class=\"token punctuation\">,<\/span> pred_D<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">}<\/span><br \/>\n    <span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;isoflops_fit.json&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>write_text<span class=\"token punctuation\">(<\/span>json<span class=\"token punctuation\">.<\/span>dumps<span class=\"token punctuation\">(<\/span>result<span class=\"token punctuation\">,<\/span> indent<span class=\"token operator\">&#061;<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;\\\\nWrote plots &#043; json to: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>args<span class=\"token punctuation\">.<\/span>outdir<span class=\"token punctuation\">.<\/span>resolve<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">if<\/span> __name__ <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;__main__&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    main<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u8fd0\u884c\u6307\u4ee4\u5982\u4e0b&#xff1a;<\/p>\n<p>python cs336_scaling\/chinchilla_isoflops.py<\/p>\n<p>\u6267\u884c\u540e\u8f93\u51fa\u5982\u4e0b&#xff1a;<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2026\/02\/20260228094237-69a2b88d0db57.png\" alt=\"\u5728\u8fd9\u91cc\u63d2\u5165\u56fe\u7247\u63cf\u8ff0\" \/><\/p>\n<p>\u6a21\u578b\u89c4\u6a21\u968f\u8ba1\u7b97\u9884\u7b97\u53d8\u5316\u7684\u7f29\u653e\u5b9a\u5f8b\u56fe\u5982\u4e0b\u6240\u793a&#xff1a;<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2026\/02\/20260228094237-69a2b88d4a29e.png\" width=\"800\" \/><\/p>\n<p>\u6570\u636e\u96c6\u89c4\u6a21\u968f\u8ba1\u7b97\u9884\u7b97\u53d8\u5316\u7684\u7f29\u653e\u5b9a\u5f8b\u56fe\u5982\u4e0b\u6240\u793a&#xff1a;<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2026\/02\/20260228094237-69a2b88dbf194.png\" width=\"800\" \/><\/p>\n<p>\u6211\u4eec\u4f7f\u7528\u7ed9\u5b9a\u7684 IsoFLOPs \u66f2\u7ebf\u6570\u636e&#xff0c;\u5728\u6bcf\u4e2a\u8ba1\u7b97\u9884\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>       C_i<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u4e0b&#xff0c;\u4ece\u4e0d\u540c\u6a21\u578b\u89c4\u6a21\u7684\u591a\u6b21\u8bad\u7ec3\u7ed3\u679c\u4e2d\u9009\u62e9 final loss \u6700\u5c0f \u7684\u90a3\u6761 run \u4f5c\u4e3a\u8be5\u9884\u7b97\u4e0b\u7684 compute-optimal \u70b9&#xff0c;\u5f97\u5230\u4e00\u7ec4\u6700\u4f18\u70b9 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        {<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        ,<\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        )<\/p>\n<p>        )<\/p>\n<p>        }<\/p>\n<p>       \\\\{(C_i,N_{\\\\text{opt}}(C_i))\\\\}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mopen\">{(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">))}<\/span><\/span><\/span><\/span><\/span>\u3002\u968f\u540e\u5229\u7528 Chinchilla \u5e38\u7528\u8fd1\u4f3c\u8ba1\u7b97\u516c\u5f0f<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         C<\/p>\n<p>         \u2248<\/p>\n<p>         6<\/p>\n<p>         N<\/p>\n<p>         D<\/p>\n<p>        C \\\\approx 6ND<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord\">6<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>\u5c06\u6bcf\u4e2a\u6700\u4f18\u70b9\u5bf9\u5e94\u7684\u6570\u636e\u89c4\u6a21\u6062\u590d\u4e3a<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          D<\/p>\n<p>          opt<\/p>\n<p>         (<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>           C<\/p>\n<p>           i<\/p>\n<p>           6<\/p>\n<p>            N<\/p>\n<p>            opt<\/p>\n<p>         (<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>         )<\/p>\n<p>         D_{\\\\text{opt}}(C_i) &#061; \\\\frac{C_i}{6N_{\\\\text{opt}}}(C_i) <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 2.3324em;vertical-align: -0.9721em\"><\/span><span class=\"mord\"><span class=\"mopen nulldelimiter\"><\/span><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 1.3603em\"><span class=\"\" style=\"top: -2.314em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mord\">6<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.23em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"frac-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3.677em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9721em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mclose nulldelimiter\"><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>\u57fa\u4e8e\u8fd9\u4e9b\u6700\u4f18\u70b9&#xff0c;\u6211\u4eec\u5728 log-log \u7a7a\u95f4\u5bf9 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>       N_{\\\\text{opt}}(C)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u4e0e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         D<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>       D_{\\\\text{opt}}(C)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u5206\u522b\u62df\u5408\u5e42\u5f8b\u5173\u7cfb <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        y<\/p>\n<p>        &#061;<\/p>\n<p>        k<\/p>\n<p>         C<\/p>\n<p>         a<\/p>\n<p>       y&#061;kC^{a}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.625em;vertical-align: -0.1944em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0359em\">y<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6944em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6644em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff0c;\u5f97\u5230&#xff1a;<\/p>\n<ul>\n<li>Compute-optimal \u6a21\u578b\u53c2\u6570\u89c4\u6a21 <span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>           N<\/p>\n<p>           opt<\/p>\n<p>          (<\/p>\n<p>          C<\/p>\n<p>          )<\/p>\n<p>          \u2248<\/p>\n<p>          1.16341<\/p>\n<p>          \u22c5<\/p>\n<p>           C<\/p>\n<p>           0.46868<\/p>\n<p>         N_{\\\\text{opt}}(C) \\\\approx 1.16341 \\\\cdot C^{0.46868}<\/p>\n<p>      <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">1.16341<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8641em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">0.46868<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>Compute-optimal \u6570\u636e token \u89c4\u6a21 <span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>           D<\/p>\n<p>           opt<\/p>\n<p>          (<\/p>\n<p>          C<\/p>\n<p>          )<\/p>\n<p>          \u2248<\/p>\n<p>          0.14326<\/p>\n<p>          \u22c5<\/p>\n<p>           C<\/p>\n<p>           0.53132<\/p>\n<p>         D_{\\\\text{opt}}(C) \\\\approx 0.14326 \\\\cdot C^{0.53132}<\/p>\n<p>      <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">0.14326<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8641em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">0.53132<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<\/ul>\n<p>\u62df\u5408\u5f97\u5230\u7684\u6307\u6570\u6ee1\u8db3 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        0.46868<\/p>\n<p>        &#043;<\/p>\n<p>        0.53132<\/p>\n<p>        \u2248<\/p>\n<p>        1<\/p>\n<p>       0.46868 &#043; 0.53132 \\\\approx 1<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">0.46868<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">&#043;<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">0.53132<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">1<\/span><\/span><\/span><\/span><\/span>&#xff0c;\u4e0e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        C<\/p>\n<p>        \u2248<\/p>\n<p>        6<\/p>\n<p>        N<\/p>\n<p>        D<\/p>\n<p>       C \\\\approx 6 ND<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord\">6<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><\/span><\/span><\/span><\/span> \u7684\u4e58\u6cd5\u7ea6\u675f\u4e00\u81f4&#xff08;\u5373\u6a21\u578b\u89c4\u6a21\u4e0e\u6570\u636e\u89c4\u6a21\u5728\u8ba1\u7b97\u9884\u7b97\u589e\u957f\u4e0b\u5206\u644a\u589e\u957f&#xff09;\u3002\u5bf9\u5e94\u7684\u6700\u4f18\u70b9\u5728\u4e24\u5f20 log-log \u56fe\u4e2d\u57fa\u672c\u6cbf\u62df\u5408\u76f4\u7ebf\u5206\u5e03&#xff0c;\u8bf4\u660e\u8be5\u5e42\u5f8b\u5bf9\u7ed9\u5b9a\u9884\u7b97\u8303\u56f4\u5185\u7684\u6570\u636e\u5177\u6709\u826f\u597d\u89e3\u91ca\u529b\u3002<\/p>\n<p>\u6309\u9898\u76ee\u8981\u6c42\u5c06\u5e42\u5f8b\u5916\u63a8\u5230\u66f4\u5927\u8ba1\u7b97\u9884\u7b97&#xff0c;\u5f97\u5230&#xff1a;<\/p>\n<ul>\n<li>\u5f53 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         C<\/p>\n<p>         &#061;<\/p>\n<p>          10<\/p>\n<p>          23<\/p>\n<p>        C&#061;10^{23}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">23<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff1a; <\/p>\n<ul>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>            N<\/p>\n<p>            opt<\/p>\n<p>           \u2248<\/p>\n<p>           7.01<\/p>\n<p>           \u00d7<\/p>\n<p>            10<\/p>\n<p>            10<\/p>\n<p>          N_{\\\\text{opt}} \\\\approx 7.01 \\\\times 10^{10}<\/p>\n<p>       <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.9694em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">7.01<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">10<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff08;\u7ea6 70B \u53c2\u6570&#xff09;<\/li>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>            D<\/p>\n<p>            opt<\/p>\n<p>           \u2248<\/p>\n<p>           2.38<\/p>\n<p>           \u00d7<\/p>\n<p>            10<\/p>\n<p>            11<\/p>\n<p>          D_{\\\\text{opt}} \\\\approx 2.38 \\\\times 10^{11}<\/p>\n<p>       <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.9694em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">2.38<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">11<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> tokens&#xff08;\u7ea6 238B tokens&#xff09;<\/li>\n<\/ul>\n<\/li>\n<li>\u5f53 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         C<\/p>\n<p>         &#061;<\/p>\n<p>          10<\/p>\n<p>          24<\/p>\n<p>        C&#061;10^{24}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff1a; <\/p>\n<ul>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>            N<\/p>\n<p>            opt<\/p>\n<p>           \u2248<\/p>\n<p>           2.06<\/p>\n<p>           \u00d7<\/p>\n<p>            10<\/p>\n<p>            11<\/p>\n<p>          N_{\\\\text{opt}} \\\\approx 2.06 \\\\times 10^{11}<\/p>\n<p>       <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.9694em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">2.06<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">11<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff08;\u7ea6 206B \u53c2\u6570&#xff09;<\/li>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>            D<\/p>\n<p>            opt<\/p>\n<p>           \u2248<\/p>\n<p>           8.09<\/p>\n<p>           \u00d7<\/p>\n<p>            10<\/p>\n<p>            11<\/p>\n<p>          D_{\\\\text{opt}} \\\\approx 8.09 \\\\times 10^{11}<\/p>\n<p>       <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.9694em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">8.09<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">11<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> tokens&#xff08;\u7ea6 809B tokens&#xff09;<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>\u6700\u540e\u6211\u4eec\u6765\u7b80\u5355\u5206\u6790\u4e0b\u4ee3\u7801\u7684\u5b9e\u73b0<\/p>\n<p>\u4e0a\u9762\u8fd9\u4efd\u811a\u672c\u7684\u6838\u5fc3\u76ee\u6807\u662f&#xff1a;\u4ece data\/isoflops_curves.json \u4e2d\u6062\u590d\u51fa compute budget \u2192 compute-optimal \u6a21\u578b\u89c4\u6a21\/\u6570\u636e\u89c4\u6a21 \u7684\u7f29\u653e\u89c4\u5f8b&#xff0c;\u5e76\u5bf9\u66f4\u5927\u9884\u7b97\u505a\u5916\u63a8\u3002<\/p>\n<p>1) \u6570\u636e\u8bfb\u53d6\u4e0e\u7edf\u4e00\u5b57\u6bb5<\/p>\n<p><span class=\"token decorator annotation punctuation\">&#064;dataclass<\/span><span class=\"token punctuation\">(<\/span>frozen<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">class<\/span> <span class=\"token class-name\">Run<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    parameters<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span>       <span class=\"token comment\"># N<\/span><br \/>\n    compute_budget<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span>   <span class=\"token comment\"># C<\/span><br \/>\n    final_loss<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span>       <span class=\"token comment\"># L<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">load_runs<\/span><span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">:<\/span> Path<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> List<span class=\"token punctuation\">[<\/span>Run<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    data <span class=\"token operator\">&#061;<\/span> json<span class=\"token punctuation\">.<\/span>loads<span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">.<\/span>read_text<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    runs<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>Run<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> r <span class=\"token keyword\">in<\/span> data<span class=\"token punctuation\">:<\/span><br \/>\n        runs<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span><br \/>\n            Run<span class=\"token punctuation\">(<\/span><br \/>\n                parameters<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;parameters&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                compute_budget<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;compute_budget&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                final_loss<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;final_loss&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> runs<\/p>\n<p>\u811a\u672c\u9996\u5148\u8bfb\u53d6 isoflops_curves.json&#xff0c;\u5e76\u5c06\u6bcf\u6761\u5b9e\u9a8c\u8bb0\u5f55\u89e3\u6790\u4e3a\u5305\u542b\u4e09\u9879\u5173\u952e\u5b57\u6bb5\u7684\u7ed3\u6784&#xff1a;<\/p>\n<ul>\n<li>parameters&#xff1a;\u6a21\u578b\u53c2\u6570\u91cf <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         N<\/p>\n<p>        N<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><\/span><\/li>\n<li>compute_budget&#xff1a;\u8ba1\u7b97\u9884\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         C<\/p>\n<p>        C<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><\/span><\/span><\/span><\/span><\/li>\n<li>final_loss&#xff1a;\u8be5\u5b9e\u9a8c\u7684\u6700\u7ec8 loss<\/li>\n<\/ul>\n<p>\u8fd9\u6837\u540e\u7eed\u903b\u8f91\u5c31\u53ef\u4ee5\u76f4\u63a5\u56f4\u7ed5 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        ,<\/p>\n<p>        N<\/p>\n<p>        ,<\/p>\n<p>        loss<\/p>\n<p>        )<\/p>\n<p>       (C,N,\\\\text{loss})<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord text\"><span class=\"mord\">loss<\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u64cd\u4f5c\u3002<\/p>\n<p>2) IsoFLOPs \u201c\u6700\u4f18\u70b9\u201d \u9009\u62e9\u7b56\u7565<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">select_opt_points<\/span><span class=\"token punctuation\">(<\/span>runs<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>Run<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> Run<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;For each compute budget C, pick the run with the lowest final_loss&#034;&#034;&#034;<\/span><br \/>\n    best<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> Run<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">}<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> r <span class=\"token keyword\">in<\/span> runs<span class=\"token punctuation\">:<\/span><br \/>\n        C <span class=\"token operator\">&#061;<\/span> r<span class=\"token punctuation\">.<\/span>compute_budget<br \/>\n        <span class=\"token keyword\">if<\/span> C <span class=\"token keyword\">not<\/span> <span class=\"token keyword\">in<\/span> best <span class=\"token keyword\">or<\/span> r<span class=\"token punctuation\">.<\/span>final_loss <span class=\"token operator\">&lt;<\/span> best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>final_loss<span class=\"token punctuation\">:<\/span><br \/>\n            best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> r<br \/>\n    <span class=\"token keyword\">return<\/span> best<\/p>\n<p>IsoFLOPs \u7684\u5173\u952e\u5728\u4e8e&#xff1a;\u5bf9\u6bcf\u4e2a\u56fa\u5b9a\u8ba1\u7b97\u9884\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>        C_i<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff0c;\u4ece\u591a\u6761\u4e0d\u540c\u6a21\u578b\u89c4\u6a21\u7684 run \u4e2d\u6311\u9009\u6700\u7ec8 loss \u6700\u5c0f\u7684\u4e00\u6761&#xff0c;\u4f5c\u4e3a\u8be5\u9884\u7b97\u4e0b\u7684 compute-optimal \u70b9\u3002\u4ee3\u7801\u91cc\u5c31\u662f\u6309 compute_budget \u5206\u7ec4\u7ef4\u62a4\u4e00\u4e2a best[C]&#xff0c;\u904d\u5386\u6240\u6709 run \u65f6\u6bd4\u8f83 final_loss \u5e76\u66f4\u65b0\u6700\u4f18\u9879\u3002<\/p>\n<p>\u6700\u7ec8\u5f97\u5230\u70b9\u96c6 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        {<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        ,<\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        )<\/p>\n<p>        )<\/p>\n<p>        }<\/p>\n<p>       \\\\{(C_i,N_{\\\\text{opt}}(C_i))\\\\}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mopen\">{(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">))}<\/span><\/span><\/span><\/span><\/span>&#xff0c;\u8fd9\u4e00\u6b65\u76f8\u5f53\u4e8e \u201c\u6cbf\u7740\u6bcf\u6761 IsoFLOPs \u66f2\u7ebf\u53d6\u6700\u4f18\u70b9\u201d&#xff0c;\u907f\u514d\u975e\u6700\u4f18 run \u5e72\u6270\u62df\u5408\u3002<\/p>\n<p>3) \u7531\u8ba1\u7b97\u9884\u7b97\u53cd\u63a8\u6700\u4f18\u6570\u636e\u89c4\u6a21<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><\/p>\n<p>    <span class=\"token comment\"># Sort by compute budget<\/span><br \/>\n    budgets <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">sorted<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">.<\/span>keys<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    n_opt <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span>best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>parameters <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> budgets<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    d_opt <span class=\"token operator\">&#061;<\/span> budgets <span class=\"token operator\">\/<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">6.0<\/span> <span class=\"token operator\">*<\/span> n_opt<span class=\"token punctuation\">)<\/span>    <\/p>\n<p>\u5f97\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        )<\/p>\n<p>       N_{\\\\text{opt}}(C_i)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u540e&#xff0c;\u811a\u672c\u7528 Chinchilla \u5e38\u7528\u8fd1\u4f3c <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        C<\/p>\n<p>        \u2248<\/p>\n<p>        6<\/p>\n<p>        N<\/p>\n<p>        D<\/p>\n<p>       C \\\\approx 6ND<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord\">6<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><\/span><\/span><\/span><\/span>&#xff0c;\u76f4\u63a5\u8ba1\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         D<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>        )<\/p>\n<p>        &#061;<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>          6<\/p>\n<p>           N<\/p>\n<p>           opt<\/p>\n<p>          (<\/p>\n<p>           C<\/p>\n<p>           i<\/p>\n<p>          )<\/p>\n<p>       D_{\\\\text{opt}}(C_i) &#061; \\\\frac{C_i}{6N_{\\\\text{opt}}(C_i)}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.4308em;vertical-align: -0.5423em\"><\/span><span class=\"mord\"><span class=\"mopen nulldelimiter\"><\/span><span class=\"mfrac\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8884em\"><span class=\"\" style=\"top: -2.655em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">6<\/span><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2963em\"><span class=\"\" style=\"top: -2.357em;margin-left: -0.109em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2819em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen mtight\">(<\/span><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3281em\"><span class=\"\" style=\"top: -2.357em;margin-left: -0.0715em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.143em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose mtight\">)<\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.23em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"frac-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3.4101em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3281em\"><span class=\"\" style=\"top: -2.357em;margin-left: -0.0715em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.143em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.5423em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mclose nulldelimiter\"><\/span><\/span><\/span><\/span><\/span><\/span> \u4ece\u800c\u628a\u6bcf\u4e2a\u9884\u7b97\u7684 \u201c\u6700\u4f18\u6a21\u578b\u89c4\u6a21\u201d \u540c\u6b65\u8f6c\u6362\u6210 \u201c\u6700\u4f18\u6570\u636e token \u89c4\u6a21\u201d\u3002<\/p>\n<p>4) \u5e42\u5f8b\u62df\u5408&#xff1a;log-log \u7ebf\u6027\u56de\u5f52<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">fit_power_law<\/span><span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span> ys<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Tuple<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Fit y &#061; k * x^a via log-log linear regression:<br \/>\n      log(y) &#061; log(k) &#043; a * log(x)<br \/>\n    Returns (k, a).<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    <span class=\"token keyword\">if<\/span> np<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">any<\/span><span class=\"token punctuation\">(<\/span>xs <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">or<\/span> np<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">any<\/span><span class=\"token punctuation\">(<\/span>ys <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;x and y must be positive for log-log fit.&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    lx <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>log<span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">)<\/span><br \/>\n    ly <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>log<span class=\"token punctuation\">(<\/span>ys<span class=\"token punctuation\">)<\/span><br \/>\n    a<span class=\"token punctuation\">,<\/span> logk <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>polyfit<span class=\"token punctuation\">(<\/span>lx<span class=\"token punctuation\">,<\/span> ly<span class=\"token punctuation\">,<\/span> deg<span class=\"token operator\">&#061;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># slope&#061;a, intercept&#061;logk<\/span><br \/>\n    k <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>exp<span class=\"token punctuation\">(<\/span>logk<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> k<span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>a<span class=\"token punctuation\">)<\/span><\/p>\n<p>\u4e3a\u4e86\u5f97\u5230\u7f29\u653e\u5f8b&#xff0c;\u6211\u4eec\u5bf9 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>       N_{\\\\text{opt}}(C)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u548c <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         D<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>       D_{\\\\text{opt}}(C)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u5206\u522b\u62df\u5408\u5e42\u5f8b <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        y<\/p>\n<p>        &#061;<\/p>\n<p>        k<\/p>\n<p>         C<\/p>\n<p>         a<\/p>\n<p>       y&#061;kC^a<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.625em;vertical-align: -0.1944em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0359em\">y<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6944em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6644em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">a<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff0c;\u505a\u6cd5\u662f\u5c06\u5176\u8f6c\u6210\u7ebf\u6027\u5f62\u5f0f <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        log<\/p>\n<p>        \u2061<\/p>\n<p>        y<\/p>\n<p>        &#061;<\/p>\n<p>        log<\/p>\n<p>        \u2061<\/p>\n<p>        k<\/p>\n<p>        &#043;<\/p>\n<p>        a<\/p>\n<p>        log<\/p>\n<p>        \u2061<\/p>\n<p>        C<\/p>\n<p>       \\\\log y &#061; \\\\log k &#043; a \\\\log C<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mop\">lo<span style=\"margin-right: 0.0139em\">g<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0359em\">y<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mop\">lo<span style=\"margin-right: 0.0139em\">g<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">&#043;<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mord mathnormal\">a<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mop\">lo<span style=\"margin-right: 0.0139em\">g<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><\/span><\/span><\/span><\/span>&#xff0c;\u7136\u540e\u7528 np.polyfit(logC, logy, 1) \u6c42\u51fa\u659c\u7387 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        a<\/p>\n<p>       a<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.4306em\"><\/span><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span><\/span> \u548c\u622a\u8ddd <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        log<\/p>\n<p>        \u2061<\/p>\n<p>        k<\/p>\n<p>       \\\\log k<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mop\">lo<span style=\"margin-right: 0.0139em\">g<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><\/span><\/span><\/span><\/span>&#xff0c;\u6700\u540e\u518d <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        exp<\/p>\n<p>        \u2061<\/p>\n<p>       \\\\exp<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.625em;vertical-align: -0.1944em\"><\/span><span class=\"mop\">exp<\/span><\/span><\/span><\/span><\/span> \u56de\u53bb\u5f97\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        k<\/p>\n<p>       k<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6944em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><\/span><\/span><\/span><\/span>\u3002<\/p>\n<p>5) \u53ef\u89c6\u5316\u4e0e\u5916\u63a8\u9884\u6d4b<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">predict_power_law<\/span><span class=\"token punctuation\">(<\/span>k<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> a<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> x<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> k <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>x <span class=\"token operator\">**<\/span> a<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">plot_scaling<\/span><span class=\"token punctuation\">(<\/span><br \/>\n    x_points<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span><br \/>\n    y_points<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span><br \/>\n    k<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    a<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    out_path<span class=\"token punctuation\">:<\/span> Path<span class=\"token punctuation\">,<\/span><br \/>\n    title<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    y_label<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    x_min<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    x_max<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><br \/>\n<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    xs <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>logspace<span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>log10<span class=\"token punctuation\">(<\/span>x_min<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> np<span class=\"token punctuation\">.<\/span>log10<span class=\"token punctuation\">(<\/span>x_max<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">300<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ys <span class=\"token operator\">&#061;<\/span> predict_power_law<span class=\"token punctuation\">(<\/span>k<span class=\"token punctuation\">,<\/span> a<span class=\"token punctuation\">,<\/span> xs<span class=\"token punctuation\">)<\/span><\/p>\n<p>    plt<span class=\"token punctuation\">.<\/span>figure<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>loglog<span class=\"token punctuation\">(<\/span>x_points<span class=\"token punctuation\">,<\/span> y_points<span class=\"token punctuation\">,<\/span> marker<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;o&#034;<\/span><span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;None&#034;<\/span><span class=\"token punctuation\">,<\/span> label<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;opt points&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>loglog<span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">,<\/span> ys<span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;-&#034;<\/span><span class=\"token punctuation\">,<\/span> label<span class=\"token operator\">&#061;<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;fit: y &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>k<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>a<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>xlabel<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Compute budget C (FLOPs)&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>ylabel<span class=\"token punctuation\">(<\/span>y_label<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>title<span class=\"token punctuation\">(<\/span>title<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>grid<span class=\"token punctuation\">(<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> which<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;both&#034;<\/span><span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#8211;&#034;<\/span><span class=\"token punctuation\">,<\/span> linewidth<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.5<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>legend<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>tight_layout<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>savefig<span class=\"token punctuation\">(<\/span>out_path<span class=\"token punctuation\">,<\/span> dpi<span class=\"token operator\">&#061;<\/span><span class=\"token number\">200<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>close<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><span class=\"token punctuation\">.<\/span><\/p>\n<p>    <span class=\"token comment\"># Predictions required by the problem<\/span><br \/>\n    targets <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1e23<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e24<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    pred_N <span class=\"token operator\">&#061;<\/span> predict_power_law<span class=\"token punctuation\">(<\/span>kN<span class=\"token punctuation\">,<\/span> aN<span class=\"token punctuation\">,<\/span> targets<span class=\"token punctuation\">)<\/span><br \/>\n    pred_D <span class=\"token operator\">&#061;<\/span> predict_power_law<span class=\"token punctuation\">(<\/span>kD<span class=\"token punctuation\">,<\/span> aD<span class=\"token punctuation\">,<\/span> targets<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Print results<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; IsoFLOPs opt points (C, N_opt, D_opt, loss) &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> budgets<span class=\"token punctuation\">:<\/span><br \/>\n        r <span class=\"token operator\">&#061;<\/span> best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;C&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>C<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  N_opt&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  D_opt&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>C<span class=\"token operator\">\/<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6<\/span><span class=\"token operator\">*<\/span>r<span class=\"token punctuation\">.<\/span>parameters<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  loss&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>final_loss<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\\\\n&#061;&#061;&#061; Power-law fits &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;N_opt(C) &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>kN<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>aN<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;D_opt(C) &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>kD<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>aD<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\\\\n&#061;&#061;&#061; Extrapolated predictions &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> C<span class=\"token punctuation\">,<\/span> Np<span class=\"token punctuation\">,<\/span> Dp <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">zip<\/span><span class=\"token punctuation\">(<\/span>targets<span class=\"token punctuation\">,<\/span> pred_N<span class=\"token punctuation\">,<\/span> pred_D<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;C&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>C<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.1e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">:  N_opt\u2248<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>Np<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> params,  D_opt\u2248<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>Dp<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> tokens&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span>    <\/p>\n<p>\u6700\u540e\u6211\u4eec\u7528 logspace \u751f\u6210\u4ece\u89c2\u6d4b\u8303\u56f4\u5ef6\u4f38\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         24<\/p>\n<p>       10^{24}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u7684 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        C<\/p>\n<p>       C<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><\/span><\/span><\/span><\/span> \u7f51\u683c&#xff0c;\u5e76\u753b\u4e24\u5f20 log-log \u56fe&#xff1a;<\/p>\n<ul>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         C<\/p>\n<p>         \u2192<\/p>\n<p>          N<\/p>\n<p>          opt<\/p>\n<p>        C \\\\rightarrow N_{\\\\text{opt}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2192<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.9694em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         C<\/p>\n<p>         \u2192<\/p>\n<p>          D<\/p>\n<p>          opt<\/p>\n<p>        C \\\\rightarrow D_{\\\\text{opt}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2192<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.9694em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">D<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0278em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<\/ul>\n<p>\u540c\u65f6\u7528\u62df\u5408\u7684 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        k<\/p>\n<p>        ,<\/p>\n<p>        a<\/p>\n<p>       k,a<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\">a<\/span><\/span><\/span><\/span><\/span> \u5728 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         23<\/p>\n<p>        ,<\/p>\n<p>         10<\/p>\n<p>         24<\/p>\n<p>       10^{23},10^{24}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0085em;vertical-align: -0.1944em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">23<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">24<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u5904\u76f4\u63a5\u8ba1\u7b97\u9884\u6d4b\u503c\u5e76\u6253\u5370\u51fa\u6765\u3002<\/p>\n<h3>2. Problem (scaling_laws): 50 points<\/h3>\n<p>\u6784\u5efa\u4e00\u5957 \u7f29\u653e\u5b9a\u5f8b&#xff08;scaling law&#xff09; \u7528\u4e8e\u5728 1e19 FLOPs \u7684\u8ba1\u7b97\u9884\u7b97\u4e0b&#xff0c;\u51c6\u786e\u9884\u6d4b\u6700\u4f18\u6a21\u578b\u89c4\u6a21\u3001\u5bf9\u5e94\u7684\u8d85\u53c2\u6570\u914d\u7f6e\u4ee5\u53ca\u6700\u7ec8\u8bad\u7ec3\u635f\u5931\u3002\u4e3a\u6b64&#xff0c;\u4f60\u9700\u8981\u4f7f\u7528\u6211\u4eec\u63d0\u4f9b\u7684 training API \u6765\u67e5\u8be2\u4e0d\u540c\u5b9e\u9a8c\u914d\u7f6e\u4e0b\u7684\u6700\u7ec8\u8bad\u7ec3\u635f\u5931&#xff08;\u89c1 \u00a73.1&#xff09;\u3002\u5728\u62df\u5408\u7f29\u653e\u5b9a\u5f8b\u7684\u8fc7\u7a0b\u4e2d&#xff0c;\u4f60\u6700\u591a\u53ea\u80fd\u67e5\u8be2 2e19 FLOPs \u89c4\u6a21\u7684\u5b9e\u9a8c\u6570\u636e&#xff0c;\u8fd9\u662f API \u5f3a\u5236\u6267\u884c\u7684\u786c\u6027\u4e0a\u9650\u3002<\/p>\n<p>Deliverable&#xff1a;\u8bf7\u63d0\u4ea4\u4e00\u4efd\u6392\u7248\u89c4\u8303\u7684\u4e66\u9762\u62a5\u544a&#xff0c;\u5176\u4e2d\u5e94\u5b8c\u6574\u3001\u6e05\u6670\u5730\u8bf4\u660e&#xff1a;<\/p>\n<ul>\n<li>\u4f60\u7528\u4e8e\u62df\u5408\u7f29\u653e\u5b9a\u5f8b\u7684\u65b9\u6cd5\u4e0e\u6574\u4f53\u601d\u8def&#xff1b;<\/li>\n<li>\u4f60\u5982\u4f55\u5229\u7528\u8be5\u7f29\u653e\u5b9a\u5f8b&#xff0c;\u5728\u7ed9\u5b9a FLOPs \u9884\u7b97\u4e0b\u9884\u6d4b\u6700\u4f18\u6a21\u578b\u89c4\u6a21&#xff1b;<\/li>\n<li>\u4f60\u6700\u7ec8\u7ed9\u51fa\u7684\u9884\u6d4b\u7ed3\u679c\u3002<\/li>\n<\/ul>\n<p>\u62a5\u544a\u4e2d\u5e94\u5305\u542b\u5bf9\u5173\u952e\u8bbe\u8ba1\u51b3\u7b56\u7684\u89e3\u91ca&#xff0c;\u5e76\u63d0\u4f9b\u8db3\u591f\u7ec6\u8282&#xff0c;\u4f7f\u4ed6\u4eba\u53ef\u4ee5\u590d\u73b0\u4f60\u7684\u65b9\u6cd5\u4e0e\u7ed3\u679c\u3002<\/p>\n<p>\u5173\u4e8e batch size \u7684\u8bf4\u660e&#xff08;\u91cd\u8981&#xff09;<\/p>\n<p>\u5728 1e19 FLOPs \u7684\u9884\u7b97\u4e0b&#xff0c;\u4f60\u62a5\u544a\u7684\u8d85\u53c2\u6570\u914d\u7f6e\u4e2d&#xff0c;batch size \u5fc5\u987b\u4e3a 128 \u6216 256\u3002\u8fd9\u4e00\u9650\u5236\u7684\u76ee\u7684\u662f\u786e\u4fdd\u5b9e\u9a8c\u5177\u5907\u8db3\u591f\u9ad8\u7684 FLOPs \u5229\u7528\u7387\u3002\u5982\u679c\u5728\u8fd0\u884c\u4f60\u6240\u62a5\u544a\u7684\u914d\u7f6e\u65f6\u51fa\u73b0\u663e\u5b58\u4e0d\u8db3&#xff08;OOM&#xff09;\u95ee\u9898&#xff0c;\u6211\u4eec\u5c06\u901a\u8fc7 \u68af\u5ea6\u7d2f\u79ef&#xff08;gradient accumulation&#xff09; \u6216 \u589e\u52a0\u6570\u636e\u5e76\u884c GPU \u6570\u91cf \u7684\u65b9\u5f0f\u6765\u7ef4\u6301\u4f60\u6240\u6307\u5b9a\u7684 batch size\u3002<\/p>\n<p>\u5efa\u8bae\u4f60\u5728\u62a5\u544a\u4e2d\u91cd\u70b9\u56de\u7b54\u7684\u95ee\u9898<\/p>\n<p>\u4e3a\u4e86\u5e2e\u52a9\u4f60\u987a\u5229\u5f00\u59cb&#xff0c;\u6211\u4eec\u5efa\u8bae\u4f60\u81f3\u5c11\u601d\u8003\u5e76\u5728\u62a5\u544a\u4e2d\u8ba8\u8bba\u4ee5\u4e0b\u95ee\u9898&#xff08;\u4f60\u7684 write-up \u5e94\u5bf9\u6bcf\u4e00\u70b9\u7ed9\u51fa\u989d\u5916\u8bf4\u660e&#xff0c;\u89e3\u91ca\u4f60\u7684\u51b3\u7b56\u4f9d\u636e&#xff09;&#xff1a;<\/p>\n<ul>\n<li>\u5728\u56fa\u5b9a\u7684 2e18 FLOPs \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u9884\u7b97\u4e0b&#xff0c;\u4f60\u662f\u5982\u4f55\u51b3\u5b9a\u8981\u67e5\u8be2\u54ea\u4e9b\u8bad\u7ec3\u914d\u7f6e\u7684&#xff1f;<\/li>\n<li>\u4f60\u662f\u5982\u4f55\u62df\u5408\u7f29\u653e\u5b9a\u5f8b\u7684&#xff1f;\u8bf7\u8be6\u7ec6\u63cf\u8ff0\u4f60\u6240\u4f7f\u7528\u7684\u5177\u4f53\u65b9\u6cd5\u6216\u65b9\u6cd5\u7ec4\u5408\u3002\u7279\u522b\u7684&#xff0c;\u6211\u4eec\u5efa\u8bae\u4f60\u53c2\u8003 [Kaplan&#043; 2020] \u4e0e [Hoffmann&#043; 2022] \u4e2d\u91c7\u7528\u7684\u5efa\u6a21\u601d\u8def\u3002<\/li>\n<li>\u4f60\u7684\u7f29\u653e\u5b9a\u5f8b\u5bf9\u5b9e\u9a8c\u6570\u636e\u7684\u62df\u5408\u6548\u679c\u5982\u4f55&#xff1f;<\/li>\n<li>\u5728 1e19 FLOPs \u7684\u9884\u7b97\u4e0b&#xff0c;\u4f60\u7684\u7f29\u653e\u5b9a\u5f8b\u9884\u6d4b\u7684\u6700\u4f18\u6a21\u578b\u89c4\u6a21\u662f\u591a\u5c11&#xff1f;\u5bf9\u5e94\u7684\u9884\u6d4b\u8bad\u7ec3\u635f\u5931\u662f\u591a\u5c11&#xff1f;<\/li>\n<li>\u5982\u679c\u4f60\u771f\u7684\u8981\u8bad\u7ec3\u4e00\u4e2a\u5177\u6709\u8be5\u9884\u6d4b\u6700\u4f18\u53c2\u6570\u89c4\u6a21\u7684\u6a21\u578b&#xff0c;\u4f60\u4f1a\u9009\u62e9\u54ea\u4e9b\u8d85\u53c2\u6570&#xff1f;Tips&#xff1a;\u5bf9\u4e8e\u4e00\u4e2a\u7ed9\u5b9a\u6a21\u578b&#xff0c;\u5176 \u975e embedding \u53c2\u6570\u6570\u91cf \u53ef\u8fd1\u4f3c\u4f30\u8ba1\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         12<\/p>\n<p>          n<\/p>\n<p>          layer<\/p>\n<p>          d<\/p>\n<p>          model<\/p>\n<p>          2<\/p>\n<p>        12n_{\\\\text{layer}}d_{\\\\text{model}}^2<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.1002em;vertical-align: -0.2861em\"><\/span><span class=\"mord\">12<\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -2.4169em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2831em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<\/ul>\n<p>\u989d\u5916\u63d0\u4ea4\u8981\u6c42<\/p>\n<p>\u9664\u4e66\u9762\u62a5\u544a\u5916&#xff0c;\u4f60\u8fd8\u9700\u8981\u989d\u5916\u63d0\u4ea4\u4ee5\u4e0b\u5185\u5bb9&#xff1a;<\/p>\n<p>1. \u4f60\u9884\u6d4b\u7684 \u6700\u4f18\u6a21\u578b\u89c4\u6a21&#xff1b;<\/p>\n<p>2. \u4f60\u9009\u62e9\u7684 \u8bad\u7ec3\u8d85\u53c2\u6570\u914d\u7f6e&#xff08;\u5305\u62ec batch size&#xff0c;\u5fc5\u987b\u4e3a 128 \u6216 256&#xff09;&#xff1b;<\/p>\n<p>3. \u4f60\u9884\u6d4b\u7684 \u6a21\u578b\u8bad\u7ec3\u635f\u5931\u3002<\/p>\n<p>\u4e0a\u8ff0\u4e09\u9879\u5185\u5bb9\u8bf7\u63d0\u4ea4\u81f3\u4ee5\u4e0b Google \u8868\u5355&#xff1a;https:\/\/forms.gle\/sAUSLwCUETew2hYN6<\/p>\n<p>\u6700\u7ec8\u8bc4\u5206\u4e2d&#xff0c;\u4e00\u90e8\u5206\u5206\u6570\u5c06\u53d6\u51b3\u4e8e\u4f60\u6240\u9884\u6d4b\u7684\u6700\u4f18\u6a21\u578b\u5728\u5b9e\u9645\u8bad\u7ec3\u4e2d\u7684\u8868\u73b0\u3002<\/p>\n<p>Note&#xff1a;\u7531\u4e8e\u7f3a\u4e4f\u5b98\u65b9 API \u7684\u8bbf\u95ee\u6743\u9650&#xff0c;\u672c\u6b21\u4f5c\u4e1a\u65e0\u6cd5\u8fdb\u884c\u6d4b\u8bd5\u3002\u672a\u6765\u5b98\u65b9\u53ef\u80fd\u4f1a\u901a\u8fc7\u5176\u4ed6\u65b9\u5f0f\u5f00\u653e&#xff0c;\u8bf7\u5173\u6ce8\u5b98\u65b9\u4ed3\u5e93\u7684\u76f8\u5173\u8bf4\u660e\u4e0e\u8ba8\u8bba&#xff1a;stanford-cs336\/assignment3-scaling#1<\/p>\n<p>\u4e0b\u9762\u6211\u4eec\u7b80\u5355\u8fc7\u4e0b\u76f8\u5173\u7684\u6d4b\u8bd5\u811a\u672c&#xff0c;\u6700\u540e\u5206\u6790\u4e0b\u62df\u5408 scaling laws \u7684\u65b9\u6cd5\u4e0e\u6574\u4f53\u601d\u8def&#xff08;\u6ce8\u610f\u2757\u6240\u6709\u811a\u672c\u5747\u672a\u901a\u8fc7\u5145\u5206\u7684\u6d4b\u8bd5&#xff0c;\u540e\u7eed\u5982\u679c\u6709\u6570\u636e\u8f6f\u4ef6\u5305\u5f00\u653e\u6211\u4eec\u518d\u6765\u5b8c\u6210\u672c\u6b21\u4f5c\u4e1a&#xff09;<\/p>\n<p>\u4f5c\u4e1a\u7684\u6838\u5fc3\u76ee\u6807\u662f&#xff1a;\u5728 1e19 FLOPs \u7684\u8bad\u7ec3\u9884\u7b97\u4e0b&#xff0c;\u9884\u6d4b compute-optimal \u7684\u6a21\u578b\u89c4\u6a21 &#043; \u4e00\u5957\u53ef\u8bad\u7ec3\u7684\u8d85\u53c2\u6570 &#043; \u9884\u6d4b\u7684\u6700\u7ec8\u8bad\u7ec3\u635f\u5931&#xff1b;\u6211\u4eec\u53ea\u80fd\u901a\u8fc7 training API \u67e5\u8be2\u5b9e\u9a8c\u7ed3\u679c&#xff0c;\u5e76\u4e14\u7528\u4e8e \u201c\u62df\u5408\u7f29\u653e\u5b9a\u5f8b\u201d \u7684\u67e5\u8be2\u603b FLOPs \u9884\u7b97\u6709\u786c\u4e0a\u9650&#xff08;\u8d85\u8fc7\u5c31\u4f1a\u88ab\u62d2\u7edd&#xff09;\u3002<\/p>\n<p>\u672c\u6b21\u4f5c\u4e1a\u5b9e\u73b0\u7684\u811a\u672c\u5305\u62ec 3 \u7c7b&#xff1a;<\/p>\n<p>A. API \u8c03\u7528\u4e0e\u7f13\u5b58\u5c42<\/p>\n<p>\u76ee\u7684&#xff1a;\u7edf\u4e00\u8c03\u7528 GET \/loss\u3001\u67e5\u8be2 GET \/total_flops_used\u3001\u62c9\u53d6 GET \/previous_runs&#xff0c;\u5e76\u628a\u5df2\u8dd1\u8fc7\u7684\u914d\u7f6e\u7f13\u5b58\u4e0b\u6765\u3002<\/p>\n<p>\u76f8\u5173\u811a\u672c&#xff1a;<\/p>\n<ul>\n<li>cs336_scaling\/api_client.py\n<ul>\n<li>get_loss(config) -&gt; loss, total_flops_used<\/li>\n<li>get_total_flops_used()<\/li>\n<li>get_previous_runs()<\/li>\n<\/ul>\n<\/li>\n<li>cache.py<\/li>\n<li>query_api.py<\/li>\n<\/ul>\n<p>B. \u5b9e\u9a8c\u8bbe\u8ba1 \/ \u641c\u7d22\u811a\u672c<\/p>\n<p>\u76ee\u7684&#xff1a;\u5728 \u201c\u62df\u5408\u9884\u7b97\u201d \u5185\u51b3\u5b9a\u8981\u67e5\u54ea\u4e9b\u70b9&#xff08;\u54ea\u4e9b\u6a21\u578b\u89c4\u6a21\u3001\u54ea\u4e9b\u5b66\u4e60\u7387\/\u5c42\u6570\/\u5bbd\u5ea6\/heads\/batch\/train_flops&#xff09;&#xff0c;\u5e76\u5c3d\u91cf\u9ad8\u6548\u627e\u5230\u89c4\u5f8b\u3002<\/p>\n<p>\u76f8\u5173\u811a\u672c&#xff1a;<\/p>\n<ul>\n<li>cs336_scaling\/run_sweep.py&#xff1a;\n<ul>\n<li>\u652f\u6301 grid \/ \u5206\u9636\u6bb5\u641c\u7d22&#xff08;\u5148\u7c97\u626b\u518d\u7ec6\u5316&#xff09;<\/li>\n<li>\u6bcf\u6b21\u67e5\u8be2\u524d\u5148\u770b total_flops_used<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>C. \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u4e0e\u9884\u6d4b\u811a\u672c<\/p>\n<p>\u76ee\u7684&#xff1a;\u628a\u6211\u4eec\u67e5\u8be2\u5230\u7684\u6570\u636e\u62df\u5408\u6210\u4e00\u4e2a\u53ef\u5916\u63a8\u7684\u6a21\u578b&#xff0c;\u7136\u540e\u5728 1e19 FLOPs \u4e0b\u8f93\u51fa&#xff1a;<\/p>\n<ul>\n<li>\u9884\u6d4b\u6700\u4f18\u6a21\u578b\u89c4\u6a21&#xff08;\u53c2\u6570\u91cf&#xff09;<\/li>\n<li>\u5bf9\u5e94\u8d85\u53c2\u6570<\/li>\n<li>\u9884\u6d4b\u8bad\u7ec3\u635f\u5931<\/li>\n<\/ul>\n<p>\u76f8\u5173\u811a\u672c&#xff1a;<\/p>\n<ul>\n<li>cs336_scaling\/scaling_data.py<\/li>\n<li>cs336_scaling\/fit_scaling_law.py<\/li>\n<li>cs336_scaling\/predict_1e19.py<\/li>\n<\/ul>\n<p>\u4e0b\u9762\u6211\u4eec\u5c31\u6765\u770b\u770b\u8fd9\u4e9b\u811a\u672c\u662f\u5982\u4f55\u5b9e\u73b0\u7684&#xff0c;\u8be5\u5982\u4f55\u8fd0\u884c<\/p>\n<h4>2.1 API \u8c03\u7528\u4e0e\u7f13\u5b58\u5c42\u811a\u672c\u5b9e\u73b0<\/h4>\n<p>assignment3-scaling\/ \u4e0b\u5305\u62ec&#xff1a;<\/p>\n<p>cs336_scaling\/<br \/>\n  api_client.py        <span class=\"token comment\"># API \u8c03\u7528 &#043; \u53c2\u6570\u6821\u9a8c &#043; \u9519\u8bef\u5904\u7406<\/span><br \/>\n  cache.py             <span class=\"token comment\"># \u672c\u5730\u7f13\u5b58&#xff08;jsonl \/ sqlite \u90fd\u884c&#xff0c;\u8fd9\u91cc\u7528 jsonl \u6700\u8f7b&#xff09;<\/span><br \/>\n  query_api.py         <span class=\"token comment\"># \u4e00\u4e2a\u5c0f CLI&#xff1a;\u5355\u6b21\/\u6279\u91cf\u67e5\u8be2\u3001\u5bfc\u51fa\u7ed3\u679c<\/span><br \/>\nruns\/<br \/>\n  api_cache.jsonl      <span class=\"token comment\"># \u81ea\u52a8\u751f\u6210&#xff1a;\u7f13\u5b58\u4e0e\u65e5\u5fd7<\/span><\/p>\n<p>\u9996\u5148\u6765\u770b \u672c\u5730\u7f13\u5b58\u5c42 cs336_scaling\/cache.py \u7684\u5b9e\u73b0&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> hashlib<br \/>\n<span class=\"token keyword\">import<\/span> json<br \/>\n<span class=\"token keyword\">from<\/span> dataclasses <span class=\"token keyword\">import<\/span> dataclass<br \/>\n<span class=\"token keyword\">from<\/span> pathlib <span class=\"token keyword\">import<\/span> Path<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Any<span class=\"token punctuation\">,<\/span> Dict<span class=\"token punctuation\">,<\/span> Optional<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">_stable_json<\/span><span class=\"token punctuation\">(<\/span>obj<span class=\"token punctuation\">:<\/span> Any<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> json<span class=\"token punctuation\">.<\/span>dumps<span class=\"token punctuation\">(<\/span>obj<span class=\"token punctuation\">,<\/span> sort_keys<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> separators<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;,&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;:&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> ensure_ascii<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">make_key<\/span><span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    payload <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;endpoint&#034;<\/span><span class=\"token punctuation\">:<\/span> endpoint<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;params&#034;<\/span><span class=\"token punctuation\">:<\/span> params<span class=\"token punctuation\">}<\/span><br \/>\n    h <span class=\"token operator\">&#061;<\/span> hashlib<span class=\"token punctuation\">.<\/span>sha256<span class=\"token punctuation\">(<\/span>_stable_json<span class=\"token punctuation\">(<\/span>payload<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>encode<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;utf-8&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>hexdigest<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> h<\/p>\n<p><span class=\"token decorator annotation punctuation\">&#064;dataclass<\/span><br \/>\n<span class=\"token keyword\">class<\/span> <span class=\"token class-name\">CacheHit<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    key<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><br \/>\n    value<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><\/p>\n<p><span class=\"token keyword\">class<\/span> <span class=\"token class-name\">JsonlCache<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Append-only JSONL cache.<br \/>\n    Each line: {&#034;key&#034;:&#8230;, &#034;endpoint&#034;:&#8230;, &#034;params&#034;:&#8230;, &#034;response&#034;:&#8230;}<br \/>\n    &#034;&#034;&#034;<\/span><\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">__init__<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">,<\/span> path<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span> <span class=\"token operator\">|<\/span> Path<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>path <span class=\"token operator\">&#061;<\/span> Path<span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">)<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>path<span class=\"token punctuation\">.<\/span>parent<span class=\"token punctuation\">.<\/span>mkdir<span class=\"token punctuation\">(<\/span>parents<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> exist_ok<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>_index<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">}<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> self<span class=\"token punctuation\">.<\/span>path<span class=\"token punctuation\">.<\/span>exists<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            self<span class=\"token punctuation\">.<\/span>_load<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">_load<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">with<\/span> self<span class=\"token punctuation\">.<\/span>path<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">open<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;r&#034;<\/span><span class=\"token punctuation\">,<\/span> encoding<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;utf-8&#034;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">as<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> line <span class=\"token keyword\">in<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n                line <span class=\"token operator\">&#061;<\/span> line<span class=\"token punctuation\">.<\/span>strip<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n                <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> line<span class=\"token punctuation\">:<\/span><br \/>\n                    <span class=\"token keyword\">continue<\/span><br \/>\n                obj <span class=\"token operator\">&#061;<\/span> json<span class=\"token punctuation\">.<\/span>loads<span class=\"token punctuation\">(<\/span>line<span class=\"token punctuation\">)<\/span><br \/>\n                key <span class=\"token operator\">&#061;<\/span> obj<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;key&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n                <span class=\"token keyword\">if<\/span> key<span class=\"token punctuation\">:<\/span><br \/>\n                    self<span class=\"token punctuation\">.<\/span>_index<span class=\"token punctuation\">[<\/span>key<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> obj<\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">get<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">,<\/span> endpoint<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Optional<span class=\"token punctuation\">[<\/span>CacheHit<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        key <span class=\"token operator\">&#061;<\/span> make_key<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        obj <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>_index<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span>key<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> obj <span class=\"token keyword\">is<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">return<\/span> <span class=\"token boolean\">None<\/span><br \/>\n        <span class=\"token keyword\">return<\/span> CacheHit<span class=\"token punctuation\">(<\/span>key<span class=\"token operator\">&#061;<\/span>key<span class=\"token punctuation\">,<\/span> value<span class=\"token operator\">&#061;<\/span>obj<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;response&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">put<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">,<\/span> endpoint<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> response<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        key <span class=\"token operator\">&#061;<\/span> make_key<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        record <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n            <span class=\"token string\">&#034;key&#034;<\/span><span class=\"token punctuation\">:<\/span> key<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;endpoint&#034;<\/span><span class=\"token punctuation\">:<\/span> endpoint<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;params&#034;<\/span><span class=\"token punctuation\">:<\/span> params<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;response&#034;<\/span><span class=\"token punctuation\">:<\/span> response<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">}<\/span><br \/>\n        <span class=\"token keyword\">with<\/span> self<span class=\"token punctuation\">.<\/span>path<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">open<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">,<\/span> encoding<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;utf-8&#034;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">as<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n            f<span class=\"token punctuation\">.<\/span>write<span class=\"token punctuation\">(<\/span>_stable_json<span class=\"token punctuation\">(<\/span>record<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> <span class=\"token string\">&#034;\\\\n&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>_index<span class=\"token punctuation\">[<\/span>key<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> record<br \/>\n        <span class=\"token keyword\">return<\/span> key<\/p>\n<p>\u6838\u5fc3\u76ee\u6807\u662f \u540c\u4e00\u8bf7\u6c42&#xff08;endpoint &#043; \u53c2\u6570&#xff09;\u53ea\u6253\u4e00\u6b21\u7f51\u7edc&#xff0c;\u5e76\u5c06\u7ed3\u679c\u5199\u6210 append-only \u7684 json&#xff0c;\u65b9\u4fbf grep \/ \u753b\u56fe \/ \u590d\u73b0\u3002<\/p>\n<p>\u63a5\u7740\u6765\u770b API \u8c03\u7528\u5c42 cs336_scaling\/api_client.py \u7684\u5b9e\u73b0&#xff1a;<\/p>\n<p><span class=\"token keyword\">from<\/span> dataclasses <span class=\"token keyword\">import<\/span> dataclass<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Any<span class=\"token punctuation\">,<\/span> Dict<\/p>\n<p><span class=\"token keyword\">import<\/span> requests<\/p>\n<p><span class=\"token keyword\">from<\/span> cache <span class=\"token keyword\">import<\/span> JsonlCache<\/p>\n<p><span class=\"token decorator annotation punctuation\">&#064;dataclass<\/span><span class=\"token punctuation\">(<\/span>frozen<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">class<\/span> <span class=\"token class-name\">LossQuery<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    d_model<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    num_layers<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    num_heads<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    batch_size<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    learning_rate<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><br \/>\n    train_flops<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><\/p>\n<p><span class=\"token keyword\">class<\/span> <span class=\"token class-name\">ScalingAPIError<\/span><span class=\"token punctuation\">(<\/span>RuntimeError<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">pass<\/span><\/p>\n<p><span class=\"token keyword\">class<\/span> <span class=\"token class-name\">ScalingAPIClient<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">def<\/span> <span class=\"token function\">__init__<\/span><span class=\"token punctuation\">(<\/span><br \/>\n        self<span class=\"token punctuation\">,<\/span><br \/>\n        api_key<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        base_url<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;http:\/\/hyperturing.stanford.edu:8000&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        cache_path<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;runs\/api_cache.jsonl&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        timeout_s<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">60<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>api_key <span class=\"token operator\">&#061;<\/span> api_key<br \/>\n        self<span class=\"token punctuation\">.<\/span>base_url <span class=\"token operator\">&#061;<\/span> base_url<span class=\"token punctuation\">.<\/span>rstrip<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\/&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>cache <span class=\"token operator\">&#061;<\/span> JsonlCache<span class=\"token punctuation\">(<\/span>cache_path<span class=\"token punctuation\">)<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>timeout_s <span class=\"token operator\">&#061;<\/span> timeout_s<\/p>\n<p>    <span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<\/span><br \/>\n    <span class=\"token comment\"># Local validation (matches the handout)<\/span><br \/>\n    <span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<\/span><br \/>\n    <span class=\"token keyword\">def<\/span> <span class=\"token function\">_validate_loss_query<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">,<\/span> q<span class=\"token punctuation\">:<\/span> LossQuery<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token comment\"># Ranges from the handout: d_model[64,1024], layers[2,24], heads[2,16],<\/span><br \/>\n        <span class=\"token comment\"># batch_size[128,256], lr[1e-4,1e-3], train_flops in a fixed set. :contentReference[oaicite:5]{index&#061;5}<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">64<\/span> <span class=\"token operator\">&lt;&#061;<\/span> q<span class=\"token punctuation\">.<\/span>d_model <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">1024<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;d_model out of range: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span> <span class=\"token operator\">&lt;&#061;<\/span> q<span class=\"token punctuation\">.<\/span>num_layers <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">24<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;num_layers out of range: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span> <span class=\"token operator\">&lt;&#061;<\/span> q<span class=\"token punctuation\">.<\/span>num_heads <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">16<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;num_heads out of range: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">128<\/span> <span class=\"token operator\">&lt;&#061;<\/span> q<span class=\"token punctuation\">.<\/span>batch_size <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">256<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;batch_size out of range: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token number\">1e-4<\/span> <span class=\"token operator\">&lt;&#061;<\/span> q<span class=\"token punctuation\">.<\/span>learning_rate <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">1e-3<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;learning_rate out of range: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        allowed <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e13<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">3e13<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6e13<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e14<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">3e14<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6e14<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e15<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">3e15<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6e15<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e16<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">3e16<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6e16<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e17<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">3e17<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">6e17<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e18<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">}<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token keyword\">in<\/span> allowed<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ValueError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;train_flops not allowed: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">_get_json<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">,<\/span> path<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        url <span class=\"token operator\">&#061;<\/span> <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>self<span class=\"token punctuation\">.<\/span>base_url<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">\/<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>path<span class=\"token punctuation\">.<\/span>lstrip<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#039;\/&#039;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><br \/>\n        r <span class=\"token operator\">&#061;<\/span> requests<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span>url<span class=\"token punctuation\">,<\/span> params<span class=\"token operator\">&#061;<\/span>params<span class=\"token punctuation\">,<\/span> timeout<span class=\"token operator\">&#061;<\/span>self<span class=\"token punctuation\">.<\/span>timeout_s<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token comment\"># API error examples return {&#034;message&#034;: &#034;&#8230;&#034;} :contentReference[oaicite:6]{index&#061;6}:contentReference[oaicite:7]{index&#061;7}<\/span><br \/>\n        <span class=\"token keyword\">try<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            payload <span class=\"token operator\">&#061;<\/span> r<span class=\"token punctuation\">.<\/span>json<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">except<\/span> Exception <span class=\"token keyword\">as<\/span> e<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ScalingAPIError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;Non-JSON response: status&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>status_code<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">, text&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>text<span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">200]<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">from<\/span> e<\/p>\n<p>        <span class=\"token keyword\">if<\/span> r<span class=\"token punctuation\">.<\/span>status_code <span class=\"token operator\">!&#061;<\/span> <span class=\"token number\">200<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            msg <span class=\"token operator\">&#061;<\/span> payload<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;message&#034;<\/span><span class=\"token punctuation\">,<\/span> payload<span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">raise<\/span> ScalingAPIError<span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;API error <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>r<span class=\"token punctuation\">.<\/span>status_code<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> &#064; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>url<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>msg<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span> payload<\/p>\n<p>    <span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<\/span><br \/>\n    <span class=\"token comment\"># Public endpoints<\/span><br \/>\n    <span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<\/span><br \/>\n    <span class=\"token keyword\">def<\/span> <span class=\"token function\">total_flops_used<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        endpoint <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;\/total_flops_used&#034;<\/span><br \/>\n        params <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;api_key&#034;<\/span><span class=\"token punctuation\">:<\/span> self<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">}<\/span><br \/>\n        hit <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> hit<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">return<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>hit<span class=\"token punctuation\">.<\/span>value<span class=\"token punctuation\">)<\/span><br \/>\n        out <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>_get_json<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token comment\"># sample shows it returns a number (JSON scalar) :contentReference[oaicite:8]{index&#061;8}<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>put<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">,<\/span> out<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>out<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">previous_runs<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        endpoint <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;\/previous_runs&#034;<\/span><br \/>\n        params <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;api_key&#034;<\/span><span class=\"token punctuation\">:<\/span> self<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">}<\/span><br \/>\n        hit <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> hit<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">return<\/span> hit<span class=\"token punctuation\">.<\/span>value<br \/>\n        out <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>_get_json<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>put<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">,<\/span> out<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span> out<\/p>\n<p>    <span class=\"token keyword\">def<\/span> <span class=\"token function\">loss<\/span><span class=\"token punctuation\">(<\/span>self<span class=\"token punctuation\">,<\/span> q<span class=\"token punctuation\">:<\/span> LossQuery<span class=\"token punctuation\">,<\/span> use_cache<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">bool<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> Any<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>_validate_loss_query<span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">)<\/span><\/p>\n<p>        endpoint <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;\/loss&#034;<\/span><br \/>\n        params <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n            <span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;batch_size&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;train_flops&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;api_key&#034;<\/span><span class=\"token punctuation\">:<\/span> self<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">}<\/span><\/p>\n<p>        <span class=\"token keyword\">if<\/span> use_cache<span class=\"token punctuation\">:<\/span><br \/>\n            hit <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">if<\/span> hit<span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token keyword\">return<\/span> hit<span class=\"token punctuation\">.<\/span>value<\/p>\n<p>        out <span class=\"token operator\">&#061;<\/span> self<span class=\"token punctuation\">.<\/span>_get_json<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token comment\"># example output: {&#034;loss&#034;: &#8230;, &#034;total_flops_used&#034;: &#8230;} :contentReference[oaicite:9]{index&#061;9}<\/span><br \/>\n        self<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>put<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">,<\/span> out<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span> out<\/p>\n<p>\u4e0a\u8ff0\u811a\u672c\u628a \u4e09\u4e2a\u7aef\u70b9 \u90fd\u5c01\u88c5\u4e86\u8d77\u6765&#xff0c;\u5e76\u63d0\u4f9b&#xff1a;<\/p>\n<ul>\n<li>\u53c2\u6570\u8303\u56f4\u7684\u672c\u5730\u6821\u9a8c&#xff08;\u907f\u514d\u65e0\u610f\u4e49 404 \u8bf7\u6c42&#xff09;<\/li>\n<li>\u81ea\u52a8\u8d70\u7f13\u5b58<\/li>\n<li>\u9519\u8bef\u4fe1\u606f\u66f4\u6e05\u6670<\/li>\n<\/ul>\n<p>\u6700\u540e \u4e00\u4e2a\u6700\u5c0f CLI cs336_scaling\/query_api.py \u7684\u5b9e\u73b0\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> argparse<br \/>\n<span class=\"token keyword\">import<\/span> os<br \/>\n<span class=\"token keyword\">from<\/span> pprint <span class=\"token keyword\">import<\/span> pprint<\/p>\n<p><span class=\"token keyword\">from<\/span> api_client <span class=\"token keyword\">import<\/span> LossQuery<span class=\"token punctuation\">,<\/span> ScalingAPIClient<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    p <span class=\"token operator\">&#061;<\/span> argparse<span class=\"token punctuation\">.<\/span>ArgumentParser<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;api-key&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>os<span class=\"token punctuation\">.<\/span>environ<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;CS336_API_KEY&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;base-url&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;http:\/\/hyperturing.stanford.edu:8000&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;cache&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;runs\/api_cache.jsonl&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    sub <span class=\"token operator\">&#061;<\/span> p<span class=\"token punctuation\">.<\/span>add_subparsers<span class=\"token punctuation\">(<\/span>dest<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;cmd&#034;<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    sub<span class=\"token punctuation\">.<\/span>add_parser<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;total_flops_used&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    sub<span class=\"token punctuation\">.<\/span>add_parser<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;previous_runs&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    q <span class=\"token operator\">&#061;<\/span> sub<span class=\"token punctuation\">.<\/span>add_parser<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;loss&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    q<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;d-model&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    q<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;num-layers&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    q<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;num-heads&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    q<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;batch-size&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    q<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;learning-rate&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    q<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;train-flops&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    args <span class=\"token operator\">&#061;<\/span> p<span class=\"token punctuation\">.<\/span>parse_args<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> args<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">raise<\/span> SystemExit<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Missing &#8211;api-key or env CS336_API_KEY&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    client <span class=\"token operator\">&#061;<\/span> ScalingAPIClient<span class=\"token punctuation\">(<\/span><br \/>\n        api_key<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">,<\/span><br \/>\n        base_url<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>base_url<span class=\"token punctuation\">,<\/span><br \/>\n        cache_path<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> args<span class=\"token punctuation\">.<\/span>cmd <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;total_flops_used&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span>client<span class=\"token punctuation\">.<\/span>total_flops_used<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> args<span class=\"token punctuation\">.<\/span>cmd <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;previous_runs&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        pprint<span class=\"token punctuation\">(<\/span>client<span class=\"token punctuation\">.<\/span>previous_runs<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> args<span class=\"token punctuation\">.<\/span>cmd <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;loss&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        out <span class=\"token operator\">&#061;<\/span> client<span class=\"token punctuation\">.<\/span>loss<span class=\"token punctuation\">(<\/span><br \/>\n            LossQuery<span class=\"token punctuation\">(<\/span><br \/>\n                d_model<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span><br \/>\n                num_layers<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">,<\/span><br \/>\n                num_heads<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">,<\/span><br \/>\n                batch_size<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">,<\/span><br \/>\n                learning_rate<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">,<\/span><br \/>\n                train_flops<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n        pprint<span class=\"token punctuation\">(<\/span>out<span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span><\/p>\n<p><span class=\"token keyword\">if<\/span> __name__ <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;__main__&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    main<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u8be5\u811a\u672c\u80fd\u8ba9\u6211\u4eec\u5feb\u901f\u9a8c\u8bc1&#xff1a;key \u662f\u5426\u80fd\u7528\u3001\u7f13\u5b58\u662f\u5426\u751f\u6548\u4ee5\u53ca\u5355\u6761 loss \u67e5\u8be2\u6d41\u7a0b\u662f\u5426\u7545\u901a\u3002<\/p>\n<p>\u6267\u884c\u6307\u4ee4\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token comment\"># \u5efa\u8bae\u7528\u73af\u5883\u53d8\u91cf\u653e key&#xff08;\u907f\u514d\u51fa\u73b0\u5728 shell history&#xff09;<\/span><br \/>\n<span class=\"token builtin class-name\">export<\/span> <span class=\"token assign-left variable\">CS336_API_KEY<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;\u4f60\u7684key(SSH\u516c\u94a5\u5b57\u7b26\u4e32&#xff0c;\u6ca1\u6362\u884c)&#034;<\/span><\/p>\n<p><span class=\"token comment\"># 1) \u5148\u9a8c\u8bc1 key &amp; \u7f51\u7edc<\/span><br \/>\nuv run python cs336_scaling\/query_api.py total_flops_used<\/p>\n<p><span class=\"token comment\"># 2) \u770b\u770b\u5386\u53f2 runs&#xff08;\u4e5f\u4f1a\u8fdb\u7f13\u5b58&#xff09;<\/span><br \/>\nuv run python cs336_scaling\/query_api.py previous_runs<\/p>\n<p><span class=\"token comment\"># 3) \u5355\u6761 loss \u67e5\u8be2&#xff08;\u53c2\u6570\u8303\u56f4\u89c1 handout&#xff09;<\/span><br \/>\nuv run python cs336_scaling\/query_api.py loss <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;d-model <span class=\"token number\">1024<\/span> &#8211;num-layers <span class=\"token number\">24<\/span> &#8211;num-heads <span class=\"token number\">16<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;batch-size <span class=\"token number\">128<\/span> &#8211;learning-rate <span class=\"token number\">0.001<\/span> &#8211;train-flops <span class=\"token number\">10000000000000000<\/span><\/p>\n<p>\u8fde\u7eed\u8dd1\u4e24\u6b21\u76f8\u540c\u7684 loss \u547d\u4ee4\u65f6&#xff0c;\u7b2c\u4e8c\u6b21\u5e94\u76f4\u63a5\u547d\u4e2d runs\/api_cache.jsonl&#xff0c;\u4e0d\u4f1a\u518d\u53d1\u8bf7\u6c42\u3002<\/p>\n<h4>2.2 \u5b9e\u9a8c\u8bbe\u8ba1 \/ \u641c\u7d22\u811a\u672c\u5b9e\u73b0<\/h4>\n<p>cs336_scaling\/run_sweep.py \u5b9e\u73b0\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> argparse<br \/>\n<span class=\"token keyword\">import<\/span> json<br \/>\n<span class=\"token keyword\">import<\/span> os<br \/>\n<span class=\"token keyword\">import<\/span> time<br \/>\n<span class=\"token keyword\">from<\/span> dataclasses <span class=\"token keyword\">import<\/span> asdict<br \/>\n<span class=\"token keyword\">from<\/span> pathlib <span class=\"token keyword\">import<\/span> Path<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Dict<span class=\"token punctuation\">,<\/span> Iterable<span class=\"token punctuation\">,<\/span> List<\/p>\n<p><span class=\"token keyword\">from<\/span> api_client <span class=\"token keyword\">import<\/span> LossQuery<span class=\"token punctuation\">,<\/span> ScalingAPIClient<span class=\"token punctuation\">,<\/span> ScalingAPIError<\/p>\n<p><span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token comment\"># Utilities<\/span><br \/>\n<span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token keyword\">def<\/span> <span class=\"token function\">now_ms<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>time<span class=\"token punctuation\">.<\/span>time<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> <span class=\"token number\">1000<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">jsonl_append<\/span><span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">:<\/span> Path<span class=\"token punctuation\">,<\/span> obj<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    path<span class=\"token punctuation\">.<\/span>parent<span class=\"token punctuation\">.<\/span>mkdir<span class=\"token punctuation\">(<\/span>parents<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> exist_ok<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">with<\/span> path<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">open<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">,<\/span> encoding<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;utf-8&#034;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">as<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n        f<span class=\"token punctuation\">.<\/span>write<span class=\"token punctuation\">(<\/span>json<span class=\"token punctuation\">.<\/span>dumps<span class=\"token punctuation\">(<\/span>obj<span class=\"token punctuation\">,<\/span> ensure_ascii<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">False<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> <span class=\"token string\">&#034;\\\\n&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">iter_unique<\/span><span class=\"token punctuation\">(<\/span>seq<span class=\"token punctuation\">:<\/span> Iterable<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    seen <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">set<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    out<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> q <span class=\"token keyword\">in<\/span> seq<span class=\"token punctuation\">:<\/span><br \/>\n        key <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">(<\/span><br \/>\n            q<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span><br \/>\n            q<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">,<\/span><br \/>\n            q<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">,<\/span><br \/>\n            q<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> key <span class=\"token keyword\">in<\/span>  seen<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">continue<\/span><br \/>\n        seen<span class=\"token punctuation\">.<\/span>add<span class=\"token punctuation\">(<\/span>key<span class=\"token punctuation\">)<\/span><br \/>\n        out<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> out<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">estimate_nonemb_params<\/span><span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> num_layers<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># Handout tip: non-embedding params \u2248 12 * n_layer * d_model^2<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> <span class=\"token number\">12.0<\/span> <span class=\"token operator\">*<\/span> num_layers <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>d_model <span class=\"token operator\">**<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token comment\"># Grid generator (coarse -&gt; refine)<\/span><br \/>\n<span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token keyword\">def<\/span> <span class=\"token function\">coarse_grid<\/span><span class=\"token punctuation\">(<\/span><br \/>\n    train_flops<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    batch_sizes<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    d_models<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    num_layers<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    num_heads<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    learning_rates<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Coarse exploration:<br \/>\n      &#8211; fewer shapes<br \/>\n      &#8211; a couple lrs<br \/>\n      &#8211; multiple compute levels<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    qs<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> train_flops<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">for<\/span> bs <span class=\"token keyword\">in<\/span> batch_sizes<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> d <span class=\"token keyword\">in<\/span> d_models<span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token keyword\">for<\/span> nl <span class=\"token keyword\">in<\/span> num_layers<span class=\"token punctuation\">:<\/span><br \/>\n                    <span class=\"token keyword\">for<\/span> nh <span class=\"token keyword\">in<\/span> num_heads<span class=\"token punctuation\">:<\/span><br \/>\n                        <span class=\"token comment\"># require d_model divisible by num_heads (Transformer constraint)<\/span><br \/>\n                        <span class=\"token keyword\">if<\/span> d <span class=\"token operator\">%<\/span> nh <span class=\"token operator\">!&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">:<\/span><br \/>\n                            <span class=\"token keyword\">continue<\/span><br \/>\n                        <span class=\"token keyword\">for<\/span> lr <span class=\"token keyword\">in<\/span> learning_rates<span class=\"token punctuation\">:<\/span><br \/>\n                            qs<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span><br \/>\n                                LossQuery<span class=\"token punctuation\">(<\/span><br \/>\n                                    d_model<span class=\"token operator\">&#061;<\/span>d<span class=\"token punctuation\">,<\/span><br \/>\n                                    num_layers<span class=\"token operator\">&#061;<\/span>nl<span class=\"token punctuation\">,<\/span><br \/>\n                                    num_heads<span class=\"token operator\">&#061;<\/span>nh<span class=\"token punctuation\">,<\/span><br \/>\n                                    batch_size<span class=\"token operator\">&#061;<\/span>bs<span class=\"token punctuation\">,<\/span><br \/>\n                                    learning_rate<span class=\"token operator\">&#061;<\/span>lr<span class=\"token punctuation\">,<\/span><br \/>\n                                    train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                                <span class=\"token punctuation\">)<\/span><br \/>\n                            <span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> iter_unique<span class=\"token punctuation\">(<\/span>qs<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">refine_grid_around_best<\/span><span class=\"token punctuation\">(<\/span><br \/>\n    best<span class=\"token punctuation\">:<\/span> LossQuery<span class=\"token punctuation\">,<\/span><br \/>\n    train_flops<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    batch_sizes<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    d_model_mults<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    layer_deltas<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    head_candidates<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    lr_mults<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Local refinement around a &#034;best&#034; config (by loss at some compute).<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    qs<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> train_flops<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">for<\/span> bs <span class=\"token keyword\">in<\/span> batch_sizes<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> dm <span class=\"token keyword\">in<\/span> d_model_mults<span class=\"token punctuation\">:<\/span><br \/>\n                d <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token builtin\">round<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">.<\/span>d_model <span class=\"token operator\">*<\/span> dm<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n                d <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">max<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">64<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1024<\/span><span class=\"token punctuation\">,<\/span> d<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># API range :contentReference[oaicite:9]{index&#061;9}<\/span><br \/>\n                <span class=\"token keyword\">for<\/span> dl <span class=\"token keyword\">in<\/span> layer_deltas<span class=\"token punctuation\">:<\/span><br \/>\n                    nl <span class=\"token operator\">&#061;<\/span> best<span class=\"token punctuation\">.<\/span>num_layers <span class=\"token operator\">&#043;<\/span> dl<br \/>\n                    nl <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">max<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">24<\/span><span class=\"token punctuation\">,<\/span> nl<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># API range :contentReference[oaicite:10]{index&#061;10}<\/span><br \/>\n                    <span class=\"token keyword\">for<\/span> nh <span class=\"token keyword\">in<\/span> head_candidates<span class=\"token punctuation\">:<\/span><br \/>\n                        <span class=\"token keyword\">if<\/span> d <span class=\"token operator\">%<\/span> nh <span class=\"token operator\">!&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">:<\/span><br \/>\n                            <span class=\"token keyword\">continue<\/span><br \/>\n                        <span class=\"token keyword\">for<\/span> lm <span class=\"token keyword\">in<\/span> lr_mults<span class=\"token punctuation\">:<\/span><br \/>\n                            lr <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">.<\/span>learning_rate <span class=\"token operator\">*<\/span> lm<span class=\"token punctuation\">)<\/span><br \/>\n                            <span class=\"token comment\"># clamp to API range [1e-4, 1e-3] :contentReference[oaicite:11]{index&#061;11}<\/span><br \/>\n                            lr <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">max<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e-4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e-3<\/span><span class=\"token punctuation\">,<\/span> lr<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n                            qs<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span><br \/>\n                                LossQuery<span class=\"token punctuation\">(<\/span><br \/>\n                                    d_model<span class=\"token operator\">&#061;<\/span>d<span class=\"token punctuation\">,<\/span><br \/>\n                                    num_layers<span class=\"token operator\">&#061;<\/span>nl<span class=\"token punctuation\">,<\/span><br \/>\n                                    num_heads<span class=\"token operator\">&#061;<\/span>nh<span class=\"token punctuation\">,<\/span><br \/>\n                                    batch_size<span class=\"token operator\">&#061;<\/span>bs<span class=\"token punctuation\">,<\/span><br \/>\n                                    learning_rate<span class=\"token operator\">&#061;<\/span>lr<span class=\"token punctuation\">,<\/span><br \/>\n                                    train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                                <span class=\"token punctuation\">)<\/span><br \/>\n                            <span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> iter_unique<span class=\"token punctuation\">(<\/span>qs<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token comment\"># Budget-aware runner<\/span><br \/>\n<span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token keyword\">def<\/span> <span class=\"token function\">would_consume_budget<\/span><span class=\"token punctuation\">(<\/span>client<span class=\"token punctuation\">:<\/span> ScalingAPIClient<span class=\"token punctuation\">,<\/span> q<span class=\"token punctuation\">:<\/span> LossQuery<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">bool<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    If cache already has this exact request, then it&#039;s free (no extra FLOPs):contentReference[oaicite:12]{index&#061;12}.<br \/>\n    We check client.cache directly via the endpoint&#043;params mapping used in api_client.py.<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    endpoint <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;\/loss&#034;<\/span><br \/>\n    params <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n        <span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;batch_size&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">:<\/span> q<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;train_flops&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;api_key&#034;<\/span><span class=\"token punctuation\">:<\/span> client<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">}<\/span><br \/>\n    hit <span class=\"token operator\">&#061;<\/span> client<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span>endpoint<span class=\"token punctuation\">,<\/span> params<span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> hit <span class=\"token keyword\">is<\/span> <span class=\"token boolean\">None<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">run_queries<\/span><span class=\"token punctuation\">(<\/span><br \/>\n    client<span class=\"token punctuation\">:<\/span> ScalingAPIClient<span class=\"token punctuation\">,<\/span><br \/>\n    queries<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>LossQuery<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    max_fit_budget_flops<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">2e18<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    results_path<span class=\"token punctuation\">:<\/span> Path <span class=\"token operator\">&#061;<\/span> Path<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;runs\/sweep_results.jsonl&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    dry_run<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">bool<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token boolean\">False<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    sleep_s<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">0.0<\/span><span class=\"token punctuation\">,<\/span><br \/>\n<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Executes queries until (estimated) budget would exceed max_fit_budget_flops.<br \/>\n    Notes:<br \/>\n      &#8211; total_flops_used is returned by API and can be fetched anytime:contentReference[oaicite:13]{index&#061;13}.<br \/>\n      &#8211; If we exceed the 2e18 scaling-law budget, API will refuse future requests:contentReference[oaicite:14]{index&#061;14},<br \/>\n        so we stop conservatively.<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    <span class=\"token comment\"># starting point from API<\/span><br \/>\n    <span class=\"token keyword\">try<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        used0 <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>client<span class=\"token punctuation\">.<\/span>total_flops_used<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">except<\/span> ScalingAPIError<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token comment\"># If key has never queried, the endpoint may 422; but in that case used0&#061;0 is safe.<\/span><br \/>\n        used0 <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">0.0<\/span><\/p>\n<p>    planned_new <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">0.0<\/span><br \/>\n    n_new <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">0<\/span><br \/>\n    n_cached <span class=\"token operator\">&#061;<\/span> <span class=\"token number\">0<\/span><\/p>\n<p>    <span class=\"token comment\"># Pre-pass: compute how many are cached &amp; estimated extra FLOPs<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> q <span class=\"token keyword\">in<\/span> queries<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> would_consume_budget<span class=\"token punctuation\">(<\/span>client<span class=\"token punctuation\">,<\/span> q<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            planned_new <span class=\"token operator\">&#043;&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">)<\/span><br \/>\n            n_new <span class=\"token operator\">&#043;&#061;<\/span> <span class=\"token number\">1<\/span><br \/>\n        <span class=\"token keyword\">else<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            n_cached <span class=\"token operator\">&#043;&#061;<\/span> <span class=\"token number\">1<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; Sweep plan &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;queries_total: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>queries<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;cached_free:   <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>n_cached<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;new_queries:   <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>n_new<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;api_used_now:  <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>used0<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> FLOPs&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;est_new_cost:  <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>planned_new<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> FLOPs&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;est_total:     <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">(<\/span>used0 <span class=\"token operator\">&#043;<\/span> planned_new<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> FLOPs&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;budget_limit:  <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>max_fit_budget_flops<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> FLOPs&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> dry_run<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;\\\\n[dry-run] not executing API calls.&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">return<\/span><\/p>\n<p>    <span class=\"token comment\"># Execute with conservative budget guard<\/span><br \/>\n    used <span class=\"token operator\">&#061;<\/span> used0<br \/>\n    <span class=\"token keyword\">for<\/span> i<span class=\"token punctuation\">,<\/span> q <span class=\"token keyword\">in<\/span> <span class=\"token builtin\">enumerate<\/span><span class=\"token punctuation\">(<\/span>queries<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        is_new <span class=\"token operator\">&#061;<\/span> would_consume_budget<span class=\"token punctuation\">(<\/span>client<span class=\"token punctuation\">,<\/span> q<span class=\"token punctuation\">)<\/span><br \/>\n        est_after <span class=\"token operator\">&#061;<\/span> used <span class=\"token operator\">&#043;<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">if<\/span> is_new <span class=\"token keyword\">else<\/span> <span class=\"token number\">0.0<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        <span class=\"token keyword\">if<\/span> est_after <span class=\"token operator\">&gt;<\/span> max_fit_budget_flops<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;[STOP] Budget guard: would exceed limit if running next query. &#034;<\/span><\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;used&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>used<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">, next_cost&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>train_flops <span class=\"token keyword\">if<\/span> is_new <span class=\"token keyword\">else<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">, &#034;<\/span><\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;limit&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>max_fit_budget_flops<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><br \/>\n            <span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">return<\/span><\/p>\n<p>        rec <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n            <span class=\"token string\">&#034;ts_ms&#034;<\/span><span class=\"token punctuation\">:<\/span> now_ms<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;index&#034;<\/span><span class=\"token punctuation\">:<\/span> i<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;query&#034;<\/span><span class=\"token punctuation\">:<\/span> asdict<span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;nonemb_params_est&#034;<\/span><span class=\"token punctuation\">:<\/span> estimate_nonemb_params<span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span> q<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;was_cached&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token keyword\">not<\/span> is_new<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">}<\/span><\/p>\n<p>        <span class=\"token keyword\">try<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            out <span class=\"token operator\">&#061;<\/span> client<span class=\"token punctuation\">.<\/span>loss<span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">,<\/span> use_cache<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            rec<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;response&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> out<br \/>\n            <span class=\"token comment\"># Use authoritative used FLOPs if API returns it in \/loss response:contentReference[oaicite:15]{index&#061;15}<\/span><br \/>\n            <span class=\"token keyword\">if<\/span> <span class=\"token builtin\">isinstance<\/span><span class=\"token punctuation\">(<\/span>out<span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">dict<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">and<\/span> <span class=\"token string\">&#034;total_flops_used&#034;<\/span> <span class=\"token keyword\">in<\/span> out<span class=\"token punctuation\">:<\/span><br \/>\n                used <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>out<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;total_flops_used&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">else<\/span><span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token comment\"># fallback estimate<\/span><br \/>\n                used <span class=\"token operator\">&#061;<\/span> est_after<br \/>\n            rec<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;api_used_after&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> used<br \/>\n            rec<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;status&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;ok&#034;<\/span><br \/>\n            <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;[<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>i<span class=\"token operator\">&#043;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">\/<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>queries<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">] ok &#034;<\/span><\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;C&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.1e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> d&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> L&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> H&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> &#034;<\/span><\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;bs&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> lr&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>q<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> &#034;<\/span><\/span><br \/>\n                <span class=\"token string-interpolation\"><span class=\"token string\">f&#034;loss&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>out<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#039;loss&#039;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> used&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>used<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><br \/>\n            <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">except<\/span> Exception <span class=\"token keyword\">as<\/span> e<span class=\"token punctuation\">:<\/span><br \/>\n            rec<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;status&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token string\">&#034;error&#034;<\/span><br \/>\n            rec<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;error&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">repr<\/span><span class=\"token punctuation\">(<\/span>e<span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;[<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>i<span class=\"token operator\">&#043;<\/span><span class=\"token number\">1<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">\/<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token builtin\">len<\/span><span class=\"token punctuation\">(<\/span>queries<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">] error: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>e<span class=\"token conversion-option punctuation\">!r<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>        jsonl_append<span class=\"token punctuation\">(<\/span>results_path<span class=\"token punctuation\">,<\/span> rec<span class=\"token punctuation\">)<\/span><\/p>\n<p>        <span class=\"token keyword\">if<\/span> sleep_s <span class=\"token operator\">&gt;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            time<span class=\"token punctuation\">.<\/span>sleep<span class=\"token punctuation\">(<\/span>sleep_s<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token comment\"># CLI<\/span><br \/>\n<span class=\"token comment\"># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/span><br \/>\n<span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    p <span class=\"token operator\">&#061;<\/span> argparse<span class=\"token punctuation\">.<\/span>ArgumentParser<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;api-key&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>os<span class=\"token punctuation\">.<\/span>environ<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;CS336_API_KEY&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;base-url&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;http:\/\/hyperturing.stanford.edu:8000&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;cache&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;runs\/api_cache.jsonl&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;out&#034;<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;runs\/sweep_results.jsonl&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;budget&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token number\">2e18<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">help<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;scaling-law fit budget cap (FLOPs)&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;dry-run&#034;<\/span><span class=\"token punctuation\">,<\/span> action<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;store_true&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    p<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;sleep&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.0<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    sub <span class=\"token operator\">&#061;<\/span> p<span class=\"token punctuation\">.<\/span>add_subparsers<span class=\"token punctuation\">(<\/span>dest<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;mode&#034;<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># coarse mode<\/span><br \/>\n    c <span class=\"token operator\">&#061;<\/span> sub<span class=\"token punctuation\">.<\/span>add_parser<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;coarse&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    c<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;train-flops&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1e13<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e14<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e15<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e16<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e17<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e18<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    c<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;batch-sizes&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    c<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;d-models&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">256<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">512<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">768<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1024<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    c<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;num-layers&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">8<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">12<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">16<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">24<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    c<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;num-heads&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">8<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">16<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    c<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;learning-rates&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1e-4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3e-4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e-3<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># refine mode: requires a seed config<\/span><br \/>\n    r <span class=\"token operator\">&#061;<\/span> sub<span class=\"token punctuation\">.<\/span>add_parser<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;refine&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;seed-d-model&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;seed-num-layers&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;seed-num-heads&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;seed-batch-size&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;seed-learning-rate&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> required<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;train-flops&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">1e16<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3e16<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e17<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">3e17<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1e18<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;batch-sizes&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">256<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;d-model-mults&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">0.75<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1.0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1.25<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;layer-deltas&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;head-candidates&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">8<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">16<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    r<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;lr-mults&#034;<\/span><span class=\"token punctuation\">,<\/span> nargs<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#043;&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token number\">0.5<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1.0<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">2.0<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    args <span class=\"token operator\">&#061;<\/span> p<span class=\"token punctuation\">.<\/span>parse_args<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> args<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">raise<\/span> SystemExit<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Missing &#8211;api-key or env CS336_API_KEY&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    client <span class=\"token operator\">&#061;<\/span> ScalingAPIClient<span class=\"token punctuation\">(<\/span><br \/>\n        api_key<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>api_key<span class=\"token punctuation\">,<\/span><br \/>\n        base_url<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>base_url<span class=\"token punctuation\">,<\/span><br \/>\n        cache_path<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>cache<span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><\/p>\n<p>    out_path <span class=\"token operator\">&#061;<\/span> Path<span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>out<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> args<span class=\"token punctuation\">.<\/span>mode <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;coarse&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        queries <span class=\"token operator\">&#061;<\/span> coarse_grid<span class=\"token punctuation\">(<\/span><br \/>\n            train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> x <span class=\"token keyword\">in<\/span> args<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            batch_sizes<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>batch_sizes<span class=\"token punctuation\">,<\/span><br \/>\n            d_models<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>d_models<span class=\"token punctuation\">,<\/span><br \/>\n            num_layers<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">,<\/span><br \/>\n            num_heads<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">,<\/span><br \/>\n            learning_rates<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>learning_rates<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">else<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        seed <span class=\"token operator\">&#061;<\/span> LossQuery<span class=\"token punctuation\">(<\/span><br \/>\n            d_model<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>seed_d_model<span class=\"token punctuation\">,<\/span><br \/>\n            num_layers<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>seed_num_layers<span class=\"token punctuation\">,<\/span><br \/>\n            num_heads<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>seed_num_heads<span class=\"token punctuation\">,<\/span><br \/>\n            batch_size<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>seed_batch_size<span class=\"token punctuation\">,<\/span><br \/>\n            learning_rate<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>seed_learning_rate<span class=\"token punctuation\">,<\/span><br \/>\n            train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e13<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span>  <span class=\"token comment\"># placeholder; replaced by &#8211;train-flops below<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n        queries <span class=\"token operator\">&#061;<\/span> refine_grid_around_best<span class=\"token punctuation\">(<\/span><br \/>\n            best<span class=\"token operator\">&#061;<\/span>seed<span class=\"token punctuation\">,<\/span><br \/>\n            train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> x <span class=\"token keyword\">in<\/span> args<span class=\"token punctuation\">.<\/span>train_flops<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            batch_sizes<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>batch_sizes<span class=\"token punctuation\">,<\/span><br \/>\n            d_model_mults<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>d_model_mults<span class=\"token punctuation\">,<\/span><br \/>\n            layer_deltas<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>layer_deltas<span class=\"token punctuation\">,<\/span><br \/>\n            head_candidates<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>head_candidates<span class=\"token punctuation\">,<\/span><br \/>\n            lr_mults<span class=\"token operator\">&#061;<\/span>args<span class=\"token punctuation\">.<\/span>lr_mults<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><\/p>\n<p>    run_queries<span class=\"token punctuation\">(<\/span><br \/>\n        client<span class=\"token operator\">&#061;<\/span>client<span class=\"token punctuation\">,<\/span><br \/>\n        queries<span class=\"token operator\">&#061;<\/span>queries<span class=\"token punctuation\">,<\/span><br \/>\n        max_fit_budget_flops<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>budget<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        results_path<span class=\"token operator\">&#061;<\/span>out_path<span class=\"token punctuation\">,<\/span><br \/>\n        dry_run<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">bool<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>dry_run<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        sleep_s<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>sleep<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">if<\/span> __name__ <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;__main__&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    main<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u4e0a\u9762\u8fd9\u4e2a\u811a\u672c\u7684\u76ee\u6807\u662f\u751f\u6210\u4e00\u6279 LossQuery&#xff0c;\u5728\u6bcf\u6b21\u771f\u6b63 query \u524d&#xff0c;\u5148\u68c0\u67e5&#xff1a;<\/p>\n<p>1. \u8fd9\u4e2a\u914d\u7f6e\u662f\u5426\u5df2\u5728\u672c\u5730 cache \/ API \u5386\u53f2\u4e2d\u51fa\u73b0&#xff08;\u51fa\u73b0\u5219\u4e0d\u8ba1\u9884\u7b97&#xff09;<\/p>\n<p>2. \u82e5\u662f\u65b0\u914d\u7f6e&#xff0c;\u4f30\u7b97\u65b0\u589e\u6d88\u8017\u662f\u5426\u4f1a\u8ba9\u603b FLOPs \u8d85\u8fc7 2e18&#xff08;\u8d85\u4e86\u5c31\u505c&#xff09;<\/p>\n<p>\u6267\u884c\u6d41\u7a0b\u5982\u4e0b&#xff1a;<\/p>\n<p>0) \u5148 dry-run \u770b\u9884\u7b97\u4f1a\u4e0d\u4f1a\u7206<\/p>\n<p>\u6211\u4eec\u7684\u9884\u7b97\u4e0a\u9650\u662f 2e18 FLOPs&#xff0c;\u8d85\u8fc7\u4f1a\u88ab\u62d2\u7edd\u540e\u7eed\u8bf7\u6c42&#xff0c;\u6240\u4ee5\u5148 dry-run \u8ba9\u811a\u672c\u7b97\u4e00\u904d\u9884\u8ba1\u6d88\u8017\u6700\u7a33<\/p>\n<p><span class=\"token builtin class-name\">export<\/span> <span class=\"token assign-left variable\">CS336_API_KEY<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;\u4f60\u7684key&#034;<\/span><\/p>\n<p>uv run python cs336_scaling\/run_sweep.py &#8211;dry-run coarse <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;train-flops 1e13 1e14 1e15 1e16 1e17 1e18 <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;batch-sizes <span class=\"token number\">128<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;d-models <span class=\"token number\">128<\/span> <span class=\"token number\">256<\/span> <span class=\"token number\">512<\/span> <span class=\"token number\">1024<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;num-layers <span class=\"token number\">2<\/span> <span class=\"token number\">4<\/span> <span class=\"token number\">8<\/span> <span class=\"token number\">16<\/span> <span class=\"token number\">24<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;num-heads <span class=\"token number\">2<\/span> <span class=\"token number\">4<\/span> <span class=\"token number\">8<\/span> <span class=\"token number\">16<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;learning-rates 1e-4 3e-4 1e-3<\/p>\n<p>\u811a\u672c\u4f1a\u8f93\u51fa&#xff1a;\u603b query \u6570\u3001cache \u547d\u4e2d\u6570\u3001\u65b0 query \u6570\u3001\u9884\u8ba1\u65b0\u589e FLOPs\u3001\u9884\u8ba1\u603b FLOPs\u3002<\/p>\n<p>1) \u6b63\u5f0f\u8dd1 coarse sweep&#xff08;\u4f1a\u5199 jsonl \u7ed3\u679c&#xff09;<\/p>\n<p>uv run python cs336_scaling\/run_sweep.py coarse <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;train-flops 1e13 1e14 1e15 1e16 1e17 1e18 <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;batch-sizes <span class=\"token number\">128<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;d-models <span class=\"token number\">128<\/span> <span class=\"token number\">256<\/span> <span class=\"token number\">512<\/span> <span class=\"token number\">1024<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;num-layers <span class=\"token number\">2<\/span> <span class=\"token number\">4<\/span> <span class=\"token number\">8<\/span> <span class=\"token number\">16<\/span> <span class=\"token number\">24<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;num-heads <span class=\"token number\">2<\/span> <span class=\"token number\">4<\/span> <span class=\"token number\">8<\/span> <span class=\"token number\">16<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;learning-rates 1e-4 3e-4 1e-3<\/p>\n<p>\u8f93\u51fa\u4f1a\u653e\u5728&#xff1a;<\/p>\n<ul>\n<li>runs\/api_cache.jsonl&#xff1a;\u7f13\u5b58<\/li>\n<li>runs\/sweep_results.jsonl&#xff1a;\u672c\u6b21 sweep \u7684\u9010\u6761\u8bb0\u5f55&#xff1a;query\u3001loss\u3001used flops\u3001\u662f\u5426 cache \u547d\u4e2d\u7b49<\/li>\n<\/ul>\n<p>2) \u9009\u4e00\u4e2a coarse \u6700\u4f18\u70b9\u4f5c\u4e3a seed&#xff0c;\u7136\u540e refin<\/p>\n<p>uv run python cs336_scaling\/run_sweep.py refine <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;seed-d-model <span class=\"token number\">512<\/span> &#8211;seed-num-layers <span class=\"token number\">16<\/span> &#8211;seed-num-heads <span class=\"token number\">8<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;seed-batch-size <span class=\"token number\">128<\/span> &#8211;seed-learning-rate <span class=\"token number\">0.0003<\/span> <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;train-flops 1e16 3e16 1e17 3e17 1e18<\/p>\n<p>\u6211\u4eec\u518d\u4ece runs\/sweep_results.jsonl \u91cc\u627e\u67d0\u4e2a compute&#xff08;\u6bd4\u5982 train_flops&#061;1e18&#xff09;\u4e0b loss \u6700\u4f4e\u7684\u914d\u7f6e&#xff0c;\u5f53\u4f5c seed<\/p>\n<h4>2.3 \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u4e0e\u9884\u6d4b\u811a\u672c\u5b9e\u73b0<\/h4>\n<p>\u6574\u4f53\u62df\u5408\u548c\u5916\u63a8\u8bbe\u8ba1\u601d\u8def\u5982\u4e0b&#xff1a;<\/p>\n<p>1. \u4ece sweep_results.jsonl \u6c47\u603b\u6570\u636e&#xff1a;\u6bcf\u6761\u8bb0\u5f55\u91cc\u6709 query&#xff08;d_model\/layers\/heads\/batch\/lr\/train_flops&#xff09; \u548c response.loss<\/p>\n<p>2. \u628a\u7ed3\u6784\u8d85\u53c2\u6620\u5c04\u5230\u6a21\u578b\u89c4\u6a21 N&#xff1a;\u7528\u4f5c\u4e1a\u5efa\u8bae\u7684 tip \u8fd1\u4f3c&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         non-emb<\/p>\n<p>        \u2248<\/p>\n<p>        12<\/p>\n<p>         n<\/p>\n<p>         layer<\/p>\n<p>         d<\/p>\n<p>         model<\/p>\n<p>         2<\/p>\n<p>       N_{\\\\text{non-emb}} \\\\approx 12n_{\\\\text{layer}}d_{\\\\text{model}}^2<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">non-emb<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.1002em;vertical-align: -0.2861em\"><\/span><span class=\"mord\">12<\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -2.4169em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2831em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>3. IsoFLOPs-stype \u7684 \u201c\u6bcf\u4e2a compute \u53d6\u6700\u4f18\u70b9\u201d&#xff1a;\u5bf9\u6bcf\u4e2a train_flops &#061; C&#xff0c;\u9009 loss \u6700\u5c0f\u7684\u914d\u7f6e\u4f5c\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        (<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        ,<\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>        ,<\/p>\n<p>         L<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>        )<\/p>\n<p>        )<\/p>\n<p>       ((C, N_\\\\text{opt}(C), L_\\\\text{opt}(C)))<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mopen\">((<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)))<\/span><\/span><\/span><\/span><\/span><\/p>\n<p>4. \u5728 log-log \u7a7a\u95f4\u62df\u5408\u4e24\u6761\u5e42\u5f8b&#xff1a;<\/p>\n<ul>\n<li>\u6700\u4f18\u6a21\u578b\u89c4\u6a21&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          N<\/p>\n<p>          opt<\/p>\n<p>         (<\/p>\n<p>         C<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          k<\/p>\n<p>          N<\/p>\n<p>         ,<\/p>\n<p>          C<\/p>\n<p>           a<\/p>\n<p>           N<\/p>\n<p>        N_\\\\text{opt}(C) &#061; k_N,C^{a_N}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6644em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3448em\"><span class=\"\" style=\"top: -2.3567em;margin-left: 0em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1433em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>\u6700\u4f18 loss&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          L<\/p>\n<p>          opt<\/p>\n<p>         (<\/p>\n<p>         C<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          L<\/p>\n<p>          \u221e<\/p>\n<p>         &#043;<\/p>\n<p>          k<\/p>\n<p>          L<\/p>\n<p>          C<\/p>\n<p>           \u2212<\/p>\n<p>            a<\/p>\n<p>            L<\/p>\n<p>        L_\\\\text{opt}(C) &#061; L_\\\\infty &#043; k_LC^{-a_L}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">&#043;<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.9213em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">L<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7713em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">\u2212<\/span><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3448em\"><span class=\"\" style=\"top: -2.3567em;margin-left: 0em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">L<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1433em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff08;\u5e26\u4e00\u4e2a\u4e0d\u53ef\u8fbe\u4e0b\u754c <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          L<\/p>\n<p>          \u221e<\/p>\n<p>        L_\\\\infty<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff0c;\u6bd4\u7eaf\u5e42\u5f8b\u66f4\u7a33&#xff09;<\/li>\n<\/ul>\n<p>5. \u5916\u63a8\u5230 1e19&#xff1a;\u5f97\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        1<\/p>\n<p>        e<\/p>\n<p>        19<\/p>\n<p>        )<\/p>\n<p>       N_\\\\text{opt}(1e19)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord mathnormal\">e<\/span><span class=\"mord\">19<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u548c <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         L<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        1<\/p>\n<p>        e<\/p>\n<p>        19<\/p>\n<p>        )<\/p>\n<p>       L_\\\\text{opt}(1e19)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord mathnormal\">e<\/span><span class=\"mord\">19<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span><\/p>\n<p>6. \u7ed9\u51fa 1e19 \u7684 \u201c\u53ef\u63d0\u4ea4\u8d85\u53c2\u201d&#xff1a;\u5728\u5141\u8bb8\u8303\u56f4\u5185\u627e\u4e00\u4e2a\u7ed3\u6784 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        (<\/p>\n<p>         d<\/p>\n<p>         model<\/p>\n<p>        ,<\/p>\n<p>         n<\/p>\n<p>         layer<\/p>\n<p>        ,<\/p>\n<p>         n<\/p>\n<p>         head<\/p>\n<p>        )<\/p>\n<p>       (d_\\\\text{model}, n_\\\\text{layer}, n_\\\\text{head})<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">head<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u4f7f\u5f97\u4f30\u7b97\u53c2\u6570\u91cf\u6700\u63a5\u8fd1 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        1<\/p>\n<p>        e<\/p>\n<p>        19<\/p>\n<p>        )<\/p>\n<p>       N_\\\\text{opt}(1e19)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord mathnormal\">e<\/span><span class=\"mord\">19<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>&#xff0c;batch \u53d6 256&#xff08;\u6216 128&#xff09;&#xff0c;learning rate \u53d6\u5728\u6700\u5927 compute&#xff08;1e18&#xff09;\u9644\u8fd1\u8868\u73b0\u6700\u597d\u7684 lr&#xff08;\u4e5f\u53ef\u4ee5\u6309 lr \u968f compute \u7684\u7ecf\u9a8c\u8d8b\u52bf\u505a\u8f7b\u5fae\u5916\u63a8&#xff09;\u3002<\/p>\n<p>\u9996\u5148 \u6570\u636e\u6c47\u603b\u5de5\u5177 cs336_scaling\/scaling_data.py \u7684\u5b9e\u73b0\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> json<br \/>\n<span class=\"token keyword\">from<\/span> dataclasses <span class=\"token keyword\">import<\/span> dataclass<br \/>\n<span class=\"token keyword\">from<\/span> pathlib <span class=\"token keyword\">import<\/span> Path<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Dict<span class=\"token punctuation\">,<\/span> Iterable<span class=\"token punctuation\">,<\/span> List<span class=\"token punctuation\">,<\/span> Optional<span class=\"token punctuation\">,<\/span> Tuple<\/p>\n<p><span class=\"token decorator annotation punctuation\">&#064;dataclass<\/span><span class=\"token punctuation\">(<\/span>frozen<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><br \/>\n<span class=\"token keyword\">class<\/span> <span class=\"token class-name\">RunRow<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    d_model<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    num_layers<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    num_heads<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    batch_size<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    learning_rate<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><br \/>\n    train_flops<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><br \/>\n    loss<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">approx_nonemb_params<\/span><span class=\"token punctuation\">(<\/span>d_model<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> num_layers<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token comment\"># Handout tip: non-embedding params \u2248 12 * n_layer * d_model^2<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> <span class=\"token number\">12.0<\/span> <span class=\"token operator\">*<\/span> num_layers <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>d_model <span class=\"token operator\">**<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">load_sweep_jsonl<\/span><span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span> <span class=\"token operator\">|<\/span> Path<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> List<span class=\"token punctuation\">[<\/span>RunRow<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    path <span class=\"token operator\">&#061;<\/span> Path<span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">)<\/span><br \/>\n    rows<span class=\"token punctuation\">:<\/span> List<span class=\"token punctuation\">[<\/span>RunRow<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> path<span class=\"token punctuation\">.<\/span>exists<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">raise<\/span> FileNotFoundError<span class=\"token punctuation\">(<\/span>path<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">with<\/span> path<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">open<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;r&#034;<\/span><span class=\"token punctuation\">,<\/span> encoding<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;utf-8&#034;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">as<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">for<\/span> line <span class=\"token keyword\">in<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n            line <span class=\"token operator\">&#061;<\/span> line<span class=\"token punctuation\">.<\/span>strip<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">if<\/span> <span class=\"token keyword\">not<\/span> line<span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token keyword\">continue<\/span><br \/>\n            obj <span class=\"token operator\">&#061;<\/span> json<span class=\"token punctuation\">.<\/span>loads<span class=\"token punctuation\">(<\/span>line<span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">if<\/span> obj<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;status&#034;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">!&#061;<\/span> <span class=\"token string\">&#034;ok&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token keyword\">continue<\/span><br \/>\n            q <span class=\"token operator\">&#061;<\/span> obj<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;query&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            resp <span class=\"token operator\">&#061;<\/span> obj<span class=\"token punctuation\">.<\/span>get<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;response&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token keyword\">if<\/span> <span class=\"token string\">&#034;loss&#034;<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token keyword\">in<\/span> resp<span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token keyword\">continue<\/span><br \/>\n            rows<span class=\"token punctuation\">.<\/span>append<span class=\"token punctuation\">(<\/span><br \/>\n                RunRow<span class=\"token punctuation\">(<\/span><br \/>\n                    d_model<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                    num_layers<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                    num_heads<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                    batch_size<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;batch_size&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                    learning_rate<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                    train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>q<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;train_flops&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                    loss<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>resp<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;loss&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token punctuation\">)<\/span><br \/>\n            <span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">group_best_by_compute<\/span><span class=\"token punctuation\">(<\/span>rows<span class=\"token punctuation\">:<\/span> Iterable<span class=\"token punctuation\">[<\/span>RunRow<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> RunRow<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    best<span class=\"token punctuation\">:<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> RunRow<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token punctuation\">}<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> r <span class=\"token keyword\">in<\/span> rows<span class=\"token punctuation\">:<\/span><br \/>\n        C <span class=\"token operator\">&#061;<\/span> r<span class=\"token punctuation\">.<\/span>train_flops<br \/>\n        <span class=\"token keyword\">if<\/span> <span class=\"token punctuation\">(<\/span>C <span class=\"token keyword\">not<\/span> <span class=\"token keyword\">in<\/span> best<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">or<\/span> <span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">.<\/span>loss <span class=\"token operator\">&lt;<\/span> best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>loss<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            best<span class=\"token punctuation\">[<\/span>C<span class=\"token punctuation\">]<\/span> <span class=\"token operator\">&#061;<\/span> r<br \/>\n    <span class=\"token keyword\">return<\/span> best<\/p>\n<p>\u62df\u5408\u811a\u672c cs336_scaling\/fit_scaling_laws.py \u7684\u5b9e\u73b0\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> argparse<br \/>\n<span class=\"token keyword\">import<\/span> csv<br \/>\n<span class=\"token keyword\">import<\/span> json<br \/>\n<span class=\"token keyword\">from<\/span> pathlib <span class=\"token keyword\">import<\/span> Path<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Dict<span class=\"token punctuation\">,<\/span> Tuple<\/p>\n<p><span class=\"token keyword\">import<\/span> numpy <span class=\"token keyword\">as<\/span> np<br \/>\n<span class=\"token keyword\">import<\/span> matplotlib<span class=\"token punctuation\">.<\/span>pyplot <span class=\"token keyword\">as<\/span> plt<\/p>\n<p><span class=\"token keyword\">from<\/span> scaling_data <span class=\"token keyword\">import<\/span> approx_nonemb_params<span class=\"token punctuation\">,<\/span> group_best_by_compute<span class=\"token punctuation\">,<\/span> load_sweep_jsonl<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">fit_powerlaw<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span> y<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Tuple<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Fit y &#061; k * x^a using log-log linear regression.<br \/>\n    Returns (k, a).<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    lx <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>log<span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">)<\/span><br \/>\n    ly <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>log<span class=\"token punctuation\">(<\/span>y<span class=\"token punctuation\">)<\/span><br \/>\n    a<span class=\"token punctuation\">,<\/span> logk <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>polyfit<span class=\"token punctuation\">(<\/span>lx<span class=\"token punctuation\">,<\/span> ly<span class=\"token punctuation\">,<\/span> <span class=\"token number\">1<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>exp<span class=\"token punctuation\">(<\/span>logk<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>a<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">fit_loss_with_floor<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">,<\/span> L<span class=\"token punctuation\">:<\/span> np<span class=\"token punctuation\">.<\/span>ndarray<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Dict<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    Fit L(C) &#061; L_inf &#043; k * C^{-a}<br \/>\n    via a simple grid search over L_inf and log-log fit on (L &#8211; L_inf).<br \/>\n    This is robust and dependency-free.<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    <span class=\"token comment\"># L_inf must be below min(L)<\/span><br \/>\n    Lmin <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span>L<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token comment\"># a conservative grid: from Lmin-2.0 down to Lmin-0.01<\/span><br \/>\n    <span class=\"token comment\"># (you can widen if needed)<\/span><br \/>\n    candidates <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>linspace<span class=\"token punctuation\">(<\/span>Lmin <span class=\"token operator\">&#8211;<\/span> <span class=\"token number\">2.0<\/span><span class=\"token punctuation\">,<\/span> Lmin <span class=\"token operator\">&#8211;<\/span> <span class=\"token number\">0.01<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">200<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    best <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;mse&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;inf&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">}<\/span><br \/>\n    <span class=\"token keyword\">for<\/span> Linf <span class=\"token keyword\">in<\/span> candidates<span class=\"token punctuation\">:<\/span><br \/>\n        y <span class=\"token operator\">&#061;<\/span> L <span class=\"token operator\">&#8211;<\/span> Linf<br \/>\n        <span class=\"token keyword\">if<\/span> np<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">any<\/span><span class=\"token punctuation\">(<\/span>y <span class=\"token operator\">&lt;&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">continue<\/span><br \/>\n        k<span class=\"token punctuation\">,<\/span> a <span class=\"token operator\">&#061;<\/span> fit_powerlaw<span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">,<\/span> y<span class=\"token punctuation\">)<\/span>          <span class=\"token comment\"># y &#061; k * C^a, but we need y &#061; k * C^{-aL}<\/span><br \/>\n        <span class=\"token comment\"># In our parameterization: y &#061; k * C^{-aL} &#061;&gt; log y &#061; log k &#8211; aL log C<\/span><br \/>\n        <span class=\"token comment\"># So slope returned is a &#061; -aL<\/span><br \/>\n        aL <span class=\"token operator\">&#061;<\/span> <span class=\"token operator\">&#8211;<\/span>a<br \/>\n        pred <span class=\"token operator\">&#061;<\/span> Linf <span class=\"token operator\">&#043;<\/span> k <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>C <span class=\"token operator\">**<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span>aL<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        mse <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>mean<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">(<\/span>pred <span class=\"token operator\">&#8211;<\/span> L<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">**<\/span> <span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">if<\/span> mse <span class=\"token operator\">&lt;<\/span> best<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;mse&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n            best <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>Linf<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>k<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>aL<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;mse&#034;<\/span><span class=\"token punctuation\">:<\/span> mse<span class=\"token punctuation\">}<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> best<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token keyword\">is<\/span> <span class=\"token boolean\">None<\/span><span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">raise<\/span> RuntimeError<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Failed to fit L(C)&#061;L_inf&#043;k*C^{-a}: no valid Linf candidate.&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">return<\/span> best<\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">plot_loglog_points_and_fit<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">,<\/span> y<span class=\"token punctuation\">,<\/span> fit_fn<span class=\"token punctuation\">,<\/span> out_path<span class=\"token punctuation\">:<\/span> Path<span class=\"token punctuation\">,<\/span> title<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">,<\/span> ylab<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">str<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    xs <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>logspace<span class=\"token punctuation\">(<\/span>np<span class=\"token punctuation\">.<\/span>log10<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">min<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> np<span class=\"token punctuation\">.<\/span>log10<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">max<\/span><span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">300<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ys <span class=\"token operator\">&#061;<\/span> fit_fn<span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>figure<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>loglog<span class=\"token punctuation\">(<\/span>x<span class=\"token punctuation\">,<\/span> y<span class=\"token punctuation\">,<\/span> marker<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;o&#034;<\/span><span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;None&#034;<\/span><span class=\"token punctuation\">,<\/span> label<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;best points&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>loglog<span class=\"token punctuation\">(<\/span>xs<span class=\"token punctuation\">,<\/span> ys<span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;-&#034;<\/span><span class=\"token punctuation\">,<\/span> label<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;fit&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>xlabel<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Compute budget C (FLOPs)&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>ylabel<span class=\"token punctuation\">(<\/span>ylab<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>title<span class=\"token punctuation\">(<\/span>title<span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>grid<span class=\"token punctuation\">(<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> which<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;both&#034;<\/span><span class=\"token punctuation\">,<\/span> linestyle<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#8211;&#034;<\/span><span class=\"token punctuation\">,<\/span> linewidth<span class=\"token operator\">&#061;<\/span><span class=\"token number\">0.5<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>legend<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>tight_layout<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>savefig<span class=\"token punctuation\">(<\/span>out_path<span class=\"token punctuation\">,<\/span> dpi<span class=\"token operator\">&#061;<\/span><span class=\"token number\">200<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    plt<span class=\"token punctuation\">.<\/span>close<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    ap <span class=\"token operator\">&#061;<\/span> argparse<span class=\"token punctuation\">.<\/span>ArgumentParser<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;sweep&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;runs\/sweep_results.jsonl&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;outdir&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;runs\/scaling_fit&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;make-plots&#034;<\/span><span class=\"token punctuation\">,<\/span> action<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;store_true&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    args <span class=\"token operator\">&#061;<\/span> ap<span class=\"token punctuation\">.<\/span>parse_args<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    rows <span class=\"token operator\">&#061;<\/span> load_sweep_jsonl<span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>sweep<span class=\"token punctuation\">)<\/span><br \/>\n    best <span class=\"token operator\">&#061;<\/span> group_best_by_compute<span class=\"token punctuation\">(<\/span>rows<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Sort by compute<\/span><br \/>\n    Cs <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token builtin\">sorted<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">.<\/span>keys<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    Ns <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span>approx_nonemb_params<span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span> best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> Cs<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><br \/>\n    Ls <span class=\"token operator\">&#061;<\/span> np<span class=\"token punctuation\">.<\/span>array<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>loss <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> Cs<span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span> dtype<span class=\"token operator\">&#061;<\/span>np<span class=\"token punctuation\">.<\/span>float64<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Fit N_opt(C) &#061; kN * C^aN<\/span><br \/>\n    kN<span class=\"token punctuation\">,<\/span> aN <span class=\"token operator\">&#061;<\/span> fit_powerlaw<span class=\"token punctuation\">(<\/span>Cs<span class=\"token punctuation\">,<\/span> Ns<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Fit L_opt(C) &#061; L_inf &#043; kL * C^{-aL}<\/span><br \/>\n    loss_fit <span class=\"token operator\">&#061;<\/span> fit_loss_with_floor<span class=\"token punctuation\">(<\/span>Cs<span class=\"token punctuation\">,<\/span> Ls<span class=\"token punctuation\">)<\/span><\/p>\n<p>    args<span class=\"token punctuation\">.<\/span>outdir<span class=\"token punctuation\">.<\/span>mkdir<span class=\"token punctuation\">(<\/span>parents<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">,<\/span> exist_ok<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">True<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Save points (for writeup tables \/ plots)<\/span><br \/>\n    csv_path <span class=\"token operator\">&#061;<\/span> args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;scaling_fit_points.csv&#034;<\/span><br \/>\n    <span class=\"token keyword\">with<\/span> csv_path<span class=\"token punctuation\">.<\/span><span class=\"token builtin\">open<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;w&#034;<\/span><span class=\"token punctuation\">,<\/span> newline<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;&#034;<\/span><span class=\"token punctuation\">,<\/span> encoding<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;utf-8&#034;<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">as<\/span> f<span class=\"token punctuation\">:<\/span><br \/>\n        w <span class=\"token operator\">&#061;<\/span> csv<span class=\"token punctuation\">.<\/span>writer<span class=\"token punctuation\">(<\/span>f<span class=\"token punctuation\">)<\/span><br \/>\n        w<span class=\"token punctuation\">.<\/span>writerow<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;train_flops&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;loss_best&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;batch_size&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;n_nonemb_params_est&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> Cs<span class=\"token punctuation\">:<\/span><br \/>\n            r <span class=\"token operator\">&#061;<\/span> best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><br \/>\n            w<span class=\"token punctuation\">.<\/span>writerow<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">[<\/span><br \/>\n                <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>loss<span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">,<\/span><br \/>\n                approx_nonemb_params<span class=\"token punctuation\">(<\/span>r<span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span> r<span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Save fit params<\/span><br \/>\n    out <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n        <span class=\"token string\">&#034;best_points&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">[<\/span><br \/>\n            <span class=\"token punctuation\">{<\/span><br \/>\n                <span class=\"token string\">&#034;train_flops&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;loss&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>loss<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>num_heads<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;batch_size&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>batch_size<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>learning_rate<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;n_nonemb_params_est&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>approx_nonemb_params<span class=\"token punctuation\">(<\/span>best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>d_model<span class=\"token punctuation\">,<\/span> best<span class=\"token punctuation\">[<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>C<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">.<\/span>num_layers<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">}<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> C <span class=\"token keyword\">in<\/span> Cs<br \/>\n        <span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;fit&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n            <span class=\"token string\">&#034;n_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">:<\/span> kN<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">:<\/span> aN<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;form&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token string\">&#034;N_opt(C)&#061;k*C^a&#034;<\/span><span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token string\">&#034;l_opt&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token punctuation\">{<\/span><br \/>\n                <span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">:<\/span> loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">:<\/span> loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">:<\/span> loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;mse&#034;<\/span><span class=\"token punctuation\">:<\/span> loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;mse&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">,<\/span><br \/>\n                <span class=\"token string\">&#034;form&#034;<\/span><span class=\"token punctuation\">:<\/span> <span class=\"token string\">&#034;L_opt(C)&#061;L_inf &#043; k*C^{-a}&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">}<\/span><br \/>\n    <span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;scaling_fit.json&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">.<\/span>write_text<span class=\"token punctuation\">(<\/span>json<span class=\"token punctuation\">.<\/span>dumps<span class=\"token punctuation\">(<\/span>out<span class=\"token punctuation\">,<\/span> indent<span class=\"token operator\">&#061;<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; Fit results &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;N_opt(C) &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>kN<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>aN<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;L_opt(C) &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#039;L_inf&#039;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> &#043; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#039;k&#039;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\"> * C^(-<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#039;a&#039;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">)&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;Saved: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>args<span class=\"token punctuation\">.<\/span>outdir<span class=\"token operator\">\/<\/span><span class=\"token string\">&#039;scaling_fit.json&#039;<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;Saved: <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>csv_path<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">if<\/span> args<span class=\"token punctuation\">.<\/span>make_plots<span class=\"token punctuation\">:<\/span><br \/>\n        plot_loglog_points_and_fit<span class=\"token punctuation\">(<\/span><br \/>\n            Cs<span class=\"token punctuation\">,<\/span> Ns<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token keyword\">lambda<\/span> x<span class=\"token punctuation\">:<\/span> kN <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>x <span class=\"token operator\">**<\/span> aN<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;nopt_vs_c.png&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            title<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;Compute-optimal model size (from best-per-C points)&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            ylab<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;N_nonemb_params_est&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token comment\"># For loss, loglog doesn&#039;t work with floor directly; plot (L-L_inf) for loglog visualization<\/span><br \/>\n        Linf <span class=\"token operator\">&#061;<\/span> loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">]<\/span><br \/>\n        y <span class=\"token operator\">&#061;<\/span> Ls <span class=\"token operator\">&#8211;<\/span> Linf<br \/>\n        plot_loglog_points_and_fit<span class=\"token punctuation\">(<\/span><br \/>\n            Cs<span class=\"token punctuation\">,<\/span> y<span class=\"token punctuation\">,<\/span><br \/>\n            <span class=\"token keyword\">lambda<\/span> x<span class=\"token punctuation\">:<\/span> loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">]<\/span> <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>x <span class=\"token operator\">**<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span>loss_fit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            args<span class=\"token punctuation\">.<\/span>outdir <span class=\"token operator\">\/<\/span> <span class=\"token string\">&#034;lopt_minus_linf_vs_c.png&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            title<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;Compute-optimal loss gap (L &#8211; L_inf)&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n            ylab<span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;L_opt &#8211; L_inf&#034;<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token punctuation\">)<\/span><br \/>\n        <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Saved plots to outdir.&#034;<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">if<\/span> __name__ <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;__main__&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    main<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u6267\u884c\u540e\u7684\u8f93\u51fa\u5305\u62ec&#xff1a;<\/p>\n<ul>\n<li>runs\/scaling_fit.json&#xff1a;\u62df\u5408\u53c2\u6570 &#043; \u6bcf\u4e2a compute \u7684\u6700\u4f18\u70b9<\/li>\n<li>runs\/scaling_fit_points.csv&#xff1a;\u65b9\u4fbf\u540e\u7eed\u7ed8\u56fe<\/li>\n<li>\u4e24\u5f20\u56fe<\/li>\n<\/ul>\n<p>\u9884\u6d4b\u811a\u672c cs336_scaling\/predict_1e19.py \u7684\u5b9e\u73b0\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token keyword\">import<\/span> argparse<br \/>\n<span class=\"token keyword\">import<\/span> json<br \/>\n<span class=\"token keyword\">from<\/span> dataclasses <span class=\"token keyword\">import<\/span> asdict<br \/>\n<span class=\"token keyword\">from<\/span> pathlib <span class=\"token keyword\">import<\/span> Path<br \/>\n<span class=\"token keyword\">from<\/span> typing <span class=\"token keyword\">import<\/span> Dict<span class=\"token punctuation\">,<\/span> Tuple<\/p>\n<p><span class=\"token keyword\">import<\/span> numpy <span class=\"token keyword\">as<\/span> np<\/p>\n<p><span class=\"token keyword\">from<\/span> api_client <span class=\"token keyword\">import<\/span> LossQuery<br \/>\n<span class=\"token keyword\">from<\/span> scaling_data <span class=\"token keyword\">import<\/span> approx_nonemb_params<\/p>\n<p>ALLOWED_D_MODEL <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token number\">64<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">96<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">160<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">192<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">256<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">320<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">384<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">512<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">640<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">768<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">896<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">1024<\/span><span class=\"token punctuation\">]<\/span><br \/>\nALLOWED_LAYERS <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">list<\/span><span class=\"token punctuation\">(<\/span><span class=\"token builtin\">range<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">25<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># 2..24<\/span><br \/>\nALLOWED_HEADS <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">4<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">8<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">16<\/span><span class=\"token punctuation\">]<\/span><br \/>\nALLOWED_BATCH <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">[<\/span><span class=\"token number\">128<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token number\">256<\/span><span class=\"token punctuation\">]<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">find_closest_arch<\/span><span class=\"token punctuation\">(<\/span>target_N<span class=\"token punctuation\">:<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#8211;<\/span><span class=\"token operator\">&gt;<\/span> Tuple<span class=\"token punctuation\">[<\/span>Dict<span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    <span class=\"token triple-quoted-string string\">&#034;&#034;&#034;<br \/>\n    brute-force search over allowed ranges to find (d_model, num_layers, num_heads)<br \/>\n    that yields N_est closest to target_N, with d_model % num_heads &#061;&#061; 0.<br \/>\n    &#034;&#034;&#034;<\/span><br \/>\n    best <span class=\"token operator\">&#061;<\/span> <span class=\"token boolean\">None<\/span><br \/>\n    best_err <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;inf&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    best_N <span class=\"token operator\">&#061;<\/span> <span class=\"token boolean\">None<\/span><\/p>\n<p>    <span class=\"token keyword\">for<\/span> d <span class=\"token keyword\">in<\/span> ALLOWED_D_MODEL<span class=\"token punctuation\">:<\/span><br \/>\n        <span class=\"token keyword\">for<\/span> nl <span class=\"token keyword\">in<\/span> ALLOWED_LAYERS<span class=\"token punctuation\">:<\/span><br \/>\n            <span class=\"token keyword\">for<\/span> nh <span class=\"token keyword\">in<\/span> ALLOWED_HEADS<span class=\"token punctuation\">:<\/span><br \/>\n                <span class=\"token keyword\">if<\/span> d <span class=\"token operator\">%<\/span> nh <span class=\"token operator\">!&#061;<\/span> <span class=\"token number\">0<\/span><span class=\"token punctuation\">:<\/span><br \/>\n                    <span class=\"token keyword\">continue<\/span><br \/>\n                N <span class=\"token operator\">&#061;<\/span> approx_nonemb_params<span class=\"token punctuation\">(<\/span>d<span class=\"token punctuation\">,<\/span> nl<span class=\"token punctuation\">)<\/span><br \/>\n                err <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">abs<\/span><span class=\"token punctuation\">(<\/span>N <span class=\"token operator\">&#8211;<\/span> target_N<span class=\"token punctuation\">)<\/span> <span class=\"token operator\">\/<\/span> target_N<br \/>\n                <span class=\"token keyword\">if<\/span> err <span class=\"token operator\">&lt;<\/span> best_err<span class=\"token punctuation\">:<\/span><br \/>\n                    best_err <span class=\"token operator\">&#061;<\/span> err<br \/>\n                    best <span class=\"token operator\">&#061;<\/span> <span class=\"token punctuation\">{<\/span><span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">:<\/span> d<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">:<\/span> nl<span class=\"token punctuation\">,<\/span> <span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">:<\/span> nh<span class=\"token punctuation\">}<\/span><br \/>\n                    best_N <span class=\"token operator\">&#061;<\/span> N<br \/>\n    <span class=\"token keyword\">return<\/span> best<span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>best_N<span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">def<\/span> <span class=\"token function\">main<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    ap <span class=\"token operator\">&#061;<\/span> argparse<span class=\"token punctuation\">.<\/span>ArgumentParser<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;fit&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span>Path<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;runs\/scaling_fit\/scaling_fit.json&#034;<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;budget&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token number\">1e19<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;batch&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">,<\/span> choices<span class=\"token operator\">&#061;<\/span>ALLOWED_BATCH<span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token number\">256<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    ap<span class=\"token punctuation\">.<\/span>add_argument<span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#8211;lr&#034;<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">type<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">,<\/span> default<span class=\"token operator\">&#061;<\/span><span class=\"token boolean\">None<\/span><span class=\"token punctuation\">,<\/span> <span class=\"token builtin\">help<\/span><span class=\"token operator\">&#061;<\/span><span class=\"token string\">&#034;override learning rate; if None, use lr from best point at max C&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    args <span class=\"token operator\">&#061;<\/span> ap<span class=\"token punctuation\">.<\/span>parse_args<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    data <span class=\"token operator\">&#061;<\/span> json<span class=\"token punctuation\">.<\/span>loads<span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>fit<span class=\"token punctuation\">.<\/span>read_text<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    nfit <span class=\"token operator\">&#061;<\/span> data<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;fit&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;n_opt&#034;<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    lfit <span class=\"token operator\">&#061;<\/span> data<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;fit&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;l_opt&#034;<\/span><span class=\"token punctuation\">]<\/span><\/p>\n<p>    C <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>budget<span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># Predictions<\/span><br \/>\n    N_pred <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>nfit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>C <span class=\"token operator\">**<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>nfit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    L_pred <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>lfit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;L_inf&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">&#043;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>lfit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;k&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span> <span class=\"token operator\">*<\/span> <span class=\"token punctuation\">(<\/span>C <span class=\"token operator\">**<\/span> <span class=\"token punctuation\">(<\/span><span class=\"token operator\">&#8211;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>lfit<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;a&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token comment\"># pick lr from the best point at maximum observed C (usually 1e18) unless overridden<\/span><br \/>\n    best_points <span class=\"token operator\">&#061;<\/span> data<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;best_points&#034;<\/span><span class=\"token punctuation\">]<\/span><br \/>\n    maxC_point <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">max<\/span><span class=\"token punctuation\">(<\/span>best_points<span class=\"token punctuation\">,<\/span> key<span class=\"token operator\">&#061;<\/span><span class=\"token keyword\">lambda<\/span> x<span class=\"token punctuation\">:<\/span> x<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;train_flops&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    lr <span class=\"token operator\">&#061;<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>lr<span class=\"token punctuation\">)<\/span> <span class=\"token keyword\">if<\/span> args<span class=\"token punctuation\">.<\/span>lr <span class=\"token keyword\">is<\/span> <span class=\"token keyword\">not<\/span> <span class=\"token boolean\">None<\/span> <span class=\"token keyword\">else<\/span> <span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>maxC_point<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>    arch<span class=\"token punctuation\">,<\/span> N_arch <span class=\"token operator\">&#061;<\/span> find_closest_arch<span class=\"token punctuation\">(<\/span>N_pred<span class=\"token punctuation\">)<\/span><\/p>\n<p>    suggested <span class=\"token operator\">&#061;<\/span> LossQuery<span class=\"token punctuation\">(<\/span><br \/>\n        d_model<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>arch<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;d_model&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        num_layers<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>arch<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;num_layers&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        num_heads<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>arch<span class=\"token punctuation\">[<\/span><span class=\"token string\">&#034;num_heads&#034;<\/span><span class=\"token punctuation\">]<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        batch_size<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span>args<span class=\"token punctuation\">.<\/span>batch<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        learning_rate<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">float<\/span><span class=\"token punctuation\">(<\/span>lr<span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span><br \/>\n        train_flops<span class=\"token operator\">&#061;<\/span><span class=\"token builtin\">int<\/span><span class=\"token punctuation\">(<\/span><span class=\"token number\">1e18<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">,<\/span>  <span class=\"token comment\"># API only supports up to 1e18; 1e19 is for your final report prediction<\/span><br \/>\n    <span class=\"token punctuation\">)<\/span><\/p>\n<p>    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; Scaling-law prediction at 1e19 FLOPs &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;Predicted N_opt (non-emb est) : <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>N_pred<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;Predicted L_opt               : <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>L_pred<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.6f<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; Closest feasible architecture (API domain) &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;arch &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>arch<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">, N_est&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>N_arch<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3e<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">, rel_err&#061;<\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span><span class=\"token builtin\">abs<\/span><span class=\"token punctuation\">(<\/span>N_arch<span class=\"token operator\">&#8211;<\/span>N_pred<span class=\"token punctuation\">)<\/span><span class=\"token operator\">\/<\/span>N_pred<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">.3%<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;&#061;&#061;&#061; Suggested training hyperparams (submit) &#061;&#061;&#061;&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;batch_size must be 128 or 256 (you chose <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>args<span class=\"token punctuation\">.<\/span>batch<span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">).&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span>  <span class=\"token comment\"># handout requirement:contentReference[oaicite:6]{index&#061;6}<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string-interpolation\"><span class=\"token string\">f&#034;learning_rate &#061; <\/span><span class=\"token interpolation\"><span class=\"token punctuation\">{<\/span>lr<span class=\"token punctuation\">:<\/span><span class=\"token format-spec\">g<\/span><span class=\"token punctuation\">}<\/span><\/span><span class=\"token string\">  (default from best &#064; max observed compute)&#034;<\/span><\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;NOTE: API train_flops max is 1e18; 1e19 values are extrapolated for the report\/submission.&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span><span class=\"token string\">&#034;Suggested config (for Google form):&#034;<\/span><span class=\"token punctuation\">)<\/span><br \/>\n    <span class=\"token keyword\">print<\/span><span class=\"token punctuation\">(<\/span>json<span class=\"token punctuation\">.<\/span>dumps<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">{<\/span><br \/>\n        <span class=\"token string\">&#034;model_size_nonemb_params_est&#034;<\/span><span class=\"token punctuation\">:<\/span> N_arch<span class=\"token punctuation\">,<\/span>   <span class=\"token comment\"># what you&#039;ll report as &#034;model size&#034;<\/span><br \/>\n        <span class=\"token string\">&#034;arch&#034;<\/span><span class=\"token punctuation\">:<\/span> arch<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;batch_size&#034;<\/span><span class=\"token punctuation\">:<\/span> args<span class=\"token punctuation\">.<\/span>batch<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;learning_rate&#034;<\/span><span class=\"token punctuation\">:<\/span> lr<span class=\"token punctuation\">,<\/span><br \/>\n        <span class=\"token string\">&#034;predicted_loss_at_1e19&#034;<\/span><span class=\"token punctuation\">:<\/span> L_pred<span class=\"token punctuation\">,<\/span><br \/>\n    <span class=\"token punctuation\">}<\/span><span class=\"token punctuation\">,<\/span> indent<span class=\"token operator\">&#061;<\/span><span class=\"token number\">2<\/span><span class=\"token punctuation\">)<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p><span class=\"token keyword\">if<\/span> __name__ <span class=\"token operator\">&#061;&#061;<\/span> <span class=\"token string\">&#034;__main__&#034;<\/span><span class=\"token punctuation\">:<\/span><br \/>\n    main<span class=\"token punctuation\">(<\/span><span class=\"token punctuation\">)<\/span><\/p>\n<p>\u8be5\u811a\u672c\u4f1a\u8bfb scaling_fit.json \u5e76\u8ba1\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        1<\/p>\n<p>        e<\/p>\n<p>        19<\/p>\n<p>        )<\/p>\n<p>       N_\\\\text{opt}(1e19)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.109em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord mathnormal\">e<\/span><span class=\"mord\">19<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>\u3001<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         L<\/p>\n<p>         opt<\/p>\n<p>        (<\/p>\n<p>        1<\/p>\n<p>        e<\/p>\n<p>        19<\/p>\n<p>        )<\/p>\n<p>       L_\\\\text{opt}(1e19)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0361em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2806em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">opt<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord mathnormal\">e<\/span><span class=\"mord\">19<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span>&#xff0c;\u7136\u540e\u5728\u5141\u8bb8\u8303\u56f4\u5185\u641c\u7d22\u4e00\u4e2a \u201c\u6700\u63a5\u8fd1\u76ee\u6807\u53c2\u6570\u91cf\u201d \u7684\u7ed3\u6784\u8d85\u53c2&#xff08;d_model\/layers\/heads&#xff09;&#xff0c;\u5e76\u7ed9\u51fa batch\u3001lr&#xff08;\u9ed8\u8ba4\u53d6\u5728\u6700\u5927 compute \u6700\u4f18\u70b9\u7684 lr&#xff09;\u3002<\/p>\n<p>\u8fd0\u884c\u6307\u4ee4\u5982\u4e0b&#xff1a;<\/p>\n<p><span class=\"token comment\"># 1) \u62df\u5408&#xff08;\u53ef\u9009\u52a0 &#8211;make-plots&#xff09;<\/span><br \/>\nuv run python cs336_scaling\/fit_scaling_laws.py <span class=\"token punctuation\">\\\\<\/span><br \/>\n  <span class=\"token parameter variable\">&#8211;sweep<\/span> runs\/sweep_results.jsonl <span class=\"token punctuation\">\\\\<\/span><br \/>\n  <span class=\"token parameter variable\">&#8211;outdir<\/span> runs\/scaling_fit <span class=\"token punctuation\">\\\\<\/span><br \/>\n  &#8211;make-plots<\/p>\n<p><span class=\"token comment\"># 2) \u5916\u63a8\u5230 1e19&#xff0c;\u5e76\u8f93\u51fa\u6700\u7ec8\u201c\u53ef\u63d0\u4ea4\u201d\u4e09\u5143\u7ec4<\/span><br \/>\nuv run python cs336_scaling\/predict_1e19.py <span class=\"token punctuation\">\\\\<\/span><br \/>\n  <span class=\"token parameter variable\">&#8211;fit<\/span> runs\/scaling_fit\/scaling_fit.json <span class=\"token punctuation\">\\\\<\/span><br \/>\n  <span class=\"token parameter variable\">&#8211;budget<\/span> 1e19 <span class=\"token punctuation\">\\\\<\/span><br \/>\n  <span class=\"token parameter variable\">&#8211;batch<\/span> <span class=\"token number\">256<\/span><\/p>\n<h4>2.4 \u6574\u4f53\u8bbe\u8ba1\u601d\u8def\u5206\u6790<\/h4>\n<p>\u5728\u672c\u6b21\u4f5c\u4e1a\u4e2d&#xff0c;\u6211\u4eec\u9700\u8981\u5229\u7528\u8bfe\u7a0b\u63d0\u4f9b\u7684 training API \u5bf9\u6a21\u578b\u89c4\u6a21\u3001\u8bad\u7ec3\u8ba1\u7b97\u91cf\u4e0e\u8bad\u7ec3\u635f\u5931\u4e4b\u95f4\u7684\u7ecf\u9a8c\u7f29\u653e\u89c4\u5f8b&#xff08;scaling laws&#xff09;\u8fdb\u884c\u5efa\u6a21\u4e0e\u5206\u6790\u3002\u76ee\u6807\u662f\u5728 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>        10^{19}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u7684\u8bad\u7ec3\u9884\u7b97\u4e0b&#xff0c;\u9884\u6d4b compute-optimal \u7684\u6a21\u578b\u89c4\u6a21\u3001\u5bf9\u5e94\u7684\u8bad\u7ec3\u8d85\u53c2\u6570\u914d\u7f6e\u4ee5\u53ca\u6700\u7ec8\u8bad\u7ec3\u635f\u5931\u3002<\/p>\n<p>\u7531\u4e8e\u5728\u5b8c\u6210\u672c\u6b21\u4f5c\u4e1a\u65f6\u535a\u4e3b\u65e0\u6cd5\u83b7\u5f97\u5b98\u65b9 API \u7684\u8bbf\u95ee\u6743\u9650&#xff0c;\u56e0\u6b64\u8fd9\u91cc\u6211\u4eec\u91cd\u70b9\u5b8c\u6210 \u5b9e\u9a8c\u8bbe\u8ba1\u3001\u5efa\u6a21\u65b9\u6cd5\u4e0e\u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u601d\u8def\u7684\u5b8c\u6574\u9610\u8ff0&#xff1b;\u6240\u6709\u4f9d\u8d56\u771f\u5b9e API \u67e5\u8be2\u624d\u80fd\u5f97\u5230\u7684\u6570\u503c\u7ed3\u679c&#xff0c;\u5747\u5728\u6587\u4e2d\u4ee5\u3010\u5360\u4f4d\u3011\u5f62\u5f0f\u6807\u6ce8&#xff0c;\u5982\u679c\u540e\u7eed\u80fd\u8bbf\u95ee\u76f8\u5e94 API \u6211\u4eec\u518d\u6765\u8865\u9f50\u3002<\/p>\n<p>1. \u95ee\u9898\u80cc\u666f\u4e0e\u7ea6\u675f\u6761\u4ef6<\/p>\n<p>training API \u5c06\u5b8c\u6574\u7684\u8bad\u7ec3\u8fc7\u7a0b\u62bd\u8c61\u4e3a\u4e00\u4e2a\u9ed1\u76d2\u63a5\u53e3&#xff0c;\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u6307\u5b9a\u6a21\u578b\u7ed3\u6784\u3001\u4f18\u5316\u5668\u8d85\u53c2\u6570\u4ee5\u53ca\u8bad\u7ec3\u8ba1\u7b97\u9884\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        C<\/p>\n<p>        &#061;<\/p>\n<p>        train_flops<\/p>\n<p>       C &#061; \\\\text{train\\\\_flops}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.0044em;vertical-align: -0.31em\"><\/span><span class=\"mord text\"><span class=\"mord\">train_flops<\/span><\/span><\/span><\/span><\/span><\/span> \u67e5\u8be2\u5bf9\u5e94\u7684\u6700\u7ec8\u635f\u5931\u3002<\/p>\n<p>\u8be5\u95ee\u9898\u5177\u6709\u4ee5\u4e0b\u5173\u952e\u7ea6\u675f&#xff1a;<\/p>\n<ul>\n<li>\u7528\u4e8e\u62df\u5408 scaling laws \u7684 API \u67e5\u8be2\u603b\u8ba1\u7b97\u9884\u7b97\u4e0a\u9650\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         2<\/p>\n<p>         \u00d7<\/p>\n<p>          10<\/p>\n<p>          18<\/p>\n<p>        2 \\\\times 10^{18}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">2<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff0c;\u8d85\u8fc7\u8be5\u9650\u5236\u5c06\u5bfc\u81f4\u540e\u7eed\u8bf7\u6c42\u88ab\u62d2\u7edd&#xff1b;<\/li>\n<li>\u53ef\u9009\u7684 train_flops \u4ec5\u9650\u4e8e\u7ed9\u5b9a\u7684\u79bb\u6563\u96c6\u5408&#xff0c;\u6700\u5927\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          10<\/p>\n<p>          18<\/p>\n<p>        10^{18}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff1b;<\/li>\n<li>\u4f5c\u4e1a\u8981\u6c42\u9884\u6d4b\u7684\u76ee\u6807\u8ba1\u7b97\u9884\u7b97\u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          10<\/p>\n<p>          19<\/p>\n<p>        10^{19}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff0c;\u56e0\u6b64\u6240\u6709\u7ed3\u8bba\u5747\u57fa\u4e8e\u5bf9\u4f4e\u4e8e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          10<\/p>\n<p>          18<\/p>\n<p>        10^{18}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u533a\u95f4\u7684 \u5916\u63a8&#xff08;extrapolation&#xff09;&#xff1b;<\/li>\n<li>\u6700\u7ec8\u63d0\u4ea4\u7684\u8bad\u7ec3\u914d\u7f6e\u4e2d&#xff0c;batch size \u5fc5\u987b\u4e3a 128 \u6216 256\u3002<\/li>\n<\/ul>\n<p>\u8fd9\u4e9b\u7ea6\u675f\u5171\u540c\u51b3\u5b9a\u4e86\u5b9e\u9a8c\u5fc5\u987b\u5728\u4e25\u683c\u7684\u9884\u7b97\u63a7\u5236\u4e0e\u5408\u7406\u7684\u5efa\u6a21\u5047\u8bbe\u4e0b\u8fdb\u884c\u3002<\/p>\n<p>2. \u5b9e\u9a8c pipeline \u6574\u4f53\u8bbe\u8ba1<\/p>\n<p>\u56f4\u7ed5\u4e0a\u8ff0\u7ea6\u675f&#xff0c;\u6211\u4eec\u8bbe\u8ba1\u5e76\u5b9e\u73b0\u4e86\u4e00\u5957\u6a21\u5757\u5316\u7684\u5b9e\u9a8c pipeline&#xff0c;\u6574\u4f53\u6d41\u7a0b\u5982\u4e0b&#xff1a;<\/p>\n<p>1. API \u8c03\u7528\u4e0e\u672c\u5730\u7f13\u5b58&#xff1a;\u6240\u6709 API \u67e5\u8be2\u5747\u901a\u8fc7\u7edf\u4e00\u5c01\u88c5\u7684\u63a5\u53e3\u5b8c\u6210&#xff0c;\u5e76\u4f7f\u7528\u672c\u5730\u7f13\u5b58\u907f\u514d\u91cd\u590d\u6d88\u8017\u8ba1\u7b97\u9884\u7b97&#xff1b;<\/p>\n<p>2. \u9884\u7b97\u611f\u77e5\u7684\u5b9e\u9a8c\u626b\u63cf&#xff08;sweep&#xff09;&#xff1a;\u5728\u5168\u5c40 FLOPs \u9884\u7b97\u9650\u5236\u4e0b&#xff0c;\u5bf9\u4e0d\u540c\u8ba1\u7b97\u9884\u7b97\u548c\u6a21\u578b\u7ed3\u6784\u8fdb\u884c\u5206\u9636\u6bb5\u63a2\u7d22&#xff1b;<\/p>\n<p>3. \u7f29\u653e\u5b9a\u5f8b\u62df\u5408&#xff1a;\u4ece\u5b9e\u9a8c\u7ed3\u679c\u4e2d\u6784\u9020 compute-optimal \u70b9&#xff0c;\u5e76\u62df\u5408\u6a21\u578b\u89c4\u6a21\u4e0e\u635f\u5931\u7684\u7f29\u653e\u89c4\u5f8b&#xff1b;<\/p>\n<p>4. \u5916\u63a8\u4e0e\u6700\u7ec8\u9884\u6d4b&#xff1a;\u5c06\u62df\u5408\u5f97\u5230\u7684 scaling laws \u5916\u63a8\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         19<\/p>\n<p>       10^{19}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs&#xff0c;\u751f\u6210\u6700\u7ec8\u53ef\u63d0\u4ea4\u7684\u9884\u6d4b\u7ed3\u679c\u3002<\/p>\n<p>3. \u6a21\u578b\u89c4\u6a21\u7684\u4f30\u8ba1\u65b9\u6cd5<\/p>\n<p>\u7531\u4e8e training API \u5e76\u672a\u76f4\u63a5\u63d0\u4f9b\u6a21\u578b\u7684\u53c2\u6570\u603b\u91cf&#xff0c;\u800c\u662f\u901a\u8fc7\u7ed3\u6784\u8d85\u53c2\u6570&#xff08;d_model\u3001num_layers\u3001num_heads&#xff09;\u7b80\u6d01\u63cf\u8ff0\u6a21\u578b\u89c4\u6a21&#xff0c;\u56e0\u6b64\u9700\u8981\u4e00\u6b21\u8fd1\u4f3c\u6620\u5c04\u5173\u7cfb\u3002<\/p>\n<p>\u5728\u672c\u5b9e\u9a8c\u4e2d&#xff0c;\u6211\u4eec\u91c7\u7528\u4f5c\u4e1a\u63d0\u793a\u4e2d\u5efa\u8bae\u7684\u8fd1\u4f3c\u516c\u5f0f&#xff0c;\u5c06\u975e embedding \u53c2\u6570\u91cf\u4f30\u8ba1\u4e3a&#xff1a;<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         \u2248<\/p>\n<p>         12<\/p>\n<p>         \u22c5<\/p>\n<p>          n<\/p>\n<p>          layer<\/p>\n<p>         \u22c5<\/p>\n<p>          d<\/p>\n<p>          model<\/p>\n<p>          2<\/p>\n<p>         N \\\\approx 12 \\\\cdot n_{\\\\text{layer}} \\\\cdot d_{\\\\text{model}}^2 <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">12<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7306em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.1111em;vertical-align: -0.247em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -2.453em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.247em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>4. Compute-Optimal \u70b9\u7684\u6784\u9020\u65b9\u6cd5<\/p>\n<p>\u5728\u6bcf\u4e00\u4e2a\u56fa\u5b9a\u7684\u8bad\u7ec3\u8ba1\u7b97\u9884\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>       C_i<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u4e0b&#xff0c;API \u5141\u8bb8\u67e5\u8be2\u591a\u7ec4\u4e0d\u540c\u7ed3\u6784\u4e0e\u8d85\u53c2\u6570\u914d\u7f6e\u3002\u4e3a\u4e86\u6784\u9020\u7f29\u653e\u5b9a\u5f8b&#xff0c;\u6211\u4eec\u91c7\u7528 IsoFLOPs \u98ce\u683c \u7684\u7b56\u7565&#xff0c;\u5728\u6bcf\u4e2a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         C<\/p>\n<p>         i<\/p>\n<p>       C_i<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u4e0a\u9009\u53d6\u8bad\u7ec3\u635f\u5931\u6700\u5c0f\u7684\u914d\u7f6e\u4f5c\u4e3a\u8be5\u9884\u7b97\u4e0b\u7684\u6700\u4f18\u70b9&#xff1a;<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          \u03b8<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>           arg\u2009min<\/p>\n<p>           \u2061<\/p>\n<p>           \u03b8<\/p>\n<p>           :<\/p>\n<p>           train_flops<\/p>\n<p>           &#061;<\/p>\n<p>            C<\/p>\n<p>            i<\/p>\n<p>         L<\/p>\n<p>         (<\/p>\n<p>         \u03b8<\/p>\n<p>         )<\/p>\n<p>         \\\\theta^*(C_i) &#061; \\\\argmin_{\\\\theta:\\\\text{train\\\\_flops}&#061;C_i} L(\\\\theta) <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">\u03b8<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.9135em;vertical-align: -1.1635em\"><\/span><span class=\"mop op-limits\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6679em\"><span class=\"\" style=\"top: -2.1535em;margin-left: 0em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.0278em\">\u03b8<\/span><span class=\"mrel mtight\">:<\/span><span class=\"mord text mtight\"><span class=\"mord mtight\">train_flops<\/span><\/span><span class=\"mrel mtight\">&#061;<\/span><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3281em\"><span class=\"\" style=\"top: -2.357em;margin-left: -0.0715em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.143em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"\"><span class=\"mop\"><span class=\"mord mathrm\" style=\"margin-right: 0.0139em\">arg<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathrm\">min<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 1.1635em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\">L<\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0278em\">\u03b8<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>\u4ece\u800c\u5f97\u5230\u4e00\u7ec4\u79bb\u6563\u7684 compute-optimal \u70b9&#xff1a;<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         {<\/p>\n<p>         (<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>         ,<\/p>\n<p>          N<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>         )<\/p>\n<p>         ,<\/p>\n<p>          L<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          C<\/p>\n<p>          i<\/p>\n<p>         )<\/p>\n<p>         )<\/p>\n<p>          }<\/p>\n<p>           i<\/p>\n<p>           &#061;<\/p>\n<p>           1<\/p>\n<p>          m<\/p>\n<p>         \\\\{(C_i, N^*(C_i), L^*(C_i))\\\\}_{i&#061;1}^m <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mopen\">{(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">))<\/span><span class=\"mclose\"><span class=\"mclose\">}<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7144em\"><span class=\"\" style=\"top: -2.453em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">i<\/span><span class=\"mrel mtight\">&#061;<\/span><span class=\"mord mtight\">1<\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">m<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.247em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>5. Scaling Laws \u7684\u5efa\u6a21\u5f62\u5f0f<\/p>\n<p>\u57fa\u4e8e\u4e0a\u8ff0 compute-optimal \u70b9&#xff0c;\u6211\u4eec\u5206\u522b\u5bf9\u6a21\u578b\u89c4\u6a21\u4e0e\u8bad\u7ec3\u635f\u5931\u62df\u5408\u7f29\u653e\u5b9a\u5f8b\u3002<\/p>\n<p>5.1 \u6a21\u578b\u89c4\u6a21\u7684\u7f29\u653e\u89c4\u5f8b<\/p>\n<p>\u6211\u4eec\u5047\u8bbe compute-optimal \u6a21\u578b\u89c4\u6a21\u968f\u8ba1\u7b97\u9884\u7b97\u5448\u5e42\u5f8b\u589e\u957f&#xff1a;<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          N<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>         C<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          k<\/p>\n<p>          N<\/p>\n<p>         \u22c5<\/p>\n<p>          C<\/p>\n<p>           a<\/p>\n<p>           N<\/p>\n<p>         N^*(C) &#061; k_N \\\\cdot C^{a_N} <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7144em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7144em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3448em\"><span class=\"\" style=\"top: -2.3567em;margin-left: 0em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1433em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>\u5728\u5bf9\u6570\u7a7a\u95f4\u4e2d&#xff0c;\u8be5\u5173\u7cfb\u4e3a\u7ebf\u6027\u5f62\u5f0f&#xff0c;\u56e0\u6b64\u53ef\u4ee5\u901a\u8fc7 log-log \u7ebf\u6027\u56de\u5f52\u7a33\u5b9a\u5730\u4f30\u8ba1\u53c2\u6570 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         k<\/p>\n<p>         N<\/p>\n<p>       k_N<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u4e0e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         a<\/p>\n<p>         N<\/p>\n<p>       a_N<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.5806em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>5.2 \u8bad\u7ec3\u635f\u5931\u7684\u7f29\u653e\u89c4\u5f8b<\/p>\n<p>\u8003\u8651\u5230\u8bad\u7ec3\u635f\u5931\u5728\u5927\u8ba1\u7b97\u91cf\u4e0b\u8d8b\u4e8e\u9971\u548c&#xff0c;\u6211\u4eec\u91c7\u7528\u5e26\u6709\u4e0b\u754c\u9879\u7684\u5e42\u5f8b\u6a21\u578b&#xff1a;<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          L<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>         C<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          L<\/p>\n<p>          \u221e<\/p>\n<p>         &#043;<\/p>\n<p>          k<\/p>\n<p>          L<\/p>\n<p>         \u22c5<\/p>\n<p>          C<\/p>\n<p>           \u2212<\/p>\n<p>            a<\/p>\n<p>            L<\/p>\n<p>         L^*(C) &#061; L_\\\\infty &#043; k_L \\\\cdot C^{-a_L} <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">&#043;<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">L<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8213em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8213em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">\u2212<\/span><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3448em\"><span class=\"\" style=\"top: -2.3567em;margin-left: 0em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">L<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1433em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>\u5176\u4e2d <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         L<\/p>\n<p>         \u221e<\/p>\n<p>       L_\\\\infty<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u8868\u793a\u4e0d\u53ef\u8fdb\u4e00\u6b65\u964d\u4f4e\u7684\u635f\u5931\u4e0b\u754c\u3002\u5b9e\u9645\u62df\u5408\u65f6&#xff0c;\u5bf9 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         L<\/p>\n<p>         \u221e<\/p>\n<p>       L_\\\\infty<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u8fdb\u884c\u7f51\u683c\u641c\u7d22&#xff0c;\u5728\u4fdd\u8bc1 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        L<\/p>\n<p>        \u2212<\/p>\n<p>         L<\/p>\n<p>         \u221e<\/p>\n<p>        &gt;<\/p>\n<p>        0<\/p>\n<p>       L &#8211; L_\\\\infty &gt; 0<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7667em;vertical-align: -0.0833em\"><\/span><span class=\"mord mathnormal\">L<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u2212<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&gt;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">0<\/span><\/span><\/span><\/span><\/span> \u7684\u524d\u63d0\u4e0b&#xff0c;\u5bf9 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        log<\/p>\n<p>        \u2061<\/p>\n<p>        (<\/p>\n<p>        L<\/p>\n<p>        \u2212<\/p>\n<p>         L<\/p>\n<p>         \u221e<\/p>\n<p>        )<\/p>\n<p>       \\\\log(L &#8211; L_\\\\infty)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mop\">lo<span style=\"margin-right: 0.0139em\">g<\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\">L<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u2212<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u4e0e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        log<\/p>\n<p>        \u2061<\/p>\n<p>        C<\/p>\n<p>       \\\\log C<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8889em;vertical-align: -0.1944em\"><\/span><span class=\"mop\">lo<span style=\"margin-right: 0.0139em\">g<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><\/span><\/span><\/span><\/span> \u8fdb\u884c\u7ebf\u6027\u56de\u5f52&#xff0c;\u4ece\u800c\u5f97\u5230\u7a33\u5b9a\u7684\u53c2\u6570\u4f30\u8ba1\u3002<\/p>\n<p>6. \u5916\u63a8\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>        10^{19}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u7684\u9884\u6d4b\u65b9\u6cd5<\/p>\n<p>\u4e00\u65e6\u5b8c\u6210\u7f29\u653e\u5b9a\u5f8b\u7684\u62df\u5408&#xff0c;\u5373\u53ef\u5c06\u5176\u5916\u63a8\u81f3\u76ee\u6807\u8ba1\u7b97\u9884\u7b97&#xff1a;<\/p>\n<p><span class=\"katex--display\"><span class=\"katex-display\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>           N<\/p>\n<p>           ^<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          k<\/p>\n<p>          N<\/p>\n<p>         \u22c5<\/p>\n<p>         (<\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>          )<\/p>\n<p>           a<\/p>\n<p>           N<\/p>\n<p>         ,<\/p>\n<p>           L<\/p>\n<p>           ^<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          L<\/p>\n<p>          \u221e<\/p>\n<p>         &#043;<\/p>\n<p>          k<\/p>\n<p>          L<\/p>\n<p>         \u22c5<\/p>\n<p>         (<\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>          )<\/p>\n<p>           \u2212<\/p>\n<p>            a<\/p>\n<p>            L<\/p>\n<p>         \\\\hat N^*(10^{19}) &#061; k_N \\\\cdot (10^{19})^{a_N}, \\\\qquad \\\\hat L^*(10^{19}) &#061; L_\\\\infty &#043; k_L \\\\cdot (10^{19})^{-a_L} <\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.1968em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.1667em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.1968em;vertical-align: -0.25em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\"><span class=\"mclose\">)<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7144em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3448em\"><span class=\"\" style=\"top: -2.3567em;margin-left: 0em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\" style=\"margin-right: 0.109em\">N<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1433em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\" style=\"margin-right: 2em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\">L<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.2222em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7387em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">&#043;<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0315em\">k<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3283em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0315em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">L<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.1141em;vertical-align: -0.25em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8641em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\"><span class=\"mclose\">)<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8213em\"><span class=\"\" style=\"top: -3.113em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">\u2212<\/span><span class=\"mord mtight\"><span class=\"mord mathnormal mtight\">a<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3448em\"><span class=\"\" style=\"top: -2.3567em;margin-left: 0em;margin-right: 0.0714em\"><span class=\"pstrut\" style=\"height: 2.5em\"><\/span><span class=\"sizing reset-size3 size1 mtight\"><span class=\"mord mathnormal mtight\">L<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1433em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/p>\n<p>\u7531\u4e8e API \u5bf9\u6a21\u578b\u7ed3\u6784\u7684\u53d6\u503c\u8303\u56f4\u6709\u9650&#xff0c;\u6211\u4eec\u8fd8\u9700\u8981\u5c06\u9884\u6d4b\u5f97\u5230\u7684\u6700\u4f18\u6a21\u578b\u89c4\u6a21 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          N<\/p>\n<p>          ^<\/p>\n<p>         \u2217<\/p>\n<p>       \\\\hat N^*<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.9468em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.1667em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u6620\u5c04\u4e3a\u4e00\u4e2a \u53ef\u5b9e\u73b0\u7684\u79bb\u6563\u7ed3\u6784\u3002\u5177\u4f53\u505a\u6cd5\u662f&#xff0c;\u5728\u6ee1\u8db3\u4ee5\u4e0b\u7ea6\u675f\u7684\u7ed3\u6784\u7a7a\u95f4\u4e2d\u641c\u7d22\u4e0e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          N<\/p>\n<p>          ^<\/p>\n<p>         \u2217<\/p>\n<p>       \\\\hat N^*<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.9468em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.1667em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u6700\u63a5\u8fd1\u7684\u914d\u7f6e&#xff1a;<\/p>\n<ul>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          d<\/p>\n<p>          model<\/p>\n<p>        \u200a<\/p>\n<p>           m<\/p>\n<p>           o<\/p>\n<p>           d<\/p>\n<p>        \u200a<\/p>\n<p>          n<\/p>\n<p>          head<\/p>\n<p>         &#061;<\/p>\n<p>         0<\/p>\n<p>        d_{\\\\text{model}} \\\\bmod n_{\\\\text{head}} &#061; 0<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.0556em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\"><span class=\"mord\"><span class=\"mord mathrm\">mod<\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.0556em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.5806em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">head<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">0<\/span><\/span><\/span><\/span><\/span>&#xff1b;<\/li>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         64<\/p>\n<p>         \u2264<\/p>\n<p>          d<\/p>\n<p>          model<\/p>\n<p>         \u2264<\/p>\n<p>         1024<\/p>\n<p>        64 \\\\le d_{\\\\text{model}} \\\\le 1024<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7804em;vertical-align: -0.136em\"><\/span><span class=\"mord\">64<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2264<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2264<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">1024<\/span><\/span><\/span><\/span><\/span>&#xff1b;<\/li>\n<li><span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         2<\/p>\n<p>         \u2264<\/p>\n<p>          n<\/p>\n<p>          layer<\/p>\n<p>         \u2264<\/p>\n<p>         24<\/p>\n<p>        2 \\\\le n_{\\\\text{layer}} \\\\le 24<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7804em;vertical-align: -0.136em\"><\/span><span class=\"mord\">2<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2264<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.9221em;vertical-align: -0.2861em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2264<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">24<\/span><\/span><\/span><\/span><\/span>\u3002<\/li>\n<\/ul>\n<p>7. \u6709\u9650\u9884\u7b97\u4e0b\u7684\u5b9e\u9a8c\u8bbe\u8ba1\u7b56\u7565<\/p>\n<p>\u5728 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        2<\/p>\n<p>        \u00d7<\/p>\n<p>         10<\/p>\n<p>         18<\/p>\n<p>       2 \\\\times 10^{18}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">2<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u7684\u603b\u9884\u7b97\u9650\u5236\u4e0b&#xff0c;\u6211\u4eec\u91c7\u7528 \u4e24\u9636\u6bb5\u5b9e\u9a8c\u7b56\u7565&#xff1a;<\/p>\n<p>\u7b2c\u4e00\u9636\u6bb5 (Coarse sweep)<\/p>\n<p>\u5728\u591a\u4e2a\u8ba1\u7b97\u9884\u7b97\u7ea7\u522b\u4e0a&#xff0c;\u9009\u53d6\u5c11\u91cf\u4ee3\u8868\u6027\u7684\u6a21\u578b\u89c4\u6a21\u3001\u5c42\u6570\u548c\u5b66\u4e60\u7387&#xff0c;\u5feb\u901f\u5efa\u7acb\u8bad\u7ec3\u635f\u5931\u968f\u8ba1\u7b97\u91cf\u53d8\u5316\u7684\u6574\u4f53\u8d8b\u52bf&#xff0c;\u5e76\u5b9a\u4f4d\u6f5c\u5728\u7684\u4f18\u826f\u7ed3\u6784\u8d8b\u4e8e\u3002<\/p>\n<p>\u7b2c\u4e8c\u9636\u6bb5&#xff08;Refine sweep&#xff09;<\/p>\n<p>\u56f4\u7ed5\u9ad8\u8ba1\u7b97\u9884\u7b97\u4e0b\u8868\u73b0\u6700\u4f18\u7684\u7ed3\u6784&#xff0c;\u8fdb\u884c\u5c0f\u8303\u56f4\u7684\u5c40\u90e8\u6270\u52a8&#xff08;\u5982\u8f7b\u5fae\u8c03\u6574\u6a21\u578b\u5bbd\u5ea6\u3001\u5c42\u6570\u4e0e\u5b66\u4e60\u7387&#xff09;&#xff0c;\u4ee5\u66f4\u5c11\u7684\u989d\u5916\u8ba1\u7b97\u4ee3\u4ef7\u63d0\u5347 compute-optimal \u70b9\u7684\u8d28\u91cf\u3002<\/p>\n<p>\u901a\u8fc7\u5728\u6267\u884c\u524d\u8fdb\u884c\u9884\u7b97\u4f30\u7b97&#xff0c;\u5e76\u7ed3\u5408\u7f13\u5b58\u673a\u5236\u590d\u7528\u5df2\u67e5\u8be2\u7ed3\u679c&#xff0c;\u53ef\u4ee5\u786e\u4fdd\u6574\u4e2a\u8fc7\u7a0b\u59cb\u7ec8\u4e0d\u8d85\u8fc7 API \u7684\u9884\u7b97\u9650\u5236\u3002<\/p>\n<p>\u5177\u4f53\u6765\u8bf4&#xff0c;\u6574\u4e2a\u67e5\u8be2\u7b56\u7565\u4e0e\u9884\u7b97\u5206\u914d\u65b9\u5f0f\u5982\u4e0b&#xff1a;<\/p>\n<p>\u5728 coarse \u9636\u6bb5&#xff0c;\u6211\u4eec\u91c7\u7528\u5148 \u201c\u8986\u76d6\u540e\u52a0\u5bc6\u201d \u7684\u539f\u5219&#xff1a;\u5bf9\u6bcf\u4e2a\u79bb\u6563\u7684 train_flops \u6321\u4f4d&#xff0c;\u81f3\u5c11\u4fdd\u8bc1\u5b58\u5728\u82e5\u5e72&#xff08;\u4f8b\u5982 5-10 \u4e2a&#xff09;\u7ed3\u6784\u5019\u9009\u70b9&#xff0c;\u4f7f\u5f97\u540e\u7eed\u5728\u8be5 compute \u4e0a\u80fd\u591f\u53ef\u9760\u5730\u9009\u51fa\u6700\u4f4e loss \u7684\u6700\u4f18\u70b9\u3002\u4e0e\u6b64\u540c\u65f6&#xff0c;\u6211\u4eec\u663e\u5f0f\u63a7\u5236\u626b\u63cf\u7ef4\u5ea6\u7684\u6570\u91cf&#xff1a;\u4f18\u5148\u626b\u63cf\u5bf9\u6a21\u578b\u5bb9\u91cf\u5f71\u54cd\u6700\u5927\u7684\u7ed3\u6784\u7ef4\u5ea6&#xff08;d_model \u4e0e num_layers&#xff09;&#xff0c;\u5e76\u4ec5\u4f7f\u7528\u5c11\u91cf\u5b66\u4e60\u7387\u5019\u9009&#xff08;\u4f8b\u5982 2-3 \u4e2a\u6570\u91cf\u7ea7\u5185\u7684\u4ee3\u8868\u503c&#xff09;&#xff0c;\u4ece\u800c\u5728\u9884\u7b97\u5185\u83b7\u5f97\u8db3\u591f\u7684\u8de8\u5c3a\u5ea6\u4fe1\u606f\u3002\u5bf9\u4e8e num_heads&#xff0c;\u6211\u4eec\u4e3b\u8981\u5c06\u5176\u4f5c\u4e3a\u6ee1\u8db3\u7ed3\u6784\u7ea6\u675f&#xff08;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         d<\/p>\n<p>         model<\/p>\n<p>       \u200a<\/p>\n<p>          m<\/p>\n<p>          o<\/p>\n<p>          d<\/p>\n<p>       \u200a<\/p>\n<p>         n<\/p>\n<p>         head<\/p>\n<p>        &#061;<\/p>\n<p>        0<\/p>\n<p>       d_{\\\\text{model}} \\\\bmod n_{\\\\text{head}} &#061; 0<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.0556em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\"><span class=\"mord\"><span class=\"mord mathrm\">mod<\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.0556em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.5806em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">head<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">0<\/span><\/span><\/span><\/span><\/span>&#xff09;\u7684\u79bb\u6563\u9009\u9879&#xff0c;\u800c\u4e0d\u5c06\u5176\u4f5c\u4e3a coarse \u9636\u6bb5\u7684\u4e3b\u8981\u8fde\u7eed\u641c\u7d22\u7ef4\u5ea6\u3002<\/p>\n<p>\u5728 refine \u9636\u6bb5&#xff0c;\u6211\u4eec\u4ee5 coarse \u9636\u6bb5\u5728\u9ad8 compute \u6321\u4f4d\u4e0a\u7684\u6700\u4f18\u70b9\u4f5c\u4e3a seed&#xff0c;\u5e76\u56f4\u7ed5\u8be5 seed \u4f5c\u4e3a\u5c40\u90e8\u7f51\u683c\u6270\u52a8&#xff0c;\u4f8b\u5982\u5bf9 d_model \u505a \u00b125% \u7684\u6bd4\u4f8b\u53d8\u5316\u3001\u5bf9 num_layers \u505a \u00b12 \u7684\u589e\u91cf\u53d8\u5316\u3001\u5bf9 learning rate \u505a x0.5\/x1\/x2 \u7684\u7f29\u653e&#xff0c;\u8fd9\u6837\u505a\u7684\u6838\u5fc3\u539f\u56e0\u662f&#xff1a;\u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u6700\u7ec8\u53ea\u4f9d\u8d56\u6bcf\u4e2a compute \u7684 \u201c\u6700\u4f18\u70b9\u8fb9\u754c\u201d&#xff0c;\u56e0\u6b64\u628a\u989d\u5916\u9884\u7b97\u6295\u5165\u5230 \u201c\u63d0\u5347\u6700\u4f18\u70b9\u8d28\u91cf\u201d \u6bd4\u76f2\u76ee\u6269\u5927\u641c\u7d22\u7a7a\u95f4\u66f4\u6709\u6548\u3002\u5b9e\u73b0\u5c42\u9762&#xff0c;\u6211\u4eec\u5728\u6bcf\u6b21\u6267\u884c\u6279\u91cf\u67e5\u8be2\u524d\u901a\u8fc7\u811a\u672c\u7edf\u8ba1 \u201c\u7f13\u5b58\u547d\u4e2d vs \u65b0\u67e5\u8be2\u6570\u201d&#xff0c;\u5e76\u7528\u65b0\u589e\u67e5\u8be2\u7684 train_flops \u7d2f\u52a0\u4f30\u7b97\u989d\u5916\u6210\u672c&#xff1b;\u4e00\u65e6\u9884\u8ba1\u4e0b\u4e00\u6761\u67e5\u8be2\u4f1a\u4f7f\u7d2f\u8ba1\u6d88\u8017\u8d85\u8fc7 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        2<\/p>\n<p>        \u00d7<\/p>\n<p>         10<\/p>\n<p>         18<\/p>\n<p>       2 \\\\times 10^{18}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.7278em;vertical-align: -0.0833em\"><\/span><span class=\"mord\">2<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u00d7<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u5219\u63d0\u524d\u505c\u6b62&#xff0c;\u4ece\u800c\u4fdd\u8bc1\u4e0d\u4f1a\u89e6\u53d1 API \u7684\u62d2\u7edd\u673a\u5236\u3002<\/p>\n<p>8. \u5b9e\u9a8c\u7ed3\u679c&#xff08;\u5f85\u8865\u9f50&#xff09;<\/p>\n<p>\u8fd0\u884c\u62df\u5408\u811a\u672c\u540e&#xff0c;\u5c06\u5f97\u5230\u4ee5\u4e0b\u7ed3\u679c&#xff1a;<\/p>\n<ul>\n<li>\u6bcf\u4e2a\u8ba1\u7b97\u9884\u7b97 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          C<\/p>\n<p>          i<\/p>\n<p>        C_i<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8333em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3117em\"><span class=\"\" style=\"top: -2.55em;margin-left: -0.0715em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mathnormal mtight\">i<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u4e0b\u7684 compute-optimal \u914d\u7f6e\u4e0e\u5bf9\u5e94\u635f\u5931<\/li>\n<li>\u6a21\u578b\u89c4\u6a21\u7f29\u653e\u5b9a\u5f8b\u7684\u62df\u5408\u53c2\u6570&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          N<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>         C<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>         \u22c5<\/p>\n<p>          C<\/p>\n<p>           \u203e<\/p>\n<p>        N^*(C) &#061; \\\\underline{\\\\hspace{0.5cm}} \\\\cdot C^{\\\\underline{\\\\hspace{0.3cm}}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.363em\"><span class=\"\" style=\"top: -3.363em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord underline mtight\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.804em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line mtight\" style=\"border-bottom-width: 0.049em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mtight\"><span class=\"mspace mtight\" style=\"margin-right: 1.2194em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.245em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>\u8bad\u7ec3\u635f\u5931\u7f29\u653e\u5b9a\u5f8b\u7684\u62df\u5408\u53c2\u6570&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          L<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>         C<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>         &#043;<\/p>\n<p>          \u203e<\/p>\n<p>         \u22c5<\/p>\n<p>          C<\/p>\n<p>           \u2212<\/p>\n<p>            \u203e<\/p>\n<p>        L^*(C) &#061; \\\\underline{\\\\hspace{0.5cm}} &#043; \\\\underline{\\\\hspace{0.5cm}} \\\\cdot C^{-\\\\underline{\\\\hspace{0.3cm}}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7833em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">&#043;<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u22c5<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7713em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.7713em\"><span class=\"\" style=\"top: -3.363em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">\u2212<\/span><span class=\"mord underline mtight\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.804em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line mtight\" style=\"border-bottom-width: 0.049em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mtight\"><span class=\"mspace mtight\" style=\"margin-right: 1.2194em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.245em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>\u5bf9\u5e94\u7684\u53ef\u89c6\u5316\u56fe\u8868<\/li>\n<\/ul>\n<p>9. \u5728 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>        10^{19}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u4e0b\u7684\u6700\u7ec8\u9884\u6d4b\u7ed3\u679c&#xff08;\u5f85\u8865\u9f50&#xff09;<\/p>\n<p>\u8fd0\u884c\u9884\u6d4b\u811a\u672c\u540e&#xff0c;\u5c06\u5f97\u5230\u4ee5\u4e0b\u7ed3\u679c&#xff1a;<\/p>\n<ul>\n<li>\u9884\u6d4b\u7684 compute-optimal \u6a21\u578b\u89c4\u6a21&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>           N<\/p>\n<p>           ^<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>        \\\\hat N^*(10^{19}) &#061; \\\\underline{\\\\hspace{0.5cm}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.1968em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.1667em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.2em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>\u6700\u63a5\u8fd1\u7684\u53ef\u884c\u6a21\u578b\u7ed3\u6784&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>          d<\/p>\n<p>          model<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>         ,<\/p>\n<p>         \u00a0<\/p>\n<p>          n<\/p>\n<p>          layer<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>         ,<\/p>\n<p>         \u00a0<\/p>\n<p>          n<\/p>\n<p>          head<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>        d_\\\\text{model}&#061;\\\\underline{\\\\hspace{0.5cm}},\\\\ n_{\\\\text{layer}}&#061;\\\\underline{\\\\hspace{0.5cm}},\\\\ n_{\\\\text{head}}&#061;\\\\underline{\\\\hspace{0.5cm}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.7167em;vertical-align: -0.2861em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\">\u00a0<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6306em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\">\u00a0<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">head<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.2em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>\u8bad\u7ec3\u8d85\u53c2\u6570&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>         batch_size<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>         ,<\/p>\n<p>         \u00a0learning_rate<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>        \\\\text{batch\\\\_size}&#061;\\\\underline{\\\\hspace{0.5cm}},\\\\ \\\\text{learning\\\\_rate}&#061;\\\\underline{\\\\hspace{0.5cm}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.0044em;vertical-align: -0.31em\"><\/span><span class=\"mord text\"><span class=\"mord\">batch_size<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.0044em;vertical-align: -0.31em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><span class=\"mpunct\">,<\/span><span class=\"mspace\">\u00a0<\/span><span class=\"mspace\" style=\"margin-right: 0.1667em\"><\/span><span class=\"mord text\"><span class=\"mord\">learning_rate<\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.2em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<li>\u9884\u6d4b\u8bad\u7ec3\u635f\u5931&#xff1a;<span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\">\n<p>           L<\/p>\n<p>           ^<\/p>\n<p>          \u2217<\/p>\n<p>         (<\/p>\n<p>          10<\/p>\n<p>          19<\/p>\n<p>         )<\/p>\n<p>         &#061;<\/p>\n<p>          \u203e<\/p>\n<p>        \\\\hat L^*(10^{19}) &#061; \\\\underline{\\\\hspace{0.5cm}}<\/p>\n<p>     <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.1968em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\">L<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.2222em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.2em;vertical-align: -0.2em\"><\/span><span class=\"mord underline\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0em\"><span class=\"\" style=\"top: -2.84em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"underline-line\" style=\"border-bottom-width: 0.04em\"><\/span><\/span><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord\"><span class=\"mspace\" style=\"margin-right: 1.4226em\"><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/li>\n<\/ul>\n<p>\u4e0a\u8ff0\u7ed3\u679c\u5373\u4e3a\u672c\u6b21\u4f5c\u4e1a\u6700\u7ec8\u9700\u8981\u63d0\u4ea4\u7684\u5185\u5bb9\u3002<\/p>\n<p>10. \u7ed3\u679c\u8ba8\u8bba&#xff08;\u62df\u5408\u6548\u679c\u8bc4\u4f30\u3001\u5916\u63a8\u4e0d\u786e\u5b9a\u6027\u4e0e\u6700\u7ec8\u8d85\u53c2\u9009\u62e9\u539f\u5219&#xff09;<\/p>\n<p>\u5728\u62df\u5408\u6548\u679c\u8bc4\u4f30\u65b9\u9762&#xff0c;\u6211\u4eec\u540c\u65f6\u91c7\u7528\u6570\u503c\u6307\u6807\u4e0e\u53ef\u89c6\u5316\u56fe\u6765\u5224\u65ad\u7f29\u653e\u5b9a\u5f8b\u662f\u5426\u53ef\u4fe1\u3002\u6570\u503c\u4e0a&#xff0c;\u6211\u4eec\u62a5\u544a\u62df\u5408\u66f2\u7ebf\u5bf9 best-per-compute \u70b9\u7684\u5747\u65b9\u5dee&#xff08;MSE&#xff09;\/ \u51b3\u5b9a\u7cfb\u6570&#xff08;R^2&#xff09;&#xff0c;\u5e76\u91cd\u70b9\u89c2\u5bdf\u9ad8 compute \u533a\u57df\u7684\u8bef\u5dee&#xff0c;\u56e0\u4e3a\u8be5\u533a\u57df\u5bf9\u5916\u63a8\u5230 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         19<\/p>\n<p>       10^{19}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u7684\u5f71\u54cd\u6700\u5927\u3002\u53ef\u89c6\u5316\u4e0a&#xff0c;\u6211\u4eec\u5728 log-log \u5750\u6807\u7cfb\u7ed8\u5236 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         N<\/p>\n<p>         \u2217<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>       N^*(C)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u4e0e <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        (<\/p>\n<p>         L<\/p>\n<p>         \u2217<\/p>\n<p>        (<\/p>\n<p>        C<\/p>\n<p>        )<\/p>\n<p>        \u2212<\/p>\n<p>         L<\/p>\n<p>         \u221e<\/p>\n<p>        )<\/p>\n<p>       (L^*(C)-L_\\\\infty)<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mopen\">(<\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.0715em\">C<\/span><span class=\"mclose\">)<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\">\u2212<\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">L<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.1514em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">\u221e<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u7684\u6563\u70b9\u4e0e\u62df\u5408\u76f4\u7ebf&#xff0c;\u5e76\u8fdb\u4e00\u6b65\u68c0\u67e5\u6b8b\u5dee\u968f compute \u662f\u5426\u5b58\u5728\u7cfb\u7edf\u6027\u504f\u5dee&#xff0c;\u4ee5\u8bc6\u522b \u201c\u5e42\u5f8b\u5047\u8bbe\u5728\u67d0\u6bb5\u5c3a\u5ea6\u5931\u6548\u201d \u7684\u98ce\u9669\u3002<\/p>\n<p>\u7531\u4e8e API \u53ef\u67e5\u8be2\u7684\u6700\u5927 train_flops \u4e3a <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         18<\/p>\n<p>       10^{18}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff0c;\u672c\u4f5c\u4e1a\u5728 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         19<\/p>\n<p>       10^{19}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> FLOPs \u4e0b\u7684\u9884\u6d4b\u5c5e\u4e8e\u5c3a\u5ea6\u5916\u63a8&#xff0c;\u5176\u4e0d\u786e\u5b9a\u6027\u4e3b\u8981\u6765\u81ea\u4e09\u4e2a\u65b9\u9762&#xff1a;\u7b2c\u4e00&#xff0c;\u5e42\u5f8b\u62df\u5408\u672c\u8eab\u5728\u89c2\u6d4b\u8303\u56f4\u5916\u53ef\u80fd\u504f\u79bb\u771f\u5b9e\u66f2\u7ebf&#xff1b;\u7b2c\u4e8c&#xff0c;compute-optimal \u70b9\u662f\u901a\u8fc7\u6709\u9650\u7f51\u683c\u641c\u7d22\u8fd1\u4f3c\u5f97\u5230&#xff0c;\u4ecd\u53ef\u80fd\u4e0e\u771f\u5b9e\u6700\u4f18\u5b58\u5728\u5dee\u8ddd&#xff1b;\u7b2c\u4e09&#xff0c;\u9884\u6d4b\u5f97\u5230\u7684\u6700\u4f18\u53c2\u6570\u89c4\u6a21 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          N<\/p>\n<p>          ^<\/p>\n<p>         \u2217<\/p>\n<p>        (<\/p>\n<p>         10<\/p>\n<p>         19<\/p>\n<p>        )<\/p>\n<p>       \\\\hat N^*(10^{19})<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.1968em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.1667em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u9700\u8981\u6620\u5c04\u5230\u79bb\u6563\u7ed3\u6784\u7a7a\u95f4&#xff08;\u6574\u6570 layers\u3001\u79bb\u6563 heads\u3001\u6709\u9650 d_model&#xff09;&#xff0c;\u56e0\u6b64\u4f1a\u5f15\u5165\u989d\u5916\u7684\u201c\u7ed3\u6784\u91cf\u5316\u8bef\u5dee\u201d\u3002\u56e0\u6b64\u6211\u4eec\u5728\u62a5\u544a\u4e2d\u9700\u8981\u660e\u786e\u6307\u51fa&#xff1a;\u6700\u7ec8\u7684 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         19<\/p>\n<p>       10^{19}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u9884\u6d4b\u503c\u662f\u57fa\u4e8e\u62df\u5408\u5916\u63a8\u7684\u4f30\u8ba1&#xff0c;\u5e76\u5728\u7ed3\u679c\u4e2d\u540c\u65f6\u7ed9\u51fa\u201c\u6700\u63a5\u8fd1\u53ef\u884c\u7ed3\u6784\u201d\u4e0e\u5176\u76f8\u5bf9\u8bef\u5dee\u3002<\/p>\n<p>\u82e5\u5b9e\u9645\u8981\u8bad\u7ec3\u8be5\u9884\u6d4b\u89c4\u6a21\u7684\u6a21\u578b&#xff0c;\u6211\u4eec\u5c06\u7ed3\u6784\u8d85\u53c2\u4f18\u5148\u9009\u53d6\u4f7f\u5f97\u975e embedding \u53c2\u6570\u91cf <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>        N<\/p>\n<p>        \u2248<\/p>\n<p>        12<\/p>\n<p>         n<\/p>\n<p>         layer<\/p>\n<p>         d<\/p>\n<p>         model<\/p>\n<p>         2<\/p>\n<p>       N\\\\approx 12n_{\\\\text{layer}}d_{\\\\text{model}}^2<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.6833em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">\u2248<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 1.1002em;vertical-align: -0.2861em\"><\/span><span class=\"mord\">12<\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">layer<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2861em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -2.4169em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\">2<\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.2831em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span> \u6700\u63a5\u8fd1 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>          N<\/p>\n<p>          ^<\/p>\n<p>         \u2217<\/p>\n<p>        (<\/p>\n<p>         10<\/p>\n<p>         19<\/p>\n<p>        )<\/p>\n<p>       \\\\hat N^*(10^{19})<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 1.1968em;vertical-align: -0.25em\"><\/span><span class=\"mord\"><span class=\"mord accent\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.9468em\"><span class=\"\" style=\"top: -3em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"mord mathnormal\" style=\"margin-right: 0.109em\">N<\/span><\/span><span class=\"\" style=\"top: -3.2523em\"><span class=\"pstrut\" style=\"height: 3em\"><\/span><span class=\"accent-body\" style=\"left: -0.1667em\"><span class=\"mord\">^<\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.6887em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mbin mtight\">\u2217<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mopen\">(<\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">19<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mclose\">)<\/span><\/span><\/span><\/span><\/span> \u7684\u7ec4\u5408&#xff0c;\u5e76\u6ee1\u8db3 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         d<\/p>\n<p>         model<\/p>\n<p>       \u200a<\/p>\n<p>          m<\/p>\n<p>          o<\/p>\n<p>          d<\/p>\n<p>       \u200a<\/p>\n<p>         n<\/p>\n<p>         head<\/p>\n<p>        &#061;<\/p>\n<p>        0<\/p>\n<p>       d_{\\\\text{model}}\\\\bmod n_{\\\\text{head}}&#061;0<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8444em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">d<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">model<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.0556em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><span class=\"mbin\"><span class=\"mord\"><span class=\"mord mathrm\">mod<\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.0556em\"><\/span><span class=\"mspace\" style=\"margin-right: 0.2222em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.5806em;vertical-align: -0.15em\"><\/span><span class=\"mord\"><span class=\"mord mathnormal\">n<\/span><span class=\"msupsub\"><span class=\"vlist-t vlist-t2\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.3361em\"><span class=\"\" style=\"top: -2.55em;margin-left: 0em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord text mtight\"><span class=\"mord mtight\">head<\/span><\/span><\/span><\/span><\/span><\/span><span class=\"vlist-s\">\u200b<\/span><\/span><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.15em\"><span class=\"\"><\/span><\/span><\/span><\/span><\/span><\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><span class=\"mrel\">&#061;<\/span><span class=\"mspace\" style=\"margin-right: 0.2778em\"><\/span><\/span><span class=\"base\"><span class=\"strut\" style=\"height: 0.6444em\"><\/span><span class=\"mord\">0<\/span><\/span><\/span><\/span><\/span>\u3002batch size \u5219\u56fa\u5b9a\u4e3a 128 \u6216 256 \u4ee5\u7b26\u5408\u63d0\u4ea4\u8981\u6c42&#xff1b;\u5b66\u4e60\u7387\u7b56\u7565\u4e0a&#xff0c;\u4f18\u5148\u53c2\u8003\u9ad8 compute&#xff08;\u5982 <span class=\"katex--inline\"><span class=\"katex\"><span class=\"katex-mathml\"><\/p>\n<p>         10<\/p>\n<p>         18<\/p>\n<p>       10^{18}<\/p>\n<p>    <\/span><span class=\"katex-html\"><span class=\"base\"><span class=\"strut\" style=\"height: 0.8141em\"><\/span><span class=\"mord\">1<\/span><span class=\"mord\"><span class=\"mord\">0<\/span><span class=\"msupsub\"><span class=\"vlist-t\"><span class=\"vlist-r\"><span class=\"vlist\" style=\"height: 0.8141em\"><span class=\"\" style=\"top: -3.063em;margin-right: 0.05em\"><span class=\"pstrut\" style=\"height: 2.7em\"><\/span><span class=\"sizing reset-size6 size3 mtight\"><span class=\"mord mtight\"><span class=\"mord mtight\">18<\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span><\/span>&#xff09;\u6863\u4f4d\u4e0b\u7684\u6700\u4f18\u70b9\u6240\u5bf9\u5e94\u7684\u5b66\u4e60\u7387\u4f5c\u4e3a\u9ed8\u8ba4\u9009\u62e9&#xff0c;\u5e76\u5728 refine \u9636\u6bb5\u901a\u8fc7\u5c40\u90e8\u7f29\u653e&#xff08;x0.5\/x1\/x2&#xff09;\u9a8c\u8bc1\u5176\u7a33\u5b9a\u6027&#xff0c;\u4ece\u800c\u5728\u9884\u7b97\u53ef\u63a7\u7684\u524d\u63d0\u4e0b\u83b7\u5f97\u66f4\u63a5\u8fd1 compute-optimal \u7684\u8d85\u53c2\u914d\u7f6e\u3002<\/p>\n<p>OK&#xff0c;\u4ee5\u4e0a\u5c31\u662f\u672c\u6b21 Scaling Laws \u6784\u5efa\u7684\u6574\u4f53\u601d\u8def\u4e86<\/p>\n<h3>\u7ed3\u8bed<\/h3>\n<p>\u672c\u7bc7\u6587\u7ae0\u6211\u4eec\u6784\u5efa\u4e86 CS336 Assignment 3 \u4e2d Scaling Laws \u7684\u5168\u90e8\u6838\u5fc3\u6d41\u7a0b&#xff0c;\u4ece IsoFLOPs \u66f2\u7ebf\u4e2d\u6784\u9020 compute-optimal \u70b9\u51fa\u53d1&#xff0c;\u5728\u4e25\u683c FLOPs \u9884\u7b97\u7ea6\u675f\u4e0b\u5b8c\u6210\u4e86\u6a21\u578b\u89c4\u6a21\u4e0e\u8bad\u7ec3\u635f\u5931\u7684\u7f29\u653e\u5b9a\u5f8b\u62df\u5408&#xff0c;\u5e76\u5c06\u5176\u5916\u63a8\u5230\u76ee\u6807\u8ba1\u7b97\u9884\u7b97<\/p>\n<p>\u4e0e\u524d\u51e0\u4e2a\u4fa7\u91cd\u7cfb\u7edf\u5b9e\u73b0\u4e0e\u6027\u80fd\u4f18\u5316\u7684\u4f5c\u4e1a\u4e0d\u540c&#xff0c;\u672c\u6b21\u4f5c\u4e1a\u7684\u91cd\u70b9\u5728\u4e8e\u5b9e\u9a8c\u8bbe\u8ba1\u4e0e\u51b3\u7b56\u65b9\u6cd5\u672c\u8eab&#xff1a;\u5982\u4f55\u5728\u53d7\u9650\u7684\u67e5\u8be2\u9884\u7b97\u4e0b\u9009\u62e9\u503c\u5f97\u63a2\u7d22\u7684\u914d\u7f6e&#xff0c;\u5982\u4f55\u907f\u514d\u65e0\u6548\u641c\u7d22\u5bf9\u62df\u5408\u7ed3\u679c\u7684\u5e72\u6270&#xff0c;\u4ee5\u53ca\u5982\u4f55\u5c06\u8fde\u7eed\u7684\u7406\u8bba\u6700\u4f18\u89e3\u6620\u5c04\u5230\u79bb\u6563\u3001\u53ef\u8bad\u7ec3\u7684\u6a21\u578b\u7ed3\u6784\u7a7a\u95f4\u3002\u56f4\u7ed5\u8fd9\u4e9b\u95ee\u9898&#xff0c;\u6211\u4eec\u6784\u5efa\u4e86\u4e00\u5957\u9884\u7b97\u611f\u77e5\u7684\u5b9e\u9a8c pipeline&#xff0c;\u5c06 API \u8c03\u7528\u3001\u7ed3\u679c\u7f13\u5b58\u3001\u5206\u9636\u6bb5\u641c\u7d22\u4e0e\u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u6709\u673a\u5730\u7ec4\u7ec7\u5728\u4e00\u8d77<\/p>\n<p>\u503c\u5f97\u6ce8\u610f\u7684\u662f\u7531\u4e8e\u5b98\u65b9 API \u7684\u8bbf\u95ee\u9650\u5236&#xff0c;\u672c\u6b21\u4f5c\u4e1a\u6211\u4eec\u5e76\u672a\u5f97\u5230\u5145\u5206\u5b9e\u9a8c\u9a8c\u8bc1&#xff0c;\u4f46\u662f\u6574\u4f53\u8bbe\u8ba1\u601d\u8def\u6211\u4eec\u8fd8\u662f\u5b8c\u6574\u7684\u68b3\u7406\u4e86\u4e00\u904d<\/p>\n<p>\u81f3\u6b64&#xff0c;\u6211\u4eec\u5b8c\u6210\u4e86 Assignment 3: Scaling \u4e2d\u8981\u6c42\u7684\u6240\u6709\u4f5c\u4e1a&#xff0c;\u4e0b\u7bc7\u6587\u7ae0\u5f00\u59cb\u6211\u4eec\u5c06\u8fdb\u5165 Assignment 4: Data \u7684\u5b9e\u73b0&#xff0c;\u656c\u8bf7\u671f\u5f85&#x1f917;<\/p>\n<h3>\u6e90\u7801\u4e0b\u8f7d\u94fe\u63a5<\/h3>\n<ul>\n<li>https:\/\/github.com\/Melody-Zhou\/stanford-cs336-spring2025-assignments<\/li>\n<\/ul>\n<h3>\u53c2\u8003<\/h3>\n<ul>\n<li>https:\/\/github.com\/stanford-cs336\/assignment3-scaling<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>\u76ee\u5f55\u524d\u8a001. Problem (chinchilla_isoflops): 5 points2. Problem (scaling_laws): 50 points2.1 API \u8c03\u7528\u4e0e\u7f13\u5b58\u5c42\u811a\u672c\u5b9e\u73b02.2 \u5b9e\u9a8c\u8bbe\u8ba1 \/ \u641c\u7d22\u811a\u672c\u5b9e\u73b02.3 \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u4e0e\u9884\u6d4b\u811a\u672c\u5b9e\u73b02.4 \u6574\u4f53\u8bbe\u8ba1\u601d\u8def\u5206\u6790\u7ed3\u8bed\u6e90\u7801\u4e0b\u8f7d\u94fe\u63a5\u53c2\u8003\u524d\u8a00 \u5728\u4e0a\u7bc7\u6587\u7ae0 \u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws <\/p>\n","protected":false},"author":2,"featured_media":78949,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[8752,8751,8754,8755,8753,75],"topic":[],"class_list":["post-78952","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-server","tag-assignment","tag-cs336","tag-isoflops","tag-power-law","tag-scaling-laws","tag-llm"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.wsisp.com\/helps\/78952.html\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\" \/>\n<meta property=\"og:description\" content=\"\u76ee\u5f55\u524d\u8a001. Problem (chinchilla_isoflops): 5 points2. Problem (scaling_laws): 50 points2.1 API \u8c03\u7528\u4e0e\u7f13\u5b58\u5c42\u811a\u672c\u5b9e\u73b02.2 \u5b9e\u9a8c\u8bbe\u8ba1 \/ \u641c\u7d22\u811a\u672c\u5b9e\u73b02.3 \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u4e0e\u9884\u6d4b\u811a\u672c\u5b9e\u73b02.4 \u6574\u4f53\u8bbe\u8ba1\u601d\u8def\u5206\u6790\u7ed3\u8bed\u6e90\u7801\u4e0b\u8f7d\u94fe\u63a5\u53c2\u8003\u524d\u8a00 \u5728\u4e0a\u7bc7\u6587\u7ae0 \u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.wsisp.com\/helps\/78952.html\" \/>\n<meta property=\"og:site_name\" content=\"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\" \/>\n<meta property=\"article:published_time\" content=\"2026-02-28T09:42:39+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2026\/02\/20260228094237-69a2b88d0db57.png\" \/>\n<meta name=\"author\" content=\"admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"39 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/78952.html\",\"url\":\"https:\/\/www.wsisp.com\/helps\/78952.html\",\"name\":\"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\",\"isPartOf\":{\"@id\":\"https:\/\/www.wsisp.com\/helps\/#website\"},\"datePublished\":\"2026-02-28T09:42:39+00:00\",\"dateModified\":\"2026-02-28T09:42:39+00:00\",\"author\":{\"@id\":\"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41\"},\"breadcrumb\":{\"@id\":\"https:\/\/www.wsisp.com\/helps\/78952.html#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.wsisp.com\/helps\/78952.html\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/78952.html#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u9996\u9875\",\"item\":\"https:\/\/www.wsisp.com\/helps\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/#website\",\"url\":\"https:\/\/www.wsisp.com\/helps\/\",\"name\":\"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3\",\"description\":\"\u9999\u6e2f\u670d\u52a1\u5668_\u9999\u6e2f\u4e91\u670d\u52a1\u5668\u8d44\u8baf_\u670d\u52a1\u5668\u5e2e\u52a9\u6587\u6863_\u670d\u52a1\u5668\u6559\u7a0b\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.wsisp.com\/helps\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41\",\"name\":\"admin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery\",\"contentUrl\":\"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery\",\"caption\":\"admin\"},\"sameAs\":[\"http:\/\/wp.wsisp.com\"],\"url\":\"https:\/\/www.wsisp.com\/helps\/author\/admin\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.wsisp.com\/helps\/78952.html","og_locale":"zh_CN","og_type":"article","og_title":"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","og_description":"\u76ee\u5f55\u524d\u8a001. Problem (chinchilla_isoflops): 5 points2. Problem (scaling_laws): 50 points2.1 API \u8c03\u7528\u4e0e\u7f13\u5b58\u5c42\u811a\u672c\u5b9e\u73b02.2 \u5b9e\u9a8c\u8bbe\u8ba1 \/ \u641c\u7d22\u811a\u672c\u5b9e\u73b02.3 \u7f29\u653e\u5b9a\u5f8b\u62df\u5408\u4e0e\u9884\u6d4b\u811a\u672c\u5b9e\u73b02.4 \u6574\u4f53\u8bbe\u8ba1\u601d\u8def\u5206\u6790\u7ed3\u8bed\u6e90\u7801\u4e0b\u8f7d\u94fe\u63a5\u53c2\u8003\u524d\u8a00 \u5728\u4e0a\u7bc7\u6587\u7ae0 \u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws","og_url":"https:\/\/www.wsisp.com\/helps\/78952.html","og_site_name":"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","article_published_time":"2026-02-28T09:42:39+00:00","og_image":[{"url":"https:\/\/www.wsisp.com\/helps\/wp-content\/uploads\/2026\/02\/20260228094237-69a2b88d0db57.png"}],"author":"admin","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"admin","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"39 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.wsisp.com\/helps\/78952.html","url":"https:\/\/www.wsisp.com\/helps\/78952.html","name":"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement - \u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","isPartOf":{"@id":"https:\/\/www.wsisp.com\/helps\/#website"},"datePublished":"2026-02-28T09:42:39+00:00","dateModified":"2026-02-28T09:42:39+00:00","author":{"@id":"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41"},"breadcrumb":{"@id":"https:\/\/www.wsisp.com\/helps\/78952.html#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.wsisp.com\/helps\/78952.html"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.wsisp.com\/helps\/78952.html#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u9996\u9875","item":"https:\/\/www.wsisp.com\/helps"},{"@type":"ListItem","position":2,"name":"\u65af\u5766\u798f\u5927\u5b66 | CS336 | \u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u8bed\u8a00\u6a21\u578b | Spring 2025 | \u7b14\u8bb0 | Assignment 3: Scaling Laws Implement"}]},{"@type":"WebSite","@id":"https:\/\/www.wsisp.com\/helps\/#website","url":"https:\/\/www.wsisp.com\/helps\/","name":"\u7f51\u7855\u4e92\u8054\u5e2e\u52a9\u4e2d\u5fc3","description":"\u9999\u6e2f\u670d\u52a1\u5668_\u9999\u6e2f\u4e91\u670d\u52a1\u5668\u8d44\u8baf_\u670d\u52a1\u5668\u5e2e\u52a9\u6587\u6863_\u670d\u52a1\u5668\u6559\u7a0b","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.wsisp.com\/helps\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"zh-Hans"},{"@type":"Person","@id":"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/358e386c577a3ab51c4493330a20ad41","name":"admin","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/www.wsisp.com\/helps\/#\/schema\/person\/image\/","url":"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery","contentUrl":"https:\/\/gravatar.wp-china-yes.net\/avatar\/?s=96&d=mystery","caption":"admin"},"sameAs":["http:\/\/wp.wsisp.com"],"url":"https:\/\/www.wsisp.com\/helps\/author\/admin"}]}},"_links":{"self":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/posts\/78952","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/comments?post=78952"}],"version-history":[{"count":0,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/posts\/78952\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/media\/78949"}],"wp:attachment":[{"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/media?parent=78952"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/categories?post=78952"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/tags?post=78952"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/www.wsisp.com\/helps\/wp-json\/wp\/v2\/topic?post=78952"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}