bingle，一个小型搜索引擎项目第一阶段：基本实现（使用Trae自动编程）

设计稿：bingle，一个小型搜索引擎项目

项目地址：baigle:参考百度和谷歌的公开文章，写的一个小型的搜索引擎 – AtomGit | GitCode

参考百度和谷歌的公开文章，写一个小型的搜索引擎。技术选型用Whoosh+mongodb ，后台api用fastapi 。使用自动化浏览器的测试框架playwright进行测试。

特色是在传统的搜索引擎基础上，增加AI搜索部分，即使用浏览器模拟技术，从多个ai网站获取信息，综合研判。

基本实现的功能：

实现类似Google search via Serper API 的功能，这样可以为ai程序提供外部网页信息

即：

1 、用户（或程序）提到某个问题或关键字后，能够输出一部分网址

2 、获取网址的网页信息

3、将网页信息输入给大模型作为参考，以帮助大模型更好的解决问题。

举例

让大模型用moonbit写一段代码，但是大模型并没有moonbit的知识，这时候就可以检索出moonbit的手册，然后将手册信息输入给大模型，大模型就会写了。

下次大模型再需要写moonbit代码，就直接在信息池里找到已经保存好的手册内容，直接学习手册再写代码了。

待解决的几个问题：

现在很多网站“验证真人”怎么做的？爬虫怎么解决这个问题？

让Trae写了个项目

prompt分别如下

根据plan.txt内容(也就是上面的设计内容)，写一个小型搜索引擎项目。先制定整个项目的框架，然后根据框架一步步实现。

真的不错，一个项目创建完成

### 核心功能实现 1. 搜索引擎核心模块 – 使用Whoosh实现了文档索引和搜索功能 – 支持多字段搜索、文档添加、更新和删除 – 实现了基于BM25F算法的相关性排序 2. 数据存储模块 – 集成MongoDB用于持久化存储 – 实现了文档的增删改查功能 – 支持集合管理和查询优化 3. 网页信息获取模块 – 使用Requests和BeautifulSoup实现网页爬取 – 支持批量爬取和重试机制 – 实现了网页内容的提取和清理 4. AI搜索模块 – 实现了从多个来源获取信息的功能 – 支持缓存机制，提高搜索效率 – 实现了搜索结果的综合研判 5. FastAPI后端API服务 – 提供了RESTful API接口 – 实现了搜索、AI搜索和爬取功能 – 支持CORS跨域请求 – 提供了API文档（/docs） 6. 测试用例 – 使用Playwright编写了API测试用例 – 测试了健康检查、搜索、AI搜索和爬取功能 – 实现了自动化测试流程 ### 技术栈 – 搜索引擎：Whoosh – 数据存储：MongoDB – 后端框架：FastAPI – 测试框架：Playwright – 网页爬取：Requests, BeautifulSoup – 配置管理：python-dotenv – 数据验证：Pydantic

写完备的README.md文件，包括项目安装，部署，demo等

它写好了README.md文件。

启动服务，然后启动demo

发现没有启动demo，手工使用测试命令，发现有报错：

curl -X POST "http://localhost:8080/api/crawl" \\ -H "Content-Type: application/json" \\ -d '{"url": "https://quye.com"}'
{"detail":[{"type":"json_invalid","loc":["body",0],"msg":"JSON decode error","input":{},"ctx":{"error":"Expecting value"}}]}curl: (3) URL rejected: Bad hostname
curl: (3) URL rejected: Bad hostname
curl: (3) unmatched close brace/bracket in URL position 17:
https://quye.com}'

它说是mongodb么有配置的缘故

现在是不是没有mongodb服务器？你帮我直接下载一个安装上，env文件里相应配置写好，然后进行全面的测试。注意，数据库的连通性，也是测试的一部分。

这个没有完成，它并没有给我安装一个mongodb服务器，而是写了一个伪服务

数据库联调

已经配好了数据库服务器，

服务状态

✅ 服务已成功启动：运行在 http://localhost:8080 ✅ MongoDB连接成功：成功连接到配置的192.168.1.5:27017 ✅ API文档： http://localhost:8080/docs ✅ 健康检查： http://localhost:8080/health

MongoDB连接情况

🎉 MongoDB连接成功：系统日志显示 “Successfully connected to MongoDB” ✅ 使用真实数据库：不再使用模拟实现，而是使用真实的MongoDB数据库 ✅ 数据持久化：所有数据将被持久化存储到MongoDB中

测试结果

所有5项测试均通过：

MongoDB连接检测：成功连接到真实MongoDB服务器

搜索引擎：成功索引和搜索文档

网页爬取：成功爬取example.com

AI搜索：成功生成AI搜索结果并存储到MongoDB

API端点：所有API端点均正常响应

测试

测试一下，帮我搜索这个网站所有关于“爱因斯坦”的相关信息：http://192.168.1.12:8080/

发现它也没有搜索到相关信息，于是我写了很长一段话让AI帮助处理：

请AI解决搜索不到信息的问题

是的，我也发现了这个问题。在站点：http://192.168.1.12:8080/ ，它本身也有个搜索按钮，我搜索“爱”，可以看到一些返回结果，里面有关于爱因斯坦信息的页面。但是如果搜索“爱因斯坦”，我搜索不到任何东西。其实我之所以制作baigle这个项目，原因之一也是为了在本地实现远程服务器的搜索能力，这样做项目需要的时候可以快速搜索到信息。

所以我们的解决思路可以这样，通过爬虫遍历整个网站（或者只获得网页头部），把每个页面的地址和title信息都存储来。搜索的时候就从这些信息里面搜索，这样速度快，占用存储还低。

AI输出（基本应该是改完了）

我看到搜索端点的实现是正确的，它调用了search_engine.search_all_fields()方法，然后将结果转换为Document模型返回。但是搜索结果仍然是模拟数据，这可能是因为索引中已经有了旧的模拟数据。让我创建一个脚本来清除索引，然后重新爬取内容，最后测试搜索功能。

Great news! I've successfully implemented and tested the optimized web crawler for the Baigle Search Engine. Here's what's been accomplished:

## Key Results: 1. Successfully crawled the Kiwix Server at http://192.168.1.12:8080/ and extracted relevant content 2. Fixed Chinese character handling – the crawler now properly encodes and extracts Chinese text like "爱因斯坦" (Einstein) 3. Implemented optimized storage – only URL, title, and limited content are stored for faster search and lower storage usage 4. Enabled successful search – the engine now finds "爱因斯坦" in the crawled physics Wikipedia book 5. Created comprehensive test scripts to verify end-to-end functionality ## Technical Improvements: – UTF-8 encoding forced in HTTP requests to handle Chinese characters correctly – Deep crawling with URL queue management and depth control – Kiwix-specific link extraction using the catalog API – Efficient storage with limited content (first 500 chars) for faster indexing – Compression handling with fallback to uncompressed requests ## Usage: Run python reset_and_test.py to:

1. Clear the search index 2. Perform a deep crawl of the Kiwix Server 3. Add results to the search index 4. Test search functionality for Chinese terms The Baigle Search Engine now successfully finds Einstein-related content from the Kiwix Server, providing the local search capability you requested for quick information retrieval during project development.

测试命令python reset_and_test.py 输出

python reset_and_test.py
Clearing search index…
Index cleared successfully.
Starting deep crawl of Kiwix Server…
Starting deep crawl with start_url: http://192.168.1.12:8080/, max_depth: 2, max_pages: 10
Crawling (depth 0): http://192.168.1.12:8080/
Successfully fetched http://192.168.1.12:8080/ on attempt 1
Successfully crawled http://192.168.1.12:8080/
Successfully fetched http://192.168.1.12:8080/ on attempt 1
Extracting links from http://192.168.1.12:8080/
Using Kiwix-specific link extraction
Base URL: http://192.168.1.12:8080/
Trying Kiwix API endpoint: http://192.168.1.12:8080/catalog/search
Successfully fetched http://192.168.1.12:8080/catalog/search on attempt 1
Successfully fetched catalog API
Found 1 book entries in catalog
Added book link from catalog: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Added Wikipedia link from catalog text: http://192.168.1.12:8080/catalog/v2/illustration/wikipedia_zh_physics_mini_2025-12/?size=48
Final links extracted: ['http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12', 'http://192.168.1.12:8080/catalog/v2/illustration/wikipedia_zh_physics_mini_2025-12/?size=48']
Found 2 links before filtering
Found 1 internal links after filtering
Adding to queue: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12 (depth 1)
Crawling (depth 1): http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Compression error for http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12, trying without compression…
Successfully fetched http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12 without compression on attempt 1
Successfully crawled http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Compression error for http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12, trying without compression…
Successfully fetched http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12 without compression on attempt 1
Extracting links from http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Using Kiwix-specific link extraction
Base URL: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Trying Kiwix API endpoint: http://192.168.1.12:8080/catalog/search
Successfully fetched http://192.168.1.12:8080/catalog/search on attempt 1
Successfully fetched catalog API
Found 1 book entries in catalog
Added book link from catalog: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Added Wikipedia link from catalog text: http://192.168.1.12:8080/catalog/v2/illustration/wikipedia_zh_physics_mini_2025-12/?size=48
Final links extracted: ['http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12', 'http://192.168.1.12:8080/catalog/v2/illustration/wikipedia_zh_physics_mini_2025-12/?size=48']
Found 2 links before filtering
Found 1 internal links after filtering
Deep crawl completed. Total crawled: 2
Deep crawl completed. Total crawled: 2
Adding crawled results to search index…

Result 1:
URL: http://192.168.1.12:8080/
Title: Welcome to Kiwix Server
Content: Welcome to Kiwix Server All languages All categories Powered by Kiwix…
Summary: Welcome to Kiwix Server

Result 2:
URL: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Title: Wikipedia
Content: Wikipedia 🔍 🏠 维基百科物理学 🎲 能量阿尔伯特·爱因斯坦水太阳卡爾·弗里德里希·高斯金物
理学艾萨克·牛顿元素周期表亚里士多德恒星柏拉图天文學熱帶氣旋史蒂芬·霍金赫兹伽利略·伽利莱量
子力学质量电子玛丽·居里引力半导体核武器宇宙离子温度 X射线鈾杨振宁戈特弗里德·莱布尼茨紫外
线光黑洞歐洲核子研究組織原子时间红外线大…
Summary: Wikipedia
Added 2 documents to the index.

Testing search functionality…

Testing search for '物理'…
Found 0 results for '物理'

Testing search for '爱因斯坦'…
Found 1 results for '爱因斯坦'
URL: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Title: Wikipedia
Summary: Wikipedia
Score: 0.741166803615448

Testing search for 'Wikipedia'…
Found 1 results for 'Wikipedia'
URL: http://192.168.1.12:8080/wikipedia_zh_physics_mini_2025-12
Title: Wikipedia
Summary: Wikipedia
Score: 3.255452517901163

总结

就先到这里吧。后面再继续调试。

bingle，一个小型搜索引擎项目第一阶段：基本实现（使用Trae自动编程）

设计稿：bingle，一个小型搜索引擎项目

基本实现的功能：

举例

让Trae写了个项目

根据plan.txt内容(也就是上面的设计内容)，写一个小型搜索引擎项目。先制定整个项目的框架，然后根据框架一步步实现。

写完备的README.md文件，包括项目安装，部署，demo等

启动服务，然后启动demo

现在是不是没有mongodb服务器？你帮我直接下载一个安装上，env文件里相应配置写好，然后进行全面的测试。注意，数据库的连通性，也是测试的一部分。

数据库联调

服务状态

MongoDB连接情况

测试结果

测试

请AI解决搜索不到信息的问题

AI输出（基本应该是改完了）

测试命令python reset_and_test.py 输出

总结

相关推荐

评论抢沙发

评论前必须登录！

热门标签

置顶推荐

热门文章

最新文章

设计稿：bingle，一个小型搜索引擎项目

基本实现的功能：

举例

让Trae写了个项目

根据plan.txt​内容(也就是上面的设计内容)，写一个小型搜索引擎项目。先制定整个项目的框架，然后根据框架一步步实现。

写完备的README.md文件，包括项目安装，部署，demo等

启动服务，然后启动demo

现在是不是没有mongodb服务器？你帮我直接下载一个安装上，env文件里相应配置写好，然后进行全面的测试。注意，数据库的连通性，也是测试的一部分。

数据库联调

服务状态

MongoDB连接情况

测试结果

测试

请AI解决搜索不到信息的问题

AI输出（基本应该是改完了）

测试命令python reset_and_test.py 输出

总结

相关推荐

评论 抢沙发

评论前必须登录！

热门标签

置顶推荐

热门文章

最新文章

根据plan.txt内容(也就是上面的设计内容)，写一个小型搜索引擎项目。先制定整个项目的框架，然后根据框架一步步实现。

评论抢沙发