Finish the search problem
The code address of this article is: example_01_Assignment
Please read the answer below after thinking for yourself
Please using the search policy to implement an agent. This agent receives two input, one is @param start station and the other is @param destination. Your agent should give the optimal route based on Beijing Subway system.
Dataflow:
1. Get data from web page.
- Get web page source from: https://baike.baidu.com/item/%E5%8C%97%E4%BA%AC%E5%9C%B0%E9%93%81/408485
- You may need @package requests https://2.python-requests.org/en/master/ page to get the response via url
- You may need save the page source to file system.
- The target of this step is to get station information of all the subway lines;
- You may need install @package beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ to get the url information, or just use > Regular Expression to get the url. Our recommendation is that using the Regular Expression and BeautiflSoup both.
- You may need BFS to get all the related page url from one url. Question: Why do we use BFS to traverse web page (or someone said, build a web spider)? Can DFS do this job? which is better?
2. Preprocessing data from page source.
- Based on the page source gotten from url. You may need some more preprocessing of the page.
- the Regular Expression you may need to process the text information.
- You may need @package networkx, @package matplotlib to visualize data.
- You should build a dictionary or graph which could represent the connection information of Beijing subway routes.
- You may need the defaultdict, set data structures to implement this procedure.
3. Build the search agent
Build the search agent based on the graph we build.
for example, when you run:
1 |
|
you need get the result:
奥体中心-> A -> B -> C -> ... -> 天安门
HTTP协议
超文本传输协议(HTTP,HyperText Transfer Protocol)是互联网上应用最为广泛的一种网络协议。所有的www文件都必须遵守这个标准。
HTTP用于客户端和服务器之间的通信。协议中规定了客户端应该按照什么格式给服务器发送请求,同时也约定了服务端返回的响应结果应该是什么格式。
请求访问文本或图像等信息的一端称为客户端,而提供信息响应的一端称为服务器端。
客户端告诉服务器请求访问信息的方法: - Get 获得内容 - Post 提交表单来爬取需要登录才能获得数据的网站 - put 传输文件
更多参考: HTTP请求状态
了解200 404 503 - 200 OK //客户端请求成功 - 404 Not Found
//请求资源不存在,eg:输入了错误的URL - 503 Server Unavailable
//服务器当前不能处理客户端的请求,一段时间后可能恢复正常。
#### Requests
纯粹HTML格式的网页通常被称为静态网页,静态网页的数据比较容易获取。
在静态网页抓取中,有一个强大的Requests库能够让你轻易地发送HTTP请求。
在终端上安装 Requests
pip install requents
1 |
|
拓展知识:
正则表达式
正则表达式的思想是你在人群中寻找你的男朋友/女朋友,他/她在你心里非常有特点。
同样,从一堆文本中找到需要的内容,我们可以采用正则表达式。
正经点说,是以一定的模式来进行字符串的匹配。
掌握正则表达式需要非常多的时间,我们可以先入门,在以后的工作中遇到,可更加深入研究。
使用正则表达式有如下步骤:
- 寻找【待找到的信息】特点
- 使用符号找到特点
- 获得信息
1 |
|
给大家提供一个字典,供大家查询~
字符 | 描述 |
---|---|
</th> |
将下一个字符标记为一个特殊字符、或一个原义字符、或一个向后引用、或一个八进制转义符。例如,“n ”匹配字符“n ”。“ ”匹配一个换行符。串行“\ ”匹配“</code>”而“ |
1 |
|
了解了怎么使用,下面进入实现
1 |
|
As much as you can to use the already implemented search agent. You just need to define the is_goal(), get_successor(), strategy() three functions.
- Define different policies for transfer system.
- Such as Shortest Path Priority(路程最短优先), Minimum Transfer Priority(最少换乘优先), Comprehensive Priority(综合优先)
- Implement Continuous transfer. Based on the Agent you implemented, please add this feature: Besides the @param start and @param destination two stations, add some more stations, we called @param by_way, it means, our path should from the start and end, but also include the @param by_way stations.
e.g 1
2
3
4
51. Input: start=A, destination=B, by_way=[C]
Output: [A, … .., C, …. B]
2. Input: start=A, destination=B, by_way=[C, D, E]
Output: [A … C … E … D … B]
# based on your policy, the E station could be reached firstly.
The Answer
1 |
|
Finish the search problem