-
Notifications
You must be signed in to change notification settings - Fork 1k
2 编写一个配置式爬虫
上一章介绍了基本爬虫的写法,这种写法呢完全由自己写解析,写数据存储,自由度很高但相对代码量也要多很多。DotnetSpider 默认实现了一种基于实体配置的爬虫编写方式。请看示例代码:
public class EntitySpider : Spider
{
public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
options, services, logger)
{
}
protected override async Task InitializeAsync(CancellationToken stoppingToken)
{
AddDataFlow(new DataParser<CnblogsEntry>());
AddDataFlow(GetDefaultStorage());
await AddRequestsAsync(
new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string> {{"网站", "博客园"}}),
new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string> {{"网站", "博客园"}}));
}
[Schema("cnblogs", "news")]
[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)]
[FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})]
public class CnblogsEntry : EntityBase<CnblogsEntry>
{
protected override void Configure()
{
HasIndex(x => x.Title);
HasIndex(x => new {x.WebSite, x.Guid}, true);
}
public int Id { get; set; }
[Required]
[StringLength(200)]
[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
public string Category { get; set; }
[Required]
[StringLength(200)]
[ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
public string WebSite { get; set; }
[StringLength(200)]
[ValueSelector(Expression = "//title")]
[ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
public string Title { get; set; }
[StringLength(40)]
[ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
public string Guid { get; set; }
[ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
public string News { get; set; }
[ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
public string Url { get; set; }
[ValueSelector(Expression = ".//div[@class='entry_summary']")]
public string PlainText { get; set; }
[ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
public DateTime CreationTime { get; set; }
}
}
首先你需要定义一个数据实体,实体必须继承自 EntityBase<>,只有继承自 EntityBase<> 的数据实体才能被框架默认实现的解析器 DataParse 和 实体存储器
EntityStorage 存储
Schema 定义数据实体需要存到的哪个数据库、哪个表,可以支持表名后缀:周度、月度、当天
定义如何从文本中要抽出数据对象,若是没有配置此特性,表示这个数据对象为页面级别的,即一个页面只产生一个数据对象,也即一条数据。
如上示例代码中:
[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
表示使用 XPath 查询器查询出符合 .//div[@class='news_block'] 的所有内容块,每个内容块为一个数据对象,也即对应一条数据。
定义从文本中查询出的数据暂存到环境数据中,可以供数据实体内部属性查询,可以配置多个。
如上示例代码中:
[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)]
表示使用 XPath 查询器 .//a[@class='current'] 结果若为 v 则保存为 { key: 类别, value: v },然后在数据实体中可以配置环境查询来设置值
[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
public string Category { get; set; }
定义如何从当前文本抽取合适的链接加入到 Scheduler 中,可以定义 xpath 查询元素以获取链接,也可以配置 pattern 来确定请求是否符合要求,若是不符合的链接则会完全忽略,即便在爬虫 InitializeAsync 中加入到 Scheduler 的链接,也要受到 pattern 的约束。
如上示例代码中:
[FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})]
表示使用 XPath 查询器 //div[@class='pager'] 查询到的页面元素里的所有链接都尝试加入到 Scheduler 中。
定义如何从当前文本块中查询值设置到数据实体的属性中。需要注意的是,所有数据实体内的 ValueSelector 是基于 EntitySelector 查询到的元素为根元素。
支持的查询类型有:XPath、Regex、Css、JsonPath、Environment。其中 Environment 表示为环境值,其数据来源有:
-
构造 Request 时设置的 Properties
-
GlobalValueSelector 查询到的所有值
-
某些系统定义的值:
ENTITY_INDEX: 表示当前数据实体是当前文本查询到的所有数据实体的第几个 GUID:获取到一个随机的 GUID DATE:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串 TODAY:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串 DATETIME:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串 NOW:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串 MONTH:获取当月的第一天,以 “yyyy-MM-dd” 格式化的字符串 MONDAY:获取当前星期的星期一,以 “yyyy-MM-dd” 格式化的字符串 SPIDER_ID:获取当前爬虫的 ID REQUEST_HASH:获取当前数据实体所属请求的 HASH 值
只要是继承自 EntityBase 的数据实体都可以使用默认实现的数据解析器 DataParser,如上示例我们可以添加一个数据解析器
AddDataFlow(new DataParser<CnblogsEntry>());
我们可以使用默认的数据存储器
AddDataFlow(GetDefaultStorage());
若要使用默认的数据存储器,需要在 appsettings.json 中设置:
"StorageConnectionString": "Database='mysql';Data Source=192.168.124.200;password=1qazZAQ!;User ID=root;Port=3306;",
"Storage": "DotnetSpider.MySql.MySqlEntityStorage,DotnetSpider.MySql",
"StorageMode": "InsertIgnoreDuplicate"
其中 StorageConnectionString 是数据库连接字符串,Storage 则是所要使用的存储器类型(需要包含程序集信息),StorageMode 表示数据存储器的模式:
Insert:直接插入,若遇到重复索引可能会有异常导致爬虫中止。所有数据库都支持
InsertIgnoreDuplicate:若数据没有违反重复约束则插入,若有重复则忽略,不是所有数据库都支持此种模式
InsertAndUpdate:若数据不存在则插入,重复则更新
Update:只做更新
class Program
{
static async Task Main(string[] args)
{
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft", LogEventLevel.Warning)
.MinimumLevel.Override("System", LogEventLevel.Warning)
.MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning)
.Enrich.FromLogContext()
.WriteTo.Console().WriteTo.RollingFile("logs/spiders.log")
.CreateLogger();
var builder = Builder.CreateDefaultBuilder<EntitySpider>(options =>
{
// 每秒 1 个请求
options.Speed = 1;
// 请求超时
options.RequestTimeout = 10;
});
builder.UseSerilog();
builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
await builder.Build().RunAsync();
Environment.Exit(0);
}
}
运行结果如下
[19:16:12 INF] Argument: RequestedQueueCount, 100
[19:16:12 INF] Argument: Depth, 0
[19:16:12 INF] Argument: RequestTimeout, 10
[19:16:12 INF] Argument: RetriedTimes, 3
[19:16:12 INF] Argument: EmptySleepTime, 10
[19:16:12 INF] Argument: Speed, 1
[19:16:12 INF] Argument: ProxyTestUri, http://www.baidu.com
[19:16:12 INF] Argument: ProxySupplierUri, http://dev.kdlapi.com/api/getproxy/?orderid=948522717574797&num=100&protocol=1&method=2&an_an=1&an_ha=1&sep=1
[19:16:12 INF] Argument: UseProxy, False
[19:16:12 INF] Argument: RemoveOutboundLinks, False
[19:16:12 INF] Argument: StorageConnectionString, Database='mysql';Data Source=localhost;password=1qazZAQ!;User ID=root;Port=3306;
[19:16:12 INF] Argument: Storage, DotnetSpider.MySql.MySqlEntityStorage,DotnetSpider.MySql
[19:16:12 INF] Argument: ConnectionString,
[19:16:12 INF] Argument: Database, dotnetspider
[19:16:12 INF] Argument: StorageMode, InsertIgnoreDuplicate
[19:16:12 INF] Argument: MySqlFileType, LoadFile
[19:16:12 INF] Argument: SqlServerVersion, V2000
[19:16:12 INF] Argument: HBaseRestServer,
[19:16:12 INF] None proxy supplier
[19:16:12 INF] Statistics service starting
[19:16:12 INF] Agent register service starting
[19:16:13 INF] Statistics service started
[19:16:13 INF] Agent register service started
[19:16:13 INF] Agent starting
[19:16:13 INF] Initialize d70f3244-5805-4bb7-a134-6762b1df49db, 博客园
[19:16:13 INF] Agent started
[19:16:13 INF] d70f3244-5805-4bb7-a134-6762b1df49db, 博客园 DataFlows: DataParser`1 -> MySqlEntityStorage
[19:16:13 INF] Register topic DOTNET_SPIDER_D70F3244-5805-4BB7-A134-6762B1DF49DB
[19:16:13 INF] d70f3244-5805-4bb7-a134-6762b1df49db, 博客园 started
[19:16:13 INF] https://news.cnblogs.com/n/page/1/ download success
[19:16:14 INF] https://news.cnblogs.com/n/page/2/ download success
[19:16:15 INF] https://news.cnblogs.com/ download success
[19:16:16 INF] https://news.cnblogs.com/n/page/3/ download success
[19:16:17 INF] https://news.cnblogs.com/n/page/4/ download success
[19:16:17 INF] d70f3244-5805-4bb7-a134-6762b1df49db total 11, success 4, failed 0, left 7
[19:16:18 INF] https://news.cnblogs.com/n/page/5/ download success
[19:16:19 INF] https://news.cnblogs.com/n/page/6/ download success
[19:16:20 INF] https://news.cnblogs.com/n/page/7/ download success
[19:16:21 INF] https://news.cnblogs.com/n/page/8/ download success
[19:16:22 INF] d70f3244-5805-4bb7-a134-6762b1df49db total 14, success 9, failed 0, left 5
[19:16:22 INF] https://news.cnblogs.com/n/page/9/ download success
[19:16:23 INF] https://news.cnblogs.com/n/page/100/ download success
[19:16:24 INF] https://news.cnblogs.com/n/page/10/ download success
[19:16:25 INF] https://news.cnblogs.com/n/page/11/ download success
[19:16:26 INF] https://news.cnblogs.com/n/page/12/ download success
[19:16:27 INF] d70f3244-5805-4bb7-a134-6762b1df49db total 22, success 14, failed 0, left 8
[19:16:27 INF] https://news.cnblogs.com/n/page/13/ download success
[19:16:28 INF] https://news.cnblogs.com/n/page/99/ download success
[19:16:29 INF] https://news.cnblogs.com/n/page/96/ download success
[19:16:30 INF] https://news.cnblogs.com/n/page/97/ download success