Skip to content

2 编写一个配置式爬虫

邹嵩 edited this page Apr 4, 2020 · 7 revisions

配置式爬虫

上一章介绍了基本爬虫的写法,这种写法呢完全由自己写解析,写数据存储,自由度很高但相对代码量也要多很多。DotnetSpider 默认实现了一种基于实体配置的爬虫编写方式。请看示例代码:

public class EntitySpider : Spider
	{
		public EntitySpider(IOptions<SpiderOptions> options, SpiderServices services, ILogger<Spider> logger) : base(
			options, services, logger)
		{
		}

		protected override async Task InitializeAsync(CancellationToken stoppingToken)
		{
			AddDataFlow(new DataParser<CnblogsEntry>());
			AddDataFlow(GetDefaultStorage());
			await AddRequestsAsync(
				new Request("https://news.cnblogs.com/n/page/1/", new Dictionary<string, string> {{"网站", "博客园"}}),
				new Request("https://news.cnblogs.com/n/page/2/", new Dictionary<string, string> {{"网站", "博客园"}}));
		}

		[Schema("cnblogs", "news")]
		[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
		[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
		[GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)]
		[FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})]
		public class CnblogsEntry : EntityBase<CnblogsEntry>
		{
			protected override void Configure()
			{
				HasIndex(x => x.Title);
				HasIndex(x => new {x.WebSite, x.Guid}, true);
			}

			public int Id { get; set; }

			[Required]
			[StringLength(200)]
			[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
			public string Category { get; set; }

			[Required]
			[StringLength(200)]
			[ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
			public string WebSite { get; set; }

			[StringLength(200)]
			[ValueSelector(Expression = "//title")]
			[ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
			public string Title { get; set; }

			[StringLength(40)]
			[ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
			public string Guid { get; set; }

			[ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
			public string News { get; set; }

			[ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
			public string Url { get; set; }

			[ValueSelector(Expression = ".//div[@class='entry_summary']")]
			public string PlainText { get; set; }

			[ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
			public DateTime CreationTime { get; set; }
		}
	}

数据实体

首先你需要定义一个数据实体,实体必须继承自 EntityBase<>,只有继承自 EntityBase<> 的数据实体才能被框架默认实现的解析器 DataParse 和 实体存储器

EntityStorage 存储

Schema

Schema 定义数据实体需要存到的哪个数据库、哪个表,可以支持表名后缀:周度、月度、当天

EntitySelector

定义如何从文本中要抽出数据对象,若是没有配置此特性,表示这个数据对象为页面级别的,即一个页面只产生一个数据对象,也即一条数据。

如上示例代码中:

[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]

表示使用 XPath 查询器查询出符合 .//div[@class='news_block'] 的所有内容块,每个内容块为一个数据对象,也即对应一条数据。

GlobalValueSelector

定义从文本中查询出的数据暂存到环境数据中,可以供数据实体内部属性查询,可以配置多个

如上示例代码中:

[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)]

表示使用 XPath 查询器 .//a[@class='current'] 结果若为 v 则保存为 { key: 类别, value: v },然后在数据实体中可以配置环境查询来设置值

[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
public string Category { get; set; }

FollowRequestSelector

定义如何从当前文本抽取合适的链接加入到 Scheduler 中,可以定义 xpath 查询元素以获取链接,也可以配置 pattern 来确定请求是否符合要求,若是不符合的链接则会完全忽略,即便在爬虫 InitializeAsync 中加入到 Scheduler 的链接,也要受到 pattern 的约束。

如上示例代码中:

[FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})]

表示使用 XPath 查询器 //div[@class='pager'] 查询到的页面元素里的所有链接都尝试加入到 Scheduler 中。

ValueSelector

定义如何从当前文本块中查询值设置到数据实体的属性中。需要注意的是,所有数据实体内的 ValueSelector 是基于 EntitySelector 查询到的元素为根元素。

支持的查询类型有:XPath、Regex、Css、JsonPath、Environment。其中 Environment 表示为环境值,其数据来源有:

  1. 构造 Request 时设置的 Properties

  2. GlobalValueSelector 查询到的所有值

  3. 某些系统定义的值:

    ENTITY_INDEX: 表示当前数据实体是当前文本查询到的所有数据实体的第几个
    GUID:获取到一个随机的 GUID
    DATE:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串
    TODAY:获取当天的时间,以 “yyyy-MM-dd” 格式化的字符串
    DATETIME:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串
    NOW:获取当前时间,以 “yyyy-MM-dd HH:mm:ss” 格式化的字符串
    MONTH:获取当月的第一天,以 “yyyy-MM-dd” 格式化的字符串
    MONDAY:获取当前星期的星期一,以 “yyyy-MM-dd” 格式化的字符串
    SPIDER_ID:获取当前爬虫的 ID
    REQUEST_HASH:获取当前数据实体所属请求的 HASH 值
    

DataParser

只要是继承自 EntityBase 的数据实体都可以使用默认实现的数据解析器 DataParser,如上示例我们可以添加一个数据解析器

AddDataFlow(new DataParser<CnblogsEntry>());

数据存储器

我们可以使用默认的数据存储器

AddDataFlow(GetDefaultStorage());

若要使用默认的数据存储器,需要在 appsettings.json 中设置:

"StorageConnectionString": "Database='mysql';Data Source=192.168.124.200;password=1qazZAQ!;User ID=root;Port=3306;",
"Storage": "DotnetSpider.MySql.MySqlEntityStorage,DotnetSpider.MySql",
"StorageMode": "InsertIgnoreDuplicate"

其中 StorageConnectionString 是数据库连接字符串,Storage 则是所要使用的存储器类型(需要包含程序集信息),StorageMode 表示数据存储器的模式:

Insert直接插入,若遇到重复索引可能会有异常导致爬虫中止。所有数据库都支持
InsertIgnoreDuplicate:若数据没有违反重复约束则插入,若有重复则忽略,不是所有数据库都支持此种模式
InsertAndUpdate:若数据不存在则插入,重复则更新
Update:只做更新

创建爬虫实例并运行

    class Program
    {
        static async Task Main(string[] args)
        {
            Log.Logger = new LoggerConfiguration()
                .MinimumLevel.Information()
                .MinimumLevel.Override("Microsoft.Hosting.Lifetime", LogEventLevel.Warning)
                .MinimumLevel.Override("Microsoft", LogEventLevel.Warning)
                .MinimumLevel.Override("System", LogEventLevel.Warning)
                .MinimumLevel.Override("Microsoft.AspNetCore.Authentication", LogEventLevel.Warning)
                .Enrich.FromLogContext()
                .WriteTo.Console().WriteTo.RollingFile("logs/spiders.log")
                .CreateLogger();

            var builder = Builder.CreateDefaultBuilder<EntitySpider>(options =>
            {
                // 每秒 1 个请求
                options.Speed = 1;
                // 请求超时
                options.RequestTimeout = 10;
            });
            builder.UseSerilog();
            builder.UseQueueDistinctBfsScheduler<HashSetDuplicateRemover>();
            await builder.Build().RunAsync();
            Environment.Exit(0);
        }
    }

运行结果如下

[19:16:12 INF] Argument: RequestedQueueCount, 100
[19:16:12 INF] Argument: Depth, 0
[19:16:12 INF] Argument: RequestTimeout, 10
[19:16:12 INF] Argument: RetriedTimes, 3
[19:16:12 INF] Argument: EmptySleepTime, 10
[19:16:12 INF] Argument: Speed, 1
[19:16:12 INF] Argument: ProxyTestUri, http://www.baidu.com
[19:16:12 INF] Argument: ProxySupplierUri, http://dev.kdlapi.com/api/getproxy/?orderid=948522717574797&num=100&protocol=1&method=2&an_an=1&an_ha=1&sep=1
[19:16:12 INF] Argument: UseProxy, False
[19:16:12 INF] Argument: RemoveOutboundLinks, False
[19:16:12 INF] Argument: StorageConnectionString, Database='mysql';Data Source=localhost;password=1qazZAQ!;User ID=root;Port=3306;
[19:16:12 INF] Argument: Storage, DotnetSpider.MySql.MySqlEntityStorage,DotnetSpider.MySql
[19:16:12 INF] Argument: ConnectionString, 
[19:16:12 INF] Argument: Database, dotnetspider
[19:16:12 INF] Argument: StorageMode, InsertIgnoreDuplicate
[19:16:12 INF] Argument: MySqlFileType, LoadFile
[19:16:12 INF] Argument: SqlServerVersion, V2000
[19:16:12 INF] Argument: HBaseRestServer, 
[19:16:12 INF] None proxy supplier
[19:16:12 INF] Statistics service starting
[19:16:12 INF] Agent register service starting
[19:16:13 INF] Statistics service started
[19:16:13 INF] Agent register service started
[19:16:13 INF] Agent starting
[19:16:13 INF] Initialize d70f3244-5805-4bb7-a134-6762b1df49db, 博客园
[19:16:13 INF] Agent started
[19:16:13 INF] d70f3244-5805-4bb7-a134-6762b1df49db, 博客园 DataFlows: DataParser`1 -> MySqlEntityStorage
[19:16:13 INF] Register topic DOTNET_SPIDER_D70F3244-5805-4BB7-A134-6762B1DF49DB
[19:16:13 INF] d70f3244-5805-4bb7-a134-6762b1df49db, 博客园 started
[19:16:13 INF] https://news.cnblogs.com/n/page/1/ download success
[19:16:14 INF] https://news.cnblogs.com/n/page/2/ download success
[19:16:15 INF] https://news.cnblogs.com/ download success
[19:16:16 INF] https://news.cnblogs.com/n/page/3/ download success
[19:16:17 INF] https://news.cnblogs.com/n/page/4/ download success
[19:16:17 INF] d70f3244-5805-4bb7-a134-6762b1df49db total 11, success 4, failed 0, left 7
[19:16:18 INF] https://news.cnblogs.com/n/page/5/ download success
[19:16:19 INF] https://news.cnblogs.com/n/page/6/ download success
[19:16:20 INF] https://news.cnblogs.com/n/page/7/ download success
[19:16:21 INF] https://news.cnblogs.com/n/page/8/ download success
[19:16:22 INF] d70f3244-5805-4bb7-a134-6762b1df49db total 14, success 9, failed 0, left 5
[19:16:22 INF] https://news.cnblogs.com/n/page/9/ download success
[19:16:23 INF] https://news.cnblogs.com/n/page/100/ download success
[19:16:24 INF] https://news.cnblogs.com/n/page/10/ download success
[19:16:25 INF] https://news.cnblogs.com/n/page/11/ download success
[19:16:26 INF] https://news.cnblogs.com/n/page/12/ download success
[19:16:27 INF] d70f3244-5805-4bb7-a134-6762b1df49db total 22, success 14, failed 0, left 8
[19:16:27 INF] https://news.cnblogs.com/n/page/13/ download success
[19:16:28 INF] https://news.cnblogs.com/n/page/99/ download success
[19:16:29 INF] https://news.cnblogs.com/n/page/96/ download success
[19:16:30 INF] https://news.cnblogs.com/n/page/97/ download success

WX20200404-192944@2x

Clone this wiki locally