-
Notifications
You must be signed in to change notification settings - Fork 61
Web entities
To analyse the web, the web pages has to be grouped in coherent set.
Those set are defined by a common URL part.
The classic and natural way to group pages is by the domain name : http://www.domain.tld
But this way of grouping pages hide the complexity of web sites as they are often made of several subdomains often containing subfolders. Hence, the somehow simple way of grouping web pages by domain name can become irrelevant if:
- the web pages are hosted in a common hosting service like blogspot.com. Indeed the blogspot.com platform hosts many different blogs. If one of them has to be included in a web corpus, a specific web entity has to be set as: http://whatever.blogspot.com. But by doing so the all http://www.blogspot.com will not be consider as an entity but the only specific "whatever" blog.
- the research scope aims to look into interactions between internal services. Consider a corporation like sciences po which website is http://www.sciences-po.fr. If one want to consider the organisation as a whole, a classic domain name grouping like www.sciences-po.fr is then correct. But, if for any reason one wants to look into interactions between internal services, one has to group web entities by service and not agregate it all in the domain. It rather defines two entities
- http://www.sciences-po.fr/recherche (research)
- http://www.sciences-po.fr/dfc (professional training)
- To adapt to all web heterogeneity, one could have to be able to focus a web entity at any level he considered to be relevant. For example, it might on occasion be interesting to go down to page level to track links pointing to a specific article in an online newspaper considered by a community as an excellent reference or as a rallying point.
The Web Entity is represented as an URL pattern. The pattern defines both the boundaries of a set of web pages and a corresponding level, low when the entity is close to the domain name, high when the entity is close the web ressources. Hence, the higher the level, the smaller and the more specific the entity will be. Reciprocally the lower the level, the bigger and the more general the entity will be. That way, HCI users can define their web entities to match with their objectives and the structure of the web they explore.
A single URL pattern is not sufficient to define a single web entity. As a matter of fact, for one URL there is one and only one web page but the reverse is not true as for one web page there are multiple possible URL. Back to sciences po example, the research service has actually two different URLs:
Here both subdomaines and subfolders had been used to point to the service website.To address this issue we need web entity to possess aliases to every encountered URL that matches them. Thus, each given alias pattern defines a common web entity and both http://recherche.sciences-po.fr and http://sciences-po.fr/recherche point at the entity science-po/recherche. This conception of a conceptual web entity also help resolve several problems related to the dynamic of the network. It is useful for example for web actors which had changed name at some point but maintained old URL to keep old links working.
A web entity is finite in time and may also change over time. The URL alias idea will help keep consistency of the web entities over time but temporality remains a crucial problem in particular if the domain is lost or if the content changes to much to remain the same entity. An archiving level is currently under active consideration to take into account dynamic of web entities.
Technically, the URL pattern is stored as a pointer to a specific stem in the memory structure that is a tree of crawled URL. All aliases for a web entity are also stems that points at the main stem .
The user by using the Interface will be able to create and edit web entities. This is the must common way to create web entities.
But it can be usefull to create automatically web entities to lower the user intervention. For example, we could want that all specific blogs of a common platform like blogger should be set a different entities http://economicsofcontempt.blogspot.com/. To achieve those usecases there are 2 different methods to create web entities automatically.
This technic would allow to set a default behaviour to generate web entities out of LRUs. In general web mining tool set as default rule that a web entity has to be created at the domain level. This would be translated as
(s:.*?\|(h:.*?\|)*?(h:.*?$)?)\|?(p.*)?$
note : problems in this regexp with trailing |
This default rule is unique for a corpus and must always match an LRU.
Here we can set that every subdomains of blogspot.com should be set as web entities. This can be translated as a lru prefix plus a lru regex :
lru_prefix : s:http|h:com|h:blogspot regex : (s:http|h:com|h:blogspot|h:.*?\|).* ''note : problems in this regexp with trailing | ''
We need both prefix and regex for performance reason. The lru_prefix helps for identifiyng the regex to test for a specific LRU.
An other example could be to set a different web entities for every subfolders of a domain. Let's say we want to study the interactions between the internal services of Sciences Po. Being said that internal services are mapped to subfolders of the main www.sciences-po.fr, the web entity creation rule would be :
lru_prefix : s:http|h:fr|h:sciences-po regex : (s:http|h:fr|h:sciences-po|(h:www|)?p:.*?\|).*
A rule being defined as lru regular expression, the web entity to be created as the group 1 in the matching result. But when multiple rules apply which one should we chose ?
We should chose the longuest in terms of number of tokens in the matching LRUs returned by the regexp; i.e. the most specific rule.
in case no creation rul apply do we need an Universal web entity (s:http) ?