This project aims to crawl Common Crawl corpus in order to create a graph of french opendata websites
The final results of this project are viewable on the website http://french-opendata.data-publica.com/index.html .
Final results, with graph representation of the french opendata subgraph of the web is viewable in graph directory of this project.
Our project aims to build a subgraph of the web, consisting of the French websites mentioning open-data. This graph enables viewers to see popular websites and connections between them, to see which kind of entities communicate with the others (Companies, Non profits/Blogs, Government agencies). It is a good way to discover the actors of the French open-data, and how they relate to one another.
We crawled the whole Common Crawl corpus. For each website, we computed two scores: an open-data score, and a "French" score. If both are large enough, the website is kept to build the graph, together with all its outgoing links (which we use to build edges of the graph). Once the crawl is over and the websites selected, two files are generated: one with the nodes and one with the connections between the nodes.
Important : to build Java Project (in commonCrawl directory), you need to complete a configuration file with your Amazon information, located on commonCrawl/src/main/resources/aws.properties
The nodes are then manually grouped by categories: type (Companies, Non profits/Blog, Government agencies) and roles (Open-Data Speaker, Open-Data Dealer).
The resulting dataset is then loaded into Gephi, to be spatialized and visualized.
If you want to see some pictures of the draft Graph realized, you can go into the preview directory. You'll have access to an overview, an image of the core, and images of two mini-clusters separated from the main graph. This preview will be soon completed with a dynamic general view of the graph, with categorized nodes, and a complete analysis of the results. These pictures are now obsoletes compared to the one in graph directory.