opendata-graph

This project aims to crawl Common Crawl corpus in order to create a graph of french opendata websites

The final results of this project are viewable on the website http://french-opendata.data-publica.com/index.html .

Final results

Final results, with graph representation of the french opendata subgraph of the web is viewable in graph directory of this project.

French open-data websites Graph

Our project aims to build a subgraph of the web, consisting of the French websites mentioning open-data. This graph enables viewers to see popular websites and connections between them, to see which kind of entities communicate with the others (Companies, Non profits/Blogs, Government agencies). It is a good way to discover the actors of the French open-data, and how they relate to one another.

Crawl Methodology

We crawled the whole Common Crawl corpus. For each website, we computed two scores: an open-data score, and a "French" score. If both are large enough, the website is kept to build the graph, together with all its outgoing links (which we use to build edges of the graph). Once the crawl is over and the websites selected, two files are generated: one with the nodes and one with the connections between the nodes.

Important : to build Java Project (in commonCrawl directory), you need to complete a configuration file with your Amazon information, located on commonCrawl/src/main/resources/aws.properties

Categorization

The nodes are then manually grouped by categories: type (Companies, Non profits/Blog, Government agencies) and roles (Open-Data Speaker, Open-Data Dealer).

Visualization

The resulting dataset is then loaded into Gephi, to be spatialized and visualized.

Graph view (draft)

If you want to see some pictures of the draft Graph realized, you can go into the preview directory. You'll have access to an overview, an image of the core, and images of two mini-clusters separated from the main graph. This preview will be soon completed with a dynamic general view of the graph, with categorized nodes, and a complete analysis of the results. These pictures are now obsoletes compared to the one in graph directory.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
commonCrawl		commonCrawl
graph		graph
preview		preview
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

opendata-graph

Final results

French open-data websites Graph

Crawl Methodology

Categorization

Visualization

Graph view (draft)

About

Releases

Packages

Languages

datapublica/opendata-graph

Folders and files

Latest commit

History

Repository files navigation

opendata-graph

Final results

French open-data websites Graph

Crawl Methodology

Categorization

Visualization

Graph view (draft)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages