Information-Retrieval

Keywords

Elasticsearch, MongoDB, Tornado Server, RESTful API, Python, Information Retrieval, Machine Learning, Web Crawler

Screenshots

Search web page
Elasticsearch result
Search Interface
Search Results

Introduction

Homework of my course "Information Retrieval", by Python 3.

Instructor: Virgil Pavlu
University: Northeastern University
Course: CS6200

Elasticsearch Index

index more than 80000 documents into elasticsearch
optimized index speed to around 15min

Documents Index

making my own "elasticsearch"
index data in both doc dimension, and term dimension
two kinds of dimension index increase the index efficiency.

Web Crawler

topic: maritime accident
Breadth-first search to iterate all pages in early waves.
topic module application for accurately checking the relevance of pages
in total 36000 pages, more than 50% is relevant to topic "maritime accident"
distinguish wanted pages by header content type before downloading it.
applied network session to restore cookies for fast and low-duty re-access.
sort domains according to last accessing time, so that multi threads can access different domains to speed up crawling
normalize href links in good method, to reduce page drop rate

Web Graph Computation

applied pagerank and HITS to evaluate the page in whole page set
regard in & out links of pages as directed network graph
web graph computation is a kind of admitting of idea “Cream rises to the top”:
good authority page can be referenced more and more,
good hub page digs more and more good authority pages.

Web Interface Relevance Assessments

applied Tornado Server as a web server, which can be accessed remotely
server communicates with elasticsearch database for searching and extracting data
MongoDB restores page info to speed up web server
made python based html template to create search result page automatically and flexibility.
set log in permit to filter users
applied application layer info to transfer parameter between pages.
after getting manual evaluation, apply query compute R-precision, Average Precision, nDCG, precision and recall and F1 to evaluate search result coming from page set.
drew precision & recall graphics for the visualized cooperation between search results distribution and page relevant true values.

Machine Learning for IR

with better understanding of elasticsearch, re-index the dataset, which set new analyzer with standard tokenizer, lowercase, and porter2 stemmer.
set nested mapping to restore features details
distinguish documents by different elasticsearch types
for a dataset with labeled data in it, split it by 80% for training, 20% for testing
tried different combination of feature to increase the performance of machine learning module
applied different machine learning modules including: Liner Regression, LogisticRegression, svm, svm rank

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
1_Elasticsearch_Index		1_Elasticsearch_Index
2_Document_Index		2_Document_Index
3_Web_Crawler		3_Web_Crawler
4_Web-Graph-Computation		4_Web-Graph-Computation
5_Web_Interface_Relevance_Assessments		5_Web_Interface_Relevance_Assessments
6_Machine_Learning_for_IR		6_Machine_Learning_for_IR
7_Unigram_Bigram_Classifier_4_Spam		7_Unigram_Bigram_Classifier_4_Spam
8_Clustering_&_Topic_Models		8_Clustering_&_Topic_Models
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information-Retrieval

Keywords

Screenshots

Introduction

About

Releases

Packages

Languages

chenxi-shi/Information-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval

Keywords

Screenshots

Introduction

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages