Skip to content

Commit

Permalink
Fixed up some things. Added results
Browse files Browse the repository at this point in the history
  • Loading branch information
Raigo Aljand committed Jun 5, 2014
1 parent b4185d2 commit c976a30
Show file tree
Hide file tree
Showing 13 changed files with 335 additions and 97 deletions.
133 changes: 94 additions & 39 deletions document.tex
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,127 @@

\usepackage{utf8}

%math
\usepackage{amsmath}

\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\argmax}{arg\,max}

% misc
\usepackage{hyperref}
\usepackage{hyphenat}
\usepackage{parskip}

\usepackage[nottoc]{tocbibind}
\usepackage{graphicx}

\usepackage{lastpage}
\usepackage{tabulary}

\graphicspath{ {img/} }

\DeclareUnicodeCharacter{00A0}{ }
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\argmax}{arg\,max}

\DeclareUnicodeCharacter{00A0}{ }

\newcommand{\pypy}{PyPy}

\newcommand{\bulletsection}[1]{\section*{#1}
\addcontentsline{toc}{section}{#1}}

\begin{document}
\include{tex/title}

\begin{abstract}
[Here you should briefly describe the aims of your work]
[Here you should briefly describe the main problems dealt with]
[Here you should briefly describe the main results obtained]

The thesis is in [language] and contains [pages] pages of text, [chapters]
chapters, [figures] figures, [tables] tables., etc.
The Estonian Wikipedia has a lot of articles that are of high-quality, but are
hard to find from the huge set of articles. The aim of this
thesis is to filter out the high-quality articles from the low-quality ones
using a machine learning algorithm called logistic regression.

The main problem was filtering out the high-quality articles from the
low-quality articles. An algorithm was written using gradient descent to find
the logistic regression weights from a matrix of numerical data. Therefore,
the main tasks of this thesis were to find known high-quality articles and known
low-quality Wikipedia articles, translate them into numerical data, train the
machine learning algorithm in a small enough number of iterations and validate
the accuracy of this algorithm.

Research shows that the Estonian Wikipedia has a category for hand-picked
high-quality articles and a way to obtain a random article, which will labeled
as low-quality. Training the algorithm with those results, the accuracy of the
result is enough to filter out high-quality articles out of all of the Estonian
Wikipedia.

% TODO Put the right data in.
The thesis is in English and contains \pageref*{LastPage} pages of text,
[chapters] chapters, [figures] figures, [tables] tables., etc.
\end{abstract}

\listoftables
\clearpage

\listoffigures
\newpage
\clearpage

\listoftables
\tableofcontents
\newpage

\section*{Sissejuhatus}
\addcontentsline{toc}{section}{Sissejuhatus}
\clearpage

\include{tex/introduction}
\include{tex/tools}

\include{tex/analyser}

\section*{Kokkuvõte}
\addcontentsline{toc}{section}{Kokkuvõte}
[Kirjelda lühidalt töö põhieesmärke]
[Kirjelda olulisemaid tulemusi]
[Too välja olulisemad järeldused (2(3)]
[Esita võimalikke edasiarendusi (vajadusel)]

\section*{Summary}
\addcontentsline{toc}{section}{Summary}
[Selgita ülevaatlikult ja põhjalikult oma töö eesmärke]
[Selgita ülevaatlikult ja piisava põhjalikkusega oma töös käsitletud probleeme]
[Kirjelda ülevaatlikult ja põhjalikult oma töö tulemusi ja järeldusi]
\include{tex/results}

\bulletsection{Summary}
The aim of this thesis is to provide a way to filter out the high-quality
articles in the Estonian Wikipedia. The best way to do so would be using
logistic regression, a machine learning algorithm.

The Estonian Wikipedia has a hand-picked category for high-quality articles.
The machine learning algorithm can be trained with these articles. Logistic
regression has a weight for each feature or dimension of the article. Logistic
regression is also blind to data type and only sees the scalar value. That
means that it is impossible to identify whether an article uses a certain kind
of template or has certain words or characters in it.

An article has 7 features: characters, pages that refer it, pages that it refers
to, images it refers to, external links it contains, templates it links to,
categories it is in. These features were selected because they were readily
available through the PyWikiBot framework. Because logistic regression accepts
only scalar values, the count of each feature was used. A 8th constant weight is
added as a prefix to counteract the bias in the data.

An algorithm was written using gradient descent to find the logistic regression
weights from the matrix of numerical data. Gradient descent is an algorithm to
calculate the local minimum or maximum of a range of data. In this case, we want
to find the minimum of the error rate. It iterates over the last result and
changes it towards the local minimum. The size of the step of each iteration
must be carefully chosen or the algorithm might step over the local minimum
without lowering the deviation.

When the data has a great bias or big deviations, gradient descent must go
through a lot of iterations to negate that. By normalising the data beforehand,
all the features have the same small bias and deviation. This is also done in
this thesis, otherwise the iteration count might grow so large, that it is hard
to differentiate between a forever looping algorithm and an eventually stopping
algorithm.

Validation is done by setting aside 30\% of the articles gathered for training
and then comparing the label the algorithm predicted for these articles with
their true label.

The average error rate of the algorithm detailed in this thesis is 2.22\%. This
is achieved with averagly 714.8 iterations of gradient descent. This error rate
can be considered good enough for the purposes of the editors in Vikipeedia.

One possible improvement is not using the PyWikiBot framework. Because of the
way the PyWikiBot framework's interface works the program queries more times from
the MediaWiki server than is required. Each article has a summary page about the
article's metadata, which can be screen scraped to get most of the required
features with one query. It is also possible to make one specific query through
the MediaWiki API to get the required metadata.

Another possible improvement is to subjectively separate bad articles
and use that data to teach the machine learning algorithm. In this thesis
separating the high-quality articles was done because the high-quality articles
were already handpicked by the Vikipeedia editors. The logistic regression
algorithm can also be made to accept multiple classifications instead of 2.

I conclude that it is entirely possible to use machine learning to sift out
the high-quality articles.

\bibliographystyle{plain}
\bibliography{loputoo}

\section*{Lisa 1}
\addcontentsline{toc}{section}{Lisa 1}

\end{document}
20 changes: 10 additions & 10 deletions loputoo.bib
Original file line number Diff line number Diff line change
Expand Up @@ -117,16 +117,6 @@ @Misc{website:normalisation
note = {[Online; accessed \today]}
}

% currently not implemented
@Misc{website:line-search,
author = {Kris Hauser},
title = {B553 Lecture 4: Gradient Descent},
howpublished = {\url{http://homes.soic.indiana.edu/classes/spring2012/csci/b553-hauserk/gradient_descent.pdf}},
month = {January},
year = {2012},
note = {[Online; accessed \today; Võimalik lisa lahendus]},
}

%%%% \section{Used technologies}
%%% \subsection{Python}
Expand Down Expand Up @@ -256,3 +246,13 @@ @Misc{website:mediawiki-api
year = {2014},
note = {[Online; accessed \today]}
}



@misc{wiki:good-article-requirements,
author = "Vikipeedia",
title = "Vikipeedia:Hea artikli nõuded --- Vikipeedia",
year = "2013",
howpublished = "\url{//et.wikipedia.org/w/index.php?title=Vikipeedia:Hea_artikli_n%C3%B5uded&oldid=3590326}",
note = "[Online; accessed \today]"
}
3 changes: 2 additions & 1 deletion tex/analyser.tex
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
\section{The implementation}
The purpose of this project is to separate good articles from average articles.
The purpose of this project is to separate high-quality articles from
low-quality articles.

\subsection{Prerequisites}
\input{tex/analyser/prerequisites}
Expand Down
63 changes: 42 additions & 21 deletions tex/analyser/architecture.tex
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
the analyse folder is the code written for this paper. The interface script is
also in the src folder named \verb;wiki-analyse.py;.

Most of the in the analyse code is programmed in a functional style. There are
Most of the analysis code is programmed in a functional style. There are
no classes, only named tuples and most functions don't change the inner state of
the parameters, but return a new value. The only imperative part of the program
is in the \verb;voidlib.py; file. There each function changes the state of the
Expand Down Expand Up @@ -33,39 +33,58 @@ \subsubsection{Machine learning}
PageModel is a simple datastructure that contains a wiki Page and the label
assigned to that page.

The \verb;modellib; module contains the
The \verb;modellib; module contains the \\
\verb;predicted_label(train_result, page); function, that will return the
machine's prediction for the Page's label.

The learning data consists of the pages in the category ``Head
artiklid''\footnote{\url{https://et.wikipedia.org/wiki/Kategooria:Head_artiklid}}
except for the 5 articles that are tied to the category but are not examples of
good articles. This category is hand built by the Vikipeedia team. Their
PageModel is built with the label \verb;GOOD_PAGE;. Then the program asks for
the same amount of random pages which PageModel is initialised with the label
\verb;AVERAGE_PAGE;. Most likely the \verb;AVERAGE_PAGE; set will have some
really good articles and really bad articles besides average articles, but
hopefully they will not effect the training result. Those two sets are then
added together and shuffled. Then 70\% of it will be used for training and the
rest will be used to test the precision of the bot. Sometimes the set the bot is
trained with may be biased and the user might want to train the bot again.

% XXX This could change.
The training data consists of the pages in the category ``Head
artiklid''\footnote{\url{https://et.wikipedia.org/wiki/Kategooria:Head_artiklid}}\footnote{Good
articles in Estonian} except for the 5 articles that are articles about the
category itself and are not examples of good articles. This category is hand
built by the Vikipeedia team. An article must fill multiple requirements before
it is considered to be good:\cite{wiki:good-article-requirements}
\begin{description}
\item[Well written] It is clearly worded and has the correct
spelling. It conforms to the style requirements and doesn't use made-up words
or slang.
\item[Factually accurate and verifiable] Each paragraph has a citation to used
sources and the sources must credible. A good article doesn't contain original
research.
\item[Covers the whole subject] The main aspects of the subject must be
covered while not being derailed to other subjects.
\item[Neutral] The subject is presented fairly and without contradiction.
\item[Stabile] The article is not often changed because of current
arguments or events.
\item[Illustrated with pictures if possible] Each picture is marked with
copyrights which are not incompatible with Vikipeedia policy. The pictures
must be on topic and sufficiently explained. A good article may not have
pictures if it is complicated to obtain one.
\end{description}

PageModels with the pages from the ``Head artiklid'' category are built with the
label \verb;GOOD_PAGE;. Then the program asks for the same amount of random
pages whose PageModel is initialised with the label \verb;AVERAGE_PAGE;. Most
likely the \verb;AVERAGE_PAGE; set will have some very high-quality articles and
very low-quality articles besides average articles, but it averages out. Those
two sets are then added together and shuffled. Then 70\% of it will be used for
training and the rest will be used to test the precision of the bot. Sometimes
the set the bot is trained with may be biased and the user might want to train
the bot again.

Each page has 7 features:
\begin{enumerate}
\item Length of the text of the Page.
\item Number of Pages that refers this Page.
\item Number of pages this page links to.
\item Number of images this page links to.
\item Number of pages this Page links to.
\item Number of images this Page links to.
\item Number of external links this Page contains.
\item Number of templates this Page links to.
\item Number of categories this page is in.
\item Number of categories this Page is in.
\end{enumerate}
Common reasoning says that the higher these features are, the better a Page
would be. However we don't know, how one feature weighs against another
feature. That's what the machine learning algorithm figures out. These 7
features with a prefix of the number 1 make up the vector x.
features with a prefix of the vaule 1 make up the vector x.

To keep the number of iterations small, The values of x are normalised before
they are trained or predicted with. It helps keep all the values of x around the
Expand Down Expand Up @@ -120,4 +139,6 @@ \subsubsection{Machine learning}

\subsubsection{Searching}

The bot finds the good pages using a brute force mechanism. It requests all the pages from wikipedia and then tries to predict whether the page is good or average. It skips all pages with exceptions, mostly that would be redirects.
The bot finds the good pages using a brute force mechanism. It requests all the
pages from Wikipedia and then tries to predict whether the page is good or
average. It skips all pages with exceptions, mostly that would be redirects.
7 changes: 5 additions & 2 deletions tex/analyser/interface.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
learning algorithm. If the argument is \verb;find;, the command will only search
for good pages using the result of the last retraining.

% TODO, tee nii, et ma lihtsalt prindin ükshaaval kõikide lehtede url'id, mis ma
% leian heana, terminalile ja ei ole vaja salvestada pickle formaadis faili.
% Kuigi, räägime sellest veel.

After the bot has found all of the good pages, it will then write the list of
good pages in a file named \verb;good_pages.pkl; using the Python library
\verb;pickle; and print it out on the terminal. \emph{Kas ma teen midagi muud
selle listiga?}
\verb;pickle; and print it out on the terminal.
24 changes: 15 additions & 9 deletions tex/analyser/prerequisites.tex
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
\emph{The author used Arch Linux of \today to make this program and hasn't
\emph{The author used Arch Linux of \today\ to make this program and hasn't
tested installing and running it on other systems. Proceed on your own caution.}

This project requires the Python 2 interpreter on the system. As of \today
This project requires the Python 2 interpreter on the system. As of \today\
PyWikiBot does not support Python 3 and therefore this paper's code was also not
written in Python 3. The Python library numpy is also necessary.
One can install them with the terminal command
One can install them with the terminal command \\
\verb;sudo pacman -S --needed python2 python2-numpy;.

A user-config.py configuration file must also exist. To generate it, there is a
script in the pywikibot folder. It's named \verb;generate_user_files.py; and and
it must be run with the Python interpreter. This is a sample installation
process:
\begin{verbatim}
raigo@archofraigo ~/git/wiki-analyse-bot/core (git)-[master] % python2 generate_user_files.py :(
raigo@archofraigo ~/git/wiki-analyse-bot/core (git)-[master] %
python2 generate_user_files.py :(
Your default user directory is "/home/raigo/.pywikibot"
How to proceed? ([K]eep [c]hange)
Do you want to copy user files from an existing pywikipedia installation? n
Create user-config.py file? Required for running bots ([y]es, [N]o) y
Do you want to copy user files from an existing pywikipedia
installation? n
Create user-config.py file? Required for running bots ([y]es,
[N]o) y
1: anarchopedia
2: battlestarwiki
[...]
Expand All @@ -27,16 +30,19 @@
[...]
33: wiktionary
34: wowwiki
Select family of sites we are working on, just enter the number not name (default: wikipedia):
Select family of sites we are working on, just enter the number
not name (default: wikipedia):
This is the list of known language(s):
ab ace [...] es et eu [...] zh-yue zu
The language code of the site we're working on (default: 'en'): et
The language code of the site we're working on (default: 'en'):
et
Username (et wikipedia): AnalyseBot
Which variant of user_config.py:
[S]mall or [E]xtended (with further information)? S
Do you want to add any other projects? (y/N)
'/home/raigo/.pywikibot/user-config.py' written.
Create user-fixes.py file? Optional and for advanced users ([y]es, [N]o)
Create user-fixes.py file? Optional and for advanced users
([y]es, [N]o)
\end{verbatim}
Questions without answers use the default answer by pressing Enter.

Expand Down
Loading

0 comments on commit c976a30

Please sign in to comment.