Fixed up some things. Added results

raigoinabox · Jun 5, 2014 · c976a30 · c976a30
1 parent b4185d2
commit c976a30
Show file tree

Hide file tree

Showing 13 changed files with 335 additions and 97 deletions.
diff --git a/document.tex b/document.tex
@@ -6,72 +6,127 @@
 
 \usepackage{utf8}
 
-%math
 \usepackage{amsmath}
-
-\DeclareMathOperator*{\argmin}{arg\,min}
-\DeclareMathOperator*{\argmax}{arg\,max}
-
-% misc
 \usepackage{hyperref}
 \usepackage{hyphenat}
 \usepackage{parskip}
-
 \usepackage[nottoc]{tocbibind}
 \usepackage{graphicx}
-
+\usepackage{lastpage}
+\usepackage{tabulary}
 
 \graphicspath{ {img/} }
 
-\DeclareUnicodeCharacter{00A0}{ }
+\DeclareMathOperator*{\argmin}{arg\,min}
+\DeclareMathOperator*{\argmax}{arg\,max}
 
+\DeclareUnicodeCharacter{00A0}{ }
 
 \newcommand{\pypy}{PyPy}
-
+\newcommand{\bulletsection}[1]{\section*{#1}
+\addcontentsline{toc}{section}{#1}}
 
 \begin{document}
 \include{tex/title}
 
 \begin{abstract}
-[Here you should briefly describe the aims of your work]
-[Here you should briefly describe the main problems dealt with]
-[Here you should briefly describe the main results obtained]
-
-The thesis is in [language] and contains [pages] pages of text, [chapters]
-chapters, [figures] figures, [tables] tables., etc.
+The Estonian Wikipedia has a lot of articles that are of high-quality, but are
+hard to find from the huge set of articles. The aim of this
+thesis is to filter out the high-quality articles from the low-quality ones
+using a machine learning algorithm called logistic regression.
+
+The main problem was filtering out the high-quality articles from the
+low-quality articles. An algorithm was written using gradient descent to find
+the logistic regression weights from a matrix of numerical data. Therefore,
+the main tasks of this thesis were to find known high-quality articles and known
+low-quality Wikipedia articles, translate them into numerical data, train the
+machine learning algorithm in a small enough number of iterations and validate
+the accuracy of this algorithm.
+
+Research shows that the Estonian Wikipedia has a category for hand-picked
+high-quality articles and a way to obtain a random article, which will labeled
+as low-quality. Training the algorithm with those results, the accuracy of the
+result is enough to filter out high-quality articles out of all of the Estonian
+Wikipedia.
+
+% TODO Put the right data in.
+The thesis is in English and contains \pageref*{LastPage} pages of text,
+[chapters] chapters, [figures] figures, [tables] tables., etc.
 \end{abstract}
 
+\listoftables
+\clearpage
+
 \listoffigures
-\newpage
+\clearpage
 
-\listoftables
 \tableofcontents
-\newpage
-
-\section*{Sissejuhatus}
-\addcontentsline{toc}{section}{Sissejuhatus}
+\clearpage
 
+\include{tex/introduction}
 \include{tex/tools}
-
 \include{tex/analyser}
-
-\section*{Kokkuvõte}
-\addcontentsline{toc}{section}{Kokkuvõte}
-[Kirjelda lühidalt töö põhieesmärke]
-[Kirjelda olulisemaid tulemusi]
-[Too välja olulisemad järeldused (2(3)]
-[Esita võimalikke edasiarendusi (vajadusel)]
-
-\section*{Summary}
-\addcontentsline{toc}{section}{Summary}
-[Selgita ülevaatlikult ja põhjalikult oma töö eesmärke] 
-[Selgita ülevaatlikult ja piisava põhjalikkusega oma töös käsitletud probleeme]
-[Kirjelda ülevaatlikult ja põhjalikult oma töö tulemusi ja järeldusi]
+\include{tex/results}
+
+\bulletsection{Summary}
+The aim of this thesis is to provide a way to filter out the high-quality
+articles in the Estonian Wikipedia. The best way to do so would be using
+logistic regression, a machine learning algorithm.
+
+The Estonian Wikipedia has a hand-picked category for high-quality articles.
+The machine learning algorithm can be trained with these articles. Logistic
+regression has a weight for each feature or dimension of the article. Logistic
+regression is also blind to data type and only sees the scalar value. That
+means that it is impossible to identify whether an article uses a certain kind
+of template or has certain words or characters in it.
+
+An article has 7 features: characters, pages that refer it, pages that it refers
+to, images it refers to, external links it contains, templates it links to,
+categories it is in. These features were selected because they were readily
+available through the PyWikiBot framework. Because logistic regression accepts
+only scalar values, the count of each feature was used. A 8th constant weight is
+added as a prefix to counteract the bias in the data.
+
+An algorithm was written using gradient descent to find the logistic regression
+weights from the matrix of numerical data. Gradient descent is an algorithm to
+calculate the local minimum or maximum of a range of data. In this case, we want
+to find the minimum of the error rate. It iterates over the last result and
+changes it towards the local minimum. The size of the step of each iteration
+must be carefully chosen or the algorithm might step over the local minimum
+without lowering the deviation.
+
+When the data has a great bias or big deviations, gradient descent must go
+through a lot of iterations to negate that. By normalising the data beforehand,
+all the features have the same small bias and deviation. This is also done in
+this thesis, otherwise the iteration count might grow so large, that it is hard
+to differentiate between a forever looping algorithm and an eventually stopping
+algorithm.
+
+Validation is done by setting aside 30\% of the articles gathered for training
+and then comparing the label the algorithm predicted for these articles with
+their true label.
+
+The average error rate of the algorithm detailed in this thesis is 2.22\%. This
+is achieved with averagly 714.8 iterations of gradient descent. This error rate
+can be considered good enough for the purposes of the editors in Vikipeedia.
+
+One possible improvement is not using the PyWikiBot framework. Because of the
+way the PyWikiBot framework's interface works the program queries more times from
+the MediaWiki server than is required. Each article has a summary page about the
+article's metadata, which can be screen scraped to get most of the required
+features with one query. It is also possible to make one specific query through
+the MediaWiki API to get the required metadata.
+
+Another possible improvement is to subjectively separate bad articles
+and use that data to teach the machine learning algorithm. In this thesis
+separating the high-quality articles was done because the high-quality articles
+were already handpicked by the Vikipeedia editors. The logistic regression
+algorithm can also be made to accept multiple classifications instead of 2.
+
+I conclude that it is entirely possible to use machine learning to sift out
+the high-quality articles.
 
 \bibliographystyle{plain}
 \bibliography{loputoo}
 
-\section*{Lisa 1}
-\addcontentsline{toc}{section}{Lisa 1}
-
 \end{document}
diff --git a/loputoo.bib b/loputoo.bib
@@ -117,16 +117,6 @@ @Misc{website:normalisation
   note = 	 {[Online; accessed \today]}
 }
 
-% currently not implemented
-@Misc{website:line-search,
-  author = 	 {Kris Hauser},
-  title = 	 {B553 Lecture 4: Gradient Descent},
-  howpublished = {\url{http://homes.soic.indiana.edu/classes/spring2012/csci/b553-hauserk/gradient_descent.pdf}},
-  month = 	 {January},
-  year = 	 {2012},
-  note = 	 {[Online; accessed \today; Võimalik lisa lahendus]},
-}
-
 %%%% \section{Used technologies}
 
 %%% \subsection{Python}
@@ -256,3 +246,13 @@ @Misc{website:mediawiki-api
   year = 	 {2014},
   note = 	 {[Online; accessed \today]}
 }
+
+
+
+@misc{wiki:good-article-requirements,
+  author = "Vikipeedia",
+  title = "Vikipeedia:Hea artikli nõuded --- Vikipeedia",
+  year = "2013",
+  howpublished = "\url{//et.wikipedia.org/w/index.php?title=Vikipeedia:Hea_artikli_n%C3%B5uded&oldid=3590326}",
+  note = "[Online; accessed \today]"
+}
diff --git a/tex/analyser.tex b/tex/analyser.tex
@@ -1,5 +1,6 @@
 \section{The implementation}
-The purpose of this project is to separate good articles from average articles.
+The purpose of this project is to separate high-quality articles from
+low-quality articles.
 
 \subsection{Prerequisites}
 \input{tex/analyser/prerequisites}

diff --git a/tex/analyser/architecture.tex b/tex/analyser/architecture.tex
@@ -4,7 +4,7 @@
 the analyse folder is the code written for this paper. The interface script is
 also in the src folder named \verb;wiki-analyse.py;.
 
-Most of the in the analyse code is programmed in a functional style. There are
+Most of the analysis code is programmed in a functional style. There are
 no classes, only named tuples and most functions don't change the inner state of
 the parameters, but return a new value. The only imperative part of the program
 is in the \verb;voidlib.py; file. There each function changes the state of the
@@ -33,39 +33,58 @@ \subsubsection{Machine learning}
 PageModel is a simple datastructure that contains a wiki Page and the label
 assigned to that page.
 
-The \verb;modellib; module contains the
+The \verb;modellib; module contains the \\
 \verb;predicted_label(train_result, page); function, that will return the
 machine's prediction for the Page's label.
 
-The learning data consists of the pages in the category ``Head
-artiklid''\footnote{\url{https://et.wikipedia.org/wiki/Kategooria:Head_artiklid}}
-except for the 5 articles that are tied to the category but are not examples of
-good articles. This category is hand built by the Vikipeedia team. Their
-PageModel is built with the label \verb;GOOD_PAGE;. Then the program asks for
-the same amount of random pages which PageModel is initialised with the label
-\verb;AVERAGE_PAGE;. Most likely the \verb;AVERAGE_PAGE; set will have some
-really good articles and really bad articles besides average articles, but
-hopefully they will not effect the training result. Those two sets are then
-added together and shuffled. Then 70\% of it will be used for training and the
-rest will be used to test the precision of the bot. Sometimes the set the bot is
-trained with may be biased and the user might want to train the bot again.
-
-% XXX This could change.
+The training data consists of the pages in the category ``Head
+artiklid''\footnote{\url{https://et.wikipedia.org/wiki/Kategooria:Head_artiklid}}\footnote{Good
+articles in Estonian} except for the 5 articles that are articles about the
+category itself and are not examples of good articles. This category is hand
+built by the Vikipeedia team. An article must fill multiple requirements before
+it is considered to be good:\cite{wiki:good-article-requirements}
+\begin{description}
+  \item[Well written] It is clearly worded and has the correct
+  spelling. It conforms to the style requirements and doesn't use made-up words
+  or slang.
+  \item[Factually accurate and verifiable] Each paragraph has a citation to used
+  sources and the sources must credible. A good article doesn't contain original
+  research.
+  \item[Covers the whole subject] The main aspects of the subject must be
+  covered while not being derailed to other subjects.
+  \item[Neutral] The subject is presented fairly and without contradiction.
+  \item[Stabile] The article is not often changed because of current
+  arguments or events.
+  \item[Illustrated with pictures if possible] Each picture is marked with
+  copyrights which are not incompatible with Vikipeedia policy. The pictures
+  must be on topic and sufficiently explained. A good article may not have
+  pictures if it is complicated to obtain one.
+\end{description}
+
+PageModels with the pages from the ``Head artiklid'' category are built with the
+label \verb;GOOD_PAGE;. Then the program asks for the same amount of random
+pages whose PageModel is initialised with the label \verb;AVERAGE_PAGE;. Most
+likely the \verb;AVERAGE_PAGE; set will have some very high-quality articles and
+very low-quality articles besides average articles, but it averages out. Those
+two sets are then added together and shuffled. Then 70\% of it will be used for
+training and the rest will be used to test the precision of the bot. Sometimes
+the set the bot is trained with may be biased and the user might want to train
+the bot again.
 
 Each page has 7 features:
 \begin{enumerate}
   \item Length of the text of the Page.
   \item Number of Pages that refers this Page.
-  \item Number of pages this page links to.
-  \item Number of images this page links to.
+  \item Number of pages this Page links to.
+  \item Number of images this Page links to.
   \item Number of external links this Page contains.  
   \item Number of templates this Page links to.
-  \item Number of categories this page is in.
+  \item Number of categories this Page is in.
 \end{enumerate}
 Common reasoning says that the higher these features are, the better a Page
 would be. However we don't know, how one feature weighs against another
 feature. That's what the machine learning algorithm figures out. These 7
-features with a prefix of the number 1 make up the vector x.
+features with a prefix of the vaule 1 make up the vector x.
 
 To keep the number of iterations small, The values of x are normalised before
 they are trained or predicted with. It helps keep all the values of x around the
@@ -120,4 +139,6 @@ \subsubsection{Machine learning}
 
 \subsubsection{Searching}
 
-The bot finds the good pages using a brute force mechanism. It requests all the pages from wikipedia and then tries to predict whether the page is good or average. It skips all pages with exceptions, mostly that would be redirects.
+The bot finds the good pages using a brute force mechanism. It requests all the
+pages from Wikipedia and then tries to predict whether the page is good or
+average. It skips all pages with exceptions, mostly that would be redirects.
diff --git a/tex/analyser/interface.tex b/tex/analyser/interface.tex
@@ -7,7 +7,10 @@
 learning algorithm. If the argument is \verb;find;, the command will only search
 for good pages using the result of the last retraining. 
 
+% TODO, tee nii, et ma lihtsalt prindin ükshaaval kõikide lehtede url'id, mis ma
+% leian heana, terminalile ja ei ole vaja salvestada pickle formaadis faili.
+% Kuigi, räägime sellest veel.
+
 After the bot has found all of the good pages, it will then write the list of
 good pages in a file named \verb;good_pages.pkl; using the Python library
-\verb;pickle; and print it out on the terminal. \emph{Kas ma teen midagi muud
-selle listiga?}
+\verb;pickle; and print it out on the terminal.
diff --git a/tex/analyser/prerequisites.tex b/tex/analyser/prerequisites.tex
@@ -1,23 +1,26 @@
-\emph{The author used Arch Linux of \today to make this program and hasn't
+\emph{The author used Arch Linux of \today\ to make this program and hasn't
 tested installing and running it on other systems. Proceed on your own caution.}
 
-This project requires the Python 2 interpreter on the system. As of \today
+This project requires the Python 2 interpreter on the system. As of \today\
 PyWikiBot does not support Python 3 and therefore this paper's code was also not
 written in Python 3. The Python library numpy is also necessary.
-One can install them with the terminal command 
+One can install them with the terminal command \\
 \verb;sudo pacman -S --needed python2 python2-numpy;.
 
 A user-config.py configuration file must also exist. To generate it, there is a
 script in the pywikibot folder. It's named \verb;generate_user_files.py; and and
 it must be run with the Python interpreter. This is a sample installation
 process:
 \begin{verbatim}
-raigo@archofraigo ~/git/wiki-analyse-bot/core (git)-[master] % python2 generate_user_files.py                                                                               :(
+raigo@archofraigo ~/git/wiki-analyse-bot/core (git)-[master] % 
+python2 generate_user_files.py                                                                               :(
 
 Your default user directory is "/home/raigo/.pywikibot"
 How to proceed? ([K]eep [c]hange) 
-Do you want to copy user files from an existing pywikipedia installation? n
-Create user-config.py file? Required for running bots ([y]es, [N]o) y
+Do you want to copy user files from an existing pywikipedia 
+installation? n
+Create user-config.py file? Required for running bots ([y]es,
+ [N]o) y
 1: anarchopedia
 2: battlestarwiki
 [...]
@@ -27,16 +30,19 @@
 [...]
 33: wiktionary
 34: wowwiki
-Select family of sites we are working on, just enter the number not name (default: wikipedia): 
+Select family of sites we are working on, just enter the number 
+not name (default: wikipedia): 
 This is the list of known language(s):
 ab ace [...] es et eu [...] zh-yue zu
-The language code of the site we're working on (default: 'en'): et
+The language code of the site we're working on (default: 'en'):
+ et
 Username (et wikipedia): AnalyseBot
 Which variant of user_config.py:
 [S]mall or [E]xtended (with further information)? S
 Do you want to add any other projects? (y/N)
 '/home/raigo/.pywikibot/user-config.py' written.
-Create user-fixes.py file? Optional and for advanced users ([y]es, [N]o) 
+Create user-fixes.py file? Optional and for advanced users
+ ([y]es, [N]o) 
 \end{verbatim}
 Questions without answers use the default answer by pressing Enter.