Merge pull request #6 from ichn-hu/master

update
FDUCSLG · Mar 30, 2019 · 662ac24 · 662ac24
2 parents c95260a + 2e83a74
commit 662ac24
Show file tree

Hide file tree

Showing 2 changed files with 26 additions and 25 deletions.
diff --git a/docs/assignment-2/index.html b/docs/assignment-2/index.html
diff --git a/docs/assignment-2/index.md b/docs/assignment-2/index.md
@@ -28,7 +28,7 @@ In this part of the assignment, you are required to use logistic regression to d
 
 Text classification is about to classify a given document with its corresponding genre, e.g. sports, news, health etc. To represent the document in a convenient way to be processed by the logistic regression, we could represent the document in a multi-hot style vector.
 
-Firstly you will have to tokenize the document into words (or a list of strings), you should firstly ignore all the characters in `string.punctuation` and then make all `string.whitespace` characters a space , and then split the document by all the spaces to have a list of strings. Then you build a vocabulary on all the words in the training dataset, which simply maps each word in the training dataset to a number. For example, 
+Firstly you will have to tokenize the document into words (or a list of strings), you should firstly ignore all the characters in `string.punctuation` and then make all `string.whitespace` characters a space , and then split the document by all the spaces to have a list of strings.  Subsequently, you should convert all characters into lowercase to facilitate future process. Then you build a vocabulary on all the words in the training dataset, which simply maps each word in the training dataset to a number. For example, 
 
 ```python
 docs_toy = [
@@ -66,6 +66,8 @@ and you use the vocabulary to map the tokenized document to a multi-hot vector!
 
 as you could verify this is the representation from the above two document.
 
+In practice, the vocabulary dictionary is quite large, which may cause the size of multi-hot vector exceeds the memory limits! To address this problem, you can set a frequency threshold  `min_count` and only consider the words which occur at least `min_count` times in the overall training set. For this problem, `min_count = 10` is suitable.
+
 Once you could represent the document in vectors (and also the category of the document in one-hot representation), then you can use the logistic regression!
 
 Logistic regression is a kind of generalized linear model, the major (or only?) difference between logistic regression and least square is that in logistic regression we use a non-linear function after the linear transformation to enable probabilistic interpretation for the output. For binary classification, the logistic  sigmoid function