-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add article interlinks to the output of gensim.scripts.segment_wiki
. Fix #1712
#1839
Merged
Merged
Changes from 3 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
f90cd9c
promoting the markup gives up information needed to find the intelinks
steremma cdfb26a
Add interlinks to the output of `segment_wiki`
steremma acc5221
Fixed PEP 8
steremma 0057c7b
Refactoring identation and variable names
steremma 107d7f7
Removed debugging code from script
steremma 4adcf86
Fixed a bug where interlinks with a description or multiple names whe…
steremma 9bf6b87
Now stripping whitespace off section titles
steremma 931e138
Unit test `gensim.scripts.segment_wiki`
steremma cd37315
Fix Python 3.5 compatibility
steremma c681a60
Section text now completely clean from wiki markup
steremma ead5386
Added extra logging info to troublehsoot weird Travis behavior
steremma 193861c
Fix PEP 8
steremma e170c06
pin workers for segment_and_write_all_articles
menshikh-iv b68507b
Merge branch 'interlinks' of https://github.com/steremma/gensim into …
steremma 0884f6d
Get rid of debugging stuff
steremma 58f63ca
Get rid of global logger
steremma 7682f30
Interlinks are now mapping from the linked article's title to the act…
steremma 3b13d3b
Moved regex outside function
steremma e038f52
Interlink extraction is now optional and controlled with the `-i` com…
steremma 68ca8b1
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
steremma 94c2b3d
PEP 8 long lines
steremma 3c838a6
made scripts tests aware of the optional interlinks argument
steremma 7f9ed71
Updated script help output for interlinks
steremma File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,27 +70,28 @@ def segment_all_articles(file_path, min_article_character=200, workers=None): | |
|
||
Yields | ||
------ | ||
(str, list of (str, str)) | ||
Structure contains (title, [(section_heading, section_content), ...]). | ||
(str, list of (str, str), list of str) | ||
Structure contains (title, [(section_heading, section_content), ...], [interlink, ...]). | ||
|
||
""" | ||
with smart_open(file_path, 'rb') as xml_fileobj: | ||
wiki_sections_corpus = _WikiSectionsCorpus( | ||
xml_fileobj, min_article_character=min_article_character, processes=workers) | ||
wiki_sections_corpus.metadata = True | ||
wiki_sections_text = wiki_sections_corpus.get_texts_with_sections() | ||
for article_title, article_sections in wiki_sections_text: | ||
yield article_title, article_sections | ||
for article_title, article_sections, article_interlinks in wiki_sections_text: | ||
yield article_title, article_sections, article_interlinks | ||
|
||
|
||
def segment_and_write_all_articles(file_path, output_file, min_article_character=200, workers=None): | ||
"""Write article title and sections to `output_file` (or stdout, if output_file is None). | ||
|
||
The output format is one article per line, in json-line format with 3 fields:: | ||
The output format is one article per line, in json-line format with 4 fields:: | ||
|
||
'title' - title of article, | ||
'section_titles' - list of titles of sections, | ||
'section_texts' - list of content from sections. | ||
'section_texts' - list of content from sections, | ||
'section_interlinks' - list of interlinks in the article. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -115,8 +116,13 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character | |
|
||
try: | ||
article_stream = segment_all_articles(file_path, min_article_character, workers=workers) | ||
for idx, (article_title, article_sections) in enumerate(article_stream): | ||
output_data = {"title": article_title, "section_titles": [], "section_texts": []} | ||
for idx, (article_title, article_sections, article_interlinks) in enumerate(article_stream): | ||
output_data = {"title": article_title, | ||
"section_titles": [], | ||
"section_texts": [], | ||
"section_interlinks": article_interlinks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you don't split interlinks by sections, this should name as "interlinks" instead of "section_interlinks". |
||
} | ||
|
||
for section_heading, section_content in article_sections: | ||
output_data["section_titles"].append(section_heading) | ||
output_data["section_texts"].append(section_content) | ||
|
@@ -171,9 +177,10 @@ def segment(page_xml): | |
Content from page tag. | ||
|
||
Returns | ||
|
||
------- | ||
(str, list of (str, str)) | ||
Structure contains (title, [(section_heading, section_content)]). | ||
(str, list of (str, str), list of str) | ||
Structure contains (title, [(section_heading, section_content), ...], [interlink, ...]). | ||
|
||
""" | ||
elem = cElementTree.fromstring(page_xml) | ||
|
@@ -186,6 +193,7 @@ def segment(page_xml): | |
lead_section_heading = "Introduction" | ||
top_level_heading_regex = r"\n==[^=].*[^=]==\n" | ||
top_level_heading_regex_capture = r"\n==([^=].*[^=])==\n" | ||
interlink_regex_capture = r"\[\[(.*?)\]\]" | ||
|
||
title = elem.find(title_path).text | ||
text = elem.find(text_path).text | ||
|
@@ -203,7 +211,14 @@ def segment(page_xml): | |
|
||
section_contents = [filter_wiki(section_content) for section_content in section_contents] | ||
sections = list(zip(section_headings, section_contents)) | ||
return title, sections | ||
|
||
interlinks = [] | ||
for filtered_content in section_contents: | ||
section_interlinks = re.findall(interlink_regex_capture, filtered_content) | ||
legit_interlinks = [i for i in section_interlinks if '[' not in i and ']' not in i] | ||
interlinks.extend(legit_interlinks) | ||
|
||
return title, sections, interlinks | ||
|
||
|
||
class _WikiSectionsCorpus(WikiCorpus): | ||
|
@@ -256,8 +271,8 @@ def get_texts_with_sections(self): | |
|
||
Yields | ||
------ | ||
(str, list of (str, str)) | ||
Structure contains (title, [(section_heading, section_content), ...]). | ||
(str, list of (str, str), list of str) | ||
Structure contains (title, [(section_heading, section_content), ...], [interlink, ...]). | ||
|
||
""" | ||
skipped_namespace, skipped_length, skipped_redirect = 0, 0, 0 | ||
|
@@ -267,7 +282,7 @@ def get_texts_with_sections(self): | |
# process the corpus in smaller chunks of docs, because multiprocessing.Pool | ||
# is dumb and would load the entire input into RAM at once... | ||
for group in utils.chunkize(page_xmls, chunksize=10 * self.processes, maxsize=1): | ||
for article_title, sections in pool.imap(segment, group): # chunksize=10): | ||
for article_title, sections, interlinks in pool.imap(segment, group): # chunksize=10): | ||
# article redirects are pruned here | ||
if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES): # filter non-articles | ||
skipped_namespace += 1 | ||
|
@@ -282,7 +297,7 @@ def get_texts_with_sections(self): | |
|
||
total_articles += 1 | ||
total_sections += len(sections) | ||
yield (article_title, sections) | ||
yield (article_title, sections, interlinks) | ||
logger.info( | ||
"finished processing %i articles with %i sections (skipped %i redirects, %i stubs, %i ignored namespaces)", | ||
total_articles, total_sections, skipped_redirect, skipped_length, skipped_namespace) | ||
|
@@ -321,3 +336,15 @@ def get_texts_with_sections(self): | |
) | ||
|
||
logger.info("finished running %s", sys.argv[0]) | ||
|
||
print("-----Now checking output--------\n\n\n") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? This isn't needed here. |
||
for line in smart_open(args.output): | ||
# decode each JSON line into a Python dictionary object | ||
article = json.loads(line) | ||
|
||
# each article has a "title" and a list of "section_titles" and "section_texts". | ||
print("Article title: %s" % article['title']) | ||
print("Article interlinks: %s" % article['section_interlinks']) | ||
for section_title, section_text in zip(article['section_titles'], article['section_texts']): | ||
print("Section title: %s" % section_title) | ||
print("Section text: %s" % section_text) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use hanging indents (instead of vertical)