Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocessing in summarize function to remove newlines in the middle of sentences #1575

Closed
diegospd opened this issue Sep 7, 2017 · 0 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@diegospd
Copy link

diegospd commented Sep 7, 2017

Description

The summarize function depends on the text having at least 10 sentences as measured by clean_text_by_sentences. If the text is shorter than that then the summarization fails in an undocumented way. Moreover clean_text_by_sentences cannot handle properly a text with new lines
at the middle of a sentence. I suggest a preprocessing step to purge those.

I'm currently using this to workaround this bug

import re
text = re.sub(r'\n|\r|\t', ' ', text)
text = re.sub(r'\s+', ' ', text)

Versions

  • Python 3.6.2 (default, Jul 17 2017, 16:44:45)
    [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)]
  • NumPy 1.13.0
  • SciPy 0.19.1
  • gensim 2.3.0
  • FAST_VERSION 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

2 participants