Skip to content

kvesik/bits-and-pieces

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

bits-and-pieces

This repo contains various utility scripts.

cleansplitwikidump.py
Additional cleaning to follow use of WikiExtractor. This script is intended to be run from within the text/ directory that is created by WikiExtractor.py (text/ should contain a series of directories AA, AB, AC, ..., each of which contains a series of text files wiki_00, wiki_01, ...). The script reads all of the output from WikiExtractor.py and cleans/reorganizes it by:

  • removing the tags (and their contents) at the beginning and end of each article
  • lowercasing each word
  • removing punctuation, digits, and
    tags from the edges of each word
  • omitting words altogether if they still contain any of the above
  • omitting any duplicate words
  • writing the remaining/cleaned words, one per line, to the file "cleanedwikiwords.txt"

About

various cleaning/organizing utilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages