bits-and-pieces

This repo contains various utility scripts.

cleansplitwikidump.py
Additional cleaning to follow use of WikiExtractor. This script is intended to be run from within the text/ directory that is created by WikiExtractor.py (text/ should contain a series of directories AA, AB, AC, ..., each of which contains a series of text files wiki_00, wiki_01, ...). The script reads all of the output from WikiExtractor.py and cleans/reorganizes it by:

removing the tags (and their contents) at the beginning and end of each article
lowercasing each word
removing punctuation, digits, and
tags from the edges of each word
omitting words altogether if they still contain any of the above
omitting any duplicate words
writing the remaining/cleaned words, one per line, to the file "cleanedwikiwords.txt"

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
cleansplitwikidump.py		cleansplitwikidump.py
cleanzoomtranscript.py		cleanzoomtranscript.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bits-and-pieces

About

Releases

Packages

Languages

kvesik/bits-and-pieces

Folders and files

Latest commit

History

Repository files navigation

bits-and-pieces

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages