-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #29 from AjaxMultiCommentary/ocr_debug
Ocr debug
- Loading branch information
Showing
39 changed files
with
1,100 additions
and
538 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
from ajmc.commons.file_management import walk_dirs | ||
from ajmc.corpora import variables as vs | ||
from ajmc.corpora.corpora_classes import Corpus | ||
|
||
|
||
DONE = [ | ||
'forum_romanum', | ||
'corpus_scriptorum_latinorum', | ||
'canonical-latinLit', | ||
'canonical-greekLit', | ||
'perseus_secondary', | ||
'perseus_legacy', | ||
'First1KGreek', | ||
'propylaeum_BOOKS', | ||
'propylaeum_DOK', | ||
'agoraclass', | ||
] | ||
|
||
corpora_stats = {} | ||
|
||
for corpus_id in walk_dirs(vs.ROOT_STORING_DIR): | ||
corpus_id = corpus_id.stem | ||
corpus_id = 'EpibauCorpus' | ||
if corpus_id in DONE: | ||
continue | ||
print('---------------------------------') | ||
print(corpus_id) | ||
try: | ||
corpus = Corpus.auto_init(corpus_id) | ||
corpora_stats[corpus_id] = len(corpus.get_plain_text()) | ||
print(corpora_stats[corpus_id]) | ||
except Exception as e: | ||
print('Skipping corpus:', corpus_id, e) | ||
break |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from pathlib import Path | ||
|
||
# Read an XML file with bs4 | ||
from bs4 import BeautifulSoup | ||
|
||
file_path = Path('/Users/sven/Desktop/lawiki-20240320-pages-articles-multistream.xml') | ||
soup = BeautifulSoup(file_path.read_text('utf-8'), features='xml') | ||
|
||
# find all the element named 'page' in the soup | ||
pages = soup.find_all('page') | ||
|
||
print(len(pages)) | ||
|
||
print(pages[0].prettify()) | ||
|
||
# We now get the text of the first page | ||
text = pages[0].text | ||
|
||
# We now estimate the size of the text in gb | ||
size = len(text) / 1e9 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.