-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add scripts to auto sort lexemes in a lexicon
- Loading branch information
Showing
3 changed files
with
106 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
@echo off | ||
setlocal | ||
|
||
REM Input and output file names | ||
set input_file=lexicon.pls | ||
set output_file=sorted_lexicon.pls | ||
|
||
REM If %1 is set, then set the input file to %1 | ||
if not "%1" == "" set input_file=%1 | ||
|
||
REM If %2 is set, then set the output file to %2 | ||
if not "%2" == "" set output_file=%2 | ||
|
||
REM Check if the input file exists | ||
if not exist "%input_file%" ( | ||
echo Input file %input_file% not found! | ||
exit /b 1 | ||
) | ||
|
||
REM Install Python if it is not installed | ||
This comment has been minimized.
Sorry, something went wrong. |
||
if not exist "%PYTHON%" ( | ||
echo Python is not installed! | ||
exit /b 1 | ||
) | ||
|
||
REM Call the Python script to perform the sorting | ||
python sort_lexemes.py "%input_file%" "%output_file%" | ||
|
||
endlocal |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
import xml.etree.ElementTree as ET | ||
|
||
def sort_lexemes(input_file, output_file): | ||
tree = ET.parse(input_file) | ||
root = tree.getroot() | ||
|
||
# Namespace used in the XML file | ||
namespace = {'pls': 'http://www.w3.org/2005/01/pronunciation-lexicon'} | ||
|
||
# Find all lexeme elements | ||
lexemes = root.findall('pls:lexeme', namespace) | ||
|
||
# Sort lexemes by the first grapheme value | ||
def get_grapheme(lex): | ||
grapheme = lex.find('pls:grapheme', namespace) | ||
return grapheme.text if grapheme is not None else "" | ||
sorted_lexemes = sorted(lexemes, key=lambda lex: get_grapheme(lex).lower()) | ||
|
||
# Remove `ns0:` prefix from all tags recursively | ||
for elem in root.iter(): | ||
elem.tag = elem.tag.replace('{http://www.w3.org/2005/01/pronunciation-lexicon}', '') | ||
|
||
# Add `xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"` back into the lexicon element | ||
root.set('xmlns', 'http://www.w3.org/2005/01/pronunciation-lexicon') | ||
|
||
# Remove existing lexeme elements | ||
for lex in lexemes: | ||
root.remove(lex) | ||
|
||
# Append sorted lexeme elements | ||
for lex in sorted_lexemes: | ||
root.append(lex) | ||
|
||
# Write the sorted XML to the output file | ||
tree.write(output_file, encoding='UTF-8', xml_declaration=True) | ||
|
||
if __name__ == "__main__": | ||
input_file = "lexicon.pls" | ||
output_file = "lexicon.pls" # Warning: This will overwrite the input file | ||
|
||
# Set input_file and output_file to the arguments passed to the script | ||
import sys | ||
if len(sys.argv) > 1: | ||
input_file = sys.argv[1] | ||
if len(sys.argv) > 2: | ||
output_file = sys.argv[2] | ||
|
||
sort_lexemes(input_file, output_file) | ||
print(f"Sorting completed successfully. Output saved to {output_file}.") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# About | ||
These sorting scripts automatically alphabetically sort all the <lexeme> elements inside a lexicon by the text content of the first <grapheme> element. | ||
|
||
# Prerequisites | ||
In order for these scripts to work, you must have Python installed. Download and install Python here: | ||
https://www.python.org/downloads/ | ||
|
||
# Simple Option | ||
This is the simplest way of using the scripts that requires not even any commandline usage on your own. | ||
|
||
Copy/move/drag and drop **both** the `sort_lexemes.bat` and `sort_lexemes.py` file into the same folder/directory that contains the `lexicon.pls` you want to sort. | ||
|
||
Double click on the `sort_lexemes.bat` file | ||
|
||
# Advanced Option | ||
It's also possible to use either of these scripts from the commandline and not move them, but it requires just some basic commandline knowledge. Choose either the `.bat` script or the `.py` to use, the `.bat` script just calls the `.py` script. | ||
|
||
From the Windows "Command Prompt" or the (more moddern) Windows "Terminal", in either "Command Prompt" mode or "Windows PowerShell" mode, copy and paste **one** of the following commands and replace the placeholder text. | ||
|
||
## `sort_lexemes.bat` Usage | ||
``` | ||
.\sort_lexemes.bat PATH_TO_EXISTING_LEXICON.pls PATH_TO_NEW_SORTED_LEXICON.pls | ||
``` | ||
|
||
## `sort_lexemes.py` Usage | ||
``` | ||
python .\sort_lexemes.py "PATH_TO_EXISTING_LEXICON.pls" "PATH_TO_NEW_SORTED_LEXICON.pls" | ||
``` |
I removed this Python checker in a later commit since I forgot to test this before pushing this commit. It didn't work.