Skip to content

Commit

Permalink
Add scripts to auto sort lexemes in a lexicon
Browse files Browse the repository at this point in the history
  • Loading branch information
ryankhart committed Jul 14, 2024
1 parent f87112a commit 1514d95
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 0 deletions.
29 changes: 29 additions & 0 deletions lexicons/sort_lexemes.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
@echo off
setlocal

REM Input and output file names
set input_file=lexicon.pls
set output_file=sorted_lexicon.pls

REM If %1 is set, then set the input file to %1
if not "%1" == "" set input_file=%1

REM If %2 is set, then set the output file to %2
if not "%2" == "" set output_file=%2

REM Check if the input file exists
if not exist "%input_file%" (
echo Input file %input_file% not found!
exit /b 1
)

REM Install Python if it is not installed

This comment has been minimized.

Copy link
@ryankhart

ryankhart Jul 14, 2024

Author Collaborator

I removed this Python checker in a later commit since I forgot to test this before pushing this commit. It didn't work.

if not exist "%PYTHON%" (
echo Python is not installed!
exit /b 1
)

REM Call the Python script to perform the sorting
python sort_lexemes.py "%input_file%" "%output_file%"

endlocal
49 changes: 49 additions & 0 deletions lexicons/sort_lexemes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import xml.etree.ElementTree as ET

def sort_lexemes(input_file, output_file):
tree = ET.parse(input_file)
root = tree.getroot()

# Namespace used in the XML file
namespace = {'pls': 'http://www.w3.org/2005/01/pronunciation-lexicon'}

# Find all lexeme elements
lexemes = root.findall('pls:lexeme', namespace)

# Sort lexemes by the first grapheme value
def get_grapheme(lex):
grapheme = lex.find('pls:grapheme', namespace)
return grapheme.text if grapheme is not None else ""
sorted_lexemes = sorted(lexemes, key=lambda lex: get_grapheme(lex).lower())

# Remove `ns0:` prefix from all tags recursively
for elem in root.iter():
elem.tag = elem.tag.replace('{http://www.w3.org/2005/01/pronunciation-lexicon}', '')

# Add `xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"` back into the lexicon element
root.set('xmlns', 'http://www.w3.org/2005/01/pronunciation-lexicon')

# Remove existing lexeme elements
for lex in lexemes:
root.remove(lex)

# Append sorted lexeme elements
for lex in sorted_lexemes:
root.append(lex)

# Write the sorted XML to the output file
tree.write(output_file, encoding='UTF-8', xml_declaration=True)

if __name__ == "__main__":
input_file = "lexicon.pls"
output_file = "lexicon.pls" # Warning: This will overwrite the input file

# Set input_file and output_file to the arguments passed to the script
import sys
if len(sys.argv) > 1:
input_file = sys.argv[1]
if len(sys.argv) > 2:
output_file = sys.argv[2]

sort_lexemes(input_file, output_file)
print(f"Sorting completed successfully. Output saved to {output_file}.")
28 changes: 28 additions & 0 deletions lexicons/sort_lexemes_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# About
These sorting scripts automatically alphabetically sort all the <lexeme> elements inside a lexicon by the text content of the first <grapheme> element.

# Prerequisites
In order for these scripts to work, you must have Python installed. Download and install Python here:
https://www.python.org/downloads/

# Simple Option
This is the simplest way of using the scripts that requires not even any commandline usage on your own.

Copy/move/drag and drop **both** the `sort_lexemes.bat` and `sort_lexemes.py` file into the same folder/directory that contains the `lexicon.pls` you want to sort.

Double click on the `sort_lexemes.bat` file

# Advanced Option
It's also possible to use either of these scripts from the commandline and not move them, but it requires just some basic commandline knowledge. Choose either the `.bat` script or the `.py` to use, the `.bat` script just calls the `.py` script.

From the Windows "Command Prompt" or the (more moddern) Windows "Terminal", in either "Command Prompt" mode or "Windows PowerShell" mode, copy and paste **one** of the following commands and replace the placeholder text.

## `sort_lexemes.bat` Usage
```
.\sort_lexemes.bat PATH_TO_EXISTING_LEXICON.pls PATH_TO_NEW_SORTED_LEXICON.pls
```

## `sort_lexemes.py` Usage
```
python .\sort_lexemes.py "PATH_TO_EXISTING_LEXICON.pls" "PATH_TO_NEW_SORTED_LEXICON.pls"
```

0 comments on commit 1514d95

Please sign in to comment.