This custom step identifies the language for text data in an input table and creates a new column containing the language's ISO 639-1 code. It makes use of the textManagement.identifyLanguage CAS action (link provided in documentation below).
Some business applications such as social media data, customer reviews and government / law enforcement deal with data in multiple languages. This custom step aids identification and downstream segmentation of observations as per language, so that they can then be analysed using the relevant language pack.
Here's a general idea of how this custom step works (the below is an animated GIF)
Tested in Viya 4, Stable 2022.12
-
A SAS Viya 4 environment (monthly release 2022.12 or later) with SAS Studio Flows.
-
At runtime: an active connection to CAS: This custom step requires Cloud Analytics Services. Ensure you have an active CAS connection available prior to running the same.
Note that this custom step runs on data loaded in Cloud Analytics Services (CAS). Ensure you are connected to CAS before running this step.
-
Input port: connect an input CAS table.
-
Document ID: select a column which serves as an ID for each observation.
-
Text variable: select a character column which contains the text to be analyzed.
-
Output columns: select columns which you would like copied over to the output table.
-
Output port: connect an output CAS table.
The custom step results in a new column, _language_, which contains the ISO 639-1 language code.
- textmanagement.identifyLanguage CAS action : https://go.documentation.sas.com/doc/en/pgmsascdc/default/casanpg/n0qdvvymlj69d7n18dfvh6ipjn2k.htm#p0sk06te8li0uyn14l3kt8i4gvdw
- Refer to the steps listed here.
- Sundaresh Sankaran ([email protected])
Version : 1.0. (15FEB2023)