sanskrit_text package

Submodules

sanskrit_text.cli module

Console interface for sanskrit-text.

sanskrit_text.cli._read_text_argument(text_arg: str | None) → str[source]: Return text from argument or stdin if argument is missing.

sanskrit_text.cli.build_parser() → ArgumentParser[source]: Build the top-level argument parser.

sanskrit_text.cli.main(argv=None) → int[source]: Entry point for the skt console script.

Module contents

Sanskrit Text Utility

sanskrit_text.ord_unicode(ch: str) → str[source]

Get Unicode 4-character-identifier corresponding to a character

Parameters:: ch (str) – Single character
Returns:: 4-character unicode identifier
Return type:: str

sanskrit_text.chr_unicode(u: str) → str[source]

Get a Unicode character corresponding to 4-chracater identifier

Parameters:: u (str) – 4-character unicode identifier
Returns:: Single character
Return type:: str

sanskrit_text.form_pratyaahaara(letters: List[str]) → str[source]: Form a pratyaahaara from a list of letters

sanskrit_text.resolve_pratyaahaara(pratyaahaara: str) → List[List[str]][source]: Resolve pratyaahaara into all possible lists of characters

sanskrit_text.clean(text: str, punct: bool = False, digits: bool = False, spaces: bool = True, allow: list = None) → str[source]

Clean a line of Sanskrit (Devanagari) text

Parameters:

text (str) – Input string
punct (bool, optional) – If True, the punctuations are kept. The default is False.
digits (bool, optional) – If True, digits are kept. The default is False.
spaces (bool, optional) – If False, spaces are removed. It is recommended to not change the default value unless it is specifically relevant to a use-case. The default is True.
allow (list, optional) – List of characters to allow. The default is None.

Returns:

Clean version of the string

Return type:

str

sanskrit_text.split_lines(text: str, pattern='[।॥\\r\\n]+') → List[str][source]

Split a string into a list of strings using regular expression

Parameters:

text (str) – Input string
pattern (regexp, optional) – Regular expression corresponding to the split points. The default is r’[।॥rn]+’.

Returns:

List of strings

Return type:

List[str]

sanskrit_text.trim_matra(line: str) → str[source]

Trim trailing mātrā and related markers from the end of a string.

This is a simple orthographic heuristic intended for rough normalisation (for example, comparing or grouping word-final consonant bases). It is not a linguistically realistic notion of stemming or lemmatisation and should not be used as such.

The function removes, in order:

A final anusvāra/halanta/visarga, if present.
A final mātrā character, if present after step (1).

If the input string is empty, it is returned unchanged.

sanskrit_text.is_laghu(syllable: str) → bool[source]: Checks if the current syllable is Laghu

sanskrit_text.toggle_matra(syllable: str) → str[source]: Change the Laghu syllable to Guru and Guru to Laghu (if possible)

sanskrit_text.marker_to_swara(m: str) → str[source]: Convert a Matra to corresponding Swara

sanskrit_text.swara_to_marker(s: str) → str[source]: Convert a Swara to correponding Matra

sanskrit_text.get_anunaasika(ch: str) → str[source]: Get the appropriate anunaasika from the character’s group

sanskrit_text.fix_anuswara(text: str) → str[source]: Check every anuswaara in the text and change to anunaasika if applicable

sanskrit_text.get_syllables_word(word: str, technical: bool = False) → List[str][source]

Get syllables from a Sanskrit (Devanagari) word

Parameters:

word (str) – Sanskrit (Devanagari) word to get syllables from. Spaces, if present, are ignored.
technical (bool, optional) – If True, ensures that each element contains at most one Swara or Vyanjana. The default is False.

Returns:

List of syllables

Return type:

List[str]

sanskrit_text.get_syllables(text: str, technical: bool = False) → List[List[List[str]]][source]

Get syllables from a Sanskrit (Devanagari) text

Parameters:

text (str) – Sanskrit (Devanagari) text to get syllables from
technical (bool, optional) – If True, ensures that each element contains at most one Swara or Vyanjana. The default is False.

Returns:

List of syllables in a nested list format Nesting Levels: Text -> Lines -> Words

Return type:

List[List[List[str]]]

sanskrit_text.split_varna_word(word: str, technical: bool = True) → List[str][source]

Obtain the Varna decomposition of a Sanskrit (Devanagari) word

Parameters:

word (str) – Sanskrit (Devanagari) word to be split.
technical (bool, optional) – If True, a split, vowels and vowel signs are treated independently which is more useful for analysis, The default is True.

Returns:

List of Varna

Return type:

List[str]

sanskrit_text.split_varna(text: str, technical: bool = True, flat: bool = False) → List[List[List[str]]][source]

Obtain the Varna decomposition of a Sanskrit (Devanagari) text

Parameters:

word (str) – Sanskrit (Devanagari) text to be split.
technical (bool, optional) – If True, a split, vowels and vowel signs are treated independently which is more useful for analysis, The default is True.
flat (bool, optional) – If True, a single list is returned instead of nested lists. The default is False.

Returns:

Varna decomposition of the text in a nested list format. Nesting Levels: Text -> Lines -> Words

Varna decomposition of each word is a List[char].
List of Varna decomposition of each word from a line.
List of Varna decomposition of each line from the text.

If flat=True, Varna decomposition of the entire text is presented as a single list, also containing whitespace markers. Lines are separated by a newline character ‘n’ and words are separated by a space character ‘ ‘.

Return type:

List[List[List[str]]] or List[str]

sanskrit_text.join_varna(viccheda: str, technical: bool = True) → str[source]

Join Varna decomposition to form a Sanskrit (Devanagari) word

Parameters:

viccheda (list) – Viccheda output obtained by split_varna_word with technical=True (or output of split_varna with technical=True and flat=True) IMPORTANT: technical=True is necessary.
technical (bool) – WARNING: Currently unused. Value of the same parameter passed to split_varna_word

Note

Currently only works for the viccheda generated with technical=True

Returns:: s – Sanskrit word
Return type:: str

sanskrit_text.get_ucchaarana_vector(letter: str, abbrev=False) → Dict[str, int][source]

Get ucchaarana sthaana and prayatna based vector of a letter

Parameters:

letter (str) – Sanskrit letter
abbrev (bool) – If True, the output will contain English abbreviations otherwise, the output will contain Sanskrit names. The default is False.

Returns:

vector – One-hot vector indicating utpatti sthaana, aabhyantara prayatna and baahya prayatna of a letter

Return type:

Dict[str, int]

sanskrit_text.get_ucchaarana_vectors(word: str, abbrev: bool = False) → List[Tuple[str, Dict[str, int]]][source]

Get ucchaarana sthaana and prayatna based vector of a word or text

Parameters:

word (str) – Sanskrit word (or text)
abbrev (bool) – If True, the output will contain English abbreviations otherwise, the output will contain Sanskrit names. The default is False.

Returns:

vectors – List of (letter, vector)

Return type:

List[Tuple[str, Dict[str, int]]]

sanskrit_text.get_signature_letter(letter: str, abbrev: bool = False) → Dict[str, str][source]

Get ucchaarana sthaana and prayatna based signature of a letter

Parameters:

letter (str) – Sanskrit letter
abbrev (bool) – If True, the output will contain English abbreviations otherwise, the output will contain Sanskrit names. The default is False.

Returns:

signature – utpatti sthaana, aabhyantara prayatna and baahya prayatna of a letter

Return type:

Dict[str, str]

sanskrit_text.get_signature_word(word: str, abbrev: bool = False) → List[Tuple[str, Dict[str, str]]][source]

Get ucchaarana sthaana and prayatna based signature of a word

Parameters:

word (str) – Sanskrit word (or text) Caution: If multiple words are provided, the spaces are not included in the output list.
abbrev (bool) – If True, the output will contain English abbreviations otherwise, the output will contain Sanskrit names. The default is False.

Returns:

List of (letter, signature)

Return type:

List[Tuple[str, Dict[str, str]]]

sanskrit_text.get_signature(text: str, abbrev: bool = False) → List[List[List[Tuple[str, Dict[str, str]]]]][source]

Get ucchaarana list of a Sanskrit text

Parameters:

text (str) – Sanskrit text (can contain newlines, spaces)
abbrev (bool) – If True, the output will contain English abbreviations otherwise, the output will contain Sanskrit names. The default is False.

Returns:

List of (letter, signature) for words in a nested list format Nesting Levels: Text -> Lines -> Words

Return type:

List[List[List[Tuple[str, Dict[str, str]]]]]

sanskrit_text.get_ucchaarana_letter(letter: str, dimension: int = 0, abbrev: bool = False) → str[source]

Get ucchaarana sthaana or prayatna of a letter

Parameters:

letter (str) – Sanskrit letter
dimension (int) –
- 0: sthaana
- 1: aabhyantara prayatna
- 2: baahya prayatna
The default is 0.
abbrev (bool) –

If True,
The output will contain English abbreviations

Otherwise,
The output will contain Sanskrit names

The default is False.

Returns:

ucchaarana sthaana or prayatna of a letter

Return type:

str

sanskrit_text.get_ucchaarana_word(word: str, dimension: int = 0, abbrev: bool = False) → List[Tuple[str, str]][source]

Get ucchaarana of a word

Parameters:

word (str) –
Sanskrit word (or text)

Caution: If multiple words are provided, the spaces are not included in the output list
dimension (int) –
- 0: sthaana
- 1: aabhyantara prayatna
- 2: baahya prayatna
The default is 0.
abbrev (bool) –

If True,
The output will contain English abbreviations

Otherwise,
The output will contain Sanskrit names

The default is False.

Returns:

List of (letter, ucchaarana)

Return type:

List[Tuple[str, str]]

sanskrit_text.get_ucchaarana(text: str, dimension: int = 0, abbrev: bool = False) → List[List[List[Tuple[str, str]]]][source]

Get ucchaarana list of a Sanskrit text

Parameters:

text (str) – Sanskrit text (can contain newlines, spaces)
dimension (int) –
- 0: sthaana
- 1: aabhyantara prayatna
- 2: baahya prayatna
The default is 0.
abbrev (bool) –

If True,
The output will contain English abbreviations

Otherwise,
The output will contain Sanskrit names

The default is False.

Returns:

List of (letter, ucchaarana) for words in a nested list format Nesting Levels: Text -> Lines -> Words

Return type:

List[List[List[Tuple[str, str]]]]

sanskrit_text.get_sthaana_letter(letter: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana_letter for sthaana

sanskrit_text.get_sthaana_word(word: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana_word for sthaana

sanskrit_text.get_sthaana(text: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana for sthaana

sanskrit_text.get_aabhyantara_letter(letter: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana_letter for aabhyantara

sanskrit_text.get_aabhyantara_word(word: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana_word for aabhyantara

sanskrit_text.get_aabhyantara(text: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana for aabhyantara

sanskrit_text.get_baahya_letter(letter: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana_letter for baahya

sanskrit_text.get_baahya_word(word: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana_word for baahya

sanskrit_text.get_baahya(text: str, abbrev: bool = False)[source]: Wrapper for get_ucchaarana for baahya