chemistry_tools.names

Functions for working with IUPAC names for chemicals.

Functions:

cas_from_iupac_name(iupac_name)

Returns the corresponding CAS registry number for the given IUPAC name.

get_IUPAC_parts(string)

Splits an IUPAC name for a compound into its constituent parts.

get_IUPAC_sort_order(iupac_names)

Returns the order the given IUPAC names should be sorted in.

get_sorted_parts(iupac_names)

Returns the constituent parts of the IUPAC names sorted into order.

iupac_name_from_cas(cas_number)

Returns the corresponding IUPAC name for the given CAS registry number.

sort_IUPAC_names(iupac_names)

Sort a list of IUPAC names into order.

sort_array_by_name(array[, name_col, reverse])

Sort a list of lists by the IUPAC name in each row.

sort_dataframe_by_name(df, name_col[, reverse])

Sorts a pandas.DataFrame by the IUPAC name in each row.

Data:

multiplier_regex

Regular expression to match “multiple” prefixes such as mono-.

re_strings

List of regular expressions to decompose an IUPAC name.

cas_from_iupac_name(iupac_name)[source]

Returns the corresponding CAS registry number for the given IUPAC name.

Parameters

iupac_name (str) – The IUPAC name to search.

Return type

str

Returns

The CAS registry number.

get_IUPAC_parts(string)[source]

Splits an IUPAC name for a compound into its constituent parts.

Parameters

string (str) – The IUPAC name to split.

Return type

List[str]

Returns

A list of constituent parts.

get_IUPAC_sort_order(iupac_names)[source]

Returns the order the given IUPAC names should be sorted in.

Useful when sorting arrays containing data in addition to the name. e.g.

>>> sort_order = get_IUPAC_sort_order([row[0] for row in data])
>>> sorted_data = sorted(data, key=lambda row: sort_order[row[0]])

where row[0] would be the name of the compound

Parameters

iupac_names (Sequence[str]) – The list of IUPAC names to sort.

Return type

Dict[str, int]

Returns

Dictionary mapping the IUPAC names to the order in which they should be sorted.

get_sorted_parts(iupac_names)[source]

Returns the constituent parts of the IUPAC names sorted into order.

The parts returned are in reverse order (i.e. 'diphenylamine' becomes ['amine', 'phenyl', 'di']).

Parameters

iupac_names (Sequence[str])

Return type

List[List[str]]

iupac_name_from_cas(cas_number)[source]

Returns the corresponding IUPAC name for the given CAS registry number.

Parameters

cas_number (str) – The cas number to search

Return type

str

Returns

The IUPAC name

multiplier_regex

Type:    Pattern

Regular expression to match “multiple” prefixes such as mono-.

Pattern

(mono)*(di)*(tri)*(tetra)*(penta)*(hexa)*(hepta)*(octa)*(nona)*(deca)*(undeca)*(dodeca)*(trideca)*(tetradeca)*(pentadeca)*(hexadeca)*(heptadeca)*(octadeca)*(nonadeca)*(icosa)*(henicosa)*(docosa)*(tricosa)*(triaconta)*(hentriaconta)*(dotriaconta)*(tetraconta)*(pentaconta)*(hexaconta)*(heptaconta)*(octaconta)*(nonaconta)*(hecta)*(dicta)*(tricta)*(tetracta)*(pentacta)*(hexacta)*(heptacta)*(octacta)*(nonacta)*(kilia)*(dilia)*(trilia)*(tetralia)*(pentalia)*(hexalia)*(heptalia)*(octalia)*(nonalia)*

re_strings = [re.compile('((\\d+),?)+(\\d+)-'), re.compile('(mono)*(di)*(tri)*(tetra)*(penta)*(hexa)*(hepta)*(octa)*(nona)*(deca)*(undeca)*(dodeca)*(trideca)*(tetradeca)*(pentadeca)*(hexadeca)*(heptadeca)*(octadeca)*(nonadeca)*(icosa)*(henicosa)*(docosa)*(tri), re.compile('nitro'), re.compile('phenyl'), re.compile('aniline'), re.compile('anisole'), re.compile('benzene'), re.compile('centralite'), re.compile('formamide'), re.compile('glycerine'), re.compile('nitrate'), re.compile('glycol'), re.compile('phthalate'), re.compile('picrate'), re.compile('toluene'), re.compile('methyl'), re.compile('(?<!m)ethyl'), re.compile('propyl'), re.compile('butyl'), re.compile(' '), re.compile('\\('), re.compile('\\)'), re.compile('hydroxyl'), re.compile('amin[oe]'), re.compile('amide')]

Type:    List[Pattern]

List of regular expressions to decompose an IUPAC name.

sort_IUPAC_names(iupac_names)[source]

Sort a list of IUPAC names into order.

Parameters

iupac_names (Sequence[str]) – The list of IUPAC names to sort

Return type

List[str]

Returns

The list of sorted IUPAC names.

sort_array_by_name(array, name_col=0, reverse=False)[source]

Sort a list of lists by the IUPAC name in each row.

Parameters
  • array (List[List[Any]])

  • name_col (int) – The index of the column containing the IUPAC names. Default 0.

  • reverse (bool) – Whether the names should be sorted in reverse order. Default is False, which sorts from A-Z.

Return type

List[List[Any]]

Returns

The sorted array

sort_dataframe_by_name(df, name_col, reverse=False)[source]

Sorts a pandas.DataFrame by the IUPAC name in each row.

Parameters
  • df (DataFrame)

  • name_col (str) – The name of the column containing the IUPAC names

  • reverse (bool) – Whether the names should be sorted in reverse order. Default is False, which sorts from A-Z

Return type

DataFrame

Returns

The sorted DataFrame