`chemistry_tools.names`

Functions for working with IUPAC names for chemicals.

Functions:

`cas_from_iupac_name`(iupac_name)	Returns the corresponding CAS registry number for the given IUPAC name.
`get_IUPAC_parts`(string)	Splits an IUPAC name for a compound into its constituent parts.
`get_IUPAC_sort_order`(iupac_names)	Returns the order the given IUPAC names should be sorted in.
`get_sorted_parts`(iupac_names)	Returns the constituent parts of the IUPAC names sorted into order.
`iupac_name_from_cas`(cas_number)	Returns the corresponding IUPAC name for the given CAS registry number.
`sort_IUPAC_names`(iupac_names)	Sort a list of IUPAC names into order.
`sort_array_by_name`(array[, name_col, reverse])	Sort a list of lists by the IUPAC name in each row.
`sort_dataframe_by_name`(df, name_col[, reverse])	Sorts a `pandas.DataFrame` by the IUPAC name in each row.

Data:

`multiplier_regex`	Regular expression to match “multiple” prefixes such as mono-.
`re_strings`	List of regular expressions to decompose an IUPAC name.

cas_from_iupac_name(iupac_name)[source]

Returns the corresponding CAS registry number for the given IUPAC name.

Parameters: iupac_name (str) – The IUPAC name to search.
Return type: str
Returns: The CAS registry number.

get_IUPAC_parts(string)[source]

Splits an IUPAC name for a compound into its constituent parts.

Parameters: string (str) – The IUPAC name to split.
Return type: List[str]
Returns: A list of constituent parts.

get_IUPAC_sort_order(iupac_names)[source]

Returns the order the given IUPAC names should be sorted in.

Useful when sorting arrays containing data in addition to the name. e.g.

>>> sort_order = get_IUPAC_sort_order([row[0] for row in data])
>>> sorted_data = sorted(data, key=lambda row: sort_order[row[0]])

where row[0] would be the name of the compound

Parameters: iupac_names (Sequence[str]) – The list of IUPAC names to sort.
Return type: Dict[str, int]
Returns: Dictionary mapping the IUPAC names to the order in which they should be sorted.

get_sorted_parts(iupac_names)[source]

Returns the constituent parts of the IUPAC names sorted into order.

The parts returned are in reverse order (i.e. 'diphenylamine' becomes ['amine', 'phenyl', 'di']).

Parameters: iupac_names (Sequence[str])
Return type: List[List[str]]

iupac_name_from_cas(cas_number)[source]

Returns the corresponding IUPAC name for the given CAS registry number.

Parameters: cas_number (str) – The cas number to search
Return type: str
Returns: The IUPAC name

multiplier_regex

Type: Pattern

Regular expression to match “multiple” prefixes such as mono-.

Pattern

(mono)*(di)*(tri)*(tetra)*(penta)*(hexa)*(hepta)*(octa)*(nona)*(deca)*(undeca)*(dodeca)*(trideca)*(tetradeca)*(pentadeca)*(hexadeca)*(heptadeca)*(octadeca)*(nonadeca)*(icosa)*(henicosa)*(docosa)*(tricosa)*(triaconta)*(hentriaconta)*(dotriaconta)*(tetraconta)*(pentaconta)*(hexaconta)*(heptaconta)*(octaconta)*(nonaconta)*(hecta)*(dicta)*(tricta)*(tetracta)*(pentacta)*(hexacta)*(heptacta)*(octacta)*(nonacta)*(kilia)*(dilia)*(trilia)*(tetralia)*(pentalia)*(hexalia)*(heptalia)*(octalia)*(nonalia)*

re_strings = [re.compile('((\\d+),?)+(\\d+)-'), re.compile('(mono)*(di)*(tri)*(tetra)*(penta)*(hexa)*(hepta)*(octa)*(nona)*(deca)*(undeca)*(dodeca)*(trideca)*(tetradeca)*(pentadeca)*(hexadeca)*(heptadeca)*(octadeca)*(nonadeca)*(icosa)*(henicosa)*(docosa)*(tri), re.compile('nitro'), re.compile('phenyl'), re.compile('aniline'), re.compile('anisole'), re.compile('benzene'), re.compile('centralite'), re.compile('formamide'), re.compile('glycerine'), re.compile('nitrate'), re.compile('glycol'), re.compile('phthalate'), re.compile('picrate'), re.compile('toluene'), re.compile('methyl'), re.compile('(?<!m)ethyl'), re.compile('propyl'), re.compile('butyl'), re.compile(' '), re.compile('\\('), re.compile('\\)'), re.compile('hydroxyl'), re.compile('amin[oe]'), re.compile('amide')]

Type: List[Pattern]

List of regular expressions to decompose an IUPAC name.

sort_IUPAC_names(iupac_names)[source]

Sort a list of IUPAC names into order.

Parameters: iupac_names (Sequence[str]) – The list of IUPAC names to sort
Return type: List[str]
Returns: The list of sorted IUPAC names.

sort_array_by_name(array, name_col=0, reverse=False)[source]

Sort a list of lists by the IUPAC name in each row.

Parameters

array (List[List[Any]])
name_col (int) – The index of the column containing the IUPAC names. Default 0.
reverse (bool) – Whether the names should be sorted in reverse order. Default is False, which sorts from A-Z.

Return type

List[List[Any]]

Returns

The sorted array

sort_dataframe_by_name(df, name_col, reverse=False)[source]

Sorts a pandas.DataFrame by the IUPAC name in each row.

Parameters

df (DataFrame)
name_col (str) – The name of the column containing the IUPAC names
reverse (bool) – Whether the names should be sorted in reverse order. Default is False, which sorts from A-Z

Return type

DataFrame

Returns

The sorted DataFrame

chemistry_tools.names

`chemistry_tools.names`