Skip to content

Conversation

@merkys
Copy link
Member

@merkys merkys commented Sep 18, 2025

This PR introduces two properties _cheminfo_components_stdinchi and _cheminfo_components_stdinchi_counts to communicate the standard InChIs for all connected components in a structure.

@vaitkus vaitkus linked an issue Sep 18, 2025 that may be closed by this pull request
- "array"
- "null"
x-optimade-dimensions:
names: ["dim__cheminfo_components_stdinchi"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably can be shortened, as OPTIMADE specification does not seem to impose any restrictions on names of dimensions.

The InChI identifier is defined by the InChI Trust (https://www.inchi-trust.org).
Every connected component in the structure MUST be represented by a separate InChI string in the list.
Values MUST start with `InChI=` and MUST be unique.
examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the somehow add explanations for the examples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no structured way to do so, but other properties provide the explanations in the description text. I will follow that here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some brief text on examples that describe our intention would be nice too, e.g., co-crystals

@a-e-day
Copy link
Collaborator

a-e-day commented Sep 18, 2025

One other thought - should there be equivalent _cheminfo_components_stdinchikey and _cheminfo_components_stdinchikey_counts too for consistency?

@merkys
Copy link
Member Author

merkys commented Sep 18, 2025

@a-e-day

One other thought - should there be equivalent _cheminfo_components_stdinchikey and _cheminfo_components_stdinchikey_counts too for consistency?

Excellent point - let's first get the current PR ready and merged, and then I can base the InChIKey one on this!

@merkys
Copy link
Member Author

merkys commented Sep 18, 2025

@vaitkus has pointed it out that ..._counts is not very clear. Possible alternatives are ratios and stoichiometry. Both seem OK for me, what do you think, which of them would be clearer?

@ijbruno
Copy link
Collaborator

ijbruno commented Sep 18, 2025

@vaitkus has pointed it out that ..._counts is not very clear. Possible alternatives are ratios and stoichiometry. Both seem OK for me, what do you think, which of them would be clearer?

Not sure which of those is best but is multiplier another alternative?

@a-e-day
Copy link
Collaborator

a-e-day commented Sep 18, 2025

I agree that _counts is maybe not that clear. I would have preferred ratios (shorter, easier to spell, more obvious to non-chemists) but just have a slight doubt that _ratios may be confused with elements_ratios which are floats and sum to 1 unlike these ratios which are integers and don't sum to 1. So I might be inclined to go for stoichiometry - cocrystal molar stoichiometry seems to be a well accepted thing. Maybe it would be more obvious if the field ended "component_stoichiometry" (or "component_ratios") so if it were called _cheminfo_stdinchikey_components_stoichiometry rather than _cheminfo_components_stdinchikey_stoichiometry (and correspondingly _cheminfo_stdinchikey_components rather than _cheminfo_components_stdinchikey)? I'm not sure about multipliers - it seems less well used in this context...

@ijbruno
Copy link
Collaborator

ijbruno commented Sep 19, 2025

Looking at Wikipedia and IUPAC pages on stoichiometry, ratios are referred to a lot. A ratio is a part of the stoichiometry, perhaps even integral to defining it. I thus include towards ratios. Perhaps:

component_inchi_ratios
component_inchikey_ratios

(although I'm not sure if it should be ratio singular)

Note on whether component should come before or after, my concern with e.g. inchi_components is that it could imply "components of an InChI" hence preferring component_inchis

@merkys
Copy link
Member Author

merkys commented Sep 19, 2025

_ratios do indeed seem to convey the meaning that they have to sum up to 1. At least this is what is done with elements_ratios property in the main OPTIMADE specification.

Regarding inchi_components vs. components_inchi: we decided to go with the latter after @ijbruno's argument that there might be a confusion as of when the split is done: before structure -> InChI or InChI -> structure (although I am not sure this is possible easily...)

Regarding singular vs. plural, I think in OPTIMADE list-valued properties are usually named plural. Not sure this is the rule, but I am just following the precedent.

@a-e-day
Copy link
Collaborator

a-e-day commented Sep 24, 2025

Happy to go with ratios if Ian and everyone thinks this is best and yes I take your point about putting components before inchi.

@vaitkus
Copy link
Contributor

vaitkus commented Sep 29, 2025

I disliked count since to me it implies whole numbers which are not sufficient for the cases we discussed (e.g. partial solvent, low occupancy moieties, etc.).

I like ratio, but that to me does not imply that the number have to sum up to one. Are there any rules on the ratios expression we want to impose? That is, a ratios of 1:1:1 may expressed in multiple different ways (0.33:0.33:0.33, 2:2:2, etc.). I would lean towards allowing databases to follow their own conventions, however, that this would reduce the usefulness of queries on the ratio field (if we want to allow those at all).

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion here, happy to go with the consensus on either ratio/stoichiometry. I don't think its particularly confusing wrt. elements_ratios, we should just choose the best name for our namespace. As @merkys said the main thing that should decide it is whether we allow querying on this field; as most implementations do not implement the correlated list queries required for elements_ratios anyway, I think we can safely leave it as a "MAY" without designing around it.

Just one further comment below

description: |-
The standard InChI identifiers for the connected components of the structure.
The InChI identifier is defined by the InChI Trust (https://www.inchi-trust.org).
Every connected component in the structure MUST be represented by a separate InChI string in the list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be clear that the components MUST NOT (or MAY) overlap, and whether parts of the structure may not be described at all in these components?

e.g., imagining a single caffeine molecule in a (perhaps unimportant or complicated) solvent, am I allowed to just list caffeine as the component inchi?

@ml-evs ml-evs force-pushed the main branch 2 times, most recently from 01c9b8a to 10c3ab4 Compare October 17, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add the _cheminfo_components_stdinchi property

6 participants