Skip to content

Commit e361239

Browse files
authored
Change logic to detect docstring style mismatch (#271)
1 parent bf4c402 commit e361239

File tree

12 files changed

+340
-91
lines changed

12 files changed

+340
-91
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
# Change Log
22

3+
## [0.8.1] - 2025-11-03
4+
5+
- Changed
6+
- The logic to detect docstring style mismatches, fixing a false positive
7+
case where non-Sphinx style docstrings are detected as Sphinx style
8+
(because there are some rST keywords in them)
9+
- Full diff
10+
- https://github.com/jsh9/pydoclint/compare/0.8.0...0.8.1
11+
312
## [0.8.0] - 2025-11-03
413

514
- Added

docs/style_mismatch.md

Lines changed: 36 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,12 @@ ______________________________________________________________________
77
**Table of Contents**
88

99
- [1. How does _pydoclint_ detect the style of a docstring?](#1-how-does-pydoclint-detect-the-style-of-a-docstring)
10-
- [1.1. Numpy-style pattern detection (enhanced detection)](#11-numpy-style-pattern-detection-enhanced-detection)
11-
- [1.2. Fallback to size-based detection](#12-fallback-to-size-based-detection)
10+
- [1.1. Keyword heuristics for each style](#11-keyword-heuristics-for-each-style)
11+
- [1.2. Handling ambiguous or missing matches](#12-handling-ambiguous-or-missing-matches)
12+
- [1.3. What happens after a mismatch is detected?](#13-what-happens-after-a-mismatch-is-detected)
1213
- [2. How accurate is this detection heuristic?](#2-how-accurate-is-this-detection-heuristic)
1314
- [3. Can I turn this off?](#3-can-i-turn-this-off)
14-
- [4. Is it much slower to parse a docstring in all 3 styles?](#4-is-it-much-slower-to-parse-a-docstring-in-all-3-styles)
15+
- [4. Is it much slower to parse a docstring with the heuristics?](#4-is-it-much-slower-to-parse-a-docstring-with-the-heuristics)
1516
- [5. What violation code is associated with style mismatch?](#5-what-violation-code-is-associated-with-style-mismatch)
1617
- [6. How to fix this violation code?](#6-how-to-fix-this-violation-code)
1718

@@ -27,40 +28,40 @@ config option.
2728

2829
_pydoclint_ detects the style of a docstring with this procedure:
2930

30-
### 1.1. Numpy-style pattern detection (enhanced detection)
31+
### 1.1. Keyword heuristics for each style
3132

32-
As of recent updates, _pydoclint_ first checks if the docstring contains
33-
numpy-style section headers with dashes. If it detects patterns like:
33+
We now rely on lightweight heuristics that look for style-specific keywords at
34+
the indentation level where the docstring begins:
3435

35-
```
36-
Returns
37-
-------
36+
- **NumPy**: section headers followed by dashed underlines (for example,
37+
`Returns` + `-------`), using a curated list of keywords.
38+
- **Google**: top-level section headers such as `Args:`, `Returns:`, `Yields:`,
39+
`Raises:`, `Examples:`, or `Notes:` with matching indentation.
40+
- **Sphinx/reST**: top-level field lists such as `:param`, `:type`, `:raises`,
41+
`:return:`, `:rtype:`, `:yield:`, or `:ytype:`.
3842

39-
Parameters
40-
----------
43+
Each helper only considers keywords that start at the same indentation level as
44+
the opening triple quotes to avoid counting inline roles or nested blocks.
4145

42-
Examples
43-
--------
44-
```
46+
### 1.2. Handling ambiguous or missing matches
4547

46-
It immediately identifies the docstring as numpy-style and parses it
47-
accordingly, even if it may not be fully parsable as numpy style. This
48-
pattern-based detection looks for common section headers (Args, Arguments,
49-
Parameters, Returns, Yields, Raises, Examples, Notes, See Also, References)
50-
followed by 3 or more dashes on the next line.
48+
- **Exactly one match** We parse the docstring using the detected style. If it
49+
differs from the configured style, DOC003 is emitted. Google parse failures
50+
are also treated as style mismatches because malformed Google sections almost
51+
always indicate another style.
52+
- **No matches** We assume the docstring uses the configured style and skip
53+
style mismatch warnings entirely.
54+
- **Multiple matches** The docstring appears to mix styles (for example, Google
55+
`Args:` plus Sphinx `:param` directives), so we emit DOC003 for every
56+
configured style.
5157

52-
### 1.2. Fallback to size-based detection
58+
### 1.3. What happens after a mismatch is detected?
5359

54-
If no numpy-style patterns are detected, _pydoclint_ falls back to the original
55-
size-based detection:
56-
57-
- It attempts to parse the docstring in all 3 styles: numpy, Google, and Sphinx
58-
- It then compares the "size" of the parsed docstring objects
59-
- The "size" is a human-made metric to measure how "fully parsed" a docstring
60-
object is. For example, a docstring object without the return section is
61-
larger in "size" than that with the return section (all others being equal)
62-
- The style that yields the largest "size" is considered the style of the
63-
docstring
60+
When DOC003 is triggered we still return the docstring parsed in the configured
61+
style, but we suppress many follow-up checks that would otherwise generate
62+
cascading false positives (argument type-hint expectations, return/yield/raise
63+
consistency, etc.). This keeps the feedback focused on resolving the style
64+
mismatch first.
6465

6566
## 2. How accurate is this detection heuristic?
6667

@@ -84,10 +85,12 @@ Actually, this style mismatch detection feature is by default _off_.
8485
You can turn this feature on by setting `--check-style-mismatch` (or `-csm`) to
8586
`True` (or `--check-style-mismatch=True`).
8687

87-
## 4. Is it much slower to parse a docstring in all 3 styles?
88+
## 4. Is it much slower to parse a docstring with the heuristics?
8889

89-
It is not. The authors of _pydoclint_ benchmarked some very large code bases,
90-
and here are the results (as of 2025/01/12):
90+
No. The new detection flow usually parses at most one style per docstring, but
91+
even when we fall back to the configured style the cost is still negligible.
92+
For reference, benchmarking large code bases (as of 2025/01/12) shows the
93+
overhead of style detection is only a few percent:
9194

9295
| | numpy | scikit-learn | Bokeh | Airflow |
9396
| ---------------------------- | ----- | ------------ | ----- | ------- |

muff.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# Docs: https://docs.astral.sh/ruff/configuration
22

33
exclude = ["tests/test_data"]
4+
fix = true
45
line-length = 79
56
output-format = "grouped"
7+
show-fixes = true
68
target-version = "py310"
9+
unsafe-fixes = true
710

811
[format]
912
docstring-code-format = true

pydoclint/utils/parse_docstring.py

Lines changed: 129 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,25 @@
66

77
from pydoclint.utils.doc import Doc
88

9+
_SPHINX_KEYWORDS = (
10+
':param ',
11+
':type ',
12+
':raises ',
13+
':return:',
14+
':rtype:',
15+
':yield:',
16+
':ytype:',
17+
)
18+
19+
_GOOGLE_KEYWORDS = (
20+
'Args:',
21+
'Returns:',
22+
'Yields:',
23+
'Raises:',
24+
'Examples:',
25+
'Notes:',
26+
)
27+
928

1029
def _containsNumpyStylePattern(docstring: str) -> bool:
1130
# Check if docstring contains numpy-style section headers with dashes.
@@ -31,6 +50,72 @@ def _containsNumpyStylePattern(docstring: str) -> bool:
3150
return bool(re.search(pattern, docstring, re.MULTILINE | re.IGNORECASE))
3251

3352

53+
def _containsSphinxStylePattern(docstring: str) -> bool:
54+
"""
55+
Check if docstring contains Sphinx-style field lists at base indentation.
56+
57+
Only lines that have the same leading indentation as the docstring
58+
definition (i.e., the opening triple quotes) count as valid Sphinx
59+
directives. Lines with more or fewer leading spaces are ignored.
60+
"""
61+
leadingIndent = _detectDocstringIndent(docstring)
62+
for line in docstring.splitlines():
63+
stripped = line.lstrip()
64+
if stripped == '':
65+
continue
66+
67+
currentIndent = len(line) - len(stripped)
68+
if currentIndent != leadingIndent:
69+
continue
70+
71+
for keyword in _SPHINX_KEYWORDS:
72+
if stripped.startswith(keyword):
73+
return True
74+
75+
return False
76+
77+
78+
def _containsGoogleStylePattern(docstring: str) -> bool:
79+
"""
80+
Check if docstring contains Google-style section headers at base indent.
81+
"""
82+
leadingIndent = _detectDocstringIndent(docstring)
83+
for line in docstring.splitlines():
84+
stripped = line.lstrip()
85+
if stripped == '':
86+
continue
87+
88+
currentIndent = len(line) - len(stripped)
89+
if currentIndent != leadingIndent:
90+
continue
91+
92+
for keyword in _GOOGLE_KEYWORDS:
93+
if stripped.startswith(keyword):
94+
return True
95+
96+
return False
97+
98+
99+
def _detectDocstringIndent(docstring: str) -> int:
100+
"""
101+
Detect the leading indentation level of a docstring.
102+
103+
This approximates the column where the opening triple quotes are placed by
104+
measuring the smallest indentation across non-empty lines.
105+
"""
106+
indent: int | None = None
107+
for line in docstring.splitlines():
108+
stripped = line.lstrip()
109+
if stripped == '':
110+
continue
111+
112+
currentIndent = len(line) - len(stripped)
113+
if indent is None or currentIndent < indent:
114+
indent = currentIndent
115+
116+
return 0 if indent is None else indent
117+
118+
34119
def parseDocstring(
35120
docstring: str,
36121
userSpecifiedStyle: str,
@@ -39,40 +124,51 @@ def parseDocstring(
39124
Parse docstring in all 3 docstring styles and return the one that is parsed
40125
with the most likely style.
41126
"""
42-
# Check if docstring contains numpy-style section headers with dashes
43-
if _containsNumpyStylePattern(docstring):
44-
# Force numpy style parsing when numpy pattern is detected
45-
docNumpy, excNumpy = parseDocstringInGivenStyle(docstring, 'numpy')
46-
return docNumpy, excNumpy, userSpecifiedStyle != 'numpy'
47-
48-
docNumpy, excNumpy = parseDocstringInGivenStyle(docstring, 'numpy')
49-
docGoogle, excGoogle = parseDocstringInGivenStyle(docstring, 'google')
50-
docSphinx, excSphinx = parseDocstringInGivenStyle(docstring, 'sphinx')
51-
52-
docstrings: dict[str, Doc] = {
53-
'numpy': docNumpy,
54-
'google': docGoogle,
55-
'sphinx': docSphinx,
56-
}
57-
docstringSizes: dict[str, int] = {
58-
'numpy': docNumpy.docstringSize,
59-
'google': docGoogle.docstringSize,
60-
'sphinx': docSphinx.docstringSize,
61-
}
62-
parsingExceptions: dict[str, ParseError | None] = {
63-
'numpy': excNumpy,
64-
'google': excGoogle,
65-
'sphinx': excSphinx,
127+
isLikelyNumpy: bool = _containsNumpyStylePattern(docstring)
128+
isLikelyGoogle: bool = _containsGoogleStylePattern(docstring)
129+
isLikelySphinx: bool = _containsSphinxStylePattern(docstring)
130+
131+
if isLikelyNumpy:
132+
# Numpy-style headers with dashes are strong indicators; ignore other
133+
# potential matches when they appear alongside them.
134+
isLikelyGoogle = False
135+
isLikelySphinx = False
136+
137+
likelyStyles = {
138+
'numpy': isLikelyNumpy,
139+
'google': isLikelyGoogle,
140+
'sphinx': isLikelySphinx,
66141
}
67-
# Whichever style has the largest docstring size, we think that it is
68-
# the actual style that the docstring is written in.
69-
maxDocstringSize = max(docstringSizes.values())
70-
styleMismatch: bool = docstringSizes[userSpecifiedStyle] < maxDocstringSize
71-
return (
72-
docstrings[userSpecifiedStyle],
73-
parsingExceptions[userSpecifiedStyle],
74-
styleMismatch,
75-
)
142+
matchedStyles = [
143+
style for style, matched in likelyStyles.items() if matched
144+
]
145+
146+
styleMismatch: bool
147+
148+
if len(matchedStyles) == 1:
149+
detectedStyle = matchedStyles[0]
150+
if detectedStyle == userSpecifiedStyle:
151+
doc, exc = parseDocstringInGivenStyle(docstring, detectedStyle)
152+
# The Google parser raises hard errors when sections are malformed,
153+
# which is a strong signal the docstring is effectively written in
154+
# a different style. Numpy/Sphinx parsers are more permissive, so
155+
# we surface only the parsing error (DOC001) without flagging a
156+
# style mismatch in those cases.
157+
styleMismatch = exc is not None and detectedStyle == 'google'
158+
return doc, exc, styleMismatch
159+
160+
doc, exc = parseDocstringInGivenStyle(docstring, detectedStyle)
161+
styleMismatch = True
162+
return doc, exc, styleMismatch
163+
164+
if len(matchedStyles) == 0:
165+
doc, exc = parseDocstringInGivenStyle(docstring, userSpecifiedStyle)
166+
styleMismatch = False
167+
return doc, exc, styleMismatch
168+
169+
doc, exc = parseDocstringInGivenStyle(docstring, userSpecifiedStyle)
170+
styleMismatch = True
171+
return doc, exc, styleMismatch
76172

77173

78174
def parseDocstringInGivenStyle(

0 commit comments

Comments
 (0)