Skip to content

Commit f805356

Browse files
committed
[receiver/github] add concurrency limits and pull request time filtering
1 parent 320ef1c commit f805356

File tree

11 files changed

+313
-24
lines changed

11 files changed

+313
-24
lines changed

.chloggen/gh-scrape-limits.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Use this changelog template to create an entry for release notes.
2+
3+
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
4+
change_type: enhancement
5+
6+
# The name of the component, or a single word describing the area of concern, (e.g. receiver/filelog)
7+
component: receiver/github
8+
9+
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
10+
note: Add concurrency limit and pull request filtering to reduce rate limiting
11+
12+
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
13+
issues: [43388]
14+
15+
# (Optional) One or more lines of additional information to render under the primary note.
16+
# These lines will be padded with 2 spaces and then inserted directly into the document.
17+
# Use pipe (|) for multiline entries.
18+
subtext:
19+
20+
# If your change doesn't affect end users or the exported elements of any package,
21+
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
22+
# Optional: The change log or logs in which this entry should be included.
23+
# e.g. '[user]' or '[user, api]'
24+
# Include 'user' if the change is relevant to end users.
25+
# Include 'api' if there is a change to a library API.
26+
# Default: '[user]'
27+
change_logs: []

receiver/githubreceiver/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@ receivers:
7777
enabled: true
7878
github_org: <myfancyorg>
7979
search_query: "org:<myfancyorg> topic:<o11yalltheway>" # Recommended optional query override, defaults to "{org,user}:<github_org>"
80+
max_concurrent_requests: 100 # Optional, default: 100
81+
pull_request_lookback_days: 30 # Optional, default: 30
8082
endpoint: "https://selfmanagedenterpriseserver.com" # Optional
8183
auth:
8284
authenticator: bearertokenauth/github
@@ -97,6 +99,10 @@ service:
9799

98100
`search_query` (optional): A filter to narrow down repositories. Defaults to `org:<github_org>` (or `user:<username>`). For example, use `repo:<org>/<repo>` to target a specific repository. Any valid GitHub search syntax is allowed.
99101

102+
`max_concurrent_requests` (optional, default: 100): Maximum concurrent API requests to prevent exceeding GitHub's secondary rate limit of 100 concurrent requests. Set lower if sharing the token with other tools, or higher (at your own risk) if you understand the implications.
103+
104+
`pull_request_lookback_days` (optional, default: 30): Days to look back for merged pull requests. Set to 0 for unlimited history. Open pull requests are always fetched regardless of this setting.
105+
100106
`metrics` (optional): Enable or disable metrics scraping. See the [metrics documentation](./documentation.md) for details.
101107

102108
### Scraping

receiver/githubreceiver/go.mod

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ require (
3131
go.uber.org/goleak v1.3.0
3232
go.uber.org/multierr v1.11.0
3333
go.uber.org/zap v1.27.1
34+
golang.org/x/sync v0.18.0
3435
)
3536

3637
require (

receiver/githubreceiver/go.sum

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

receiver/githubreceiver/internal/scraper/githubscraper/README.md

Lines changed: 81 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -42,23 +42,90 @@ to prevent abuse and maintain API availability. The following secondary limit is
4242
particularly relevant:
4343

4444
- **Concurrent Requests Limit**: The API allows no more than 100 concurrent
45-
requests. This limit is shared across the REST and GraphQL APIs. Since the
46-
scraper creates a goroutine per repository, having more than 100 repositories
47-
returned by the `search_query` will result in exceeding this limit.
48-
It is recommended to use the `search_query` config option to limit the number of
49-
repositories that are scraped. We recommend one instance of the receiver per
50-
team (note: `team` is not a valid quantifier when searching repositories `topic`
51-
is). Reminder that each instance of the receiver should have its own
52-
corresponding token for authentication as this is what rate limits are tied to.
45+
requests. This limit is shared across the REST and GraphQL APIs. The scraper
46+
provides a `max_concurrent_requests` configuration option (default: 100) to
47+
control concurrency and reduce the likelihood of exceeding this limit.
5348

54-
In summary, we recommend the following:
49+
## Configuration Options for Rate Limiting
50+
51+
To reduce rate limit issues, the GitHub scraper provides two configuration
52+
options:
53+
54+
### Concurrency Control
55+
56+
```yaml
57+
receivers:
58+
github:
59+
scrapers:
60+
github:
61+
max_concurrent_requests: 100 # Default: 100
62+
```
63+
64+
The `max_concurrent_requests` option limits how many repositories are scraped
65+
concurrently. GitHub's API enforces a secondary rate limit of 100 concurrent
66+
requests (shared between REST and GraphQL APIs). The default value of 100
67+
respects this limit.
68+
69+
**When to adjust:**
70+
- Set lower (e.g., 50) if you're also using GitHub's API from other tools with
71+
the same token
72+
- Set higher than 100 only if you understand the risk of secondary rate limit
73+
errors
74+
- The receiver will warn if this value exceeds 100
75+
76+
### Pull Request Time Filtering
77+
78+
```yaml
79+
receivers:
80+
github:
81+
scrapers:
82+
github:
83+
pull_request_lookback_days: 30 # Default: 30
84+
```
85+
86+
The `pull_request_lookback_days` option limits how far back to query for merged
87+
pull requests. Open pull requests are always fetched regardless of age. The
88+
scraper will stop paginating through merged PRs once it encounters PRs older
89+
than the lookback period, significantly reducing API calls for repositories with
90+
large PR histories.
91+
92+
**When to adjust:**
93+
- Set to 0 to fetch all historical pull requests (may consume significant API
94+
quota)
95+
- Increase (e.g., 90) if your team's PR cycle time is longer
96+
- Decrease (e.g., 7) if you only care about very recent merged PRs
97+
98+
**Note:** The implementation fetches open and merged PRs separately to enable
99+
early termination of pagination for merged PRs, minimizing unnecessary API
100+
calls.
101+
102+
### Example Configuration
103+
104+
```yaml
105+
receivers:
106+
github:
107+
collection_interval: 300s # 5 minutes
108+
scrapers:
109+
github:
110+
github_org: myorg
111+
search_query: "org:myorg topic:observability"
112+
max_concurrent_requests: 50
113+
pull_request_lookback_days: 30
114+
endpoint: https://github.example.com # For GitHub Enterprise
115+
auth:
116+
authenticator: bearertokenauth/github
117+
```
118+
119+
### Recommendations
120+
121+
Based on the limitations above, we recommend:
55122

56123
- One instance of the receiver per team
57-
- Each instance of the receiver should have its own token
58-
- Leverage `search_query` config option to limit repositories returned to 100 or
59-
less per instance
60-
- `collection_interval` should be long enough to avoid rate limiting (see above
61-
formula). A sensible default is `300s`.
124+
- Each instance should have its own token
125+
- Use `search_query` to limit repositories to a reasonable number
126+
- Set `collection_interval` to 300s (5 minutes) or higher to avoid primary rate limits
127+
- Use `max_concurrent_requests: 100` (default) to prevent secondary rate limit errors
128+
- Use `pull_request_lookback_days: 30` (default) to limit historical PR queries
62129

63130
**Additional Resources:**
64131

receiver/githubreceiver/internal/scraper/githubscraper/config.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@
44
package githubscraper // import "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal/scraper/githubscraper"
55

66
import (
7+
"errors"
8+
79
"go.opentelemetry.io/collector/config/confighttp"
10+
"go.uber.org/multierr"
811

912
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal"
1013
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal/metadata"
@@ -19,4 +22,27 @@ type Config struct {
1922
GitHubOrg string `mapstructure:"github_org"`
2023
// SearchQuery is the query to use when defining a custom search for repository data
2124
SearchQuery string `mapstructure:"search_query"`
25+
// MaxConcurrentRequests limits the number of concurrent API requests to prevent
26+
// exceeding GitHub's secondary rate limit of 100 concurrent requests.
27+
// Default: 100
28+
MaxConcurrentRequests int `mapstructure:"max_concurrent_requests"`
29+
// PullRequestLookbackDays limits how far back to query for merged/closed pull requests.
30+
// Open pull requests are always fetched regardless of age.
31+
// Set to 0 to fetch all historical pull requests.
32+
// Default: 30
33+
PullRequestLookbackDays int `mapstructure:"pull_request_lookback_days"`
34+
}
35+
36+
func (cfg *Config) Validate() error {
37+
var errs error
38+
39+
if cfg.MaxConcurrentRequests <= 0 {
40+
errs = multierr.Append(errs, errors.New("max_concurrent_requests must be greater than 0"))
41+
}
42+
43+
if cfg.PullRequestLookbackDays < 0 {
44+
errs = multierr.Append(errs, errors.New("pull_request_lookback_days cannot be negative"))
45+
}
46+
47+
return errs
2248
}

receiver/githubreceiver/internal/scraper/githubscraper/config_test.go

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,71 @@ func TestConfig(t *testing.T) {
2323
clientConfig.Timeout = 15 * time.Second
2424

2525
expectedConfig := &Config{
26-
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
27-
ClientConfig: clientConfig,
26+
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
27+
ClientConfig: clientConfig,
28+
MaxConcurrentRequests: 100,
29+
PullRequestLookbackDays: 30,
2830
}
2931

3032
assert.Equal(t, expectedConfig, defaultConfig)
3133
}
34+
35+
func TestConfigValidate(t *testing.T) {
36+
tests := []struct {
37+
name string
38+
config *Config
39+
expectedErr string
40+
}{
41+
{
42+
name: "valid config with defaults",
43+
config: &Config{
44+
MaxConcurrentRequests: 100,
45+
PullRequestLookbackDays: 30,
46+
},
47+
expectedErr: "",
48+
},
49+
{
50+
name: "valid config with zero lookback (unlimited)",
51+
config: &Config{
52+
MaxConcurrentRequests: 50,
53+
PullRequestLookbackDays: 0,
54+
},
55+
expectedErr: "",
56+
},
57+
{
58+
name: "invalid negative max_concurrent_requests",
59+
config: &Config{
60+
MaxConcurrentRequests: -1,
61+
PullRequestLookbackDays: 30,
62+
},
63+
expectedErr: "max_concurrent_requests must be greater than 0",
64+
},
65+
{
66+
name: "invalid zero max_concurrent_requests",
67+
config: &Config{
68+
MaxConcurrentRequests: 0,
69+
PullRequestLookbackDays: 30,
70+
},
71+
expectedErr: "max_concurrent_requests must be greater than 0",
72+
},
73+
{
74+
name: "invalid negative lookback days",
75+
config: &Config{
76+
MaxConcurrentRequests: 100,
77+
PullRequestLookbackDays: -1,
78+
},
79+
expectedErr: "pull_request_lookback_days cannot be negative",
80+
},
81+
}
82+
83+
for _, tt := range tests {
84+
t.Run(tt.name, func(t *testing.T) {
85+
err := tt.config.Validate()
86+
if tt.expectedErr == "" {
87+
assert.NoError(t, err)
88+
} else {
89+
assert.ErrorContains(t, err, tt.expectedErr)
90+
}
91+
})
92+
}
93+
}

receiver/githubreceiver/internal/scraper/githubscraper/factory.go

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,10 @@ import (
1818
// This file implements factory for the GitHub Scraper as part of the GitHub Receiver
1919

2020
const (
21-
TypeStr = "scraper"
22-
defaultHTTPTimeout = 15 * time.Second
21+
TypeStr = "scraper"
22+
defaultHTTPTimeout = 15 * time.Second
23+
defaultMaxConcurrentRequests = 100
24+
defaultPullRequestLookbackDays = 30
2325
)
2426

2527
type Factory struct{}
@@ -28,8 +30,10 @@ func (*Factory) CreateDefaultConfig() internal.Config {
2830
clientConfig := confighttp.NewDefaultClientConfig()
2931
clientConfig.Timeout = defaultHTTPTimeout
3032
return &Config{
31-
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
32-
ClientConfig: clientConfig,
33+
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
34+
ClientConfig: clientConfig,
35+
MaxConcurrentRequests: defaultMaxConcurrentRequests,
36+
PullRequestLookbackDays: defaultPullRequestLookbackDays,
3337
}
3438
}
3539

receiver/githubreceiver/internal/scraper/githubscraper/github_scraper.go

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ import (
1717
"go.opentelemetry.io/collector/pdata/pmetric"
1818
"go.opentelemetry.io/collector/receiver"
1919
"go.uber.org/zap"
20+
"golang.org/x/sync/semaphore"
2021

2122
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal/metadata"
2223
)
@@ -30,6 +31,7 @@ type githubScraper struct {
3031
logger *zap.Logger
3132
mb *metadata.MetricsBuilder
3233
rb *metadata.ResourceBuilder
34+
sem *semaphore.Weighted // Concurrency limiter
3335
}
3436

3537
func (ghs *githubScraper) start(ctx context.Context, host component.Host) (err error) {
@@ -95,6 +97,28 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {
9597

9698
ghs.mb.RecordVcsRepositoryCountDataPoint(now, int64(count))
9799

100+
// Log warning if repository count exceeds configured concurrency limit
101+
if len(repos) > ghs.cfg.MaxConcurrentRequests {
102+
ghs.logger.Sugar().Warnf(
103+
"Found %d repositories but max_concurrent_requests is set to %d. "+
104+
"Consider using search_query to reduce repository count or increase max_concurrent_requests. "+
105+
"Note: GitHub's API limit is 100 concurrent requests.",
106+
len(repos), ghs.cfg.MaxConcurrentRequests,
107+
)
108+
}
109+
110+
// Log warning if max_concurrent_requests exceeds GitHub's limit
111+
if ghs.cfg.MaxConcurrentRequests > 100 {
112+
ghs.logger.Sugar().Warnf(
113+
"max_concurrent_requests is set to %d which exceeds GitHub's API limit of 100 concurrent requests. "+
114+
"This may result in secondary rate limit errors.",
115+
ghs.cfg.MaxConcurrentRequests,
116+
)
117+
}
118+
119+
// Initialize semaphore for concurrency control
120+
ghs.sem = semaphore.NewWeighted(int64(ghs.cfg.MaxConcurrentRequests))
121+
98122
// Get the ref (branch) count (future branch data) for each repo and record
99123
// the given metrics
100124
var wg sync.WaitGroup
@@ -110,6 +134,13 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {
110134
go func() {
111135
defer wg.Done()
112136

137+
// Acquire semaphore before making API calls
138+
if err := ghs.sem.Acquire(ctx, 1); err != nil {
139+
ghs.logger.Sugar().Errorf("failed to acquire semaphore: %v", zap.Error(err))
140+
return
141+
}
142+
defer ghs.sem.Release(1)
143+
113144
branches, count, err := ghs.getBranches(ctx, genClient, name, trunk)
114145
if err != nil {
115146
ghs.logger.Sugar().Errorf("error getting branch count: %v", zap.Error(err))
@@ -164,7 +195,7 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {
164195
ghs.mb.RecordVcsContributorCountDataPoint(now, int64(contribs), url, name)
165196

166197
// Get change (pull request) data
167-
prs, err := ghs.getPullRequests(ctx, genClient, name)
198+
prs, err := ghs.getPullRequests(ctx, genClient, name, ghs.cfg.PullRequestLookbackDays)
168199
if err != nil {
169200
ghs.logger.Sugar().Errorf("error getting pull requests: %v", zap.Error(err))
170201
}

0 commit comments

Comments
 (0)