Skip to content

Conversation

@michaelkedar
Copy link
Member

@michaelkedar michaelkedar commented Nov 25, 2025

This is pretty much a direct translation of the existing python code (which makes some of it unidiomatic go), changed to use the new Vulnerability instead of Bug.

One difference is the alias and upstream computations now run at the same time, and the writes to the Vulnerability protos now happen together. I had to modify the recoverer to be able to handle multiple updating multiple fields when the GCS generation is mismatched to support this properly.

Also ported over the datastore emulator manager, and made some interfaces for the google cloud clients (with the added benefit of introducing write retries which resolves #4355)
This has allowed my to bring over the existing tests.

I've renamed the alias-computation cron to relations - Cloud Deploy is going to create a new workload for this, and the old one is going to have to be deleted manually.

In a follow-up, I'm going to add another goroutine to calculate the related field, since these aren't being kept up to date at the moment.

@michaelkedar
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a major rewrite of the alias and upstream computation logic from Python to Go, and the changes are extensive but well-executed. The introduction of interfaces for cloud clients (CloudStorage, Publisher) and mock implementations for testing is a significant improvement in code quality and testability. The new relations Go command is well-structured, and the test coverage is comprehensive. The addition of a datastore emulator for testing is also a great enhancement to the development workflow. I've identified a minor resource leak and a point of code complexity that could be simplified, but overall this is a high-quality contribution.

Comment on lines +135 to +152
for _, upstream := range group.UpstreamIDs {
if _, ok := visited[upstream]; !ok && !slices.Contains(toVisit, upstream) {
toVisit = append(toVisit, upstream)
} else {
if u, ok := upstreamMap[vulnID]; !ok {
upstreamMap[vulnID] = []string{upstream}
} else if !slices.Contains(u, upstream) {
upstreamMap[vulnID] = append(u, upstream)
}
}
}
// Add the immediate upstreams of the vuln to the map
upstreamMap[vulnID] = group.UpstreamIDs
for _, upstream := range group.UpstreamIDs {
if _, ok := visited[upstream]; !ok && !slices.Contains(toVisit, upstream) {
toVisit = append(toVisit, upstream)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for populating toVisit and upstreamMap appears to be more complex than necessary and contains some redundant logic. The first loop from line 135 seems to be partially overwritten by the assignment on line 147 and the second loop on line 148. This can be simplified for clarity and to remove the redundancy.

if len(group.UpstreamIDs) == 0 {
	continue
}
// Add the immediate upstreams of the vuln to the map
upstreamMap[vulnID] = group.UpstreamIDs
for _, upstream := range group.UpstreamIDs {
	if _, ok := visited[upstream]; !ok && !slices.Contains(toVisit, upstream) {
		toVisit = append(toVisit, upstream)
	}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this logic is basically directly translated from the python implementation:

for upstream in upstreams:
if upstream not in visited and upstream not in to_visit:
to_visit.add(upstream)
else:
if bug_id not in upstream_map:
upstream_map[bug_id] = set([upstream])
else:
upstream_map[bug_id].add(upstream)
# Add the immediate upstreams of the bug to the dict
upstream_map[bug_id] = upstreams
to_visit.update(upstreams - visited)

@michaelkedar
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and well-executed pull request that rewrites the alias and upstream computation logic from Python to Go. The new Go implementation is well-structured, and the introduction of interfaces for cloud clients and comprehensive test utilities is a fantastic improvement for testability and maintainability. The core logic appears to be a faithful translation of the original Python code. I've identified a couple of critical compilation errors related to the misuse of sync.WaitGroup that need to be addressed. Otherwise, the changes are excellent.

@michaelkedar michaelkedar changed the title feat: rewrite alias/upstream computation in Go (WIP) feat: rewrite alias/upstream computation in Go Nov 26, 2025
@michaelkedar michaelkedar marked this pull request as ready for review November 26, 2025 23:48
vuln_proto.aliases[:] = aliases
if aliases_modified > modified:
modified = aliases_modified
fields = field.split(',')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe in the PR description what this change is for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added:

One difference is the alias and upstream computations now run at the same time, and the writes to the Vulnerability protos now happen together. I had to modify the recoverer to be able to handle multiple updating multiple fields when the GCS generation is mismatched to support this properly.

// generate the final files, sending the data to be written to `routerToWriteCh`.
// 5. A pool of `writer` workers receive the file data and write it to the output.
gcsObjToDownloaderCh := make(chan *storage.ObjectHandle)
gcsObjToDownloaderCh := make(chan string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename this variable since it's taking in a path now
nit2: Alias string to a gcsPath type and put that as the channel type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to gcsPathToDownloaderCh
re: nit2 - I don't really see the benefit of doing that - it's fairly standard to use strings as paths in go (e.g. in the path package)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more to clarify whether you are meant to pass in a path, a name, or a vuln ID...etc. This makes it a bit more self documenting. I think you don't even need to cast when passing in a string for aliases

if errors.Is(err, iterator.Done) {
break
}
for path, err := range vulnClient.Objects(ctx, gcsProtoPrefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so clean!

query := datastore.NewQuery("Vulnerability").FilterField("upstream_raw", ">", "")
it := cl.Run(ctx, query)

vulns := make(map[string][]string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here about what vulns is suppose to contain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to rawUpstreams which is more descriptive

"google.golang.org/api/iterator"
)

func computeUpstream(targetBugUpstream []string, bugs map[string][]string) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment describing what this function does, and what the two args are. it's unclear why targetBugUpstream is a list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added back in all the original docstrings from python to the go code, which should make it slightly clearer.

I also changed the function signature to be computeUpstream(vulnID []string, rawUpstreams map[string][]string) []string , which is simpler and clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add some retries on GCS operations in Go exporter

2 participants