Skip to content

Conversation

@staru09
Copy link
Contributor

@staru09 staru09 commented Jul 12, 2025

This PR is a WIP for issue #130.
This adds a summarizer and correct flag to improve and summarize the transcripts generated by Deepgram/Whisper using openai api.

@staru09
Copy link
Contributor Author

staru09 commented Jul 12, 2025

@kouloumos please ignore the previous one and review this
Last one was more complex and we may not need that anyways.

Copy link
Member

@kouloumos kouloumos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, Aru! This is a really solid foundation.

I've added a few small notes inline. I also had a question about the llm service. I was wondering if we should split its responsibilities from the start?

I'm thinking ahead to our plans for a more advanced correction system. Do you think the llm service might get too complex if it has to manage corrections, summaries, and potentially more in the future?

Perhaps we could introduce a CorrectionService and a SummarizerService? They might just be simple LLM prompts for now, but it would give us a dedicated place to build out that more complex logic later.

An added benefit might be simplifying the Transcription class. If it initializes a list of services from the start, we might be able to get rid of the internal flags that track what needs to be done.

What are your thoughts on that trade-off? Happy to discuss it further.

Additional git productivity tip:

  • Instead of merging the main branch into your feature branch, prefer to rebase your branch on top of main. This creates a clean git history, where code changes are only made in non-merge commits. The basic rebasing workflow is shown below.
git checkout your_branch_name
git fetch origin
git rebase origin/main
# If conflicts arise, resolve them, stage the resolved files, and continue with:
# git rebase --continue
# Repeat until rebase is complete.
git push -f # (force push to GitHub)

pass

@abstractmethod
def summarize_text(self, text: str, model: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call this summarize_transcript in the same pattern as correct_transcript?

Comment on lines 505 to 567
text_exporter = self.exporters.get("text")

# --- Step 1: Save the original raw transcript if it exists ---
if transcript.outputs["raw"] and text_exporter:
raw_output_dir = text_exporter.get_output_path(transcript)
raw_file_path = text_exporter.construct_file_path(
directory=raw_output_dir,
filename=f"{transcript.title}_raw",
file_type="txt",
include_timestamp=False
)
text_exporter.write_to_file(transcript.outputs["raw"], raw_file_path)
self.logger.info(f"Raw transcript saved to: {raw_file_path}")

# --- Step 2: Perform LLM Correction ---
if self.llm_service and self.correct and transcript.outputs["raw"]:
self.logger.info(f"Correcting transcript with {self.llm_provider}...")
corrected_transcript_text = self.llm_service.correct_transcript(
transcript.outputs["raw"], self.llm_correction_model
)

if text_exporter:
corrected_output_dir = text_exporter.get_output_path(transcript)
corrected_file_path = text_exporter.construct_file_path(
directory=corrected_output_dir,
filename=f"{transcript.title}_corrected",
file_type="txt",
include_timestamp=False
)
text_exporter.write_to_file(corrected_transcript_text, corrected_file_path)
self.logger.info(f"Corrected transcript saved to: {corrected_file_path}")

# Overwrite the main "raw" output so subsequent steps use the corrected version
transcript.outputs["raw"] = corrected_transcript_text

# --- Step 3: Perform LLM Summarization ---
if self.llm_service and self.summarize_llm and transcript.outputs["raw"]:
self.logger.info(f"Summarizing transcript with {self.llm_provider}...")
summary_text = self.llm_service.summarize_text(
transcript.outputs["raw"], self.llm_summary_model
)
transcript.summary = summary_text

if text_exporter:
summary_output_dir = text_exporter.get_output_path(transcript)
summary_file_path = text_exporter.construct_file_path(
directory=summary_output_dir,
filename=f"{transcript.title}_summary",
file_type="txt",
include_timestamp=False
)
text_exporter.write_to_file(summary_text, summary_file_path)
self.logger.info(f"Summary saved to: {summary_file_path}")

# --- Step 4: Run existing exporters for final outputs (Markdown, JSON, etc.) ---
if self.markdown or self.github_handler:
transcript.outputs["markdown"] = self.write_to_markdown_file(
transcript,
)

if "text" in self.exporters:
try:
transcript.outputs["text"] = self.exporters["text"].export(
transcript, add_timestamp=False
)
except Exception as e:
self.logger.warning(f"Text exporter failed: {e}")

if "json" in self.exporters:
transcript.outputs["json"] = self.exporters["json"].export(
transcript
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach here with the raw and corrected text exports. I see how that could help with verifying the correction algorithm.

I'm thinking about the long-term workflow, and I wonder if we can make the verification process even smoother by leveraging git. What do you think about this potential flow?

  • When the --github flag is used, we could create two separate commits: a first one with the initial transcript, and a second with the corrections applied. This would allow us to see the exact impact of the algorithm in the PR's diff.
    • for this to work you would need to store the corrected transcript to a different field and use it in the start() method.
  • When the flag isn't used, we'd simply save the final, corrected text based in our export settings.

This would also help us enforce that the export step always happens at the very end of the postprocessing chain. I am now thinking that export() should probably be a method by itself.

Let me know if this makes sense or if you foresee any issues with this approach.

Comment on lines 527 to 535
corrected_output_dir = text_exporter.get_output_path(transcript)
corrected_file_path = text_exporter.construct_file_path(
directory=corrected_output_dir,
filename=f"{transcript.title}_corrected",
file_type="txt",
include_timestamp=False
)
text_exporter.write_to_file(corrected_transcript_text, corrected_file_path)
self.logger.info(f"Corrected transcript saved to: {corrected_file_path}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems redundant, we have methods for exporting that you can utilize

transcriber.py Outdated
correct_transcript = click.option(
"--correct",
is_flag=True,
default=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the default=settings.config.get(,) here (and on the rest of the new flags) so that users can use the config.ini file to set their defaults.

transcriber.py Outdated
Comment on lines 219 to 224
summarize_llm = click.option(
"--summarize-llm",
is_flag=True,
default=False,
help="Generate a new summary using the configured LLM provider.",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to use the existing summarize flag, we are not using. It supposed to be working with deepgram but we never use it. so feel free to repurpose it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should use LLM for correction and summarize with deepgram only ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not what I meant. There is a --summarize flag already. I'm suggesting to use that instead of a new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okk got it. Sorry for the misunderstanding.

Comment on lines 15 to 18
[llm]
llm_provider = openai
llm_correction_model = gpt-4o
llm_summary_model = gpt-3.5
Copy link
Member

@kouloumos kouloumos Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add those in the [DEFAULT] category instead of here.

Comment on lines 548 to 558
if text_exporter:
summary_output_dir = text_exporter.get_output_path(transcript)
summary_file_path = text_exporter.construct_file_path(
directory=summary_output_dir,
filename=f"{transcript.title}_summary",
file_type="txt",
include_timestamp=False
)
text_exporter.write_to_file(summary_text, summary_file_path)
self.logger.info(f"Summary saved to: {summary_file_path}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need all this code for the paths.

also, logging services should happen inside the services

@staru09
Copy link
Contributor Author

staru09 commented Jul 13, 2025

I'm thinking ahead to our plans for a more advanced correction system. Do you think the llm service might get too complex if it has to manage corrections, summaries, and potentially more in the future?

Perhaps we could introduce a CorrectionService and a SummarizerService? They might just be simple LLM prompts for now, but it would give us a dedicated place to build out that more complex logic later.

An added benefit might be simplifying the Transcription class. If it initializes a list of services from the start, we might be able to get rid of the internal flags that track what needs to be done.

What are your thoughts on that trade-off? Happy to discuss it further.

I think let's split the llm_service into correction and summarization service so we atleast have placeholders to look at for further improvement.

@staru09 staru09 force-pushed the llm_feats branch 4 times, most recently from d1c6a8d to bc4597e Compare August 17, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants