DocForge

Github repository: https://github.com/Shriyabh11/DocForge

Google meet link:https://meet.google.com/qxg-jphp-hnb

Aim

To build, fine-tune, and evaluate a transformer-based model that automatically generates complete function-level documentation from source code, including descriptions, parameter explanations, and return values.

Introduction

Software documentation is essential for maintainable and understandable code, yet developers often neglect writing detailed documentation due to time constraints and project complexity.

DocForge addresses this problem by automatically generating high-quality documentation directly from source code using deep learning and natural language processing techniques. The project fine-tunes CodeT5, a transformer model specialized for code understanding and generation, on the Code2Doc dataset containing curated function-documentation pairs across multiple programming languages.

The system analyzes code structure and semantics to produce human-readable descriptions, parameter explanations, and return value summaries. The final model is integrated into a Streamlit web application for real-time documentation generation.

Literature Survey and Technologies Used

Literature Survey

Code2Doc (2025) introduced a high-quality curated dataset for automatic code documentation generation using strict filtering and quality scoring techniques.
CodeT5 (2021) proposed an encoder-decoder transformer architecture specifically designed for programming language understanding and code generation tasks.
Attention Is All You Need (2017) introduced the transformer architecture and attention mechanisms that form the foundation of modern large language models.
BLEU and ROUGE-L metrics are widely used evaluation techniques for measuring text generation quality and similarity between generated and reference outputs.

Technologies Used

Python
PyTorch
Hugging Face Transformers
CodeT5-base
Streamlit
Kaggle GPU (NVIDIA T4)
BLEU and ROUGE-L evaluation metrics

Methodology

Dataset

The Code2Doc dataset containing 13,358 curated code-documentation pairs was used. The dataset includes Python, Java, TypeScript, JavaScript, and C++ functions.

The dataset undergoes:

Basic filtering
Quality scoring
Deduplication
AI-content detection

Dataset Statistics

The Code2Doc dataset used for training contains 13,358 curated code-documentation pairs after a four-stage filtering process. It includes five programming languages: Python, Java, TypeScript, JavaScript, and C++. Java formed the largest portion of the dataset with 61.4% samples, while Python contributed 27%. The dataset achieved a mean quality score of 6.93/10, ensuring reliable training data.

Four-Stage Curation Pipeline

Stage 1 — Basic Filtering: removes trivial docs, test functions, placeholders

Stage 2 — Quality Scoring: 8-dimension weighted score, threshold ≥ 6.0

Stage 3 — Deduplication: exact hash + MinHash-LSH near-duplicate removal

Stage 4 — AI Content Detection: heuristic flagging of LLM-generated docs

Model & Architecture

CodeT5 — Seq2Seq Fine-Tuning

CodeT5 is an encoder-decoder transformer pre-trained on large-scale code corpora (GitHub, CodeSearchNet). It is purpose-built for code understanding and generation, making it ideally suited for the code → docstring task.

Input Format

Input: "Summarize {language}: {function_code}"

Target: "{documentation}"

How It Works

The encoder reads the full source code using bidirectional multi-head self-attention
Each token attends to all other tokens — capturing long-range code structure
The decoder generates the docstring token by token
Cross-attention allows the decoder to focus on relevant parts of the encoded source
Output ends when the model produces the end-of-sequence token

CodeT5-base was selected for this project because it is specifically designed for code understanding and generation tasks using an encoder-decoder architecture, making it more suitable for generating structured documentation than general-purpose models like Llama 3.1 8B. The model, containing 222M parameters, was fine-tuned on the Code2Doc dataset with 9,192 training, 1,022 validation, and 1,293 test samples, using an input length of 512 tokens and output length of 256 tokens. Training was performed on NVIDIA T4 GPUs through Kaggle with an effective batch size of 16 using gradient accumulation. The model showed consistent improvement during training, with decreasing loss values and increasing validation BLEU scores, reaching 0.512 in the final epoch, indicating improved documentation generation performance.

Evaluation

The generated documentation was evaluated using:

BLEU Score
ROUGE-L Score

Deployment

The final trained model was integrated into a Streamlit web application for interactive documentation generation.

Dashboard

Results

Key Findings

CodeT5 significantly outperformed the fine-tuned Llama 3.1 8B baseline despite being much smaller.
Hyperparameter tuning improved both BLEU and ROUGE-L performance.
High-quality curated datasets proved more effective than larger noisy datasets.

Conclusions

DocForge demonstrates that domain-specific transformer models can effectively generate accurate and useful code documentation. Fine-tuning CodeT5 on curated datasets produced strong performance and significantly outperformed larger general-purpose language models.

Future Scope

Support additional programming languages
Improve semantic accuracy
Add repository-level documentation generation
Include human feedback evaluation mechanisms

References

Karaman, R.K. & Akarsu, M. (2025). Code2Doc: A Quality-First Curated Dataset for Code Documentation.

Wang, Y. et al. (2021). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation.

Vaswani, A. et al. (2017). Attention Is All You Need.

Papineni, K. et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation.

Lin, C.Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries.

Mentors and Mentees Details

Mentors

Shriya Bharadwaj
Priyadharshni S

Mentees

Dhruv Bhavesh Chokshi
Harsh Raj
Aadit Munje
Shreevarna S Rao
Dharsini Nakulan

Virtual Expo 2026

Abstract

Abstract

Report Information

Team Members

Team Members

Report Details

Report Details

Explore More Projects