Virtual Expo 2026

DocForge

Envision CompSoc

Github repository: https://github.com/Shriyabh11/DocForge

Google meet link:https://meet.google.com/qxg-jphp-hnb

Aim

To build, fine-tune, and evaluate a transformer-based model that automatically generates complete function-level documentation from source code, including descriptions, parameter explanations, and return values.

Introduction

Software documentation is essential for maintainable and understandable code, yet developers often neglect writing detailed documentation due to time constraints and project complexity.

DocForge addresses this problem by automatically generating high-quality documentation directly from source code using deep learning and natural language processing techniques. The project fine-tunes CodeT5, a transformer model specialized for code understanding and generation, on the Code2Doc dataset containing curated function-documentation pairs across multiple programming languages.

The system analyzes code structure and semantics to produce human-readable descriptions, parameter explanations, and return value summaries. The final model is integrated into a Streamlit web application for real-time documentation generation.

Literature Survey and Technologies Used

Literature Survey

  • Code2Doc (2025) introduced a high-quality curated dataset for automatic code documentation generation using strict filtering and quality scoring techniques.
  • CodeT5 (2021) proposed an encoder-decoder transformer architecture specifically designed for programming language understanding and code generation tasks.
  • Attention Is All You Need (2017) introduced the transformer architecture and attention mechanisms that form the foundation of modern large language models.
  • BLEU and ROUGE-L metrics are widely used evaluation techniques for measuring text generation quality and similarity between generated and reference outputs.

Technologies Used

  • Python
  • PyTorch
  • Hugging Face Transformers
  • CodeT5-base
  • Streamlit
  • Kaggle GPU (NVIDIA T4)
  • BLEU and ROUGE-L evaluation metrics

Methodology

Dataset 

The Code2Doc dataset containing 13,358 curated code-documentation pairs was used. The dataset includes Python, Java, TypeScript, JavaScript, and C++ functions.

The dataset undergoes:

  • Basic filtering
  • Quality scoring
  • Deduplication
  • AI-content detection

Dataset Statistics

The Code2Doc dataset used for training contains 13,358 curated code-documentation pairs after a four-stage filtering process. It includes five programming languages: Python, Java, TypeScript, JavaScript, and C++. Java formed the largest portion of the dataset with 61.4% samples, while Python contributed 27%. The dataset achieved a mean quality score of 6.93/10, ensuring reliable training data.

Four-Stage Curation Pipeline

Stage 1 — Basic Filtering: removes trivial docs, test functions, placeholders

Stage 2 — Quality Scoring: 8-dimension weighted score, threshold ≥ 6.0

Stage 3 — Deduplication: exact hash + MinHash-LSH near-duplicate removal

Stage 4 — AI Content Detection: heuristic flagging of LLM-generated docs

 

Model & Architecture

CodeT5 — Seq2Seq Fine-Tuning

CodeT5 is an encoder-decoder transformer pre-trained on large-scale code corpora (GitHub, CodeSearchNet). It is purpose-built for code understanding and generation, making it ideally suited for the code → docstring task.

Input Format

Input:   "Summarize {language}: {function_code}"

Target:  "{documentation}"

How It Works

  • The encoder reads the full source code using bidirectional multi-head self-attention
  • Each token attends to all other tokens — capturing long-range code structure
  • The decoder generates the docstring token by token
  • Cross-attention allows the decoder to focus on relevant parts of the encoded source
  • Output ends when the model produces the end-of-sequence token

CodeT5-base was selected for this project because it is specifically designed for code understanding and generation tasks using an encoder-decoder architecture, making it more suitable for generating structured documentation than general-purpose models like Llama 3.1 8B. The model, containing 222M parameters, was fine-tuned on the Code2Doc dataset with 9,192 training, 1,022 validation, and 1,293 test samples, using an input length of 512 tokens and output length of 256 tokens. Training was performed on NVIDIA T4 GPUs through Kaggle with an effective batch size of 16 using gradient accumulation. The model showed consistent improvement during training, with decreasing loss values and increasing validation BLEU scores, reaching 0.512 in the final epoch, indicating improved documentation generation performance.

Evaluation

The generated documentation was evaluated using:

  • BLEU Score
  • ROUGE-L Score

Deployment

The final trained model was integrated into a Streamlit web application for interactive documentation generation.

Dashboard

Results

Key Findings

  • CodeT5 significantly outperformed the fine-tuned Llama 3.1 8B baseline despite being much smaller.
  • Hyperparameter tuning improved both BLEU and ROUGE-L performance.
  • High-quality curated datasets proved more effective than larger noisy datasets.

Conclusions

DocForge demonstrates that domain-specific transformer models can effectively generate accurate and useful code documentation. Fine-tuning CodeT5 on curated datasets produced strong performance and significantly outperformed larger general-purpose language models.

Future Scope

  • Support additional programming languages
  • Improve semantic accuracy 
  • Add repository-level documentation generation
  • Include human feedback evaluation mechanisms

References

Karaman, R.K. & Akarsu, M. (2025). Code2Doc: A Quality-First Curated Dataset for Code Documentation.

Wang, Y. et al. (2021). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation.

Vaswani, A. et al. (2017). Attention Is All You Need.

Papineni, K. et al. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation.

Lin, C.Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries.

Mentors and Mentees Details

Mentors

  • Shriya Bharadwaj
  • Priyadharshni S

Mentees

  • Dhruv Bhavesh Chokshi
  • Harsh Raj
  • Aadit Munje
  • Shreevarna S Rao
  • Dharsini Nakulan

 

Report Information

Explore More Projects

View All 2026 Projects