Introduction 

Building NLP systems for domain-specific environments requires more than selecting a single model architecture. Real-world classification workflows often involve continuous experimentation, iterative training pipelines, evaluation tuning, and scalable experiment management.

In this project, our team not only researched and evaluated multiple NLP approaches but also implemented end-to-end classification pipelines for both single-label and multi-label classification workflows. The work focused on building an experimentation-driven NLP ecosystem capable of supporting traditional SpaCy-based pipelines, deep learning architectures, and transformer-based models while maintaining reproducibility, scalability, and efficient experiment tracking across iterations.

The implementation involved designing standardized training and evaluation workflows, benchmarking different model architectures, optimizing classification performance for domain-specific text, and integrating MLOps tooling such as MLflow for experiment tracking and DVC for dataset and model versioning.

Figure 1: End-to-end NLP classification architecture illustrating data ingestion, preprocessing, experimentation across SpaCy, deep learning, and transformer-based models, experiment tracking with MLflow, dataset and model versioning with DVC, and deployment through Dockerized AWS Spot Instance infrastructure with Amazon EFS-backed model storage. 

Exploring Multiple NLP Approaches: 

SpaCy-Based Classification Pipelines 

The initial workflow leveraged SpaCy-based classification pipelines for lightweight and fast experimentation. These pipelines provided an efficient starting point for training custom classifiers on structured domain-specific datasets. 

Figure 2: Lightweight SpaCy classification pipeline used for rapid experimentation, efficient training, and low-latency inference in domain-specific NLP workflows. 

The team explored:

  • Designing scalable text preprocessing workflows involving normalization, tokenization, stopword removal, lemmatization, and train-validation data preparation to improve data consistency across classification experiments.
  • Building custom SpaCy-based text categorization pipelines for both single-label and multi-label classification tasks.
  • Tuning confidence score thresholds to reduce false positives, improve classification reliability, and better control prediction behavior for production-oriented NLP workflows.
  • Optimizing pipeline execution and inference performance to support faster experimentation cycles, lightweight deployments, and efficient model serving within containerized environments.

Deep Learning Model Experiments

To improve contextual understanding and classification accuracy, the workflow expanded into deep learning-based architectures. Deep learning-based NLP classification workflow:

Figure 3: End-to-end deep learning classification pipeline demonstrating how embedding representations and sequence-learning architectures are used to capture contextual information within domain-specific text. 

The experimentation included:

  • Implementing CNN-based text classification models to capture localized contextual patterns and evaluate lightweight deep learning approaches for domain-specific sentence classification tasks.
  • Exploring RNN and BiLSTM architectures to better understand sequential dependencies and contextual relationships within structured and unstructured textual data.

Figure 4: Evolution from static word embeddings to contextual transformer representations, enabling richer semantic feature extraction for domain-specific NLP classification tasks. 

  • Experimenting with embedding-based sequence modeling techniques using pretrained and trainable embeddings to improve semantic representation learning across classification workflows.
  • Building and evaluating both single-label and multi-label deep learning classification pipelines to support different prediction and annotation strategies.
  • Performing comparative evaluation across imbalanced datasets to analyze model robustness, label bias, generalization capability, and classification consistency under varying data distributions.

Transformer-Based Architectures 

As experimentation matured, transformer-based models were introduced to capture richer contextual representations and improve semantic understanding across domain-specific classification workflows.

The work included experimenting with transformer-powered NLP pipelines such as:

  • Integrating contextual embedding models including transformer-based architectures like SciBERT to improve representation learning for domain-specific textual data.
  • Exploring domain-adapted transformer workflows using SciBERT-based pipelines within both NLTK-driven preprocessing workflows and SpaCy transformer architectures for advanced text classification tasks.

      import spacy

      nlp = spacy.load(“en_core_sci_scibert”)

      doc = nlp(“Patient shows improved appetite”)

      print(doc.vector.shape)

  • Implementing fine-tuning strategies for transformer-based classification models to optimize performance across single-label and multi-label prediction scenarios.
  • Building hybrid transformer + NLP pipelines combining traditional preprocessing techniques with contextual transformer embeddings to improve classification consistency and semantic understanding.
  • Evaluating transformer architectures against baseline SpaCy and deep learning models to compare contextual learning capability, classification accuracy, and inference trade-offs.
  • Experimenting with transformer-integrated SpaCy pipelines using transformer-enabled components for streamlined training, inference, and domain-specific NLP experimentation workflows.

Transformer-based experimentation significantly improved contextual understanding and semantic feature extraction, especially for complex domain-specific text where traditional NLP and sequence-learning architectures showed limitations.

Experiment Tracking with MLflow 

To manage growing experimentation complexity, MLflow was integrated into the workflow for centralized experiment tracking.

Figure 5: Centralized MLflow experiment repository used to organize, monitor, and manage multiple NLP classification runs across different model architectures. 

The platform was used to:

  • Track model parameters, hyperparameters, and training configurations across multiple NLP experiments, enabling consistent comparison between different model architectures and training strategies.
  • Store and monitor evaluation metrics such as accuracy, precision, recall, F1-score, and loss values to measure model performance throughout the experimentation lifecycle.

Figure 6: Example of MLflow-based experiment tracking, illustrating real-time visualization of training loss, training accuracy, and validation accuracy for a deep learning classification workflow. 

  • Compare experiment runs and benchmark results across SpaCy, deep learning, and transformer-based models to identify the most effective configurations

Figure 7: Comparative analysis of NLP experiment runs in MLflow, enabling side-by-side evaluation of hyperparameters, model configurations, and classification performance metrics. 

  • Log training artifacts including trained models, configuration files, and evaluation outputs, providing centralized access to experiment-related assets.
  • Improve reproducibility and collaboration by maintaining a structured history of experiments, making it easier to revisit, validate, and reproduce previous results.

MLflow provided a unified view of experimentation across SpaCy, deep learning, and transformer-based workflows, enabling efficient comparison of model performance, reproducible training processes, and streamlined collaboration during iterative model development. 

Dataset and Model Versioning with DVC

As datasets and model artifacts evolved, DVC was incorporated to improve version control and reproducibility.

Figure 8: DVC-based dataset and model versioning workflow illustrating traceable dataset evolution, model artifact management, reproducible training pipelines 

The workflow supported:

  • Versioning datasets alongside model development activities, enabling the team to track data changes, compare dataset revisions, and maintain consistency across training experiments.
  • Managing model artifacts through version-controlled storage, making it easier to track model evolution, validate improvements, and maintain historical records of trained models.
  • Building reproducible experiment pipelines by linking specific datasets, configurations, and model versions to individual training runs.
  • Supporting reliable rollback and recovery capabilities, allowing previous dataset and model versions to be restored whenever validation, comparison, or troubleshooting was required.

Key Takeaways

This project demonstrated that building effective domain-specific NLP classification systems requires continuous experimentation across multiple modeling approaches rather than relying on a single architecture. Different techniques offered unique advantages in terms of contextual understanding, training complexity, inference performance, and scalability.

Through the implementation and evaluation of SpaCy-based pipelines, deep learning architectures, and transformer-powered workflows, the team established a structured framework for comparing and optimizing classification performance across both single-label and multi-label use cases.

The integration of MLOps practices through MLflow and DVC further enhanced experiment management, reproducibility, and collaboration by providing centralized experiment tracking and version-controlled datasets and model artifacts.

Overall, the approach enabled faster experimentation cycles, reliable benchmarking across model generations, and the development of scalable NLP workflows capable of adapting to evolving domain-specific classification requirements.