Natural Language Processing Toolkit

AI and ML

Natural Language Processing Toolkit

The Natural Language Processing Toolkit (NLTK) is a Python-based software application that offers a suite of tools for the purpose of processing natural language data. It provides APIs that can help quickly apply pretrained NLP models to your text, including Text Summarization, Sentence Similarity, and more. It also includes a user interface demo using Streamlit.

Introduction

Natural Language Processing (NLP) is a field of computer science and artificial intelligence that focuses on the interaction between computers and humans in natural language. It involves developing algorithms and models that can analyze, understand, and generate human language. NLP is used in a wide range of applications, including Text Summarization, Sentence Similarity, Chatbots, Grammar Correction, and more.

Our Approaches

01. Text Summarization

Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, whereas other models can generate entirely new text.

Our Text Summarization using LongT5 model has been fine-tuning on a large dataset of paired text summaries. This approach involves feeding the LongT5 model with pairs of text inputs and corresponding summaries and optimizing the model to predict accurate summaries.

The model was fine-tuned using techniques such as transfer learning, curriculum learning, and multi-task learning to improve its performance. Additionally, techniques such as beam search and length normalization can be applied to improve the quality of the generated summaries.

02. Sentence Similarity

The Sentence Similarity is first fed with a pair of input sentences, and the final hidden state of the [CLS] token is extracted. The [CLS] token represents the aggregated representation of the two input sentences. Then, a fully connected layer is added on top of the [CLS] token to produce a similarity score between 0 and 1 for the pair of input sentences. The model is then trained on a dataset of sentence pairs with corresponding similarity scores using mean squared error loss or binary cross-entropy loss. Once the model is trained, it can be used to compute the similarity between new pairs of input sentences.

03. Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing task that aims to identify and extract entities such as names, locations, organizations, and dates from text. Spacy is a popular Python library for NLP that provides an easy-to-use interface for NER. The basic approach for NER using Spacy involves the following steps:

04. Grammar Correction

Grammar Correction using a language model to generate grammatically correct sentences based on input text. Our approach uses techniques such as sequence-to-sequence models, and transformers. The model is trained on a large corpus of text to learn the patterns of grammar and syntax, and then used to generate synthetic sentences that adhere to those rules. The quality of the generated sentences depends on the complexity of the model and the quality and quantity of the training data.

05. Comment Classification

The Comment Classification detects whether text contains toxic content such as threatening language, insults, obscenities, identity-based hate, or sexually explicit language. Our approach is using a BERT model, which was trained on a large civil comments dataset.

Usage

Step 01

Access to the NLP Toolkit site: https://experiment.saigontechnology.vn/nlp-toolkit/. Or you can access the main Saigon Technology AI Research Lab page here: https://experiment.saigontechnology.vn/, select the NLP Toolkit section and click Try our demo button.

Step 02

On the NLP Toolkit page, to start please choose the demo in the sidebar.

Step 03

Step 3.1: Input the corpus to the text area or simply enter an article URL. The summarization of the corpus/article will be displayed at the bottom of the page.

Step 3.2: (Sentence Similarity) Input the reference sentence and target sentence in the sidebar. Click the “Submit” button.

Result:

Step 3.3: (Named Entities Recognize) Input the sentence in the text area. Press “Ctrl +Enter” to submit the sentence.

Result:

Step 3.4: (Grammar Correction) Select the sample sentence or bring your own sentence to input. Press “Ctrl + Enter” to submit the sentence.

Result:

Step 3.5: (Comment Classifier) Input your sentence in the text area. Press “Ctrl + Enter” to submit your sentence.

Result:

Related Projects

AI and ML

Analyze The Object On Image Using Computer Vision Technologies

With the help of Computer Vision techniques, this effort seeks to create a system that can evaluate the Object on Image.

Detail

CVParser Documents

This project aims at developing an end-to-end system, CV Parser System, to extract important information from a .pdf CV file automatically. As Artificial Intelligence has gained a reputation recently, applying the Computer Vision or Natural Language Processing technologies.

Detail

Media and Entertainment

Music Recommendation System

Music Recommendation Systems have become immensely popular, enabling users to explore new songs and artists based on their listening habits and preferences.

Detail

Computer Vision

Product Recognition

Utilizing AI-based Computer Vision techniques, the Product Recognition system autonomously detects and categorizes products present within images or videos. Through a comprehensive analysis of the visual attributes of products, including their shapes, colors, and textures

Detail

AI and ML

Skin Analyzing System

A skin analysis system evaluates skin health and appearance using imaging, machine learning, and data analysis to assess conditions like acne, wrinkles, and sun damage.

Detail

Travel

Semantic Search For Travel Place Document

This project aims at developing a Semantic Searching Engine that could search with meaning not only to find keywords but to determine the intent and contextual meaning of the input sentence.

Detail

Business

Optical Character Recognition Document

Every day, a vast quantity of textual information is written or printed on tangible paper, such as study-related messages, invoices, periodicals, books, ads, and so on. Paper contamination is a major issue in the corporate world and has obvious environmental consequences.

Detail

Beauty & Wellness

Realitiverse’s Fitness Tracker Platform

Realitiverse's Fitness Tracker Platform is designed for Singapore's software market. The platform focuses on fitness lessons and mental health.

Detail

AI and ML

Looking for World Class PepTalks?

To validate the idea of Peptalk AI, Peptalk developed its MVP version first. The app aimed to transform the way clients find and book experts to organize meetings and events through a smart chatbot interface. The platform provided a seamless experience for users looking to enhance their meetings with specialized insights. PepTalk is a company […]

Detail

Browse Our Portfolio