Beginner’s Guide to FinGPT: Training with LoRA and ChatGLM2–6B

Cost-Effective FinGPT Training: One Notebook, $10 GPU

10 min readOct 6, 2023

Generated using Midjourney by Bruce Yang

Welcome to this comprehensive guide aimed at beginners diving into the realm of Financial Large Language Models (FinLLMs) with FinGPT. This blog post demystifies the process of training FinGPT using Low-Rank Adaptation (LoRA) with the robust base model ChatGLM2–6b.

Code:

FinGPT/FinGPT_Training_LoRA_with_ChatGLM2_6B_for_Beginners.ipynb at master ·…

Data-Centric FinGPT. Open-source for open finance! Revolutionize 🔥 We release the trained model on HuggingFace. …

github.com

Tool: Need to use Google Colab to run the Jupyter Notebook

Google Colab serves as a universal cloud environment, facilitating standardization effectively. In contrast, setting up and running on local environments can be quite complex due to varying settings for each user, for which there isn’t a one-size-fits-all plan available.

Part 1: Preparing the Data

1.1 Initialize Directories
1.2 Load and Prepare Dataset
1.3 Concatenate and Shuffle Dataset

Part 2: Dataset Formatting and Tokenization

2.1 Dataset Formatting
2.2 Tokenization
2.3 Save the dataset

Part 3: Setup FinGPT training parameters with LoRA on ChatGlm2–6b

3.1 Training Arguments Setup
3.2 Quantization Config Setup
3.3 Model Loading & Preparation
3.4 LoRA Config & Setup

Part 4: Loading Data and Training FinGPT

4.1 Loading Your Data:
4.2 Training Configuration and Launch:
4.3 Model Saving and Download:

Part 5: Inference and Benchmarks using FinGPT

5.1 Load the model
5.2 Run Benchmarks:
5.3 Compare it with FinGPT V3.1 results

Part 1: Preparing the Data

Data preparation is a crucial step in training Financial Large Language Models. Here, we’ll guide you on how to get your dataset ready for FinGPT using Python.

In this section, you’ve initialized your working directory and loaded a financial sentiment dataset. Let’s break down the steps:

1.1 Initialize Directories:

This block checks if certain paths exist; if they do, it deletes them to avoid data conflicts and creates a new directory for the upcoming data.

import os
import shutil

jsonl_path = "../data/dataset_new.jsonl"
save_path = '../data/dataset_new'


if os.path.exists(jsonl_path):
    os.remove(jsonl_path)

if os.path.exists(save_path):
    shutil.rmtree(save_path)

directory = "../data"
if not os.path.exists(directory):
    os.makedirs(directory)

1.2 Load and Prepare Dataset:

Import necessary libraries from the datasets package
Load the Twitter Financial News Sentiment (TFNS) dataset and convert it to a Pandas dataframe.
Map numerical labels to their corresponding sentiments (negative, positive, neutral).
Add instruction for each data entry, which is crucial for Instruction Tuning.
Convert the Pandas dataframe back to a Hugging Face Dataset object.

from datasets import load_dataset
import datasets

dic = {
    0:"negative",
    1:'positive',
    2:'neutral',
}

tfns = load_dataset('zeroshot/twitter-financial-news-sentiment')
tfns = tfns['train']
tfns = tfns.to_pandas()
tfns['label'] = tfns['label'].apply(lambda x:dic[x])
tfns['instruction'] = 'What is the sentiment of this tweet? Please choose an answer from {negative/neutral/positive}.'
tfns.columns = ['input', 'output', 'instruction']
tfns = datasets.Dataset.from_pandas(tfns)
tfns

1.3 Concatenate and Shuffle Dataset

tmp_dataset = datasets.concatenate_datasets([tfns]*2)
train_dataset = tmp_dataset
print(tmp_dataset.num_rows)

all_dataset = train_dataset.shuffle(seed = 42)
all_dataset.shape

Now that your training data is loaded and prepared.

Part 2: Dataset Formatting and Tokenization

Once your data is prepared, the next steps involve formatting the dataset for model ingestion and tokenizing the input data. Below, we provide a step-by-step breakdown of the code snippets shared.

2.1 Dataset Formatting:

You must structure your data in a specific format that aligns with the training process.

import json
from tqdm.notebook import tqdm


def format_example(example: dict) -> dict:
    context = f"Instruction: {example['instruction']}\n"
    if example.get("input"):
        context += f"Input: {example['input']}\n"
    context += "Answer: "
    target = example["output"]
    return {"context": context, "target": target}


data_list = []
for item in all_dataset.to_pandas().itertuples():
    tmp = {}
    tmp["instruction"] = item.instruction
    tmp["input"] = item.input
    tmp["output"] = item.output
    data_list.append(tmp)


# save to a jsonl file
with open("../data/dataset_new.jsonl", 'w') as f:
    for example in tqdm(data_list, desc="formatting.."):
        f.write(json.dumps(format_example(example)) + '\n')

The data looks like this

2.2 Tokenization

Tokenization is the process of converting input text into tokens that can be fed into the model.

## need to set the packages to run this code block
!pip install protobuf transformers==4.30.2 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate

import datasets
from transformers import AutoTokenizer, AutoConfig

model_name = "THUDM/chatglm2-6b"
jsonl_path = "../data/dataset_new.jsonl"  # updated path
save_path = '../data/dataset_new'  # updated path
max_seq_length = 512
skip_overlength = True

# The preprocess function tokenizes the prompt and target, combines them into input IDs,
# and then trims or pads the sequence to the maximum sequence length.
def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}

# The read_jsonl function reads each line from the JSONL file, preprocesses it using the preprocess function,
# and then yields each preprocessed example.
def read_jsonl(path, max_seq_length, skip_overlength=False):
    tokenizer = AutoTokenizer.from_pretrained(
        model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(
        model_name, trust_remote_code=True, device_map='auto')
    with open(path, "r") as f:
        for line in tqdm(f.readlines()):
            example = json.loads(line)
            feature = preprocess(tokenizer, config, example, max_seq_length)
            if skip_overlength and len(feature["input_ids"]) > max_seq_length:
                continue
            feature["input_ids"] = feature["input_ids"][:max_seq_length]
            yield feature

2.3 Save the dataset

# The script then creates a Hugging Face Dataset object from the generator and saves it to disk.
save_path = '../data/dataset_new'

dataset = datasets.Dataset.from_generator(
    lambda: read_jsonl(jsonl_path, max_seq_length, skip_overlength)
    )
dataset.save_to_disk(save_path)

Part 3: Setup FinGPT training parameters with LoRA on ChatGlm2–6b

Training a model can be resource-intensive. Ensure you have a powerful GPU

Need to purchase Google Colab GPU plans, Colab Pro is sufficient or just buy 100 compute units for $10
NVIDIA A100 is recommended due to its high memory capacity.

3.1 Training Arguments Setup:

Initialize and set training arguments.

from typing import List, Dict, Optional
import torch
from loguru import logger
from transformers import (
    AutoModel,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import (
    TaskType,
    LoraConfig,
    get_peft_model,
    set_peft_model_state_dict,
    prepare_model_for_kbit_training,
    prepare_model_for_int8_training,
)
from peft.utils import TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING

training_args = TrainingArguments(
        output_dir='./finetuned_model',    # saved model path
        logging_steps = 500,
        # max_steps=10000,
        num_train_epochs = 2,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=1e-4,
        weight_decay=0.01,
        warmup_steps=1000,
        save_steps=500,
        fp16=True,
        # bf16=True,
        torch_compile = False,
        load_best_model_at_end = True,
        evaluation_strategy="steps",
        remove_unused_columns=False,

    )

3.2 Quantization Config Setup:

Set quantization configuration to reduce model size without losing significant precision.

# Quantization
q_config = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_quant_type='nf4',
                                bnb_4bit_use_double_quant=True,
                                bnb_4bit_compute_dtype=torch.float16
                                )

3.3 Model Loading & Preparation:

Load the base model and tokenizer, and prepare the model for INT8 training.

Runtime -> Change runtime type -> A100 GPU
Restart runtime and run again if not working

# Load tokenizer & model
# need massive space
model_name = "THUDM/chatglm2-6b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
        model_name,
        quantization_config=q_config,
        trust_remote_code=True,
        device='cuda'
    )
model = prepare_model_for_int8_training(model, use_gradient_checkpointing=True)

3.4 LoRA Config & Setup:

Implement Low-Rank Adaptation (LoRA) and print trainable parameters.

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )



# LoRA
target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['chatglm']
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=target_modules,
    bias='none',
)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

resume_from_checkpoint = None
if resume_from_checkpoint is not None:
    checkpoint_name = os.path.join(resume_from_checkpoint, 'pytorch_model.bin')
    if not os.path.exists(checkpoint_name):
        checkpoint_name = os.path.join(
            resume_from_checkpoint, 'adapter_model.bin'
        )
        resume_from_checkpoint = False
    if os.path.exists(checkpoint_name):
        logger.info(f'Restarting from {checkpoint_name}')
        adapters_weights = torch.load(checkpoint_name)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        logger.info(f'Checkpoint {checkpoint_name} not found')

model.print_trainable_parameters()

Part 4: Loading Data and Training FinGPT

In this segment, we’ll delve into the loading of your pre-processed data, and finally, launch the training of your FinGPT model. Here’s a stepwise breakdown of the script provided:

Need to purchase Google Colab GPU plans, Colab Pro is sufficient or just buy 100 compute units for $10

4.1 Loading Your Data:


# load data
from datasets import load_from_disk
import datasets

dataset = datasets.load_from_disk("../data/dataset_new")
dataset = dataset.train_test_split(0.2, shuffle=True, seed = 42)

4.2 Training Configuration and Launch:

Customize the Trainer class for specific loss computation, prediction step, and model-saving methods.
Define a data collator function to process batches of data during training.
Set up TensorBoard for logging, instantiate your modified trainer, and begin training.

class ModifiedTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"],
            labels=inputs["labels"],
        ).loss

    def prediction_step(self, model: torch.nn.Module, inputs, prediction_loss_only: bool, ignore_keys = None):
        with torch.no_grad():
            res = model(
                input_ids=inputs["input_ids"].to(model.device),
                labels=inputs["labels"].to(model.device),
            ).loss
        return (res, None, None)

    def save_model(self, output_dir=None, _internal_call=False):
        from transformers.trainer import TRAINING_ARGS_NAME

        os.makedirs(output_dir, exist_ok=True)
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
        saved_params = {
            k: v.to("cpu") for k, v in self.model.named_parameters() if v.requires_grad
        }
        torch.save(saved_params, os.path.join(output_dir, "adapter_model.bin"))

def data_collator(features: list) -> dict:
    len_ids = [len(feature["input_ids"]) for feature in features]
    longest = max(len_ids)
    input_ids = []
    labels_list = []
    for ids_l, feature in sorted(zip(len_ids, features), key=lambda x: -x[0]):
        ids = feature["input_ids"]
        seq_len = feature["seq_len"]
        labels = (
            [tokenizer.pad_token_id] * (seq_len - 1) + ids[(seq_len - 1) :] + [tokenizer.pad_token_id] * (longest - ids_l)
        )
        ids = ids + [tokenizer.pad_token_id] * (longest - ids_l)
        _ids = torch.LongTensor(ids)
        labels_list.append(torch.LongTensor(labels))
        input_ids.append(_ids)
    input_ids = torch.stack(input_ids)
    labels = torch.stack(labels_list)
    return {
        "input_ids": input_ids,
        "labels": labels,
    }

from torch.utils.tensorboard import SummaryWriter
from transformers.integrations import TensorBoardCallback

# Train
# Took about 10 compute units
# Took 40 mins to train
writer = SummaryWriter()
trainer = ModifiedTrainer(
    model=model,
    args=training_args,             # Trainer args
    train_dataset=dataset["train"], # Training set
    eval_dataset=dataset["test"],   # Testing set
    data_collator=data_collator,    # Data Collator
    callbacks=[TensorBoardCallback(writer)],
)
trainer.train()
writer.close()
# save model
model.save_pretrained(training_args.output_dir)

4.3 Model Saving and Download:

After training, save and download your model. You can also check the model’s size.

Make sure to save the model, otherwise, if you restart the session the model will be gone and you need to retrain it.

!zip -r /content/saved_model.zip /content/{training_args.output_dir}

# download to local
from google.colab import files
files.download('/content/saved_model.zip')

# save to google drive
from google.colab import drive
drive.mount('/content/drive')


# save the finetuned model to google drive
!cp -r "/content/finetuned_model" "/content/drive/MyDrive"

def get_folder_size(folder_path):
    total_size = 0
    for dirpath, _, filenames in os.walk(folder_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size / 1024 / 1024  # Size in MB

model_size = get_folder_size(training_args.output_dir)
print(f"Model size: {model_size} MB")

Model size: 29.84746265411377 MB

Now your model is trained and saved! You can download it and use it for generating financial insights or any other relevant tasks in the finance domain. The usage of TensorBoard allows you to deeply understand and visualize the training dynamics and performance of your model in real time.

Happy FinGPT Training! 🚀

Part 5: Inference and Benchmarks using FinGPT

Now that your model is trained, let’s understand how to use it to infer and run benchmarks.

5.1 Load the model

#clone the FinNLP repository
!git clone https://github.com/AI4Finance-Foundation/FinNLP.git

import sys
sys.path.append('/content/FinNLP/')


from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

from peft import PeftModel
import torch

# Load benchmark datasets from FinNLP
from finnlp.benchmarks.fpb import test_fpb
from finnlp.benchmarks.fiqa import test_fiqa , add_instructions
from finnlp.benchmarks.tfns import test_tfns
from finnlp.benchmarks.nwgi import test_nwgi

# load model from google drive
from google.colab import drive
drive.mount('/content/drive')


# Define the path you want to check
path_to_check = "/content/drive/My Drive/finetuned_model"

# Check if the specified path exists
if os.path.exists(path_to_check):
    print("Path exists.")
else:
    print("Path does not exist.")


## load the chatglm2-6b base model
base_model = "THUDM/chatglm2-6b"
peft_model = training_args.output_dir

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()

## load our finetuned model
base_model = "THUDM/chatglm2-6b"
peft_model = "/content/drive/My Drive/finetuned_model"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(base_model, trust_remote_code=True, load_in_8bit=True, device_map="auto")

model = PeftModel.from_pretrained(model, peft_model)

model = model.eval()

5.2 Run Benchmarks:

We made 4 datasets as our benchmarks in FinNLP

FinNLP/finnlp/benchmarks at main · AI4Finance-Foundation/FinNLP

Democratizing Internet-scale financial data. Contribute to AI4Finance-Foundation/FinNLP development by creating an…

github.com

batch_size = 8

# TFNS Test Set, len 2388
# Available: 84.85 compute units
res = test_tfns(model, tokenizer, batch_size = batch_size)
# Available: 83.75 compute units
# Took about 1 compute unite to inference


# FPB, len 1212
res = test_fpb(model, tokenizer, batch_size = batch_size)

# FiQA, len 275
res = test_fiqa(model, tokenizer, prompt_fun = add_instructions, batch_size = batch_size)

# NWGI, len 4047
res = test_nwgi(model, tokenizer, batch_size = batch_size)

5.3 Compare it with FinGPT V3.1 results

FinGPT/fingpt/FinGPT-v3 at master · AI4Finance-Foundation/FinGPT

Data-Centric FinGPT. Open-source for open finance! Revolutionize 🔥 We release the trained model on HuggingFace. …

github.com

Comparison

TFNS:
FinGPT V3.1:

Acc: 0.876
F1 macro: 0.841
F1 weighted (follow BloombergGPT): 0.875

This notebook:

Acc: 0.856
F1 macro: 0.806
F1 weighted (follow BloombergGPT): 0.850

Since we trained on the TFNS dataset, it is expected that the test results would be good.

FPB:
FinGPT V3.1:

Acc: 0.856
F1 macro: 0.841
F1 weighted: 0.855

This notebook:

Acc: 0.741
F1 macro: 0.655
F1 weighted: 0.694

Considering the FPB dataset was not included in our training set, the obtained zero-shot results are acceptable.

FiQA:
FinGPT V3.1:

Acc: 0.836
F1 macro: 0.746
F1 weighted: 0.850

This notebook:

Acc: 0.48
F1 macro: 0.5
F1 weighted: 0.49

Since the FiQA dataset wasn’t part of our training set, our model’s zero-shot performance is relatively poor compared to FinGPT V3.1.

NWGI:
FinGPT V3.1:

Acc: 0.642
F1 macro: 0.650
F1 weighted: 0.642

This notebook:

Acc: 0.521
F1 macro: 0.500
F1 weighted: 0.490

The results are reasonable

Conclusion

The training and testing of FinGPT in this exercise demanded a total of 20 compute units, broken down into 10 for training and another 10 for inference.
100 Compute Units for 10 dollars, that makes 2 dollars to train and test with FinGPT
This cost-effective approach is primarily attributable to the utilization of the Low-Rank Adaptation (LoRA) method, which proves to be economical while ensuring efficient model training and inference.

This exercise provided insights into the performance of your trained FinGPT model across various benchmarks. While there are areas where it excels, certain benchmarks highlight opportunities for improvement and tuning. Exploring additional training data and refining the model further will likely lead to enhanced performance across different financial NLP tasks, making it a powerful tool for various applications in the finance sector.

Happy Experimenting with FinGPT! 🚀

Please report issues here:

Issues · AI4Finance-Foundation/FinGPT

Data-Centric FinGPT. Open-source for open finance! Revolutionize 🔥 We release the trained model on HuggingFace. …

github.com

Thanks.

Beginner’s Guide to FinGPT: Training with LoRA and ChatGLM2–6B

Cost-Effective FinGPT Training: One Notebook, $10 GPU

FinGPT/FinGPT_Training_LoRA_with_ChatGLM2_6B_for_Beginners.ipynb at master ·…

Data-Centric FinGPT. Open-source for open finance! Revolutionize 🔥 We release the trained model on HuggingFace. …

Part 1: Preparing the Data

Part 2: Dataset Formatting and Tokenization

Part 3: Setup FinGPT training parameters with LoRA on ChatGlm2–6b

Part 4: Loading Data and Training FinGPT

Part 5: Inference and Benchmarks using FinGPT

Part 1: Preparing the Data

1.1 Initialize Directories:

1.2 Load and Prepare Dataset:

1.3 Concatenate and Shuffle Dataset

Part 2: Dataset Formatting and Tokenization

2.1 Dataset Formatting:

2.2 Tokenization

2.3 Save the dataset

Part 3: Setup FinGPT training parameters with LoRA on ChatGlm2–6b

3.1 Training Arguments Setup:

3.2 Quantization Config Setup:

3.3 Model Loading & Preparation:

3.4 LoRA Config & Setup:

Part 4: Loading Data and Training FinGPT

4.1 Loading Your Data:

4.2 Training Configuration and Launch:

4.3 Model Saving and Download:

Part 5: Inference and Benchmarks using FinGPT

5.1 Load the model

5.2 Run Benchmarks:

FinNLP/finnlp/benchmarks at main · AI4Finance-Foundation/FinNLP

Democratizing Internet-scale financial data. Contribute to AI4Finance-Foundation/FinNLP development by creating an…

5.3 Compare it with FinGPT V3.1 results

FinGPT/fingpt/FinGPT-v3 at master · AI4Finance-Foundation/FinGPT

Data-Centric FinGPT. Open-source for open finance! Revolutionize 🔥 We release the trained model on HuggingFace. …

Comparison

Conclusion

Issues · AI4Finance-Foundation/FinGPT

Data-Centric FinGPT. Open-source for open finance! Revolutionize 🔥 We release the trained model on HuggingFace. …

Written by ByFintech @ AI4Finance Foundation