首页 > 资讯 > 后端开发 > Python >Pytorch BertModel的使用说明

951

分享到

Pytorch BertModel的使用说明

2024-04-02 19:04:59 951人浏览泡泡鱼

Python 官方文档：入门教程 => 点击学习

摘要

基本介绍环境: python 3.5+, PyTorch 0.4.1/1.0.0 安装: pip install pytorch-pretrained-bert 必需参数:

基本介绍

环境: python 3.5+, PyTorch 0.4.1/1.0.0

安装:


pip install pytorch-pretrained-bert

必需参数:

--data_dir: "str": 数据根目录.目录下放着,train.xxx/dev.xxx/test.xxx三个数据文件.

--vocab_dir: "str": 词库文件地址.

--bert_model: "str": 存放着bert预训练好的模型. 需要是一个gz文件, 如"..x/xx/bert-base-chinese.tar.gz ", 里面包含一个bert_config.JSON和pytorch_model.bin文件.

--task_name: "str": 用来选择对应数据集的参数,如"cola",对应着数据集.

--output_dir: "str": 模型预测结果和模型参数存储目录.

简单例子:

导入所需包


import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertFORMaskedLM

创建分词器


tokenizer = BertTokenizer.from_pretrained(--vocab_dir)

需要参数: --vocab_dir，数据样式见此

拥有函数:

tokenize: 输入句子，根据--vocab_dir和贪心原则切词. 返回单词列表

convert_token_to_ids: 将切词后的列表转换为词库对应id列表.

convert_ids_to_tokens: 将id列表转换为单词列表.


text = '[CLS] 武松打老虎 [SEP] 你在哪 [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0,0,0,0, 1,1, 1, 1, 1, 1, 1, 1]
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

这里对标记符号的切词似乎有问题([cls]/[sep])，而且中文bert是基于字级别编码的，因此切出来的都是一个一个汉字:


['[', 'cl', '##s', ']', '武', '松', '打', '老', '虎', '[', 'sep', ']', '你', '在', '哪', '[', 'sep', ']']

创建bert模型并加载预训练模型:


model = BertModel.from_pretrained(--bert_model)

放入GPU:


tokens_tensor = tokens_tensor.cuda()
segments_tensors = segments_tensors.cuda()
model.cuda()

前向传播:


encoded_layers, pooled_output= model(tokens_tensor, segments_tensors)

参数:

input_ids: (batch_size, sqe_len)代表输入实例的Tensor

token_type_ids=None: (batch_size, sqe_len)一个实例可以含有两个句子，这个相当于句子标记.

attention_mask=None: (batch_size*): 传入每个实例的长度，用于attention的mask.

output_all_encoded_layers=True: 控制是否输出所有encoder层的结果.

返回值:

encoded_layer：长度为num_hidden_layers的(batch_size， sequence_length，hidden_size)的Tensor.列表

pooled_output: (batch_size, hidden_size), 最后一层encoder的第一个词[CLS]经过Linear层和激活函数Tanh()后的Tensor. 其代表了句子信息

补充：pytorch使用Bert

主要分为以下几个步骤：

下载模型放到目录中

使用transformers中的BertModel，BertTokenizer来加载模型与分词器

使用tokenizer的encode和decode 函数分别编码与解码，注意参数add_special_tokens和skip_special_tokens

forward的输入是一个[batch_size, seq_length]的tensor，再需要注意的是attention_mask参数。

输出是一个tuple，tuple的第一个值是bert的最后一个transformer层的hidden_state，size是[batch_size, seq_length, hidden_size]，也就是bert最后的输出，再用于下游的任务。


# -*- encoding: utf-8 -*-
import warnings
warnings.filterwarnings('ignore')
from transformers import BertModel, BertTokenizer, BertConfig
import os
from os.path import dirname, abspath
root_dir = dirname(dirname(dirname(abspath(__file__))))
import torch
# 把预训练的模型从官网下载下来放到目录中
pretrained_path = os.path.join(root_dir, 'pretrained/bert_zh')
# 从文件中加载bert模型
model = BertModel.from_pretrained(pretrained_path)
# 从bert目录中加载词典
tokenizer = BertTokenizer.from_pretrained(pretrained_path)
print(f'vocab size :{tokenizer.vocab_size}')
# 把'[PAD]'编码
print(tokenizer.encode('[PAD]'))
print(tokenizer.encode('[SEP]'))
# 把中文句子编码，默认加入了special tokens了，也就是句子开头加入了[CLS] 句子结尾加入了[SEP]
ids = tokenizer.encode("我是中国人", add_special_tokens=True)
# 从结果中看，101是[CLS]的id，而2769是"我"的id
# [101, 2769, 3221, 704, 1744, 782, 102]
print(ids)
# 把ids解码为中文，默认是没有跳过特殊字符的
print(tokenizer.decode([101, 2769, 3221, 704, 1744, 782, 102], skip_special_tokens=False))
# print(model)
inputs = torch.tensor(ids).unsqueeze(0)
# forward，result是一个tuple，第一个tensor是最后的hidden-state
result = model(torch.tensor(inputs))
# [1, 5, 768]
print(result[0].size())
# [1, 768]
print(result[1].size())
for name, parameter in model.named_parameters():
  # 打印每一层，及每一层的参数
  print(name)
  # 每一层的参数默认都requires_grad=True的，参数是可以学习的
  print(parameter.requires_grad)
  # 如果只想训练第11层transformer的参数的话：
  if '11' in name:
    parameter.requires_grad = True
  else:
    parameter.requires_grad = False
print([p.requires_grad for name, p in model.named_parameters()])

添加atten_mask的方法：

其中101是[CLS]，102是[SEP]，0是[PAD]


>>> a
tensor([[101,  3,  4, 23, 11,  1, 102,  0,  0,  0]])
>>> notpad = a!=0
>>> notpad
tensor([[ True, True, True, True, True, True, True, False, False, False]])
>>> notcls = a!=101
>>> notcls
tensor([[False, True, True, True, True, True, True, True, True, True]])
>>> notsep = a!=102
>>> notsep
tensor([[ True, True, True, True, True, True, False, True, True, True]])
>>> mask = notpad & notcls & notsep
>>> mask
tensor([[False, True, True, True, True, True, False, False, False, False]])
>>>

以上为个人经验，希望能给大家一个参考，也希望大家多多支持编程网。如有错误或未考虑完全的地方，望不吝赐教。

您可能感兴趣的文档:

--结束END--

本文标题: Pytorch BertModel的使用说明

本文链接: https://lsjlt.com/news/122441.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

Pytorch BertModel的使用说明

基本介绍

必需参数:

简单例子:

拥有函数:

参数:

返回值:

主要分为以下几个步骤：

添加atten_mask的方法：

Pytorch BertModel的使用说明

【pytorch】torch.cdist使用说明

pytorch 中nn.Dropout的使用说明

Pytorch中BertModel怎么用

PyTorch中grid_sample的使用及说明

Pytorch 中net.train 和 net.eval的使用说明

pytorch中的model=model.to(device)使用说明

Pytorch-Geometric中的Message Passing使用及说明

Pytorch中torch.cat()函数的使用及说明

pytorch中[..., 0]的用法说明

Pytorch中torch.repeat_interleave()函数使用及说明

Pytorch中torch.argmax()函数使用及说明

pytorch中retain_graph==True的作用说明

pytorch中的numel函数用法说明

pytorch 中autograd.grad()函数的用法说明

pytorch中Schedule与warmup_steps的用法说明

Pytorch中torch.nn.Softmax的dim参数用法说明

PyTorch常用函数torch.cat()中dim参数使用说明

pytorch中常用的损失函数用法说明

mysql8.0JSON_CONTAINS的使用说明

python分析数据的方法是什么

如何使用Python实现抽奖小程序

python copy函数的作用是什么

python ffmpeg模块怎么安装和使用

python进程池创建队列的方法是什么

python无法运行文件的原因有哪些

python can't open file报错怎么解决

python keyerror错误怎么解决

python字符串处理与应用的方法有哪些

python全局变量如何定义