我的开发经验

意图识别数据处理

1232人浏览 / 1人评论

一般导入的包

import numpy as np
from collections import defaultdict #得到词频字典
import pandas as pd
import gensim #word2vec
from sklearn.model_selection import train_test_split #切分数据集
from operator import itemgetter
from tqdm import tqdm
import os
import pickle

###用法
dict1 = defaultdict(int)
dict1[word] +=1

data = pd.read_csv('')[:20].text

#加载
model_w2v = gensim.models.Word2Vec.load('data/wiki.Mode')
#model_w2v.wv.most_similar("民生银行")  # 找最相似的词
# model_w2v.wv.get_vector("民生银行")  # 查看向量
# model_w2v.wv.syn0  #  model_w2v.wv.vectors 一样都是查看向量
# model_w2v.wv.vocab  # 查看词和对应向量
# model_w2v.wv.index2word  # 每个index对应的词


train_words,test_words,train_labels,test_labels = train_test_split(x,label,test_size=0.2,random_state=42)

lists = (itemgetter *([0,1,2]))([a,b,c,d]) #结果：[a,b,c]

pickle.dump('',open('','wb'))
pickle.load('',open('','rb'))

数据处理device转换

device = torch.device('cuda' if torch.cuda.is_availabel() else 'cpu')

数据处理顺序
1. 初始化参数
```
parameters = {
    #词最低频率设置
    'min_count_word':1,
    'word2ind':None,
    'ind2word':None,
    'ind2embedding':None,
    'output_size':None,
    'epoch':20,
    'batch_size':10,
    'embedding_dim':300,
    'hidden_size':128,
    'num_layers':2, #堆叠lstm层数
    'dropout':0.5,
    'cuda':device,
    'lr':0.01,
    'num_unknow':0
}
```
2. 加载数据文件为list（csv/txt）
3. 创建dataset,生成：
  1. vocab(词频字典)，用defaultdict
  2. vovab添加<unk>，<pad>
  3. 根据词频参数要求，删除低频词
  4. 生成word2index , index2word ,ind2embedding(one_hot,word2vec,bert)
  5. 赋值到parameters
4. 创建batch_yield
  1. 进行数据shuffle，可用permutation = np.random.permutation(len(trian_datas)).permutation() 函数返回原始数组的无序副本。相比之下，shuffle() 函数会打乱原始数组
  2. #np.array()和list有本质的不同，np.array()中的数据类型完全相同。所以可以用chars = chars[permutation]进行索引赋值
  3. 数据遍历，将words用itemgetter转为index赋到新列表batch_x
  4. 当batch_x到达batch_size时，用itemgetter将数据转为ind2embedding，同时要数据长度与最大值保持一致，不够时用<pad>补齐
  5. 用yield返回数据 x,y,keys,epoc
5. 可用pickle保存处理的dataset和parameters
6. 获取处理结果
```
while 1:
    x,y,keys,epoch = next(train_yield)
    if not keys:
        break
```

全部评论

2024-10-06 16:09:42.0

Do you mind if I quote a couple of your articles as long as I provide credit and sources back to your webpage? My website is in the exact same area of interest as yours and my visitors would truly benefit from some of the information you present here. Please let me know if this ok with you. Thank you! Here is my web site; [priligy dapoxetine review](https://enhanceyourlife.mom/ "priligy dapoxetine review")

个人微信号

扫码联系我