意图识别数据处理

1232人浏览 / 1人评论
  1. 一般导入的包
    import numpy as np
    from collections import defaultdict #得到词频字典
    import pandas as pd
    import gensim #word2vec
    from sklearn.model_selection import train_test_split #切分数据集
    from operator import itemgetter
    from tqdm import tqdm
    import os
    import pickle
    
    ###用法
    dict1 = defaultdict(int)
    dict1[word] +=1
    
    data = pd.read_csv('')[:20].text
    
    #加载
    model_w2v = gensim.models.Word2Vec.load('data/wiki.Mode')
    #model_w2v.wv.most_similar("民生银行")  # 找最相似的词
    # model_w2v.wv.get_vector("民生银行")  # 查看向量
    # model_w2v.wv.syn0  #  model_w2v.wv.vectors 一样都是查看向量
    # model_w2v.wv.vocab  # 查看词和对应向量
    # model_w2v.wv.index2word  # 每个index对应的词
    
    
    train_words,test_words,train_labels,test_labels = train_test_split(x,label,test_size=0.2,random_state=42)
    
    lists = (itemgetter *([0,1,2]))([a,b,c,d]) #结果:[a,b,c]
    
    pickle.dump('',open('','wb'))
    pickle.load('',open('','rb'))

     

  2. 数据处理device转换
    device = torch.device('cuda' if torch.cuda.is_availabel() else 'cpu')

     

  3. 数据处理顺序
    1. 初始化参数
      parameters = {
          #词最低频率设置
          'min_count_word':1,
          'word2ind':None,
          'ind2word':None,
          'ind2embedding':None,
          'output_size':None,
          'epoch':20,
          'batch_size':10,
          'embedding_dim':300,
          'hidden_size':128,
          'num_layers':2, #堆叠lstm层数
          'dropout':0.5,
          'cuda':device,
          'lr':0.01,
          'num_unknow':0
      }

       

    2. 加载数据文件为list(csv/txt)
    3. 创建dataset,生成:
      1. vocab(词频字典),用defaultdict
      2. vovab添加<unk>,<pad>
      3. 根据词频参数要求,删除低频词
      4. 生成word2index , index2word ,ind2embedding(one_hot,word2vec,bert)
      5. 赋值到parameters
    4. 创建batch_yield
      1. 进行数据shuffle,可用permutation = np.random.permutation(len(trian_datas)).permutation() 函数返回原始数组的无序副本。相比之下,shuffle() 函数会打乱原始数组
      2. #np.array()和list有本质的不同,np.array()中的数据类型完全相同。所以可以用chars = chars[permutation]进行索引赋值
      3. 数据遍历,将words用itemgetter转为index赋到新列表batch_x
      4. 当batch_x到达batch_size时,用itemgetter将数据转为ind2embedding,同时要数据长度与最大值保持一致,不够时用<pad>补齐
      5. 用yield返回数据 x,y,keys,epoc
    5. 可用pickle保存处理的dataset和parameters
    6. 获取处理结果
      while 1:
          x,y,keys,epoch = next(train_yield)
          if not keys:
              break

       

全部评论

2024-10-06 16:09:42.0
Do you mind if I quote a couple of your articles as long as I provide credit and sources back to your webpage? My website is in the exact same area of interest as yours and my visitors would truly benefit from some of the information you present here. Please let me know if this ok with you. Thank you! Here is my web site; [priligy dapoxetine review](https://enhanceyourlife.mom/ "priligy dapoxetine review")