特征工程
原始数据的直接特征
文章的自身特征, category_id表示这文章的类型, created_at_ts表示文章建立的时间, 这个关系着文章的时效性, words_count是文章的字数, 一般字数太长我们不太喜欢点击, 也不排除有人就喜欢读长文。
文章的内容embedding特征, 这个召回的时候用过, 这里可以选择使用, 也可以选择不用, 也可以尝试其他类型的embedding特征, 比如W2V等。
用户的设备特征信息
上面这些直接可以用的特征, 待做完特征工程之后, 直接就可以根据article_id或者是user_id把这些特征加入进去。 但是我们需要先基于召回的结果,构造一些特征,然后制作标签,形成一个监督学习的数据集。
构造监督数据集的思路
根据召回结果, 我们会得到一个{user_id: [可能点击的文章列表]}形式的字典。 那么我们就可以对于每个用户, 每篇可能点击的文章构造一个监督测试集, 比如对于用户user1, 假设得到的他的召回列表{user1: [item1, item2, item3]}, 我们就可以得到三行数据(user1, item1), (user1, item2), (user1, item3)的形式, 这就是监督测试集时候的前两列特征。
构造特征的思路
已知每个用户的点击文章是与其历史点击的文章信息是有很大关联的, 比如同一个主题、相似等等。 所以特征构造很重要的一系列特征是要结合用户的历史点击文章信息。我们已经得到了每个用户及点击候选文章的两列的一个数据集, 而我们的目的是要预测最后一次点击的文章, 比较自然的一个思路就是和其最后几次点击的文章产生关系, 这样既考虑了其历史点击文章信息, 又得离最后一次点击较近,因为新闻很大的一个特点就是注重时效性 。 往往用户的最后一次点击会和其最后几次点击有很大的关联。 所以我们就可以对于每个候选文章, 做出与最后几次点击相关的特征如下 :
候选item与最后几次点击的相似性特征(embedding内积) --- 这个直接关联用户历史行为
候选item与最后几次点击的相似性特征的统计特征 --- 统计特征可以减少一些波动和异常
候选item与最后几次点击文章的字数差的特征 --- 可以通过字数看用户偏好
候选item与最后几次点击的文章建立的时间差特征 --- 时间差特征可以看出该用户对于文章的实时性的偏好
如果使用了youtube召回的话, 我们还可以制作用户与候选item的相似特征
特征的构造
首先基于日志数据集获得用户的最后一次点击操作和用户的历史点击。
基于用户的历史行为制作特征, 这个会用到用户的历史点击表, 最后的召回列表, 文章的信息表和embedding向量。
制作标签, 形成最后的监督学习数据集。
导入需要的包
1 2 3 4 5 6 7 8 9 10 11 12 import numpy as npimport pandas as pdimport picklefrom tqdm import tqdmimport gc, osimport loggingimport timeimport lightgbm as lgbfrom gensim.models import Word2Vecfrom sklearn.preprocessing import MinMaxScalerimport warningswarnings.filterwarnings('ignore' )
df节省内存函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 def reduce_mem (df ): starttime = time.time() numerics = ['int16' , 'int32' , 'int64' , 'float16' , 'float32' , 'float64' ] start_mem = df.memory_usage().sum () / 1024 **2 for col in df.columns: col_type = df[col].dtypes if col_type in numerics: c_min = df[col].min () c_max = df[col].max () if pd.isnull(c_min) or pd.isnull(c_max): continue if str (col_type)[:3 ] == 'int' : if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max : df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max : df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max : df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max : df[col] = df[col].astype(np.int64) else : if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max : df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max : df[col] = df[col].astype(np.float32) else : df[col] = df[col].astype(np.float64) end_mem = df.memory_usage().sum () / 1024 **2 return df
1 2 data_path = './raw_data/' save_path = './temp_results/'
数据读取
训练和验证集的划分
划分训练和验证集的原因是为了在线下验证模型参数的好坏,为了完全模拟测试集,我们这里就在训练集中抽取部分用户的所有信息来作为验证集。提前做训练验证集划分的好处就是可以分解制作排序特征时的压力,一次性做整个数据集的排序特征可能时间会比较长。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def trn_val_split (all_click_df, sample_user_nums ): all_click = all_click_df all_user_ids = all_click.user_id.unique() sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False ) click_val = all_click[all_click['user_id' ].isin(sample_user_ids)] click_trn = all_click[~all_click['user_id' ].isin(sample_user_ids)] click_val = click_val.sort_values(['user_id' , 'click_timestamp' ]) val_ans = click_val.groupby('user_id' ).tail(1 ) click_val = click_val.groupby('user_id' ).apply(lambda x: x[:-1 ]).reset_index(drop=True ) val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())] return click_trn, click_val, val_ans
获取历史点击和最后一次点击
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def get_hist_and_last_click (all_click ): all_click = all_click.sort_values(by=['user_id' , 'click_timestamp' ]) click_last_df = all_click.groupby('user_id' ).tail(1 ) def hist_func (user_df ): if len (user_df) == 1 : return user_df else : return user_df[:-1 ] click_hist_df = all_click.groupby('user_id' ).apply(hist_func).reset_index(drop=True ) return click_hist_df, click_last_df
读取训练、验证及测试集
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def get_trn_val_tst_data (data_path, offline=True ): if offline: click_trn_data = pd.read_csv(data_path+'train_click_log.csv' ) click_trn_data = reduce_mem(click_trn_data) click_trn, click_val, val_ans = trn_val_split(click_trn_data, sample_user_nums) else : click_trn = pd.read_csv(data_path+'train_click_log.csv' ) click_trn = reduce_mem(click_trn) click_val = None val_ans = None click_tst = pd.read_csv(data_path+'testA_click_log.csv' ) return click_trn, click_val, click_tst, val_ans
读取召回列表
1 2 3 4 5 6 7 8 9 10 11 12 13 def get_recall_list (save_path, single_recall_model=None , multi_recall=False ): if multi_recall: return pickle.load(open (save_path + 'final_recall_items_dict.pkl' , 'rb' )) if single_recall_model == 'i2i_itemcf' : return pickle.load(open (save_path + 'itemcf_recall_dict.pkl' , 'rb' )) elif single_recall_model == 'i2i_emb_itemcf' : return pickle.load(open (save_path + 'itemcf_emb_dict.pkl' , 'rb' )) elif single_recall_model == 'user_cf' : return pickle.load(open (save_path + 'youtubednn_usercf_dict.pkl' , 'rb' )) elif single_recall_model == 'youtubednn' : return pickle.load(open (save_path + 'youtube_u2i_dict.pkl' , 'rb' ))
利用Word2Vec生成点击序列的Embedding
Embedding
参考我的博客
Word2Vec
Word2Vec基于分布式假设,即一个词的上下文可以很好的表达出词的语义。通过无监督学习产生词向量的方式。
word2vec中有两个非常经典的模型:skip-gram和CBOW。
skip-gram skip-gram通过中心词预测周围词。
CBOW CBOW通过周围词预测中心词。如下图所示为窗口为2的示例。
skip-gram和CBOW:
使用gensim训练word2vec
重要参数
注意事项
训练的时候输入的语料库一定要是字符组成的二维数组,如:[['北', '京', '你', '好'], ['上', '海', '你', '好']]
使用模型的时候有一些默认值,可以通过在Jupyter里面通过Word2Vec??查看
Word2Vec详细原理博客:Word2Vec讲解
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def trian_item_word2vec (click_df, embed_size=64 , save_name='item_w2v_emb.pkl' , split_char=' ' ): click_df = click_df.sort_values('click_timestamp' ) click_df['click_article_id' ] = click_df['click_article_id' ].astype(str ) docs = click_df.groupby(['user_id' ])['click_article_id' ].apply(lambda x: list (x)).reset_index() docs = docs['click_article_id' ].values.tolist() logging.basicConfig(format ='%(asctime)s:%(levelname)s:%(message)s' , level=logging.INFO) w2v = Word2Vec(docs, size=16 , sg=1 , window=5 , seed=2020 , workers=24 , min_count=1 , iter =1 ) item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id' ]} pickle.dump(item_w2v_emb_dict, open (save_path + 'item_w2v_emb.pkl' , 'wb' )) return item_w2v_emb_dict
读取新闻的Embedding
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def get_embedding (save_path, all_click_df ): if os.path.exists(save_path + 'item_content_emb.pkl' ): item_content_emb_dict = pickle.load(open (save_path + 'item_content_emb.pkl' , 'rb' )) else : print('item_content_emb.pkl 文件不存在...' ) if os.path.exists(save_path + 'item_w2v_emb.pkl' ): item_w2v_emb_dict = pickle.load(open (save_path + 'item_w2v_emb.pkl' , 'rb' )) else : item_w2v_emb_dict = trian_item_word2vec(all_click_df) if os.path.exists(save_path + 'item_youtube_emb.pkl' ): item_youtube_emb_dict = pickle.load(open (save_path + 'item_youtube_emb.pkl' , 'rb' )) else : print('item_youtube_emb.pkl 文件不存在...' ) if os.path.exists(save_path + 'user_youtube_emb.pkl' ): user_youtube_emb_dict = pickle.load(open (save_path + 'user_youtube_emb.pkl' , 'rb' )) else : print('user_youtube_emb.pkl 文件不存在...' ) return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict
读取文章信息
1 2 3 4 5 def get_article_info_df (): article_info_df = pd.read_csv(data_path + 'articles.csv' ) article_info_df = reduce_mem(article_info_df) return article_info_df
读取数据
1 2 click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False )
-- Mem. usage decreased to 23.34 Mb (69.4% reduction),time spend:0.00 min
1 2 3 4 5 6 7 8 click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn) if click_val is not None : click_val_hist, click_val_last = click_val, val_ans else : click_val_hist, click_val_last = None , None click_tst_hist = click_tst
对训练数据做负采样
通过召回我们将数据转换成三元组的形式(user1, item1, label)的形式,观察发现正负样本差距极度不平衡,我们可以先对负样本进行下采样,下采样的目的一方面缓解了正负样本比例的问题,另一方面也减小了我们做排序特征的压力,我们在做负采样的时候又有哪些东西是需要注意的呢?
只对负样本进行下采样(如果有比较好的正样本扩充的方法其实也是可以考虑的)
负采样之后,保证所有的用户和文章仍然出现在采样之后的数据中
下采样的比例可以根据实际情况人为的控制
做完负采样之后,更新此时新的用户召回文章列表,因为后续做特征的时候可能用到相对位置的信息。
1 2 3 4 5 6 7 8 9 10 11 def recall_dict_2_df (recall_list_dict ): df_row_list = [] for user, recall_list in tqdm(recall_list_dict.items()): for item, score in recall_list: df_row_list.append([user, item, score]) col_names = ['user_id' , 'sim_item' , 'score' ] recall_list_df = pd.DataFrame(df_row_list, columns=col_names) return recall_list_df
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 def neg_sample_recall_data (recall_items_df, sample_rate=0.001 ): pos_data = recall_items_df[recall_items_df['label' ] == 1 ] neg_data = recall_items_df[recall_items_df['label' ] == 0 ] print('pos_data_num:' , len (pos_data), 'neg_data_num:' , len (neg_data), 'pos/neg:' , len (pos_data)/len (neg_data)) def neg_sample_func (group_df ): neg_num = len (group_df) sample_num = max (int (neg_num * sample_rate), 1 ) sample_num = min (sample_num, 5 ) return group_df.sample(n=sample_num, replace=True ) neg_data_user_sample = neg_data.groupby('user_id' , group_keys=False ).apply(neg_sample_func) neg_data_item_sample = neg_data.groupby('sim_item' , group_keys=False ).apply(neg_sample_func) neg_data_new = neg_data_user_sample.append(neg_data_item_sample) neg_data_new = neg_data_new.sort_values(['user_id' , 'score' ]).drop_duplicates(['user_id' , 'sim_item' ], keep='last' ) data_new = pd.concat([pos_data, neg_data_new], ignore_index=True ) return data_new
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def get_rank_label_df (recall_list_df, label_df, is_test=False ): if is_test: recall_list_df['label' ] = -1 return recall_list_df label_df = label_df.rename(columns={'click_article_id' : 'sim_item' }) recall_list_df_ = recall_list_df.merge(label_df[['user_id' , 'sim_item' , 'click_timestamp' ]], \ how='left' , on=['user_id' , 'sim_item' ]) recall_list_df_['label' ] = recall_list_df_['click_timestamp' ].apply(lambda x: 0.0 if np.isnan(x) else 1.0 ) del recall_list_df_['click_timestamp' ] return recall_list_df_
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 def get_user_recall_item_label_df (click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df ): trn_user_items_df = recall_list_df[recall_list_df['user_id' ].isin(click_trn_hist['user_id' ].unique())] trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False ) trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df) if click_val is not None : val_user_items_df = recall_list_df[recall_list_df['user_id' ].isin(click_val_hist['user_id' ].unique())] val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False ) val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df) else : val_user_item_label_df = None tst_user_items_df = recall_list_df[recall_list_df['user_id' ].isin(click_tst_hist['user_id' ].unique())] tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None , is_test=True ) return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df
1 2 3 4 recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf' ) recall_list_df = recall_dict_2_df(recall_list_dict)
100%|██████████| 250000/250000 [00:00<00:00, 289677.16it/s]
1 2 3 trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist, )
pos_data_num: 3 neg_data_num: 1999997 pos/neg: 1.500002250003375e-06
1 trn_user_item_label_df.label
0 1.0
1 1.0
2 1.0
3 0.0
4 0.0
...
224357 0.0
224358 0.0
224359 0.0
224360 0.0
224361 0.0
Name: label, Length: 224362, dtype: float64
将召回数据转换成字典
1 2 3 4 5 6 7 def make_tuple_func (group_df ): row_data = [] for name, row_df in group_df.iterrows(): row_data.append((row_df['sim_item' ], row_df['score' ], row_df['label' ])) return row_data
1 2 3 4 5 6 7 8 9 10 11 trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id' ).apply(make_tuple_func).reset_index() trn_user_item_label_tuples_dict = dict (zip (trn_user_item_label_tuples['user_id' ], trn_user_item_label_tuples[0 ])) if val_user_item_label_df is not None : val_user_item_label_tuples = val_user_item_label_df.groupby('user_id' ).apply(make_tuple_func).reset_index() val_user_item_label_tuples_dict = dict (zip (val_user_item_label_tuples['user_id' ], val_user_item_label_tuples[0 ])) else : val_user_item_label_tuples_dict = None tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id' ).apply(make_tuple_func).reset_index() tst_user_item_label_tuples_dict = dict (zip (tst_user_item_label_tuples['user_id' ], tst_user_item_label_tuples[0 ]))
特征工程
制作与用户历史行为相关特征
对于每个用户召回的每个商品,做特征。具体步骤如下:
对于每个用户, 获取最后点击的N个商品的item_id
对于该用户的每个召回商品, 计算与上面最后N次点击商品的相似度的和(最大, 最小,均值), 时间差特征,相似性特征,字数差特征,与该用户的相似性特征
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 def create_feature (users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None , N=1 ): """ 基于用户的历史行为做相关特征 :param users_id: 用户id :param recall_list: 对于每个用户召回的候选文章列表 :param click_hist_df: 用户的历史点击信息 :param articles_info: 文章信息 :param articles_emb: 文章的embedding向量, 这个可以用item_content_emb, item_w2v_emb, item_youtube_emb :param user_emb: 用户的embedding向量, 这个是user_youtube_emb, 如果没有也可以不用, 但要注意如果要用的话, articles_emb就要用item_youtube_emb的形式, 这样维度才一样 :param N: 最近的N次点击 由于testA日志里面很多用户只存在一次历史点击, 所以为了不产生空值,默认是1 """ all_user_feas = [] i = 0 for user_id in tqdm(users_id): hist_user_items = click_hist_df[click_hist_df['user_id' ]==user_id]['click_article_id' ][-N:] for rank, (article_id, score, label) in enumerate (recall_list[user_id]): a_create_time = articles_info[articles_info['article_id' ]==article_id]['created_at_ts' ].values[0 ] a_words_count = articles_info[articles_info['article_id' ]==article_id]['words_count' ].values[0 ] single_user_fea = [user_id, article_id] sim_fea = [] time_fea = [] word_fea = [] for hist_item in hist_user_items: b_create_time = articles_info[articles_info['article_id' ]==hist_item]['created_at_ts' ].values[0 ] b_words_count = articles_info[articles_info['article_id' ]==hist_item]['words_count' ].values[0 ] sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id])) time_fea.append(abs (a_create_time-b_create_time)) word_fea.append(abs (a_words_count-b_words_count)) single_user_fea.extend(sim_fea) single_user_fea.extend(time_fea) single_user_fea.extend(word_fea) single_user_fea.extend([max (sim_fea), min (sim_fea), sum (sim_fea), sum (sim_fea) / len (sim_fea)]) if user_emb: single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id])) single_user_fea.extend([score, rank, label]) all_user_feas.append(single_user_fea) id_cols = ['user_id' , 'click_article_id' ] sim_cols = ['sim' + str (i) for i in range (N)] time_cols = ['time_diff' + str (i) for i in range (N)] word_cols = ['word_diff' + str (i) for i in range (N)] sat_cols = ['sim_max' , 'sim_min' , 'sim_sum' , 'sim_mean' ] user_item_sim_cols = ['user_item_sim' ] if user_emb else [] user_score_rank_label = ['score' , 'rank' , 'label' ] cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label df = pd.DataFrame( all_user_feas, columns=cols) return df
1 2 3 article_info_df = get_article_info_df() all_click = click_trn.append(click_tst) item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)
-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min
2020-12-02 20:24:15,896:INFO:collecting all words and their counts
2020-12-02 20:24:15,897:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-12-02 20:24:15,903:INFO:PROGRESS: at sentence #10000, processed 25727 words, keeping 3473 word types
2020-12-02 20:24:15,910:INFO:PROGRESS: at sentence #20000, processed 53883 words, keeping 5811 word types
2020-12-02 20:24:15,917:INFO:PROGRESS: at sentence #30000, processed 84881 words, keeping 7676 word types
2020-12-02 20:24:15,925:INFO:PROGRESS: at sentence #40000, processed 118390 words, keeping 9297 word types
2020-12-02 20:24:15,935:INFO:PROGRESS: at sentence #50000, processed 154179 words, keeping 10844 word types
2020-12-02 20:24:15,944:INFO:PROGRESS: at sentence #60000, processed 192350 words, keeping 12357 word types
2020-12-02 20:24:15,953:INFO:PROGRESS: at sentence #70000, processed 233685 words, keeping 13473 word types
2020-12-02 20:24:15,965:INFO:PROGRESS: at sentence #80000, processed 281335 words, keeping 14939 word types
2020-12-02 20:24:15,977:INFO:PROGRESS: at sentence #90000, processed 329973 words, keeping 16420 word types
2020-12-02 20:24:15,988:INFO:PROGRESS: at sentence #100000, processed 379428 words, keeping 17904 word types
2020-12-02 20:24:15,998:INFO:PROGRESS: at sentence #110000, processed 431464 words, keeping 18928 word types
2020-12-02 20:24:16,010:INFO:PROGRESS: at sentence #120000, processed 489655 words, keeping 20157 word types
2020-12-02 20:24:16,025:INFO:PROGRESS: at sentence #130000, processed 550375 words, keeping 21588 word types
2020-12-02 20:24:16,038:INFO:PROGRESS: at sentence #140000, processed 613031 words, keeping 22923 word types
2020-12-02 20:24:16,051:INFO:PROGRESS: at sentence #150000, processed 678645 words, keeping 24209 word types
2020-12-02 20:24:16,066:INFO:PROGRESS: at sentence #160000, processed 749559 words, keeping 25743 word types
2020-12-02 20:24:16,086:INFO:PROGRESS: at sentence #170000, processed 831064 words, keeping 27232 word types
2020-12-02 20:24:16,104:INFO:PROGRESS: at sentence #180000, processed 914233 words, keeping 28612 word types
2020-12-02 20:24:16,127:INFO:PROGRESS: at sentence #190000, processed 1004976 words, keeping 29699 word types
2020-12-02 20:24:16,152:INFO:PROGRESS: at sentence #200000, processed 1112623 words, keeping 31116 word types
2020-12-02 20:24:16,171:INFO:PROGRESS: at sentence #210000, processed 1200577 words, keeping 31798 word types
2020-12-02 20:24:16,193:INFO:PROGRESS: at sentence #220000, processed 1285942 words, keeping 32381 word types
2020-12-02 20:24:16,212:INFO:PROGRESS: at sentence #230000, processed 1380836 words, keeping 33131 word types
2020-12-02 20:24:16,239:INFO:PROGRESS: at sentence #240000, processed 1498710 words, keeping 34213 word types
2020-12-02 20:24:16,266:INFO:collected 35380 word types from a corpus of 1630633 raw words and 250000 sentences
2020-12-02 20:24:16,266:INFO:Loading a fresh vocabulary
2020-12-02 20:24:16,569:INFO:effective_min_count=1 retains 35380 unique words (100% of original 35380, drops 0)
2020-12-02 20:24:16,569:INFO:effective_min_count=1 leaves 1630633 word corpus (100% of original 1630633, drops 0)
2020-12-02 20:24:16,638:INFO:deleting the raw counts dictionary of 35380 items
2020-12-02 20:24:16,639:INFO:sample=0.001 downsamples 73 most-common words
2020-12-02 20:24:16,639:INFO:downsampling leaves estimated 1452855 word corpus (89.1% of prior 1630633)
2020-12-02 20:24:16,695:INFO:estimated required memory for 35380 words and 16 dimensions: 22218640 bytes
2020-12-02 20:24:16,695:INFO:resetting layer weights
2020-12-02 20:24:16,889:INFO:training model with 24 workers on 35380 vocabulary and 16 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2020-12-02 20:24:17,583:INFO:worker thread finished; awaiting finish of 23 more threads
2020-12-02 20:24:17,587:INFO:worker thread finished; awaiting finish of 22 more threads
2020-12-02 20:24:17,593:INFO:worker thread finished; awaiting finish of 21 more threads
2020-12-02 20:24:17,595:INFO:worker thread finished; awaiting finish of 20 more threads
2020-12-02 20:24:17,595:INFO:worker thread finished; awaiting finish of 19 more threads
2020-12-02 20:24:17,615:INFO:worker thread finished; awaiting finish of 18 more threads
2020-12-02 20:24:17,625:INFO:worker thread finished; awaiting finish of 17 more threads
2020-12-02 20:24:17,627:INFO:worker thread finished; awaiting finish of 16 more threads
2020-12-02 20:24:17,628:INFO:worker thread finished; awaiting finish of 15 more threads
2020-12-02 20:24:17,629:INFO:worker thread finished; awaiting finish of 14 more threads
2020-12-02 20:24:17,629:INFO:worker thread finished; awaiting finish of 13 more threads
2020-12-02 20:24:17,633:INFO:worker thread finished; awaiting finish of 12 more threads
2020-12-02 20:24:17,658:INFO:worker thread finished; awaiting finish of 11 more threads
2020-12-02 20:24:17,666:INFO:worker thread finished; awaiting finish of 10 more threads
2020-12-02 20:24:17,667:INFO:worker thread finished; awaiting finish of 9 more threads
2020-12-02 20:24:17,667:INFO:worker thread finished; awaiting finish of 8 more threads
2020-12-02 20:24:17,668:INFO:worker thread finished; awaiting finish of 7 more threads
2020-12-02 20:24:17,668:INFO:worker thread finished; awaiting finish of 6 more threads
2020-12-02 20:24:17,669:INFO:worker thread finished; awaiting finish of 5 more threads
2020-12-02 20:24:17,673:INFO:worker thread finished; awaiting finish of 4 more threads
2020-12-02 20:24:17,675:INFO:worker thread finished; awaiting finish of 3 more threads
2020-12-02 20:24:17,677:INFO:worker thread finished; awaiting finish of 2 more threads
2020-12-02 20:24:17,688:INFO:worker thread finished; awaiting finish of 1 more threads
2020-12-02 20:24:17,692:INFO:worker thread finished; awaiting finish of 0 more threads
2020-12-02 20:24:17,693:INFO:EPOCH - 1 : training on 1630633 raw words (1453218 effective words) took 0.8s, 1829115 effective words/s
2020-12-02 20:24:17,693:INFO:training on a 1630633 raw words (1453218 effective words) took 0.8s, 1808166 effective words/s
2020-12-02 20:24:17,694:WARNING:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
1 2 3 4 5 6 7 8 9 10 11 12 trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \ click_trn_hist, article_info_df, item_content_emb_dict) if val_user_item_label_tuples_dict is not None : val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \ click_val_hist, article_info_df, item_content_emb_dict) else : val_user_item_feats_df = None tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \ click_tst_hist, article_info_df, item_content_emb_dict)
100%|██████████| 200000/200000 [09:22<00:00, 355.53it/s]
100%|██████████| 50000/50000 [16:08<00:00, 51.64it/s]
1 2 3 4 5 6 7 trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv' , index=False ) if val_user_item_feats_df is not None : val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv' , index=False ) tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv' , index=False )
用户和文章特征
用户相关特征
现在正式进行特征工程,既要拼接上已有的特征, 也会做更多的特征出来。已有的特征和可构造特征:
user_id
click_article_id
click_timestamp
click_environment
click_deviceGroup
click_os
click_country
click_region
click_referrer_type
0
249999
160974
1506959142820
4
1
17
1
13
2
1
249999
160417
1506959172820
4
1
17
1
13
2
2
249998
160974
1506959056066
4
1
12
1
13
2
3
249998
202557
1506959086066
4
1
12
1
13
2
4
249997
183665
1506959088613
4
1
17
1
15
5
1 2 3 4 5 6 7 8 9 10 11 12 articles = pd.read_csv(data_path+'articles.csv' ) articles = reduce_mem(articles) if click_val is not None : all_data = click_trn.append(click_val) all_data = click_trn.append(click_tst) all_data = reduce_mem(all_data) all_data = all_data.merge(articles, left_on='click_article_id' , right_on='article_id' )
-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min
-- Mem. usage decreased to 46.65 Mb (62.5% reduction),time spend:0.00 min
(1630633, 13)
分析
1. 通过分析点击时间和点击文章的次数,区分用户活跃度
如果某个用户点击文章之间的时间间隔比较小, 同时点击的文章次数很多的话, 那么我们认为这种用户一般就是活跃用户, 当然衡量用户活跃度的方式可能多种多样, 这里我们只提供其中一种,我们写一个函数, 得到可以衡量用户活跃度的特征,逻辑如下:
首先根据用户user_id分组, 对于每个用户,计算点击文章的次数, 两两点击文章时间间隔的均值
把点击次数取倒数和时间间隔的均值统一归一化,然后两者相加合并,该值越小, 说明用户越活跃
上面两两点击文章的时间间隔均值, 会出现如果用户只点击了一次的情况,这时候时间间隔均值那里会出现空值, 对于这种情况最后特征那里给个大数进行区分
这个的衡量标准就是先把点击的次数取到数然后归一化, 然后点击的时间差归一化, 然后两者相加进行合并, 该值越小, 说明被点击的次数越多, 且间隔时间短。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 def active_level (all_data, cols ): """ 制作区分用户活跃度的特征 :param all_data: 数据集 :param cols: 用到的特征列 """ data = all_data[cols] data.sort_values(['user_id' , 'click_timestamp' ], inplace=True ) user_act = pd.DataFrame(data.groupby('user_id' , as_index=False )[['click_article_id' , 'click_timestamp' ]].\ agg({'click_article_id' :np.size, 'click_timestamp' : {list }}).values, columns=['user_id' , 'click_size' , 'click_timestamp' ]) def time_diff_mean (l ): if len (l) == 1 : return 1 else : return np.mean([j-i for i, j in list (zip (l[:-1 ], l[1 :]))]) user_act['time_diff_mean' ] = user_act['click_timestamp' ].apply(lambda x: time_diff_mean(x)) user_act['click_size' ] = 1 / user_act['click_size' ] user_act['click_size' ] = (user_act['click_size' ] - user_act['click_size' ].min ()) / (user_act['click_size' ].max () - user_act['click_size' ].min ()) user_act['time_diff_mean' ] = (user_act['time_diff_mean' ] - user_act['time_diff_mean' ].min ()) / (user_act['time_diff_mean' ].max () - user_act['time_diff_mean' ].min ()) user_act['active_level' ] = user_act['click_size' ] + user_act['time_diff_mean' ] user_act['user_id' ] = user_act['user_id' ].astype('int' ) del user_act['click_timestamp' ] return user_act
1 user_act_fea = active_level(all_data, ['user_id' , 'click_article_id' , 'click_timestamp' ])
user_id
click_size
time_diff_mean
active_level
0
0
0.499466
0.000048
0.499515
1
1
0.499466
0.000048
0.499515
2
2
0.499466
0.000048
0.499515
3
3
0.499466
0.000048
0.499515
4
4
0.499466
0.000048
0.499515
2. 通过分析点击时间和被点击文章的次数, 衡量文章热度特征
和上面同样的思路, 如果一篇文章在很短的时间间隔之内被点击了很多次, 说明文章比较热门,实现的逻辑和上面的基本一致, 只不过这里是按照点击的文章进行分组: 1. 根据文章进行分组, 对于每篇文章的用户, 计算点击的时间间隔 2. 将用户的数量取倒数, 然后用户的数量和时间间隔归一化, 然后相加得到热度特征, 该值越小, 说明被点击的次数越大且时间间隔越短, 文章比较热 > 这只是给出一种判断文章热度的一种方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 def hot_level (all_data, cols ): """ 制作衡量文章热度的特征 :param all_data: 数据集 :param cols: 用到的特征列 """ data = all_data[cols] data.sort_values(['click_article_id' , 'click_timestamp' ], inplace=True ) article_hot = pd.DataFrame(data.groupby('click_article_id' , as_index=False )[['user_id' , 'click_timestamp' ]].\ agg({'user_id' :np.size, 'click_timestamp' : {list }}).values, columns=['click_article_id' , 'user_num' , 'click_timestamp' ]) def time_diff_mean (l ): if len (l) == 1 : return 1 else : return np.mean([j-i for i, j in list (zip (l[:-1 ], l[1 :]))]) article_hot['time_diff_mean' ] = article_hot['click_timestamp' ].apply(lambda x: time_diff_mean(x)) article_hot['user_num' ] = 1 / article_hot['user_num' ] article_hot['user_num' ] = (article_hot['user_num' ] - article_hot['user_num' ].min ()) / (article_hot['user_num' ].max () - article_hot['user_num' ].min ()) article_hot['time_diff_mean' ] = (article_hot['time_diff_mean' ] - article_hot['time_diff_mean' ].min ()) / (article_hot['time_diff_mean' ].max () - article_hot['time_diff_mean' ].min ()) article_hot['hot_level' ] = article_hot['user_num' ] + article_hot['time_diff_mean' ] article_hot['click_article_id' ] = article_hot['click_article_id' ].astype('int' ) del article_hot['click_timestamp' ] return article_hot
1 article_hot_fea = hot_level(all_data, ['user_id' , 'click_article_id' , 'click_timestamp' ])
click_article_id
user_num
time_diff_mean
hot_level
0
3
1
0.0
1
1
69
1
0.0
1
2
84
1
0.0
1
3
94
1
0.0
1
4
125
1
0.0
1
用户的各种习惯
这个基于原来的日志表做一个类似于article的那种DataFrame, 存放用户特有的信息, 主要包括点击习惯, 爱好特征之类的。
用户的设备习惯, 这里取最常用的设备(众数)
用户的时间习惯: 根据其点击过得历史文章的时间来做一个统计(这个感觉最好是把时间戳里的时间特征的h特征提出来,看看用户习惯一天的啥时候点击文章), 但这里先用转换的时间吧, 求个均值
用户的爱好特征, 对于用户点击的历史文章主题进行用户的爱好判别, 更偏向于哪几个主题, 这个最好是multi-hot进行编码, 先试试行不
用户文章的字数差特征, 用户的爱好文章的字数习惯
上述数据通过对用户进行分组然后进行统计即可.
用户的设备习惯
1 2 3 4 5 6 7 8 9 10 11 12 def device_fea (all_data, cols ): """ 制作用户的设备特征 :param all_data: 数据集 :param cols: 用到的特征列 """ user_device_info = all_data[cols] user_device_info = user_device_info.groupby('user_id' ).agg(lambda x: x.value_counts().index[0 ]).reset_index() return user_device_info
1 2 3 device_cols = ['user_id' , 'click_environment' , 'click_deviceGroup' , 'click_os' , 'click_country' , 'click_region' , 'click_referrer_type' ] user_device_info = device_fea(all_data, device_cols)
user_id
click_environment
click_deviceGroup
click_os
click_country
click_region
click_referrer_type
0
0
4
1
17
1
25
2
1
1
4
1
17
1
25
6
2
2
4
3
20
1
25
2
3
3
4
3
2
1
25
2
4
4
4
1
12
1
16
1
用户的时间习惯
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 def user_time_hob_fea (all_data, cols ): """ 制作用户的时间习惯特征 :param all_data: 数据集 :param cols: 用到的特征列 """ user_time_hob_info = all_data[cols] mm = MinMaxScaler() user_time_hob_info['click_timestamp' ] = mm.fit_transform(user_time_hob_info[['click_timestamp' ]]) user_time_hob_info['created_at_ts' ] = mm.fit_transform(user_time_hob_info[['created_at_ts' ]]) user_time_hob_info = user_time_hob_info.groupby('user_id' ).agg('mean' ).reset_index() user_time_hob_info.rename(columns={'click_timestamp' : 'user_time_hob1' , 'created_at_ts' : 'user_time_hob2' }, inplace=True ) return user_time_hob_info
1 2 user_time_hob_cols = ['user_id' , 'click_timestamp' , 'created_at_ts' ] user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols)
用户的主题爱好
这里先把用户点击的文章属于的主题转成一个列表, 后面汇总的时候单独制作一个特征, 就是文章的主题如果属于这里面, 就是1, 否则就是0。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def user_cat_hob_fea (all_data, cols ): """ 用户的主题爱好 :param all_data: 数据集 :param cols: 用到的特征列 """ user_category_hob_info = all_data[cols] user_category_hob_info = user_category_hob_info.groupby('user_id' ).agg({list }).reset_index() user_cat_hob_info = pd.DataFrame() user_cat_hob_info['user_id' ] = user_category_hob_info['user_id' ] user_cat_hob_info['cate_list' ] = user_category_hob_info['category_id' ] return user_cat_hob_info
1 2 user_category_hob_cols = ['user_id' , 'category_id' ] user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols)
用户的字数偏好特征
1 2 user_wcou_info = all_data.groupby('user_id' )['words_count' ].agg('mean' ).reset_index() user_wcou_info.rename(columns={'words_count' : 'words_hbo' }, inplace=True )
用户的信息特征合并保存
1 2 3 4 5 user_info = pd.merge(user_act_fea, user_device_info, on='user_id' ) user_info = user_info.merge(user_time_hob_info, on='user_id' ) user_info = user_info.merge(user_cat_hob_info, on='user_id' ) user_info = user_info.merge(user_wcou_info, on='user_id' )
1 2 user_info.to_csv(save_path + 'user_info.csv' , index=False )
用户特征直接读入
如果前面关于用户的特征工程已经给做完了,后面可以直接读取
1 2 user_info = pd.read_csv(save_path + 'user_info.csv' )
1 2 3 4 5 6 7 8 9 10 if os.path.exists(save_path + 'trn_user_item_feats_df.csv' ): trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv' ) if os.path.exists(save_path + 'tst_user_item_feats_df.csv' ): tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv' ) if os.path.exists(save_path + 'val_user_item_feats_df.csv' ): val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv' ) else : val_user_item_feats_df = None
1 2 3 4 5 6 7 8 9 10 trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id' , how='left' ) if val_user_item_feats_df is not None : val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id' , how='left' ) else : val_user_item_feats_df = None tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id' ,how='left' )
1 trn_user_item_feats_df.columns
Index(['user_id', 'click_article_id', 'sim0', 'time_diff0', 'word_diff0',
'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score', 'rank', 'label',
'click_size', 'time_diff_mean', 'active_level', 'click_environment',
'click_deviceGroup', 'click_os', 'click_country', 'click_region',
'click_referrer_type', 'user_time_hob1', 'user_time_hob2', 'cate_list',
'words_hbo'],
dtype='object')
文章的特征直接读入
1 2 articles = pd.read_csv(data_path+'articles.csv' ) articles = reduce_mem(articles)
-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min
1 2 3 4 5 6 7 8 9 trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id' , right_on='article_id' ) if val_user_item_feats_df is not None : val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id' , right_on='article_id' ) else : val_user_item_feats_df = None tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id' , right_on='article_id' )
召回文章的主题是否在用户的爱好里面
1 2 3 4 5 6 trn_user_item_feats_df['is_cat_hab' ] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set (x.cate_list) else 0 , axis=1 ) if val_user_item_feats_df is not None : val_user_item_feats_df['is_cat_hab' ] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set (x.cate_list) else 0 , axis=1 ) else : val_user_item_feats_df = None tst_user_item_feats_df['is_cat_hab' ] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set (x.cate_list) else 0 , axis=1 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 del trn_user_item_feats_df['cate_list' ]if val_user_item_feats_df is not None : del val_user_item_feats_df['cate_list' ] else : val_user_item_feats_df = None del tst_user_item_feats_df['cate_list' ]del trn_user_item_feats_df['article_id' ]if val_user_item_feats_df is not None : del val_user_item_feats_df['article_id' ] else : val_user_item_feats_df = None del tst_user_item_feats_df['article_id' ]
保存特征
1 2 3 4 5 trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv' , index=False ) if val_user_item_feats_df is not None : val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv' , index=False ) tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv' , index=False )
总结
特征工程和数据清洗转换 是比赛中至关重要的一块, 因为数据和特征 决定了机器学习的上限 ,而算法和模型只是逼近这个上限而已,所以特征工程的好坏往往决定着最后的结果,特征工程可以一步增强数据的表达能力,通过构造新特征,我们可以挖掘出数据的更多信息,使得数据的表达能力进一步放大。 本内容,主要是先通过制作特征和标签把预测问题转成了监督学习问题,然后围绕着用户画像和文章画像进行一系列特征的制作, 此外,为了保证正负样本的数据均衡,采用的负采样技术等。