Datawhale team learning - construction of recommendation system Task05 user inverted index table

The following study notes are from the recommendation system course that Datawhale teamed up to learn, project address:

After we understand the construction of news materials and the basis of front-end and back-end interaction, the rest is to understand how the recommendation process is implemented in the current project. In addition, the following content mainly refers to the md file under docs in the project. Yes, there are some changes to the actual code, but the general logic should be the same.

Inverted index

In an unprocessed database, the document ID is generally used as the index, and the document content is used as the record. The inverted index is simply to use the word or record as an index and the document ID as a record, so that the document where it is located can be easily found through the word or record.

The detailed inverted index can be found here:

The recommendation system process construction mainly includes two parts: offline and online


The offline part mainly performs offline calculation based on the previously stored material portraits and user portraits, and provides each user with a list of popular pages and recommended pages and caches them to facilitate the acquisition of the list of online services. The following is the process of generating these two lists and caching them to redis.

List of popular pages

  • When the user clicks the hot button, the offline part will get the data that has been cached in redis and generated for the user in advance. A popular page is a page with a high popularity value (according to its release time, users' behavior records (the number of likes, favorites and readings) to calculate the popularity information of the article).
  • After the materials are processed in the early morning of each day, we will get the publication time of each article (static feature) and the number of likes, favorites, and reads (dynamic features) that each article has accumulated so far. At this time, we can traverse the materials For each article in the pool, get the publication time of the article, and make the difference with the current time to get the timeliness of the article, and then filter out the articles that have been published for too long according to the timeliness, and then combine the dynamic characteristics of the article, based on the heat formula, Can calculate the heat value of the article. Each article has a hot value. Sorting according to the hot value, you can get the hot list of articles, and cache the list to redis in the form of zset. The reason for zset is that it can help us automatically sort according to the hot value. This is a public hot list, which can be used as the initialization state of each user's hot list.
  • Due to the different preferences and interests of each user, for the same popular page, the clicked articles may be different, and when we expose to users, we often filter out the content that has been exposed to users first, so for the sake of each user When the user logs in, we will generate a list of popular pages for each user separately, and the initialization state is the public list above. After that, when the user clicks on the popular page, the article will be obtained from the list of their own popular pages.

The code is under under recprocess/recall

def get_hot_rec_list(self):
        """Get the likes, favorites and creation time of the material, calculate the popularity and generate a list of popular recommendations to store in redis
        # Traverse all articles in the material pool
        for item in self.feature_protrail_collection.find():
            news_id = item['news_id']
            news_cate = item['cate']
            news_ctime = item['ctime']
            news_likes_num = item['likes']
            news_collections_num = item['collections']
            news_read_num = item['read_num']
            news_hot_value = item['hot_value']

            #print(news_id, news_cate, news_ctime, news_likes_num, news_collections_num, news_read_num, news_hot_value)

            # Time conversion and calculation time difference The premise is to ensure that the current time is greater than the news creation time, and no exceptions are currently caught
            news_ctime_standard = datetime.strptime(news_ctime, "%Y-%m-%d %H:%M")
            cur_time_standard =
            time_day_diff = (cur_time_standard - news_ctime_standard).days
            time_hour_diff = (cur_time_standard - news_ctime_standard).seconds / 3600

            # Only the content of the last 3 days
            if time_day_diff > 3:
            # To calculate the heat score, here we use the Rubik's Cube show heat formula, which can be adjusted, read_num is the last hot_value and the last hot_value is added?   Because like_num are also accumulated, so the calculation here is not the value-added, but the real-time heat.
            # news_hot_value = (news_likes_num * 6 + news_collections_num * 3 + news_read_num * 1) * 10 / (time_hour_diff+1)**1.2
            # 72 means 3 days,
            news_hot_value = (news_likes_num * 0.6 + news_collections_num * 0.3 + news_read_num * 0.1) * 10 / (1 + time_hour_diff / 72) 

            #print(news_likes_num, news_collections_num, time_hour_diff)

            # Update article hot_value of material pool
            item['hot_value'] = news_hot_value
            self.feature_protrail_collection.update({'news_id':news_id}, item)

            #print("news_hot_value: ", news_hot_value)

            # save to redis
            self.reclist_redis.zadd('hot_list', {'{}_{}'.format(news_cate, news_id): news_hot_value}, nx=True)

It is to traverse the material pool every day. For each article, calculate the heat value based on dynamic information and static characteristics, and sort the heat value to generate a public hot template as the initial of each user's individual hot list.

Recommended page list

The recommendation page is the key to the recommendation system. For each user, we will generate a different recommendation page. This is what we know as "thousands of people and thousands of faces". How to do this? It is necessary to use the saved user portraits and item portraits to create features, and then predict the ranking through the model to achieve the so-called personalization. Of course, for a new user, since we have not stored the user portrait in advance, we will treat it as a cold start here. Therefore, this part is divided into two parts: cold start and personalized recommendation.

  • Cold start: Cold start is mainly for new users, through some rough information, such as age, gender (which will be obtained when the user registers), etc., to obtain articles in some major categories (articles suitable for this age and gender) , and then generate a cold start recommendation list for new users based on the popularity information of the article. Of course, this is just a simple way. Cold start is actually a more complicated scenario. Interested students can refer to some other materials.

The code is under under recprocess/cold_start

def generate_cold_start_news_list_to_redis_for_register_user(self):
        """Create a cold start news list for registered users
        for user_info in self.register_user_sess.query(RegisterUser).all():
            if int(user_info.age) < 23 and user_info.gender == "female":
                redis_key = "cold_start_group:{}".format(str(1))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            elif int(user_info.age) >= 23 and user_info.gender == "female":
                redis_key = "cold_start_group:{}".format(str(2))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            elif int(user_info.age) < 23 and user_info.gender == "male":
                redis_key = "cold_start_group:{}".format(str(3))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
            elif int(user_info.age) >= 23 and user_info.gender == "male":
                redis_key = "cold_start_group:{}".format(str(4))
                self.copy_redis_sorted_set(user_info.userid, redis_key)
  • Personalization: Personalized recommendation is mainly for old users. We capture their interests and hobbies through the normal recommendation process, achieve personalized recommendation, and optimize user experience. Therefore, this part follows the normal recommendation process, such as the well-known recall→sort→rearrangement→personalized list generation. The purpose of recall is to quickly find a small number of users potentially interested in items from the massive item library based on some user characteristics. For fine row, the emphasis is on fast. Fine row mainly integrates more features and uses complex models to make personalized recommendations, emphasizing accuracy. On the rearrangement side, it is mainly based on the results of the fine-arrangement, plus various business strategies, such as deduplication, insertion, breaking up, diversity assurance, etc., which are mainly dominated by technical product strategies or improve user experience.


First, it is divided into two waves according to the type of users. If it is a new user, it will go through the cold-start recommendation process, and generate a cold-start recommendation list for the user through the rough information of the user. If it is an old user, go through the personalized recommendation process, and generate a personalized list for the old user by recalling → sorting → rearranging, etc. Finally, they are all stored in Redis.


Online is to provide a series of services for the behavior triggered by the user in the process of using the APP or the system. When the user first enters the system, he will enter the recommendation page of the news. At this time, the system will obtain the article on the recommendation page for the user and display it. When a user enters a popular page, the system will obtain a list of popular pages for the user and display it.

  • Get the list of recommended pages: This service is triggered when the user just enters the system, and when the user browses articles in the recommended page, refreshes and pulls down the process. When the system triggers the service, it will first determine whether the user is a new user or an old user.

    • If it is a new user, read the recommendation list from the cold start list stored offline, and select a specified number of articles to recommend (for example, 10 articles are recommended to the user at a time), but before the recommendation, the exposed articles need to be removed (to avoid Repeated exposure will affect the user experience), so for each user, we will also record an exposed list, which is convenient for us to deduplicate. At the same time, when a batch of articles is exposed, we will even update our exposure list.
    • If it is an old user, read it from the personalized recommendation list stored offline. As above, select the specified number of articles, remove the exposure, generate the final recommendation list, and update the user exposure record at the same time.
  • Get the list of popular pages: This service is triggered when users click on popular pages and browse articles in the popular pages to refresh the process. When the service is triggered, it will still judge new users and old users.

    • If it is a new user, you need to generate a list of popular pages for the user from the public cold start template stored offline, and then obtain it, select a specified number of articles to recommend, and go to exposure, generate the final recommendation list, and update the exposure record as above. .
    • If it is an old user, read it from the user's popular list stored offline, select a specified number of articles to recommend, go to exposure, generate the final recommendation list, and update the exposure record.

The code is under under recprocess

def get_hot_list(self, user_id):
        """Top Page List Results"""
        hot_list_key_prefix = "user_id_hot_list:"
        hot_list_user_key = hot_list_key_prefix + str(user_id)

        user_exposure_prefix = "user_exposure:"
        user_exposure_key = user_exposure_prefix + str(user_id)

        # When there is no data for this user in the database, copy a copy from the popular list 
        if self.reclist_redis_db.exists(hot_list_user_key) == 0: # Returns 1 if it exists, returns 0 if it does not exist
            print("copy a hot_list for {}".format(hot_list_user_key))
            # Regenerate a hot page recommendation list for the current user, that is, copy the list in the hot_list to the current user, and replace the key with user_id
            self.reclist_redis_db.zunionstore(hot_list_user_key, ["hot_list"])

        # There are 10 items by default on a page, but 20 items are selected here, because some of them may have been exposed on the recommended page
        article_num = 200

        # What is returned is a list of news_id zrevrange sort scores from large to small
        candiate_id_list = self.reclist_redis_db.zrevrange(hot_list_user_key, 0, article_num-1)

        if len(candiate_id_list) > 0:
            # Get the specific content of the news according to news_id, and return a list, the elements in the list are the news information dictionary displayed in order
            news_info_list = []
            selected_news = []   # record what was actually chosen
            cou = 0

            # Exposure list
            if self.exposure_redis_db.exists(user_exposure_key) > 0:
                exposure_list = self.exposure_redis_db.smembers(user_exposure_key)
                news_expose_list = set(map(lambda x: x.split(':')[0], exposure_list))
                news_expose_list = set()

            for i in range(len(candiate_id_list)):
                candiate = candiate_id_list[i]
                news_id = candiate.split('_')[1]

                # Re-exposed, including on the recommended page and the hot page
                if news_id in news_expose_list:

                # TODO Some news may not get static information, what should be the bug here
                # The reason for the bug is that the data in json.loads() redis will report an error, and the data in redis needs to be processed
                # It can be filtered when the material is processed, and the news that json cannot be load ed
                    news_info_dict = self.get_news_detail(news_id)
                except Exception as e:
                    with open("/home/recsys/news_rec_server/logs/news_bad_cases.log", "a+") as f:
                        f.write(news_id + "\n")
                        print("there are not news detail info for {}".format(news_id))
                # You need to confirm the json received by the front end, the key needs to be single or double quotes
                # Note that the key of the original number contains category information
                cou += 1
                if cou == 10:
            if len(selected_news) > 0:
                # Manually delete the cached results read, this is very critical, returns the number of elements deleted to detect whether they have been deleted
                removed_num = self.reclist_redis_db.zrem(hot_list_user_key, *selected_news)
                print("the numbers of be removed:", removed_num)

            # Exposure reset
            return news_info_list 
            #TODO do this temporarily, it's not good
            self.reclist_redis_db.zunionstore(hot_list_user_key, ["hot_list"])
            print("copy a hot_list for {}".format(hot_list_user_key))
            # If it is data that has been refreshed and re-copied after all the content, remember to clear today's exposure data.
            return  self.get_hot_list(user_id)

Tags: Database Python

Posted by bigphpn00b on Thu, 02 Jun 2022 08:42:47 +0530