Hive e-commerce data warehouse

Hive e-commerce data warehouse (II)


Business terms
1. user
The user takes the device as the judgment standard. In mobile statistics, each independent device is considered as an independent user. The Android system identifies an independent user according to the IMEI number, and the IOS system identifies an independent user according to the OpenUDID, one user for each mobile phone.

2. new user
Users who use the application online for the first time. If a user opens an APP for the first time, the user is defined as a new user; Uninstalling and reinstalling devices will not be counted as a new addition. New users include daily new users, weekly new users and monthly new users.

3. active users
The user who opens the application is an active user, regardless of the user's usage. A device that is opened multiple times a day is counted as an active user.

4. weekly (monthly) active users
For a user who has started an application in a natural week (month), only one active user is recorded for multiple starts in that week (month).

5. monthly activity rate
The proportion between monthly active users and the total number of users accumulated up to this month.

6. silent users
The user only starts once on the day of installation (the next day), and there is no restart in the subsequent time. This indicator can reflect the quality of new users and the matching degree between users and APP.

7. version distribution
The number of new users, active users and startup times of different versions in each day of the week. It is helpful to judge the pros and cons of APP versions and user behavior habits.

8. return users this week
Users who did not start the app last week and started the app this week.

9. active users for consecutive n weeks
Start at least once a week for n consecutive weeks.

10. loyal users
Users who have been active for more than 5 weeks

11. continuously active users
Active users for 2 weeks or more

12. recently lost users
Users who have not started the application for consecutive n (2 < = n < = 4) weeks. (not started in week n+1)

13. retained users
New users within a certain period of time who still use the application after a certain period of time are considered as retained users; The proportion of these users in the new users at that time is the retention rate.

For example, 200 new users were added in May. Among these 200 users, 100 started applications in June, 80 in July and 50 in August; Then the retention rate of new users in May is 50% after one month, 40% after two months and 25% after three months.

14. user freshness
The proportion of new and old users who start the application every day, that is, the proportion of new users in the number of active users.

15. single use duration
Length of time used for each startup.

16. calculation standard of startup times
The IOS platform application is returned to the background for an independent startup; For Android platform, we stipulate that the interval between two starts is less than 30 seconds, and one start is calculated. In the process of using, if the user exits the application for 30 seconds due to sending and receiving SMS or answering the phone, and then returns to the application again, the two behaviors should be continuous rather than independent, so they can be counted as one use behavior, that is, one startup. The standard of 30 seconds is mostly used in the industry, but users can still customize this time interval.

System function
1 collect_set function
1) Create original data table
hive (gmall)>

drop table if exists student ;
create table student 
(name string, area string, course string, score int);

2) Insert data into the original data table
hive (gmall)>

insert into table student values('zhang3','bj','math',88);
insert into table student values('li4','bj','math',99);
insert into table student values('wang5','sh','chinese',92);
insert into table student values('zhao6','sh','chinese',54);
insert into table student values('tian7','bj','chinese',91);

3) Querying data in a table
hive (gmall)>

select * from student ;

student.name  student.area student.course  student.score
zhang3  bj       math    88
li4     bj       math    99
wang5   sh      chinese  92
zhao6   sh      chinese  54
tian7   bj      chinese  91

4) Aggregate the data of different rows in the same group into a set

hive (gmall)> 
select course, collect_set(area), avg(score) 
from student group by course;


chinese ["sh","bj"]     79.0
math    ["bj"]          93.5

5) Subscript can take a
hive (gmall)>

select course, collect_set(area)[0], avg(score) 
from student group by course;

chinese sh      79.0
math    bj      93.5

2 date processing function
1) date_format function (sort date according to format)

hive (gmall)> select date_format('2019-02-10','yyyy-MM');

2019-02

2) date_add function (plus or minus date)

hive (gmall)> select date_add('2019-02-10',1);
2019-02-11

hive (gmall)> select date_add('2019-02-10',-1);
2019-02-09

3) next_day function
(1) Take the next Monday of the current day

hive (gmall)> select next_day('2019-02-12','MO');
2019-02-18

Note: English from Monday to Sunday
(Monday´╝îTuesday,Wednesday,Thursday,Friday,Saturday,Sunday)

(2) Take the Monday of the current week

hive (gmall)> select date_add(next_day('2019-02-12','MO'),-7);
2019-02-11

4) last_day function (find the last day of the month)

hive (gmall)> select last_day('2019-02-10');
2019-02-28

Demand 1: user active theme

1.1 DWS layer

Objective: to make statistics on the details of each equipment in the current day, current week and current month

1.1.1 daily active equipment details dws_uv_detail_day

1.1.2 weekly user access details dws_uv_detail_wk

Obtain the weekly user access details according to the daily user access details.

1.1.3 monthly active equipment details dws_uv_detail_mn

1.2 ADS layer

Target: number of active devices in the current day, current week and current month

1.2.1 number of active devices ads_uv_count


1) Create table statement

hive (gmall)>
drop table if exists gmall.ads_uv_count;
create external table gmall.ads_uv_count( 
`dt` string COMMENT 'Statistical date',
`day_count` bigint COMMENT 'Number of users of the day',
`wk_count`  bigint COMMENT 'Number of users in the current week',
`mn_count`  bigint COMMENT 'Number of users in the current month',
`is_weekend` string COMMENT 'Y,N Is it a weekend,Used to get the final result of this week',
`is_monthend` string COMMENT 'Y,N Month end,Used to get the final result of this month' 
) COMMENT 'Number of active devices' 
row format delimited fields terminated by '\t' 
location '/warehouse/gmall/ads/ads_uv_count/'
;

2) Import data

hive (gmall)>
insert into table gmall.ads_uv_count 
select  
  '2019-02-10' dt,
   daycount.ct,
   wkcount.ct,
   mncount.ct,
   if(date_add(next_day('2019-02-10','MO'),-1)='2019-02-10','Y','N') ,
   if(last_day('2019-02-10')='2019-02-10','Y','N') 
from 
(
   select  
      '2019-02-10' dt,
       count(*) ct
   from gmall.dws_uv_detail_day
   where dt='2019-02-10'  
)daycount join 
( 
   select  
     '2019-02-10' dt,
     count (*) ct
   from gmall.dws_uv_detail_wk
   where wk_dt=concat(date_add(next_day('2019-02-10','MO'),-7),'_' ,date_add(next_day('2019-02-10','MO'),-1) )
) wkcount on daycount.dt=wkcount.dt
join 
( 
   select  
     '2019-02-10' dt,
     count (*) ct
   from gmall.dws_uv_detail_mn
   where mn=date_format('2019-02-10','yyyy-MM')  
)mncount on daycount.dt=mncount.dt
;

1.2.2 ADS layer loading data script

1) Create a script in the /home/hadoop/bin directory of node03

[hadoop@node03 bin]$ vim ads_uv_log.sh

Write the following in the script

#!/bin/bash

# Define variables for easy modification
APP=gmall
hive=/kkb/install/hive-1.1.0-cdh5.14.2/bin/hive

# If it is the entered date, the entered date will be used; If no date is entered, take the day before the current time
if [ -n "$1" ] ;then
    do_date=$1
else 
    do_date=`date -d "-1 day" +%F`  
fi 

sql="
  set hive.exec.dynamic.partition.mode=nonstrict;

insert into table "$APP".ads_uv_count 
select  
  '$do_date' dt,
   daycount.ct,
   wkcount.ct,
   mncount.ct,
   if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
   if(last_day('$do_date')='$do_date','Y','N') 
from 
(
   select  
      '$do_date' dt,
       count(*) ct 
   from "$APP".dws_uv_detail_day
   where dt='$do_date'  
)daycount   join 
( 
   select  
     '$do_date' dt,
     count (*) ct
   from "$APP".dws_uv_detail_wk
   where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_' ,date_add(next_day('$do_date','MO'),-1) )
)  wkcount  on daycount.dt=wkcount.dt
join 
( 
   select  
     '$do_date' dt,
     count (*) ct
   from "$APP".dws_uv_detail_mn
   where mn=date_format('$do_date','yyyy-MM')  
)mncount on daycount.dt=mncount.dt;
"

$hive -e "$sql"

2) Increase script execution permission

[hadoop@node03 bin]$ chmod 777 ads_uv_log.sh

3) Script usage

[hadoop@node03 bin]$ ./ads_uv_log.sh 2019-02-11

Requirement 2: new topics for users

Users who use the application online for the first time. If a user opens an APP for the first time, the user is defined as a new user; Uninstalling and reinstalling devices will not be counted as a new addition. New users include daily new users, weekly new users and monthly new users.

2.1 DWS layer (daily new equipment list)



Use the daily active user table Left Join to add a daily equipment table. The associated condition is mid_ The IDs are equal. If it is a daily newly added device, it is null in the daily newly added device table.

2.2 ADS layer (daily newly added equipment table ads_new_mid_count)

Demand 3: user retention theme

3.1 demand objectives


3.2 DWS layer

3.2.1 DWS of daily user details_ User_ Retention_ Day

3.2.2 DWS of user details retained in n days_ User_ Retention_ Day

1) Import data (new users' access and retention details in the first 1,2,3, and n days before calculation every day)

hive (gmall)>
insert overwrite table gmall.dws_user_retention_day
partition(dt="2019-02-11")
select
    nm.mid_id,
    nm.user_id,
    nm.version_code,
    nm.version_name,
    nm.lang,
    nm.source,
    nm.os,
    nm.area,
    nm.model,
    nm.brand,
    nm.sdk_version,
    nm.gmail,
    nm.height_width,
    nm.app_time,
    nm.network,
    nm.lng,
    nm.lat,
    nm.create_date,
    1 retention_day 
from gmall.dws_uv_detail_day ud join gmall.dws_new_mid_day nm  on ud.mid_id =nm.mid_id 
where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-1)

union all
select  
    nm.mid_id,
    nm.user_id , 
    nm.version_code , 
    nm.version_name , 
    nm.lang , 
    nm.source, 
    nm.os, 
    nm.area, 
    nm.model, 
    nm.brand, 
    nm.sdk_version, 
    nm.gmail, 
    nm.height_width,
    nm.app_time,
    nm.network,
    nm.lng,
    nm.lat,
    nm.create_date,
    2 retention_day 
from  gmall.dws_uv_detail_day ud join gmall.dws_new_mid_day nm 
      on ud.mid_id =nm.mid_id 
where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-2)

union all
select  
    nm.mid_id,
    nm.user_id , 
    nm.version_code , 
    nm.version_name , 
    nm.lang , 
    nm.source, 
    nm.os, 
    nm.area, 
    nm.model, 
    nm.brand, 
    nm.sdk_version, 
    nm.gmail, 
    nm.height_width,
    nm.app_time,
    nm.network,
    nm.lng,
    nm.lat,
    nm.create_date,
    3 retention_day 
from  gmall.dws_uv_detail_day ud join gmall.dws_new_mid_day nm   on ud.mid_id =nm.mid_id 
where ud.dt='2019-02-11' and nm.create_date=date_add('2019-02-11',-3);

(1) Union will de duplicate the union result set, which is less efficient than union all
(2) union all does not de duplicate the result set, so it is efficient

3.3 ADS layer

3.3.1 number of retained users ads_user_retention_day_count

3.3.2 retained user ratio ads_user_retention_day_rate

Demand 4: number of silent users

Silent user: refers to the user who started only on the day of installation and started one week ago

4.1 DWS layer

Use daily living schedule dws_uv_detail_day as DWS layer data

4.2 ADS layer

4.2.1 number of silent users ads_silent_count

Demand 5: number of returned users this week

Reflow this week = active this week - new this week - active last week

5.1 DWS layer

Use daily living schedule dws_uv_detail_day as DWS layer data

5.2 ADS layer

5.2.1 number of returned users this week ads_back_count

Demand 6: number of lost users

Lost users: we call them lost users who have not logged in in the last 7 days

6.1 DWS layer

Use daily living schedule dws_uv_detail_day as DWS layer data

6.2 ADS layer

Demand 7: number of active users in the last three consecutive weeks

Users who have been active continuously in the last three weeks: statistics are usually made for the data of the previous three weeks on Monday, and the data is calculated once a week.

7.1 DWS layer

Use weekly schedule dws_uv_detail_wk as DWS layer data

7.2 ADS layer


1) Create table statement

hive (gmall)>
drop table if exists  gmall.ads_continuity_wk_count;
create external table gmall.ads_continuity_wk_count( 
    `dt` string COMMENT 'Statistical date,Usually end week Sunday date,If calculated once a day,Available current day date',
    `wk_dt` string COMMENT 'Duration',
    `continuity_count` bigint
) 
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_wk_count';

2) Import data of the week of February 20, 2019

hive (gmall)>
insert into table gmall.ads_continuity_wk_count
select 
     '2019-02-20',
     concat(date_add(next_day('2019-02-20','MO'),-7*3),'_',date_add(next_day('2019-02-20','MO'),-1)),
     count(*)
from 
(
    select mid_id
    from gmall.dws_uv_detail_wk
    where wk_dt>=concat(date_add(next_day('2019-02-20','MO'),-7*3),'_',date_add(next_day('2019-02-20','MO'),-7*2-1)) 
    and wk_dt<=concat(date_add(next_day('2019-02-20','MO'),-7),'_',date_add(next_day('2019-02-20','MO'),-1))
    group by mid_id
    having count(*)=3
)t1;

Demand 8: number of active users for three consecutive days in the last seven days

Note: number of active users for 3 consecutive days in the last 7 days

8.1 DWS layer

Use daily living schedule dws_uv_detail_day as DWS layer data

8.2 ADS layer


1) Create table statement

hive (gmall)>
drop table if exists gmall.ads_continuity_uv_count;
create external table gmall.ads_continuity_uv_count( 
    `dt` string COMMENT 'Statistical date',
    `wk_dt` string COMMENT 'Date of last 7 days',
    `continuity_count` bigint
) COMMENT 'Number of continuously active devices'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_uv_count';

2) Write out SQL statements for importing data

hive (gmall)>
insert into table gmall.ads_continuity_uv_count
select
    '2019-02-12',
    concat(date_add('2019-02-12',-6),'_','2019-02-12'),
    count(*)
from
(
    select mid_id
    from
    (
        select mid_id      
        from
        (
            select 
                mid_id,
                date_sub(dt,rank) date_dif
            from
            (
                select 
                    mid_id,
                    dt,
                    rank() over(partition by mid_id order by dt) rank
                from gmall.dws_uv_detail_day
                where dt>=date_add('2019-02-12',-6) and dt<='2019-02-12'
            )t1
        )t2 
        group by mid_id,date_dif
        having count(*)>=3
    )t3 
    group by mid_id
)t4;

Tags: hive

Posted by Nuggit on Mon, 30 May 2022 06:59:09 +0530