SQL Training for E-Commerce

Haider Raza
7 min readSep 14, 2020

A couple of days ago I had the chance to work on some SQL queries for an e-commerce website. Along the way, I was stuck on some relations and went for google help. Found some really good open-source platforms which helped me in understanding some new concepts. later on, I thought if those resources were not present online if no one would have uploaded those questions and no one would have answered them will I be able to complete my work in due time?. After this realization, I thought now is the time to give back, what I have learned from the Tech community.

In this Article starting from some very basic queries I will work my way to some complex queries. The Data Warehouse I am using is a very basic E-commerce DWH.

In the user life cycle when a new user is registered it will make a new row in the ‘users’ table. If any user will make a sale new rows will be created in the ‘sales’ and ‘payments’ table. User activity will be recorded in the ‘activities’ table. One new row per user will be created for every new activity. ‘Activities.activity_is_last’ boolean will be updated accordingly and will only be true for only one row per user at any time. Device information per activity will be recorded in the ‘devices’ and ‘informations’ table.

Question 1. The number of daily active users per device?

The simple question we need to calculate the number of daily active users per device. we have three types of devices Mobile, Tablet, and PC. So we have to calculate how many users are using Mobiles, Tablets, and PCs per day.

select Date, device_id, count(user_id) as ‘No of active users’ from (
select date (activities.activity_timestamp) as date, devices.device_id, users.user_id
from activities
inner join users
inner join devices
on activities.activity_user_id = users.user_id
and activities.activity_device_id = devices.device_id
group by date,device_id,user_id)as users_per_device_per_date

group by date,device_id
order by date desc;
Figure (1.1)

First I would explain the inner query which is in italics (Just for identification) in the above complete query. I have joined the ‘activities’ table to ‘users’ and ‘devices’ table. The reason is to get distinct users and devices from activities. For example on the date, ‘XX-XX-XXXX’ one user with user_id ‘1001’ has 5 different activities. 2 from Tablet, 1 from Mobile, and 2 from PC. In this case, our final result should be one active user per device for date ‘XX-XX-XXXX’ and user_id ‘1001’. Because we do not want activity repetitions from the same user_id. As a result of the inner query, we will get a table with Date, device_id, and user_id shown in Figure (1.1).

Figure (1.2)

Now in outer Query, we are only doing the Count of user_ids to get the actual count per device per date. In the end, ordering it by date. The final result is shown in figure (1.2).

Kindly note one important factor for the above type of questions, Always extract complete ‘date’ from the timestamp. if you will only extract ‘day’ it will create repetitions. For example, the day value extracted from timestamps 14 August 2020 and 14 September 2020 will be the same equal to 14.

Question 2. Total sale amount per city for 2020?

Easy to calculate metric and gives really good information about demographical analysis of your customer base. Per the city, in 2020 we have to calculate the total payment amount which is also equal to the total sale amount.

select city, sum(pay_amount) as ‘Pay Amount’ from(
select users.user_city as city,users.user_id,payments.pay_sale_id,payments.pay_amount
from sales
inner join users
inner join payments
on users.user_id = sales.sale_user_id
and sales.sale_id = payments.pay_sale_id
where payments.pay_timestamp between ‘2020–01–01’ and ‘2020–12–31’
group by city,user_id,sale_id)as sale_amount_per_city_per_user_id

group by city;
Figure (2.1)

I have used the same pattern as question 1 inner query in italics and outer query in bold. First I will explain the inner query. I have joined the ‘sales’ table to ‘users’ and ‘payments’ table on user_id and sales_id using where condition on the timestamp to get only data where the year is 2020. Grouped by city, user_id, and sale_id to avoid any repetitions. This will result in all user’s payments in the year 2020 by city result shown in figure (2.1).

Figure (2.2)

Now in the outer query, we just have to sum the payment value to get the complete pay amount value for that city. The final result is shown in fig (2.2).

Question 3. The number of users whose last activity was done by device_name=‘Phone’ but also has one additional activity with any other device?

This metric will give us some insight into the number of users who have shifted from one device to another device. In our specific case, we are calculating users who have shifted to the phone from any other device.

select count(users.user_id) as ‘No of users’ From users 
where user_id in (
select activities.activity_user_id from activities inner join
devices
on devices.device_id = activities.activity_device_id
where activities.activity_is_last = 1 and
devices.device_name = ‘Phone’
)
and
user_id in (
select activities.activity_user_id from activities inner join
devices
on devices.device_id = activities.activity_device_id
where
devices.device_name <> ‘Phone’
)

Joined ‘users’ and ‘activites’ table on device_id. Then using the AND operator between users with last activity done by phone and users with any other activity not done by phone. The count of all users is our result.

Question 4. The number of page_visits and Registrations per date in the year 2020?

We have to calculate the number of new Registrations and the number of page visits in the year 2020 per date. This result will give show us the relation between new registration and customer engagement with our platform on that day. This will show how interactive and engaging is our platform for new customers. If the result shows that we are having new registration but a lesser number of page visits then our product team has to work on customer engagement. Different results can be derived from it, to increase customer engagement resulting in more page visits.

select coalesce (date1,date2) as Date,no_of_visits1 as ‘No of Visits’,no_of_regs2 as ‘No of Registerations’ from 
(
SELECT d1.date1,d1.no_of_visits1,d2.date2,d2.no_of_regs2 from
(
select date (inf.info_timestamp)as date1,sum(inf.info_page_visits)as no_of_visits1
from informations as inf
where inf.info_timestamp between ‘2020–01–01’ and ‘2021–01–01’
group by date1) as d1
left outer join
(
select date (u.user_registeration_timestamp)as date2,count(u.user_id)as no_of_regs2
from users as u
where u.user_registeration_timestamp between ‘2020–01–01’ and ‘2021–01–01’
group by date2
) as d2
on d1.date1 = d2.date2
UNION
SELECT d3.date3,d3.no_of_visits3,d4.date4,d4.no_of_regs4 from
(
select date (inf.info_timestamp)as date3,sum(inf.info_page_visits)as no_of_visits3
from informations as inf
where inf.info_timestamp between ‘2020–01–01’ and ‘2021–01–01’
group by date3) as d3
right outer join
(
select date (u.user_registeration_timestamp)as date4,count(u.user_id)as no_of_regs4
from users as u
where u.user_registeration_timestamp between ‘2020–01–01’ and ‘2021–01–01’
group by date4
) as d4
on d3.date3 = d4.date4)as full_table;

The above query is very simple let me break it down for you. I have used full outer join on two tables. Table one showing the number of page visits per date and table two showing the number of user registrations per date. MySQL does not support full outer join so to get the functionality of full outer join you have to union right outer join and left outer join.

Figure (4.1)

After joining these two tables the resulting table will have date1 corresponding to the number of visits and date2 corresponding to the number of registrations shown in figure (4.1).

Figure (4.2)

For the final result, I have used the coalesce operator to join Date1 and Date 2 result shown in figure (4.2). On dates from ‘2020–01–02’ to ‘2020–06–01’ we have some visits but no new registration. On dates ‘2020–07–02’ and ‘2020–08–10’ we have new registrations but no page visits.

We can also find out the number of page visits only on those days we have new registrations in the year 2020.

select coalesce(usr.date2,pg.date1) as Date ,usr.no_of_regs as 'No of registrations',pg.page_visits as 'No of page visits' from (
select date(u.user_registeration_timestamp) as date2, count(u.user_id) as no_of_regs from users as u
group by date2) as usr
left join (select date(inf.info_timestamp) as date1 ,sum(info_page_visits) as page_visits from informations as inf
group by date1) as pg
on pg.date1=usr.date2
where usr.date2 between '2020-01-01' and '2020-12-31'
Figure (4.3)

As you can see in ‘users’ table figure (4.3) we have four new registrations on the date “2020–05–02”, three new registrations on the date “2020–07–02” and one new registration on the date “2020–08–10” in the year 2020.

Figure (4.4)

As you can see in the ‘activities’ table figure (4.4) we have 10-page visits on the date “2020–05–02” and 14-page visits on “2020–07–02”.

Figure (4.5)

This will be our result shown in figure (4.5). Showing us the number of page visits per day in 2020 where we have new registrations. You can observe that on the date “2020–08–10” we had one new registration but no page visits.

--

--