Airbnb Dataset – Extracting Host information

The Inside Airbnb London dataset is regularly scraped from the Airbnb site a surprising amount of data is duplicated, or incomplete.

CREATE TABLE [dbo].[listingsRaw](
	[id] [bigint] NOT NULL,
	[listing_url] [nvarchar](50) NOT NULL,
	[scrape_id] [datetime2](7) NOT NULL,
	[last_scraped] [date] NOT NULL,
	[source] [nvarchar](50) NOT NULL,
	[name] [nvarchar](max) NOT NULL,
	[description] [nvarchar](max) NULL,
	[neighborhood_overview] [nvarchar](1050) NULL,
	[picture_url] [nvarchar](150) NULL,
	[host_id] [int] NOT NULL,
	[host_url] [nvarchar](50) NOT NULL,
	[host_name] [nvarchar](50) NULL,
	[host_since] [date] NULL,
	[host_location] [nvarchar](100) NULL,
	[host_about] [nvarchar](max) NULL,
	[host_response_time] [nvarchar](50) NULL,
	[host_response_rate] [nvarchar](50) NULL,
	[host_acceptance_rate] [nvarchar](50) NULL,
	[host_is_superhost] [bit] NULL,
	[host_thumbnail_url] [nvarchar](150) NULL,
	[host_picture_url] [nvarchar](150) NULL,
	[host_neighbourhood] [nvarchar](50) NULL,
	[host_listings_count] [int] NULL,
	[host_total_listings_count] [int] NULL,
	[host_verifications] [nvarchar](50) NOT NULL,
	[host_has_profile_pic] [bit] NULL,
	[host_identity_verified] [bit] NULL,
	[neighbourhood] [nvarchar](100) NULL,
	[neighbourhood_cleansed] [nvarchar](50) NOT NULL,
	[neighbourhood_group_cleansed] [nvarchar](1) NULL,
	[latitude] [float] NOT NULL,
	[longitude] [float] NOT NULL,
	[property_type] [nvarchar](50) NOT NULL,
	[room_type] [nvarchar](50) NOT NULL,
	[accommodates] [tinyint] NOT NULL,
	[bathrooms] [nvarchar](1) NULL,
	[bathrooms_text] [nvarchar](50) NULL,
	[bedrooms] [tinyint] NULL,
	[beds] [tinyint] NULL,
	[amenities] [nvarchar](max) NOT NULL,
	[price] [money] NOT NULL,
	[minimum_nights] [smallint] NOT NULL,
	[maximum_nights] [int] NOT NULL,
	[minimum_minimum_nights] [smallint] NULL,
	[maximum_minimum_nights] [int] NULL,
	[minimum_maximum_nights] [int] NULL,
	[maximum_maximum_nights] [int] NULL,
	[minimum_nights_avg_ntm] [float] NULL,
	[maximum_nights_avg_ntm] [float] NULL,
	[calendar_updated] [nvarchar](1) NULL,
	[has_availability] [bit] NOT NULL,
	[availability_30] [tinyint] NOT NULL,
	[availability_60] [tinyint] NOT NULL,
	[availability_90] [tinyint] NOT NULL,
	[availability_365] [smallint] NOT NULL,
	[calendar_last_scraped] [date] NOT NULL,
	[number_of_reviews] [smallint] NOT NULL,
	[number_of_reviews_ltm] [int] NOT NULL,
	[number_of_reviews_l30d] [tinyint] NOT NULL,
	[first_review] [date] NULL,
	[last_review] [date] NULL,
	[review_scores_rating] [float] NULL,
	[review_scores_accuracy] [float] NULL,
	[review_scores_cleanliness] [float] NULL,
	[review_scores_checkin] [float] NULL,
	[review_scores_communication] [float] NULL,
	[review_scores_location] [float] NULL,
	[review_scores_value] [float] NULL,
	[license] [nvarchar](max) NULL,
	[instant_bookable] [bit] NOT NULL,
	[calculated_host_listings_count] [int] NULL,
	[calculated_host_listings_count_entire_homes] [int] NOT NULL,
	[calculated_host_listings_count_private_rooms] [int] NOT NULL,
	[calculated_host_listings_count_shared_rooms] [int] NOT NULL,
	[reviews_per_month] [float] NULL
)
SQL Server Management Studio displaying raw Listings

The ListingsRaw table had 87946 rows and there were 53394 hosts so some hosts must have many listings

SELECT COUNT(*) FROM listingsRaw

SELECT COUNT(DISTINCT HOST_ID) FROM listingsRaw

When I queried the TOP(5) hosts

SELECT [Host_id], COUNT(Host_id) as HostCount
FROM listingsRaw
GROUP BY host_id
ORDER BY COUNT(Host_id) DESC, host_id ASC
Host_id	    HostCount
129230780	595
33889201	312
314162972	301
28820321	288
1432477	    259

I was interested in the five hosts with the most listings.

Host_id: 129230780
Name: STK Homes	
About: Hello from Staykeepers Homes! We are a group of people who love traveling and sharing their experiences with others. But more so, we like to think of ourselves as people who constantly search to challenge and improve themselves.	
URL: https://www.airbnb.com/users/show/129230780

Host_id: 33889201
Name: Veeve
About: Greetings from the Veeve team!   Veeve manage the largest collection of homes to rent for short stays in London since 2011, launching in Paris in 2017. Our rental homes are predominantly privately-owned, lived-in homes in vibrant neighbourhoods. Thanks to our fully managed service, when you stay in a Veeve property you can benefit from the spaciousness, personal style and relaxed feel of a home – but with all the amenities of a hotel.   Once a booking is confirmed our Customer Service team will become your main point of contact and get in touch with you to organise your arrival at the property (including airport transfers, upon request); after which our on-the-ground management team will look after all aspects of your stay, with in-stay support available to help with any queries.   We have strict cleanliness guidelines for our cleaners and all staff who enter our properties and have reinforced existing cleaning procedures in line with official WHO, British and French government guidance.   As always, our team ensure that all properties are cleaned to a professional standard before and after each stay, with increased focus on touch-point cleaning (door handles, kitchen surfaces, kettles, taps etc.).   All staff who enter properties will follow the below procedures:    - Carry and regularly use alcohol-based hand rub - Sanitise their mobile phones  - Focus extra attention when cleaning key touch-point areas (door handles, kitchen surfaces, kettles, taps etc.) - Wear protective gear where appropriate - Wash and disinfect work shoes   We might be able to arrange for property viewings, upon request, for stays of over 28 nights.   Please be aware once a booking is confirmed you will be required to complete a Rental Agreement and also  ID verification via our partners Superhog. This is mandatory for all our stays and in accordance to short lets requirements with Veeve.  If you have any questions, please do not hesitate to contact us and we will be sure to answer you promptly.
URL: https://www.airbnb.com/users/show/33889201

Host_id: 314162972
Name: Blueground	
About: We’re Blueground, a global proptech company with several thousand move-in-ready apartments in a growing number of major cities around the world. With flexible terms and homes in vibrant, centrally based neighborhoods, you’ll feel at home and free to roam for as long as you want — a month, a year, or longer.   Each apartment is thoughtfully designed with exclusive furnishings, fully equipped kitchens, and incredible amenities – making every day a five-star experience. From day one, you’ll enjoy high-speed Wi-Fi, premium linens, and smart home entertainment. Plus, access to pools, gyms, and outdoor spaces in select buildings.  Why stress over your apartment? We provide a hassle-free alternative — a consistent, quality guest experience that starts even before you arrive. Because we let you book our most up-to-date apartment listings online, confirm with a click, pay securely, and check in easily.   During your stay Upon arrival, you’ll either be greeted personally by a Blueground team member or given self-check-in instructions. The entire apartment is yours! You’ll enjoy reliable support via email, phone, and our Guest App, where you can request everything from a home cleaning to extra towels.   We’ll share all details upon confirmation of your stay.	
URL: https://www.airbnb.com/users/show/314162972

Host_id: 28820321
Name: Veronica
About: I moved to London to follow my love who is a Londoner, I have a 4 years old daughter and two cats. we live in a beautiful house surrounded by parks and trees in Muswell Hill.  The Airbnb experience has been very positive and from that experience I set up a company called HelloGuest to provide excellent hassle-free Airbnb management services for short stay rental properties in London and across all the UK advertised on Airbnb.  Our services cover everything from full property management (guest communication, check-in & out, cleaning & laundry, manage the listing, 24/7 support service, etc) to partly managed or individual services depending on your needs.  Our aim is to provide professional services for high quality properties to deliver an excellent guest experience.  We are here to help, so please get in touch with any questions you have.
URL: https://www.airbnb.com/users/show/28820321

host_id: 1432477
Name: City Relay
About: City Relay is a property management company that looks after a bright and diverse portfolio of properties in London! We are committed to offering the best stay to each and every one of our guests. We provide some unique services that most of the other hosts cannot provide.   All our properties are selected carefully in order to offer you the best stay! From the moment you book to the moment you arrive at the apartment, we are helping you through all the processes so that your stay is hassle-free!   In order to address health and safety concerns, all our properties receive a deep professional cleaning and disinfection after each stay.	
URL: https://www.airbnb.com/users/show/1432477

After looking at the listings table this was my first attempt at a Host table.

CREATE TABLE [dbo].[Host](
	[Id] [bigint] NOT NULL,
	[Url] [nvarchar](150) NOT NULL,
	[Name] [nvarchar](50) NOT NULL,
	[Since] [date] NOT NULL,
	[About] [nvarchar](max) NULL,
	[Response_time] [nvarchar](50) NOT NULL,
	[Response_rate] [nvarchar](50) NOT NULL,
	[Acceptance_rate] [nvarchar](50) NOT NULL,
	[Thumbnail_url] [nvarchar](150) NOT NULL,
	[Picture_url] [nvarchar](150) NOT NULL,
	[Identity_verified] [bit] NOT NULL,
	[Identity_verifications] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Host] PRIMARY KEY CLUSTERED 
(
	[Id] ASC
)

Which I populated with the following SELECT INTO statement

SELECT DISTINCT [Host_id] as Id
   ,host_url as url
   ,[Host_Name] as Name
   ,host_since as Since
   ---,host_location as
   ,host_about as About
   ,host_response_time as Response_time
   ,host_response_rate as ResponseRate
   ,host_acceptance_rate as Acceptance_rate
   --,host_is_superhost
   ,host_thumbnail_url as thumbnail_url
   ,host_picture_url as picture_url
   --,host_neighbourhood
   --,host_listings_count
   --,host_total_listings_count
   ,host_verifications as Identity_verifications
   --,host_has_profile_pic
   ,host_identity_verified as Identity_verified
FROM listingsRaw
INTO Host

After extracting the hosts information both needed some more “curation” so I wrote some queries (like the following) to find incomplete records.

SELECT id, Url from Host WHERE Url is null 
SELECT id, Name from Host WHERE Name is null 
SELECT id, Since from Host WHERE Since is null 
SELECT id, response_time from Host WHERE response_time is null 
SELECT id, response_rate from Host WHERE response_rate is null 
SELECT id, Acceptance_rate from Host WHERE Acceptance_rate is null 
SELECT id, Thumbnail_url from Host WHERE Thumbnail_url is null 
SELECT id, Picture_url from Host WHERE Picture_url is null 
SELECT id, Identity_verified from Host WHERE Identity_verified is null 
SELECT id, Identity_verifications from Host WHERE Identity_verifications is null 
SELECT id, About from Host WHERE About is null 

At the end of my data cleansing the Listing table has 87940 rows and the Host table 53388 rows.

Airbnb Dataset – Initial import of the Listings

For the next series of random posts about ASP.NET Core Minimal APIs I wanted a much larger dataset for testing. When looking for datasets for another project I stumbled across Inside Airbnb which has a selection of datasets scraped from the Airbnb website. I started with the Melbourne Victoria Australia dataset which had 24652 listings.

SSMS Import Flat File task configuration

I used the SQL Server Managment Studio (SSMS) “Import flat file” task to load listings.csv.

SSMS Import Flat File task column selection failure

I had to manually update some of the “suggested” column types which “broke” the import flat file task.

SSMS Import Flat File task column designer

The tinyint or smallint columns which caused conversion errors were changed to an integer column and any text column that which caused a truncation error was changed to an NVARCHAR(MAX) column.

SSMS Import Flat File task data preview

The Melbourne Victoria Australia dataset turned out to be pretty “nasty” with lots of nullable and columns that looked like several values had been concatenated.

CREATE TABLE [dbo].[listings](
	[id] [bigint] NOT NULL,
	[listing_url] [nvarchar](50) NOT NULL,
	[scrape_id] [datetime2](7) NOT NULL,
	[last_scraped] [date] NOT NULL,
	[source] [nvarchar](50) NOT NULL,
	[name] [nvarchar](100) NOT NULL,
	[description] [nvarchar](1050) NULL,
	[neighborhood_overview] [nvarchar](1050) NULL,
	[picture_url] [nvarchar](150) NOT NULL,
	[host_id] [int] NOT NULL,
	[host_url] [nvarchar](50) NOT NULL,
	[host_name] [nvarchar](50) NULL,
	[host_since] [date] NULL,
	[host_location] [nvarchar](50) NULL,
	[host_about] [nvarchar](max) NULL,
	[host_response_time] [nvarchar](50) NULL,
	[host_response_rate] [nvarchar](50) NULL,
	[host_acceptance_rate] [nvarchar](50) NULL,
	[host_is_superhost] [bit] NULL,
	[host_thumbnail_url] [nvarchar](150) NULL,
	[host_picture_url] [nvarchar](150) NULL,
	[host_neighbourhood] [nvarchar](50) NULL,
	[host_listings_count] [int] NULL,
	[host_total_listings_count] [int] NULL,
	[host_verifications] [nvarchar](50) NOT NULL,
	[host_has_profile_pic] [nvarchar](50) NULL,
	[host_identity_verified] [bit] NULL,
	[neighbourhood] [nvarchar](max) NULL,
	[neighbourhood_cleansed] [nvarchar](50) NOT NULL,
	[neighbourhood_group_cleansed] [nvarchar](1) NULL,
	[latitude] [float] NOT NULL,
	[longitude] [float] NOT NULL,
	[property_type] [nvarchar](50) NOT NULL,
	[room_type] [nvarchar](50) NOT NULL,
	[accommodates] [tinyint] NOT NULL,
	[bathrooms] [nvarchar](1) NULL,
	[bathrooms_text] [nvarchar](50) NULL,
	[bedrooms] [tinyint] NULL,
	[beds] [tinyint] NULL,
	[amenities] [nvarchar](max) NOT NULL,
	[price] [money] NOT NULL,
	[minimum_nights] [smallint] NOT NULL,
	[maximum_nights] [int] NOT NULL,
	[minimum_minimum_nights] [smallint] NOT NULL,
	[maximum_minimum_nights] [int] NOT NULL,
	[minimum_maximum_nights] [int] NOT NULL,
	[maximum_maximum_nights] [int] NOT NULL,
	[minimum_nights_avg_ntm] [float] NOT NULL,
	[maximum_nights_avg_ntm] [float] NOT NULL,
	[calendar_updated] [nvarchar](1) NULL,
	[has_availability] [bit] NOT NUL\L,
	[availability_30] [tinyint] NOT NULL,
	[availability_60] [tinyint] NOT NULL,
	[availability_90] [tinyint] NOT NULL,
	[availability_365] [smallint] NOT NULL,
	[calendar_last_scraped] [date] NOT NULL,
	[number_of_reviews] [int] NOT NULL,
	[number_of_reviews_ltm] [int] NOT NULL,
	[number_of_reviews_l30d] [tinyint] NOT NULL,
	[first_review] [date] NULL,
	[last_review] [date] NULL,
	[review_scores_rating] [float] NULL,
	[review_scores_accuracy] [float] NULL,
	[review_scores_cleanliness] [float] NULL,
	[review_scores_checkin] [float] NULL,
	[review_scores_communication] [float] NULL,
	[review_scores_location] [float] NULL,
	[review_scores_value] [float] NULL,
	[license] [nvarchar](1) NULL,
	[instant_bookable] [bit] NOT NULL,
	[calculated_host_listings_count] [int] NOT NULL,
	[calculated_host_listings_count_entire_homes] [tinyint] NOT NULL,
	[calculated_host_listings_count_private_rooms] [tinyint] NOT NULL,
	[calculated_host_listings_count_shared_rooms] [bit] NULL,
	[reviews_per_month] [float] NULL,
 CONSTRAINT [PK_listings] PRIMARY KEY CLUSTERED 
(
	[id] ASC
)

On the Kaggle website there was a study of an Airbnb dataset from Singapore which looked better but was a significantly smaller with only 3483 listings. After some “exploration” of the available datasets on Inside Airbnb the London one looked pretty good.

Inside Airbnb map of London listings

The “suggested” table structure for the London dataset looks a lot better so I will use if for my ASP.NET Core Minimal APIs posts

CREATE TABLE [dbo].[listings](
	[id] [bigint] NOT NULL,
	[name] [nvarchar](max) NOT NULL,
	[host_id] [int] NOT NULL,
	[host_name] [nvarchar](50) NULL,
	[neighbourhood_group] [nvarchar](1) NULL,
	[neighbourhood] [nvarchar](50) NOT NULL,
	[latitude] [float] NOT NULL,
	[longitude] [float] NOT NULL,
	[room_type] [nvarchar](50) NOT NULL,
	[price] [money] NOT NULL,
	[minimum_nights] [smallint] NOT NULL,
	[number_of_reviews] [int] NOT NULL,
	[last_review] [date] NULL,
	[reviews_per_month] [float] NULL,
	[calculated_host_listings_count] [int] NOT NULL,
	[availability_365] [smallint] NOT NULL,
	[number_of_reviews_ltm] [int] NOT NULL,
	[license] [nvarchar](50) NULL,
 CONSTRAINT [PK_listings] PRIMARY KEY CLUSTERED 
(
	[id] ASC
)

Some of the columns will need to be “unpacked” e.g. “Rental unit in Islington · ★4.80 · 1 bedroom · 1 bed · 1 shared bath” but this shouldn’t be an issue.