Advanced Tutorial
What to Expect from This Tutorial?
In this tutorial, you will create a table and insert a large dataset (two million rows of the New York taxi data). Then you will run queries on the dataset, including an example of how to create a dictionary and use it to perform a JOIN.
This tutorial assumes you have access to a running ClickHouse service. If not, check out the Quick Start.
1. Create a New Table
The New York City taxi data contains the details of millions of taxi rides, with columns like pickup and drop-off times and locations, cost, tip amount, tolls, payment type and so on. Let's create a table to store this data...
- Connect to the SQL console
If you need a SQL client connection, your ClickHouse Cloud service has an associated web based SQL console; expand Connect to SQL console below for details.
Connect to SQL console
From your ClickHouse Cloud services list, click on a service.
This will redirect you to the SQL console.
If you are using self-managed ClickHouse you can connect to the SQL console at https://hostname:8443/play (check with your ClickHouse administrator for the details).
- Create the following
trips
table in thedefault
database:CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
2. Insert the Dataset
Now that you have a table created, let's add the NYC taxi data. It is in CSV files in S3, and you can load the data from there.
-
The following command inserts ~2,000,000 rows into your
trips
table from two different files in S3:trips_1.tsv.gz
andtrips_2.tsv.gz
:INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames', "
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
") SETTINGS input_format_try_infer_datetimes = 0 -
Wait for the
INSERT
to finish - it might take a moment for the 150 MB of data to be downloaded.NoteThe
s3
function cleverly knows how to decompress the data, and theTabSeparatedWithNames
format tells ClickHouse that the data is tab-separated and also to skip the header row of each file. -
When the insert is finished, verify it worked:
SELECT count() FROM trips
You should see about 2M rows (1,999,657 rows, to be precise).
NoteNotice how quickly and how few rows ClickHouse had to process to determine the count? You can get back the count in 0.001 seconds with only 6 rows processed.
-
If you run a query that needs to hit every row, you will notice considerably more rows need to be processed, but the run time is still blazing fast:
SELECT DISTINCT(pickup_ntaname) FROM trips
This query has to process 2M rows and return 190 values, but notice it does this in about 1 second. The
pickup_ntaname
column represents the name of the neighborhood in New York City where the taxi ride originated.
3. Analyze the Data
Let's run some queries to analyze the 2M rows of data...
-
We will start with some simple calculations, like computing the average tip amount:
SELECT round(avg(tip_amount), 2) FROM trips
The response is:
┌─round(avg(tip_amount), 2)─┐
│ 1.68 │
└───────────────────────────┘ -
This query computes the average cost based on the number of passengers:
SELECT
passenger_count,
ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_countThe
passenger_count
ranges from 0 to 9:┌─passenger_count─┬─average_total_amount─┐
│ 0 │ 22.69 │
│ 1 │ 15.97 │
│ 2 │ 17.15 │
│ 3 │ 16.76 │
│ 4 │ 17.33 │
│ 5 │ 16.35 │
│ 6 │ 16.04 │
│ 7 │ 59.8 │
│ 8 │ 36.41 │
│ 9 │ 9.81 │
└─────────────────┴──────────────────────┘ -
Here is a query that calculates the daily number of pickups per neighborhood:
SELECT
pickup_date,
pickup_ntaname,
SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASCThe result looks like:
┌─pickup_date─┬─pickup_ntaname───────────────────────────────────────────┬─number_of_trips─┐
│ 2015-07-01 │ Brooklyn Heights-Cobble Hill │ 13 │
│ 2015-07-01 │ Old Astoria │ 5 │
│ 2015-07-01 │ Flushing │ 1 │
│ 2015-07-01 │ Yorkville │ 378 │
│ 2015-07-01 │ Gramercy │ 344 │
│ 2015-07-01 │ Fordham South │ 2 │
│ 2015-07-01 │ SoHo-TriBeCa-Civic Center-Little Italy │ 621 │
│ 2015-07-01 │ Park Slope-Gowanus │ 29 │
│ 2015-07-01 │ Bushwick South │ 5 │ -
This query computes the length of the trip and groups the results by that value:
SELECT
avg(tip_amount) AS avg_tip,
avg(fare_amount) AS avg_fare,
avg(passenger_count) AS avg_passenger,
count() AS count,
truncate(date_diff('second', pickup_datetime, dropoff_datetime)/60) as trip_minutes
FROM trips
WHERE trip_minutes > 0
GROUP BY trip_minutes
ORDER BY trip_minutes DESCThe result looks like:
┌──────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬──count─┬─trip_minutes─┐
│ 1.9600000381469727 │ 8 │ 1 │ 1 │ 27511 │
│ 0 │ 12 │ 2 │ 1 │ 27500 │
│ 0.542166673981895 │ 19.716666666666665 │ 1.9166666666666667 │ 60 │ 1439 │
│ 0.902499997522682 │ 11.270625001192093 │ 1.95625 │ 160 │ 1438 │