=====================================
git clone https://github.com/OSU-IDEA-Lab/Join-Game.git
Create a newdirectory called "executables" outside of the Join-game directory. Replace everything in quotations with the path asked along with the quotations.
./configure --prefix="/path/to/executables/directory" --enable-depend --enable-assert --enable-debug
make
make install
export PATH="/path/to/executables/directory"/bin:$PATH
export PGDATA=DemoDir
initdb
In the below command replace portNumber with your port number of choice
"/path/to/executables/directory"/bin/pg_ctl -D "path/to/Join-Game"/DemoDir -o "-p portNumber" -l logfile start
psql -p portNumber template1
replace databaseName by the name of your database
create database databaseName;
\q
"/path/to/executables/directory"/bin/pg_ctl -D "path/to/Join-Game"/DemoDir -o "-p 1997" -l logfile stop
lsof -i :portNumber
psql -p portNumber databaseName
=====================================
To run similarity joins, follow the below steps:
- make world
- make install-world
- restart the postgres server
- connect to psql
- CREATE EXTENSION fuzzystrmatch;
Now, you can run the similarity joins.
=====================================
Linux UserRun make -f makefile_linux.original. You will see dbgen and dists.dss files.
These two will be used for TPC-H data generation.
-
Inside dbgen folder, run the below command.
./dbgen -s 10.0 -z 0 The above command will create z = 0 data with 10GB size data. -
Do the below command to see the first few rows of a table.
head customer.tbl If you observe carefully, each tuple will have a | symbol at the end.
For query loading and processing, the | at the end is not required. So, we need to remove this in the next step. -
Remove the "|" at the end of the tuple in all the tables.
sed -i 's/|$//' *.tbl
After running the above command, we can see that | is removed for all the tuples in all the tables.
You can also run the script remove.sh for multiple tables.
- Shuffle each table as below. If you do not shuffle, the tuples will be in sorted order.
Run the script shuffle_tables.sh for multiple tables.
-
To upload the Cars, WDC and Movies datasets, just run the data_uploader.bash file.
-
You can download the above datasets at the below links.
- Cars Dataset: parking_tickets table: http://www.kaggle.com/datasets/new-york-city/nyc-parking-tickets and Car_brands table: http://www.back4app.com/database/back4app/car-make-model-dataset
Query used: SELECT Car_brands1.make, parking_tickets1.vehicle_make FROM Car_brands1 JOIN parking_tickets1 ON levenshtein(trim(Car_brands1.make::varchar(10)), trim(parking_tickets1.vehicle_make::varchar(10))) <= (1, 2 and 3); - WDC Dataset: webdatacommons.org/largescaleproductcorpus/v2 You can divide the data into two separate tables to join them as WDC1 and WDC2
Query Used: SELECT wdc1Brands.brand, wdc2Brands.brand FROM wdc1Brands JOIN wdc2Brands ON levenshtein(trim(wdc1Brands.brand::varchar(10)), trim(wdc2Brands.brand::varchar(10))) <= (1, 2 and 3); - Movies Dataset: IMDB table: https://developer.imdb.com/non-commercial-datasets/ and OMDB table: https://www.omdbapi.com/
Query Used: EXPLAIN ANALYSE SELECT imdb.title, omdbMovies.title FROM imdb JOIN omdbMovies ON levenshtein(trim(imdb.title::varchar(50)), trim(omdbMovies.title::varchar(50))) <= (9, 10 and 11);
- Cars Dataset: parking_tickets table: http://www.kaggle.com/datasets/new-york-city/nyc-parking-tickets and Car_brands table: http://www.back4app.com/database/back4app/car-make-model-dataset
-
Open the data_uploader.bash file and then replace "database_name" "username" "port" with your database_name, username and port. Also, you would need to open the python files mentioned in the bash file and then replace the file path of the data files for all these tables with their correct file paths.