Skip to content

OSU-IDEA-Lab/Join-Game

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to run experiments?

To upload the datasets that have been used for the join game experiment, you just need to run the "data_uploader.bash" file with the updated database_name, username and port of your PostgreSQL database. You would also need to update the path of the file location that contains all the table data. The path from the data directory is present, any path that preceeds it, needs to be updated.

=====================================

Installation steps for PostgreSQL:

git clone https://github.com/OSU-IDEA-Lab/Join-Game.git
Create a newdirectory called "executables" outside of the Join-game directory. Replace everything in quotations with the path asked along with the quotations.
./configure --prefix="/path/to/executables/directory" --enable-depend --enable-assert --enable-debug
make
make install
export PATH="/path/to/executables/directory"/bin:$PATH
export PGDATA=DemoDir
initdb
In the below command replace portNumber with your port number of choice
"/path/to/executables/directory"/bin/pg_ctl -D "path/to/Join-Game"/DemoDir -o "-p portNumber" -l logfile start
psql -p portNumber template1
replace databaseName by the name of your database
create database databaseName;

\q
"/path/to/executables/directory"/bin/pg_ctl -D "path/to/Join-Game"/DemoDir -o "-p 1997" -l logfile stop
lsof -i :portNumber
psql -p portNumber databaseName

=====================================

To run similarity joins, follow the below steps:

  1. make world
  2. make install-world
  3. restart the postgres server
  4. connect to psql
  5. CREATE EXTENSION fuzzystrmatch;
    Now, you can run the similarity joins.

=====================================

Data uploading to your PostgreSQL database.:

Compiling Code

Linux User
Run make -f makefile_linux.original. You will see dbgen and dists.dss files.

These two will be used for TPC-H data generation.

Preparing TPC-H dataset

The command ./dbgen -h shows list of options.
  1. Inside dbgen folder, run the below command.
    ./dbgen -s 10.0 -z 0 The above command will create z = 0 data with 10GB size data.

  2. Do the below command to see the first few rows of a table.
    head customer.tbl If you observe carefully, each tuple will have a | symbol at the end.
    For query loading and processing, the | at the end is not required. So, we need to remove this in the next step.

  3. Remove the "|" at the end of the tuple in all the tables.
    sed -i 's/|$//' *.tbl

After running the above command, we can see that | is removed for all the tuples in all the tables.

You can also run the script remove.sh for multiple tables.

  1. Shuffle each table as below. If you do not shuffle, the tuples will be in sorted order.
    Run the script shuffle_tables.sh for multiple tables.

Preparing Similarity Join dataset

  1. To upload the Cars, WDC and Movies datasets, just run the data_uploader.bash file.

  2. You can download the above datasets at the below links.

    1. Cars Dataset: parking_tickets table: http://www.kaggle.com/datasets/new-york-city/nyc-parking-tickets and Car_brands table: http://www.back4app.com/database/back4app/car-make-model-dataset
      Query used: SELECT Car_brands1.make, parking_tickets1.vehicle_make FROM Car_brands1 JOIN parking_tickets1 ON levenshtein(trim(Car_brands1.make::varchar(10)), trim(parking_tickets1.vehicle_make::varchar(10))) <= (1, 2 and 3);
    2. WDC Dataset: webdatacommons.org/largescaleproductcorpus/v2 You can divide the data into two separate tables to join them as WDC1 and WDC2
      Query Used: SELECT wdc1Brands.brand, wdc2Brands.brand FROM wdc1Brands JOIN wdc2Brands ON levenshtein(trim(wdc1Brands.brand::varchar(10)), trim(wdc2Brands.brand::varchar(10))) <= (1, 2 and 3);
    3. Movies Dataset: IMDB table: https://developer.imdb.com/non-commercial-datasets/ and OMDB table: https://www.omdbapi.com/
      Query Used: EXPLAIN ANALYSE SELECT imdb.title, omdbMovies.title FROM imdb JOIN omdbMovies ON levenshtein(trim(imdb.title::varchar(50)), trim(omdbMovies.title::varchar(50))) <= (9, 10 and 11);
  3. Open the data_uploader.bash file and then replace "database_name" "username" "port" with your database_name, username and port. Also, you would need to open the python files mentioned in the bash file and then replace the file path of the data files for all these tables with their correct file paths.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •