Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pplacer error #330

Closed
michellehauer opened this issue Jun 25, 2021 · 1 comment
Closed

pplacer error #330

michellehauer opened this issue Jun 25, 2021 · 1 comment

Comments

@michellehauer
Copy link

michellehauer commented Jun 25, 2021

Running gtdbtk on a computer cluster and encountering an issue with pplacer.

I already saw issue #170. I am using 1CPU, and after allocating 100GB then 204GB, tried using bigmem. Still didn't work.

Per a suggestions in issue #170, I ran
pplacer -m WAG -j 1 -c /gpfs/data/rbeinart/Databases/gtdbtk-1.4.0/db/pplacer/gtdb_r95_bac120.refpkg -o /tmp/pplacer.bac120.json ./align/gtdbtk.bac120.user_msa.fasta
which gave:

Running pplacer v1.1.alpha19-0-g807f6f3 analysis on ./align/gtdbtk.bac120.user_msa.fasta...
Didn't find any reference sequences in given alignment file. Using supplied reference alignment.
query bin.3 is not the same length as the reference alignment (got 5037; expected 5040)

Output log from running gtdbtk command:

[2021-06-25 19:13:44] INFO: Completed 4 genomes in 4.65 minutes (1.16 minutes/genome).
[2021-06-25 19:13:44] TASK: Identifying TIGRFAM protein families.
[2021-06-25 19:16:02] INFO: Completed 4 genomes in 2.30 minutes (1.74 genomes/minute).
[2021-06-25 19:16:02] TASK: Identifying Pfam protein families.
[2021-06-25 19:16:29] INFO: Completed 4 genomes in 26.79 seconds (6.70 seconds/genome).
[2021-06-25 19:16:29] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2021-06-25 19:16:29] TASK: Summarising identified marker genes.
[2021-06-25 19:16:33] INFO: Completed 4 genomes in 3.96 seconds (1.01 genomes/second).
[2021-06-25 19:16:33] INFO: Done.
[2021-06-25 19:16:33] INFO: Aligning markers in 4 genomes with 1 CPUs.
[2021-06-25 19:16:33] INFO: Processing 4 genomes identified as bacterial.
[2021-06-25 19:16:38] INFO: Read concatenated alignment for 45,555 GTDB genomes.
[2021-06-25 19:16:38] TASK: Generating concatenated alignment for each marker.
[2021-06-25 19:16:40] INFO: Completed 4 genomes in 2.26 seconds (1.77 genomes/second).
[2021-06-25 19:16:40] TASK: Aligning 120 identified markers using hmmalign 3.1b2 (February 2015).
[2021-06-25 19:16:44] INFO: Completed 120 markers in 3.39 seconds (35.43 markers/second).
[2021-06-25 19:16:44] DEBUG: Successfully written all markers to: ./align/intermediate_results/markers
[2021-06-25 19:16:44] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2021-06-25 19:17:38] INFO: Completed 45,559 sequences in 54.14 seconds (841.47 sequences/second).
[2021-06-25 19:17:38] INFO: Masked bacterial alignment from 41,084 to 5,037 AAs.
[2021-06-25 19:17:38] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2021-06-25 19:17:38] INFO: Creating concatenated alignment for 45,559 bacterial GTDB and user genomes.
[2021-06-25 19:17:38] INFO: Creating concatenated alignment for 4 bacterial user genomes.
[2021-06-25 19:17:38] INFO: Done.
[2021-06-25 19:17:39] TASK: Placing 4 bacterial genomes into reference tree with pplacer using 1 CPUs (be patient).
[2021-06-25 19:17:39] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2021-06-25 19:17:39] DEBUG: pplacer -m wag -j 1 -c /gpfs/data/rbeinart/cbreusing/miniconda3/envs/gtdbtk/share/gtdbtk-1.5.0/db/pplacer/gtdb_r202_bac120.refpkg -o ./classify/intermediate_results/pplacer/pplacer.bac120.json ./align/gtdbtk.bac120.user_msa.fasta
[2021-06-25 19:39:00] ERROR: Controlled exit resulting from an unrecoverable error or warning.

================================================================================
EXCEPTION: PplacerException
  MESSAGE: An error was encountered while running pplacer.
________________________________________________________________________________

Traceback (most recent call last):
  File "/users/mhauer1/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/__main__.py", line 95, in main
    gt_parser.parse_options(args)
  File "/users/mhauer1/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/main.py", line 718, in parse_options
    self.classify(options)
  File "/users/mhauer1/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/main.py", line 440, in classify
    classify.run(genomes,
  File "/users/mhauer1/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/classify.py", line 444, in run
    classify_tree = self.place_genomes(user_msa_file,
  File "/users/mhauer1/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/classify.py", line 240, in place_genomes
    pplacer.run(self.pplacer_cpus, 'wag', pplacer_ref_pkg, pplacer_json_out,
  File "/users/mhauer1/miniconda3/envs/gtdbtk/lib/python3.8/site-packages/gtdbtk/external/pplacer.py", line 92, in run
    raise PplacerException(
gtdbtk.exceptions.PplacerException: An error was encountered while running pplacer.
================================================================================
(END)
`

output file in classify/intermediate_results/pplacer/pplacer.bac120.out

Running pplacer v1.1.alpha19-0-g807f6f3 analysis on ./align/gtdbtk.bac120.user_msa.fasta...
Didn't find any reference sequences in given alignment file. Using supplied reference alignment.
Pre-masking sequences... sequence length cut from 5037 to 5002.
Determining figs... figs disabled.
Allocating memory for internal nodes... done.
Caching likelihood information on reference tree... 

Any suggestions?

@pchaumeil
Copy link
Collaborator

Hello,
It seems there is a conflict between the Release 95 and Release 202 of GTDB-Tk databases.
Release 95 trims the alignment to 5040 AA and Release 202 uses 5037 AA so , looking at the log , it looks like the alignment step is done based on R202 (Masked bacterial alignment from 41,084 to 5,037 AAs.) but pplacer is still using Release 95 (that is why it expects 5040 AA in the MSA) to place the genomes.

I would recommend downloading a fresh version of GTDB-Tk release 202 and place it in a newly created folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants