Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  You are expected to use this module via the Bio.SeqIO functions, with the 
  73  following format names: 
  74   
  75   - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  76   - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  77      offset of 33 (e.g. from the NCBI Short Read Archive). These can hold PHRED 
  78      scores from 0 to 93. 
  79   - "fastq-sanger" is an alias for "fastq". 
  80   - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  81      files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  82      scores from -5 to 62. 
  83   - "fastq-illumina" means new Illumina 1.3+ style FASTQ files, using PHRED 
  84      scores but with an ASCII offset 64, allowing PHRED scores from 0 to 62. 
  85   
  86  We could potentially add support for "qual-solexa" meaning QUAL files which 
  87  contain Solexa scores, but thus far there isn't any reason to use such files. 
  88   
  89  For example, consider the following short FASTQ file:: 
  90   
  91      @EAS54_6_R1_2_1_413_324 
  92      CCCTTCTTGTCTTCAGCGTTTCTCC 
  93      + 
  94      ;;3;;;;;;;;;;;;7;;;;;;;88 
  95      @EAS54_6_R1_2_1_540_792 
  96      TTGGCAGGCCAAGGCCGATGGATCA 
  97      + 
  98      ;;;;;;;;;;;7;;;;;-;;;3;83 
  99      @EAS54_6_R1_2_1_443_348 
 100      GTTGCTTCTGGCGTGGGTGGGGGGG 
 101      + 
 102      ;;;;;;;;;;;9;7;;.7;393333 
 103   
 104  This contains three reads of length 25.  From the read length these were 
 105  probably originally from an early Solexa/Illumina sequencer but this file 
 106  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 107  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 108  "fastq" as the format name: 
 109   
 110      >>> from Bio import SeqIO 
 111      >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 112      ...     print record.id, record.seq 
 113      EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 114      EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 115      EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 116   
 117  The qualities are held as a list of integers in each record's annotation: 
 118   
 119      >>> print record 
 120      ID: EAS54_6_R1_2_1_443_348 
 121      Name: EAS54_6_R1_2_1_443_348 
 122      Description: EAS54_6_R1_2_1_443_348 
 123      Number of features: 0 
 124      Per letter annotation for: phred_quality 
 125      Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 126      >>> print record.letter_annotations["phred_quality"] 
 127      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 128   
 129  You can use the SeqRecord format method to show this in the QUAL format: 
 130   
 131      >>> print record.format("qual") 
 132      >EAS54_6_R1_2_1_443_348 
 133      26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 134      24 18 18 18 18 
 135      <BLANKLINE> 
 136   
 137  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 138   
 139      >>> print record.format("fastq") 
 140      @EAS54_6_R1_2_1_443_348 
 141      GTTGCTTCTGGCGTGGGTGGGGGGG 
 142      + 
 143      ;;;;;;;;;;;9;7;;.7;393333 
 144      <BLANKLINE> 
 145   
 146  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 147  of 64): 
 148   
 149      >>> print record.format("fastq-illumina") 
 150      @EAS54_6_R1_2_1_443_348 
 151      GTTGCTTCTGGCGTGGGTGGGGGGG 
 152      + 
 153      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 154      <BLANKLINE> 
 155   
 156  You can also get Biopython to convert the scores and show a Solexa style 
 157  FASTQ file: 
 158   
 159      >>> print record.format("fastq-solexa") 
 160      @EAS54_6_R1_2_1_443_348 
 161      GTTGCTTCTGGCGTGGGTGGGGGGG 
 162      + 
 163      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 164      <BLANKLINE> 
 165   
 166  Notice that this is actually the same output as above using "fastq-illumina" 
 167  as the format! The reason for this is all these scores are high enough that 
 168  the PHRED and Solexa scores are almost equal. The differences become apparent 
 169  for poor quality reads. See the functions solexa_quality_from_phred and 
 170  phred_quality_from_solexa for more details. 
 171   
 172  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 173  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 174   
 175      >>> sub_rec = record[5:15] 
 176      >>> print sub_rec 
 177      ID: EAS54_6_R1_2_1_443_348 
 178      Name: EAS54_6_R1_2_1_443_348 
 179      Description: EAS54_6_R1_2_1_443_348 
 180      Number of features: 0 
 181      Per letter annotation for: phred_quality 
 182      Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 183      >>> print sub_rec.letter_annotations["phred_quality"] 
 184      [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 185      >>> print sub_rec.format("fastq") 
 186      @EAS54_6_R1_2_1_443_348 
 187      TTCTGGCGTG 
 188      + 
 189      ;;;;;;9;7; 
 190      <BLANKLINE> 
 191       
 192  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 193   
 194      >>> from Bio import SeqIO 
 195      >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 196      >>> out_handle = open("Quality/temp.qual", "w") 
 197      >>> SeqIO.write(record_iterator, out_handle, "qual") 
 198      3 
 199      >>> out_handle.close() 
 200   
 201  You can of course read in a QUAL file, such as the one we just created: 
 202   
 203      >>> from Bio import SeqIO 
 204      >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 205      ...     print record.id, record.seq 
 206      EAS54_6_R1_2_1_413_324 ????????????????????????? 
 207      EAS54_6_R1_2_1_540_792 ????????????????????????? 
 208      EAS54_6_R1_2_1_443_348 ????????????????????????? 
 209   
 210  Notice that QUAL files don't have a proper sequence present!  But the quality 
 211  information is there: 
 212   
 213      >>> print record 
 214      ID: EAS54_6_R1_2_1_443_348 
 215      Name: EAS54_6_R1_2_1_443_348 
 216      Description: EAS54_6_R1_2_1_443_348 
 217      Number of features: 0 
 218      Per letter annotation for: phred_quality 
 219      UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 220      >>> print record.letter_annotations["phred_quality"] 
 221      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 222   
 223  Just to keep things tidy, if you are following this example yourself, you can 
 224  delete this temporary file now: 
 225   
 226      >>> import os 
 227      >>> os.remove("Quality/temp.qual") 
 228   
 229  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 230  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 231  would have to read the two in separately and then combine the data.  However, 
 232  since this is such a common thing to want to do, there is a helper iterator 
 233  defined in this module that does this for you - PairedFastaQualIterator. 
 234   
 235  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 236  then a simple dictionary approach would work: 
 237   
 238      >>> from Bio import SeqIO 
 239      >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta")) 
 240      >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"): 
 241      ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 242   
 243  You can then access any record by its key, and get both the sequence and the 
 244  quality scores. 
 245   
 246      >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq") 
 247      @EAS54_6_R1_2_1_540_792 
 248      TTGGCAGGCCAAGGCCGATGGATCA 
 249      + 
 250      ;;;;;;;;;;;7;;;;;-;;;3;83 
 251      <BLANKLINE> 
 252   
 253  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 254  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 255  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 256  for the more recent variant), as this cannot be detected reliably 
 257  automatically. 
 258   
 259  To illustrate this problem, let's consider an artifical example: 
 260   
 261      >>> from Bio.Seq import Seq 
 262      >>> from Bio.Alphabet import generic_dna 
 263      >>> from Bio.SeqRecord import SeqRecord 
 264      >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 265      ... description="Made up!") 
 266      >>> print test.format("fasta") 
 267      >Test Made up! 
 268      NACGTACGTA 
 269      <BLANKLINE> 
 270      >>> print test.format("fastq") 
 271      Traceback (most recent call last): 
 272       ... 
 273      ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 274   
 275  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 276  or FASTQ format we need to provide some quality scores. These are held as a 
 277  list of integers (one for each base) in the letter_annotations dictionary: 
 278   
 279      >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40] 
 280      >>> print test.format("qual") 
 281      >Test Made up! 
 282      0 1 2 3 4 5 10 20 30 40 
 283      <BLANKLINE> 
 284      >>> print test.format("fastq") 
 285      @Test Made up! 
 286      NACGTACGTA 
 287      + 
 288      !"#$%&+5?I 
 289      <BLANKLINE> 
 290   
 291  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 292  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 293  the letter "I": 
 294   
 295      >>> ord('!') - 33 
 296      0 
 297      >>> ord('I') - 33 
 298      40 
 299      >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 300      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 301   
 302  Similarly, we could produce an Illumina 1.3+ style FASTQ file using PHRED 
 303  scores with an offset of 64: 
 304   
 305      >>> print test.format("fastq-illumina") 
 306      @Test Made up! 
 307      NACGTACGTA 
 308      + 
 309      @ABCDEJT^h 
 310      <BLANKLINE> 
 311   
 312  And we can check this too - the first PHRED score was zero, and this mapped to 
 313  "@", while the final score was 40 and this mapped to "h": 
 314   
 315      >>> ord("@") - 64 
 316      0 
 317      >>> ord("h") - 64 
 318      40 
 319      >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 320      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 321   
 322  Notice how different the standard Sanger FASTQ and the Illumina 1.3+ style 
 323  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 324  format to consider which encodes Solexa scores instead of PHRED scores. 
 325   
 326  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 327  scores (rounding to one decimal place): 
 328   
 329      >>> for q in [0,1,2,3,4,5,10,20,30,40]: 
 330      ...     print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)) 
 331      PHRED 0 maps to Solexa -5.0 
 332      PHRED 1 maps to Solexa -5.0 
 333      PHRED 2 maps to Solexa -2.3 
 334      PHRED 3 maps to Solexa -0.0 
 335      PHRED 4 maps to Solexa 1.8 
 336      PHRED 5 maps to Solexa 3.3 
 337      PHRED 10 maps to Solexa 9.5 
 338      PHRED 20 maps to Solexa 20.0 
 339      PHRED 30 maps to Solexa 30.0 
 340      PHRED 40 maps to Solexa 40.0 
 341   
 342  Now here is the record using the old Solexa style FASTQ file: 
 343   
 344      >>> print test.format("fastq-solexa") 
 345      @Test Made up! 
 346      NACGTACGTA 
 347      + 
 348      ;;>@BCJT^h 
 349      <BLANKLINE> 
 350   
 351  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 352   
 353      >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 354      [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 355   
 356  This explains why the last few letters of this FASTQ output matched that using 
 357  the Illumina 1.3+ format - high quality PHRED scores and Solexa scores are 
 358  approximately equal. 
 359   
 360  """ 
 361  __docformat__ = "epytext en" #Don't just use plain text in epydoc API pages! 
 362   
 363  from Bio.Alphabet import single_letter_alphabet 
 364  from Bio.Seq import Seq, UnknownSeq 
 365  from Bio.SeqRecord import SeqRecord 
 366  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 367  from math import log 
 368  import warnings 
 369   
 370  # define score offsets. See discussion for differences between Sanger and 
 371  # Solexa offsets. 
 372  SANGER_SCORE_OFFSET = 33 
 373  SOLEXA_SCORE_OFFSET = 64 
 374   
375 -def solexa_quality_from_phred(phred_quality):
376 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 377 378 PHRED and Solexa quality scores are both log transformations of a 379 probality of error (high score = low probability of error). This function 380 takes a PHRED score, transforms it back to a probability of error, and 381 then re-expresses it as a Solexa score. This assumes the error estimates 382 are equivalent. 383 384 How does this work exactly? Well the PHRED quality is minus ten times the 385 base ten logarithm of the probability of error:: 386 387 phred_quality = -10*log(error,10) 388 389 Therefore, turning this round:: 390 391 error = 10 ** (- phred_quality / 10) 392 393 Now, Solexa qualities use a different log transformation:: 394 395 solexa_quality = -10*log(error/(1-error),10) 396 397 After substitution and a little manipulation we get:: 398 399 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 400 401 However, real Solexa files use a minimum quality of -5. This does have a 402 good reason - a random a random base call would be correct 25% of the time, 403 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 404 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 405 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 406 407 Taken literally, this logarithic formula would map a PHRED quality of zero 408 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 409 score of zero means a probability of error of one (i.e. the base call is 410 definitely wrong), which is worse than random! In practice, a PHRED quality 411 of zero usually means a default value, or perhaps random - and therefore 412 mapping it to the minimum Solexa score of -5 is reasonable. 413 414 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 415 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 416 quality of zero to -5.0 as well. 417 418 Note this function will return a floating point number, it is up to you to 419 round this to the nearest integer if appropriate. e.g. 420 421 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2) 422 80.00 423 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2) 424 50.00 425 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2) 426 19.96 427 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2) 428 9.54 429 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2) 430 3.35 431 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2) 432 1.80 433 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2) 434 -0.02 435 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2) 436 -2.33 437 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2) 438 -5.00 439 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2) 440 -5.00 441 442 Notice that for high quality reads PHRED and Solexa scores are numerically 443 equal. The differences are important for poor quality reads, where PHRED 444 has a minimum of zero but Solexa scores can be negative. 445 446 Finally, as a special case where None is used for a "missing value", None 447 is returned: 448 449 >>> print solexa_quality_from_phred(None) 450 None 451 """ 452 if phred_quality is None: 453 #Assume None is used as some kind of NULL or NA value; return None 454 #e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 455 return None 456 elif phred_quality > 0: 457 #Solexa uses a minimum value of -5, which after rounding matches a 458 #random nucleotide base call. 459 return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10)) 460 elif phred_quality == 0: 461 #Special case, map to -5 as discussed in the docstring 462 return -5.0 463 else: 464 raise ValueError("PHRED qualities must be positive (or zero), not %s" \ 465 % repr(phred_quality))
466
467 -def phred_quality_from_solexa(solexa_quality):
468 """Convert a Solexa quality (which can be negative) to a PHRED quality. 469 470 PHRED and Solexa quality scores are both log transformations of a 471 probality of error (high score = low probability of error). This function 472 takes a Solexa score, transforms it back to a probability of error, and 473 then re-expresses it as a PHRED score. This assumes the error estimates 474 are equivalent. 475 476 The underlying formulas are given in the documentation for the sister 477 function solexa_quality_from_phred, in this case the operation is:: 478 479 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 480 481 This will return a floating point number, it is up to you to round this to 482 the nearest integer if appropriate. e.g. 483 484 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2) 485 80.00 486 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2) 487 20.04 488 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2) 489 10.41 490 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2) 491 3.01 492 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2) 493 1.19 494 495 Note that a solexa_quality less then -5 is not expected, will trigger a 496 warning, but will still be converted as per the logarithmic mapping 497 (giving a number between 0 and 1.19 back). 498 499 As a special case where None is used for a "missing value", None is 500 returned: 501 502 >>> print phred_quality_from_solexa(None) 503 None 504 """ 505 if solexa_quality is None: 506 #Assume None is used as some kind of NULL or NA value; return None 507 return None 508 if solexa_quality < -5: 509 import warnings 510 warnings.warn("Solexa quality less than -5 passed, %s" \ 511 % repr(solexa_quality)) 512 return 10*log(10**(solexa_quality/10.0) + 1, 10)
513
514 -def _get_phred_quality(record):
515 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 516 517 If there are no PHRED qualities, but there are Solexa qualities, those are 518 used instead after conversion. 519 """ 520 try: 521 return record.letter_annotations["phred_quality"] 522 except KeyError: 523 pass 524 try: 525 return [phred_quality_from_solexa(q) for \ 526 q in record.letter_annotations["solexa_quality"]] 527 except KeyError: 528 raise ValueError("No suitable quality scores found in " 529 "letter_annotations of SeqRecord (id=%s)." \ 530 % record.id)
531 532 #Only map 0 to 93, we need to give a warning on truncating at 93 533 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \ 534 for qp in range(0, 93+1)) 535 #Only map -5 to 93, we need to give a warning on truncating at 93 536 _solexa_to_sanger_quality_str = dict( \ 537 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \ 538 for qs in range(-5, 93+1))
539 -def _get_sanger_quality_str(record):
540 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 541 542 >>> from Bio.Seq import Seq 543 >>> from Bio.SeqRecord import SeqRecord 544 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 545 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]}) 546 >>> _get_sanger_quality_str(r) 547 'SI?5+!' 548 549 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 550 the PHRED qualities are integers, this function is able to use a very fast 551 pre-cached mapping. However, if they are floats which differ slightly, then 552 it has to do the appropriate rounding - which is slower: 553 554 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 555 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]}) 556 >>> _get_sanger_quality_str(r2) 557 'SI?5+!' 558 559 If your scores include a None value, this raises an exception: 560 561 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 562 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]}) 563 >>> _get_sanger_quality_str(r3) 564 Traceback (most recent call last): 565 ... 566 TypeError: A quality value of None was found 567 568 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 569 scores are used in preference: 570 571 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 572 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0], 573 ... "solexa_quality":[-5,-4,0,None,0,40]}) 574 >>> _get_sanger_quality_str(r4) 575 'SI?5+!' 576 577 If there are no PHRED scores, but there are Solexa scores, these are used 578 instead (after the approriate conversion): 579 580 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 581 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]}) 582 >>> _get_sanger_quality_str(r5) 583 'I?5+$"' 584 585 Again, integer Solexa scores can be looked up in a pre-cached mapping making 586 this very fast. You can still use approximate floating point scores: 587 588 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 589 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]}) 590 >>> _get_sanger_quality_str(r6) 591 'I?5+$"' 592 593 Notice that due to the limited range of printable ASCII characters, a 594 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 595 file (using ASCII 126, the tilde). This function will issue a warning 596 in this situation. 597 """ 598 #TODO - This functions works and is fast, but it is also ugly 599 #and there is considerable repetition of code for the other 600 #two FASTQ variants. 601 try: 602 #These take priority (in case both Solexa and PHRED scores found) 603 qualities = record.letter_annotations["phred_quality"] 604 except KeyError: 605 #Fall back on solexa scores... 606 pass 607 else: 608 #Try and use the precomputed mapping: 609 try: 610 return "".join([_phred_to_sanger_quality_str[qp] \ 611 for qp in qualities]) 612 except KeyError: 613 #Could be a float, or a None in the list, or a high value. 614 pass 615 if None in qualities: 616 raise TypeError("A quality value of None was found") 617 if max(qualities) >= 93.5: 618 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ") 619 #This will apply the truncation at 93, giving max ASCII 126 620 return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \ 621 for qp in qualities]) 622 #Fall back on the Solexa scores... 623 try: 624 qualities = record.letter_annotations["solexa_quality"] 625 except KeyError: 626 raise ValueError("No suitable quality scores found in " 627 "letter_annotations of SeqRecord (id=%s)." \ 628 % record.id) 629 #Try and use the precomputed mapping: 630 try: 631 return "".join([_solexa_to_sanger_quality_str[qs] \ 632 for qs in qualities]) 633 except KeyError: 634 #Either no PHRED scores, or something odd like a float or None 635 pass 636 if None in qualities: 637 raise TypeError("A quality value of None was found") 638 #Must do this the slow way, first converting the PHRED scores into 639 #Solexa scores: 640 if max(qualities) >= 93.5: 641 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ") 642 #This will apply the truncation at 93, giving max ASCII 126 643 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \ 644 for qs in qualities])
645 646 #Only map 0 to 62, we need to give a warning on truncating at 62 647 assert 62+SOLEXA_SCORE_OFFSET == 126 648 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \ 649 for qp in range(0, 62+1)) 650 #Only map -5 to 62, we need to give a warning on truncating at 62 651 _solexa_to_illumina_quality_str = dict( \ 652 (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \ 653 for qs in range(-5, 62+1))
654 -def _get_illumina_quality_str(record):
655 """Returns an Illumina 1.3+ FASTQ encoded quality string (PRIVATE). 656 657 Notice that due to the limited range of printable ASCII characters, a 658 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 659 file (using ASCII 126, the tilde). This function will issue a warning 660 in this situation. 661 """ 662 #TODO - This functions works and is fast, but it is also ugly 663 #and there is considerable repetition of code for the other 664 #two FASTQ variants. 665 try: 666 #These take priority (in case both Solexa and PHRED scores found) 667 qualities = record.letter_annotations["phred_quality"] 668 except KeyError: 669 #Fall back on solexa scores... 670 pass 671 else: 672 #Try and use the precomputed mapping: 673 try: 674 return "".join([_phred_to_illumina_quality_str[qp] \ 675 for qp in qualities]) 676 except KeyError: 677 #Could be a float, or a None in the list, or a high value. 678 pass 679 if None in qualities: 680 raise TypeError("A quality value of None was found") 681 if max(qualities) >= 62.5: 682 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ") 683 #This will apply the truncation at 62, giving max ASCII 126 684 return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \ 685 for qp in qualities]) 686 #Fall back on the Solexa scores... 687 try: 688 qualities = record.letter_annotations["solexa_quality"] 689 except KeyError: 690 raise ValueError("No suitable quality scores found in " 691 "letter_annotations of SeqRecord (id=%s)." \ 692 % record.id) 693 #Try and use the precomputed mapping: 694 try: 695 return "".join([_solexa_to_illumina_quality_str[qs] \ 696 for qs in qualities]) 697 except KeyError: 698 #Either no PHRED scores, or something odd like a float or None 699 pass 700 if None in qualities: 701 raise TypeError("A quality value of None was found") 702 #Must do this the slow way, first converting the PHRED scores into 703 #Solexa scores: 704 if max(qualities) >= 62.5: 705 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ") 706 #This will apply the truncation at 62, giving max ASCII 126 707 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \ 708 for qs in qualities])
709 710 #Only map 0 to 62, we need to give a warning on truncating at 62 711 assert 62+SOLEXA_SCORE_OFFSET == 126 712 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \ 713 for qs in range(-5, 62+1)) 714 #Only map -5 to 62, we need to give a warning on truncating at 62 715 _phred_to_solexa_quality_str = dict(\ 716 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \ 717 for qp in range(0, 62+1))
718 -def _get_solexa_quality_str(record):
719 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 720 721 Notice that due to the limited range of printable ASCII characters, a 722 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 723 file (using ASCII 126, the tilde). This function will issue a warning 724 in this situation. 725 """ 726 #TODO - This functions works and is fast, but it is also ugly 727 #and there is considerable repetition of code for the other 728 #two FASTQ variants. 729 try: 730 #These take priority (in case both Solexa and PHRED scores found) 731 qualities = record.letter_annotations["solexa_quality"] 732 except KeyError: 733 #Fall back on PHRED scores... 734 pass 735 else: 736 #Try and use the precomputed mapping: 737 try: 738 return "".join([_solexa_to_solexa_quality_str[qs] \ 739 for qs in qualities]) 740 except KeyError: 741 #Could be a float, or a None in the list, or a high value. 742 pass 743 if None in qualities: 744 raise TypeError("A quality value of None was found") 745 if max(qualities) >= 62.5: 746 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ") 747 #This will apply the truncation at 62, giving max ASCII 126 748 return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \ 749 for qs in qualities]) 750 #Fall back on the PHRED scores... 751 try: 752 qualities = record.letter_annotations["phred_quality"] 753 except KeyError: 754 raise ValueError("No suitable quality scores found in " 755 "letter_annotations of SeqRecord (id=%s)." \ 756 % record.id) 757 #Try and use the precomputed mapping: 758 try: 759 return "".join([_phred_to_solexa_quality_str[qp] \ 760 for qp in qualities]) 761 except KeyError: 762 #Either no PHRED scores, or something odd like a float or None 763 #or too big to be in the cache 764 pass 765 if None in qualities: 766 raise TypeError("A quality value of None was found") 767 #Must do this the slow way, first converting the PHRED scores into 768 #Solexa scores: 769 if max(qualities) >= 62.5: 770 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ") 771 return "".join([chr(min(126, 772 int(round(solexa_quality_from_phred(qp))) + \ 773 SOLEXA_SCORE_OFFSET)) \ 774 for qp in qualities])
775 776 #TODO - Default to nucleotide or even DNA?
777 -def FastqGeneralIterator(handle):
778 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 779 780 This code does not try to interpret the quality string numerically. It 781 just returns tuples of the title, sequence and quality as strings. For 782 the sequence and quality, any whitespace (such as new lines) is removed. 783 784 Our SeqRecord based FASTQ iterators call this function internally, and then 785 turn the strings into a SeqRecord objects, mapping the quality string into 786 a list of numerical scores. If you want to do a custom quality mapping, 787 then you might consider calling this function directly. 788 789 For parsing FASTQ files, the title string from the "@" line at the start 790 of each record can optionally be omitted on the "+" lines. If it is 791 repeated, it must be identical. 792 793 The sequence string and the quality string can optionally be split over 794 multiple lines, although several sources discourage this. In comparison, 795 for the FASTA file format line breaks between 60 and 80 characters are 796 the norm. 797 798 WARNING - Because the "@" character can appear in the quality string, 799 this can cause problems as this is also the marker for the start of 800 a new sequence. In fact, the "+" sign can also appear as well. Some 801 sources recommended having no line breaks in the quality to avoid this, 802 but even that is not enough, consider this example:: 803 804 @071113_EAS56_0053:1:1:998:236 805 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 806 +071113_EAS56_0053:1:1:998:236 807 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 808 @071113_EAS56_0053:1:1:182:712 809 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 810 + 811 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 812 @071113_EAS56_0053:1:1:153:10 813 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 814 + 815 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 816 @071113_EAS56_0053:1:3:990:501 817 TGGGAGGTTTTATGTGGA 818 AAGCAGCAATGTACAAGA 819 + 820 IIIIIII.IIIIII1@44 821 @-7.%<&+/$/%4(++(% 822 823 This is four PHRED encoded FASTQ entries originally from an NCBI source 824 (given the read length of 36, these are probably Solexa Illumna reads where 825 the quality has been mapped onto the PHRED values). 826 827 This example has been edited to illustrate some of the nasty things allowed 828 in the FASTQ format. Firstly, on the "+" lines most but not all of the 829 (redundant) identifiers are ommited. In real files it is likely that all or 830 none of these extra identifiers will be present. 831 832 Secondly, while the first three sequences have been shown without line 833 breaks, the last has been split over multiple lines. In real files any line 834 breaks are likely to be consistent. 835 836 Thirdly, some of the quality string lines start with an "@" character. For 837 the second record this is unavoidable. However for the fourth sequence this 838 only happens because its quality string is split over two lines. A naive 839 parser could wrongly treat any line starting with an "@" as the beginning of 840 a new sequence! This code copes with this possible ambiguity by keeping 841 track of the length of the sequence which gives the expected length of the 842 quality string. 843 844 Using this tricky example file as input, this short bit of code demonstrates 845 what this parsing function would return: 846 847 >>> handle = open("Quality/tricky.fastq", "rU") 848 >>> for (title, sequence, quality) in FastqGeneralIterator(handle): 849 ... print title 850 ... print sequence, quality 851 071113_EAS56_0053:1:1:998:236 852 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 853 071113_EAS56_0053:1:1:182:712 854 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 855 071113_EAS56_0053:1:1:153:10 856 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 857 071113_EAS56_0053:1:3:990:501 858 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 859 >>> handle.close() 860 861 Finally we note that some sources state that the quality string should 862 start with "!" (which using the PHRED mapping means the first letter always 863 has a quality score of zero). This rather restrictive rule is not widely 864 observed, so is therefore ignored here. One plus point about this "!" rule 865 is that (provided there are no line breaks in the quality sequence) it 866 would prevent the above problem with the "@" character. 867 """ 868 #We need to call handle.readline() at least four times per record, 869 #so we'll save a property look up each time: 870 handle_readline = handle.readline 871 872 #Skip any text before the first record (e.g. blank lines, comments?) 873 while True: 874 line = handle_readline() 875 if line == "" : return #Premature end of file, or just empty? 876 if line[0] == "@": 877 break 878 879 while True: 880 if line[0] != "@": 881 raise ValueError("Records in Fastq files should start with '@' character") 882 title_line = line[1:].rstrip() 883 #Will now be at least one line of quality data - in most FASTQ files 884 #just one line! We therefore use string concatenation (if needed) 885 #rather using than the "".join(...) trick just in case it is multiline: 886 seq_string = handle_readline().rstrip() 887 #There may now be more sequence lines, or the "+" quality marker line: 888 while True: 889 line = handle_readline() 890 if not line: 891 raise ValueError("End of file without quality information.") 892 if line[0] == "+": 893 #The title here is optional, but if present must match! 894 second_title = line[1:].rstrip() 895 if second_title and second_title != title_line: 896 raise ValueError("Sequence and quality captions differ.") 897 break 898 seq_string += line.rstrip() #removes trailing newlines 899 #This is going to slow things down a little, but assuming 900 #this isn't allowed we should try and catch it here: 901 if " " in seq_string or "\t" in seq_string: 902 raise ValueError("Whitespace is not allowed in the sequence.") 903 seq_len = len(seq_string) 904 905 #Will now be at least one line of quality data... 906 quality_string = handle_readline().rstrip() 907 #There may now be more quality data, or another sequence, or EOF 908 while True: 909 line = handle_readline() 910 if not line : break #end of file 911 if line[0] == "@": 912 #This COULD be the start of a new sequence. However, it MAY just 913 #be a line of quality data which starts with a "@" character. We 914 #should be able to check this by looking at the sequence length 915 #and the amount of quality data found so far. 916 if len(quality_string) >= seq_len: 917 #We expect it to be equal if this is the start of a new record. 918 #If the quality data is longer, we'll raise an error below. 919 break 920 #Continue - its just some (more) quality data. 921 quality_string += line.rstrip() 922 923 if seq_len != len(quality_string): 924 raise ValueError("Lengths of sequence and quality values differs " 925 " for %s (%i and %i)." \ 926 % (title_line, seq_len, len(quality_string))) 927 928 #Return the record and then continue... 929 yield (title_line, seq_string, quality_string) 930 if not line : return #StopIteration at end of file 931 assert False, "Should not reach this line"
932 933 #This is a generator function!
934 -def FastqPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
935 """Generator function to iterate over FASTQ records (as SeqRecord objects). 936 937 - handle - input file 938 - alphabet - optional alphabet 939 - title2ids - A function that, when given the title line from the FASTQ 940 file (without the beginning >), will return the id, name and 941 description (in that order) for the record as a tuple of 942 strings. If this is not given, then the entire title line 943 will be used as the description, and the first word as the 944 id and name. 945 946 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 947 948 For each sequence in a (Sanger style) FASTQ file there is a matching string 949 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 950 values with an offset of 33. 951 952 For example, consider a file containing three short reads:: 953 954 @EAS54_6_R1_2_1_413_324 955 CCCTTCTTGTCTTCAGCGTTTCTCC 956 + 957 ;;3;;;;;;;;;;;;7;;;;;;;88 958 @EAS54_6_R1_2_1_540_792 959 TTGGCAGGCCAAGGCCGATGGATCA 960 + 961 ;;;;;;;;;;;7;;;;;-;;;3;83 962 @EAS54_6_R1_2_1_443_348 963 GTTGCTTCTGGCGTGGGTGGGGGGG 964 + 965 ;;;;;;;;;;;9;7;;.7;393333 966 967 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 968 string encoding the PHRED qualities using a ASCI values with an offset of 969 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 970 971 Using this module directly you might run: 972 973 >>> handle = open("Quality/example.fastq", "rU") 974 >>> for record in FastqPhredIterator(handle): 975 ... print record.id, record.seq 976 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 977 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 978 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 979 >>> handle.close() 980 981 Typically however, you would call this via Bio.SeqIO instead with "fastq" 982 (or "fastq-sanger") as the format: 983 984 >>> from Bio import SeqIO 985 >>> handle = open("Quality/example.fastq", "rU") 986 >>> for record in SeqIO.parse(handle, "fastq"): 987 ... print record.id, record.seq 988 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 989 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 990 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 991 >>> handle.close() 992 993 If you want to look at the qualities, they are record in each record's 994 per-letter-annotation dictionary as a simple list of integers: 995 996 >>> print record.letter_annotations["phred_quality"] 997 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 998 """ 999 assert SANGER_SCORE_OFFSET == ord("!") 1000 #Originally, I used a list expression for each record: 1001 # 1002 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1003 # 1004 #Precomputing is faster, perhaps partly by avoiding the subtractions. 1005 q_mapping = dict() 1006 for letter in range(0, 255): 1007 q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET 1008 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1009 if title2ids: 1010 id, name, descr = title2ids(title_line) 1011 else: 1012 descr = title_line 1013 id = descr.split()[0] 1014 name = id 1015 record = SeqRecord(Seq(seq_string, alphabet), 1016 id=id, name=name, description=descr) 1017 qualities = [q_mapping[letter] for letter in quality_string] 1018 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1019 raise ValueError("Invalid character in quality string") 1020 #For speed, will now use a dirty trick to speed up assigning the 1021 #qualities. We do this to bypass the length check imposed by the 1022 #per-letter-annotations restricted dict (as this has already been 1023 #checked by FastqGeneralIterator). This is equivalent to: 1024 #record.letter_annotations["phred_quality"] = qualities 1025 dict.__setitem__(record._per_letter_annotations, 1026 "phred_quality", qualities) 1027 yield record
1028 1029 #This is a generator function!
1030 -def FastqSolexaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1031 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1032 1033 The optional arguments are the same as those for the FastqPhredIterator. 1034 1035 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1036 encoding the Solexa integer qualities using ASCII values with an offset 1037 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1038 will NOT perform any automatic conversion when loading. 1039 1040 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1041 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1042 1043 For example, consider a file containing these five records:: 1044 1045 @SLXA-B3_649_FC8437_R1_1_1_610_79 1046 GATGTGCAATACCTTTGTAGAGGAA 1047 +SLXA-B3_649_FC8437_R1_1_1_610_79 1048 YYYYYYYYYYYYYYYYYYWYWYYSU 1049 @SLXA-B3_649_FC8437_R1_1_1_397_389 1050 GGTTTGAGAAAGAGAAATGAGATAA 1051 +SLXA-B3_649_FC8437_R1_1_1_397_389 1052 YYYYYYYYYWYYYYWWYYYWYWYWW 1053 @SLXA-B3_649_FC8437_R1_1_1_850_123 1054 GAGGGTGTTGATCATGATGATGGCG 1055 +SLXA-B3_649_FC8437_R1_1_1_850_123 1056 YYYYYYYYYYYYYWYYWYYSYYYSY 1057 @SLXA-B3_649_FC8437_R1_1_1_362_549 1058 GGAAACAAAGTTTTTCTCAACATAG 1059 +SLXA-B3_649_FC8437_R1_1_1_362_549 1060 YYYYYYYYYYYYYYYYYYWWWWYWY 1061 @SLXA-B3_649_FC8437_R1_1_1_183_714 1062 GTATTATTTAATGGCATACACTCAA 1063 +SLXA-B3_649_FC8437_R1_1_1_183_714 1064 YYYYYYYYYYWYYYYWYWWUWWWQQ 1065 1066 Using this module directly you might run: 1067 1068 >>> handle = open("Quality/solexa_example.fastq", "rU") 1069 >>> for record in FastqSolexaIterator(handle): 1070 ... print record.id, record.seq 1071 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1072 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1073 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1074 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1075 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1076 >>> handle.close() 1077 1078 Typically however, you would call this via Bio.SeqIO instead with 1079 "fastq-solexa" as the format: 1080 1081 >>> from Bio import SeqIO 1082 >>> handle = open("Quality/solexa_example.fastq", "rU") 1083 >>> for record in SeqIO.parse(handle, "fastq-solexa"): 1084 ... print record.id, record.seq 1085 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1086 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1087 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1088 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1089 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1090 >>> handle.close() 1091 1092 If you want to look at the qualities, they are recorded in each record's 1093 per-letter-annotation dictionary as a simple list of integers: 1094 1095 >>> print record.letter_annotations["solexa_quality"] 1096 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1097 1098 These scores aren't very good, but they are high enough that they map 1099 almost exactly onto PHRED scores: 1100 1101 >>> print "%0.2f" % phred_quality_from_solexa(25) 1102 25.01 1103 1104 Let's look at faked example read which is even worse, where there are 1105 more noticeable differences between the Solexa and PHRED scores:: 1106 1107 @slxa_0001_1_0001_01 1108 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1109 +slxa_0001_1_0001_01 1110 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1111 1112 Again, you would typically use Bio.SeqIO to read this file in (rather than 1113 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1114 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1115 as shown above. This example has only as one entry, so instead we can 1116 use the Bio.SeqIO.read() function: 1117 1118 >>> from Bio import SeqIO 1119 >>> handle = open("Quality/solexa_faked.fastq", "rU") 1120 >>> record = SeqIO.read(handle, "fastq-solexa") 1121 >>> handle.close() 1122 >>> print record.id, record.seq 1123 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1124 >>> print record.letter_annotations["solexa_quality"] 1125 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1126 1127 These quality scores are so low that when converted from the Solexa scheme 1128 into PHRED scores they look quite different: 1129 1130 >>> print "%0.2f" % phred_quality_from_solexa(-1) 1131 2.54 1132 >>> print "%0.2f" % phred_quality_from_solexa(-5) 1133 1.19 1134 1135 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1136 method to output the record(s): 1137 1138 >>> print record.format("fastq-solexa") 1139 @slxa_0001_1_0001_01 1140 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1141 + 1142 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1143 <BLANKLINE> 1144 1145 Note this output is slightly different from the input file as Biopython 1146 has left out the optional repetition of the sequence identifier on the "+" 1147 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1148 output format instead, and Biopython will do the conversion for you: 1149 1150 >>> print record.format("fastq") 1151 @slxa_0001_1_0001_01 1152 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1153 + 1154 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1155 <BLANKLINE> 1156 1157 >>> print record.format("qual") 1158 >slxa_0001_1_0001_01 1159 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1160 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1161 1 1 1162 <BLANKLINE> 1163 1164 As shown above, the poor quality Solexa reads have been mapped to the 1165 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1166 """ 1167 q_mapping = dict() 1168 for letter in range(0, 255): 1169 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET 1170 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1171 if title2ids: 1172 id, name, descr = title_line 1173 else: 1174 descr = title_line 1175 id = descr.split()[0] 1176 name = id 1177 record = SeqRecord(Seq(seq_string, alphabet), 1178 id=id, name=name, description=descr) 1179 qualities = [q_mapping[letter] for letter in quality_string] 1180 #DO NOT convert these into PHRED qualities automatically! 1181 if qualities and (min(qualities) < -5 or max(qualities)>62): 1182 raise ValueError("Invalid character in quality string") 1183 #Dirty trick to speed up this line: 1184 #record.letter_annotations["solexa_quality"] = qualities 1185 dict.__setitem__(record._per_letter_annotations, 1186 "solexa_quality", qualities) 1187 yield record
1188 1189 #This is a generator function!
1190 -def FastqIlluminaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1191 """Parse new Illumina 1.3+ FASTQ like files (which differ in the quality mapping). 1192 1193 The optional arguments are the same as those for the FastqPhredIterator. 1194 1195 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1196 encoding PHRED integer qualities using ASCII values with an offset of 64. 1197 1198 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1199 with an ASCII offset of 64. They are approximately equal but only for high 1200 qaulity reads. If you have an old Solexa/Illumina file with negative 1201 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1202 1203 >>> from Bio import SeqIO 1204 >>> record = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-solexa") 1205 >>> print record.id, record.seq 1206 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1207 >>> print record.letter_annotations["solexa_quality"] 1208 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1209 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina") 1210 Traceback (most recent call last): 1211 ... 1212 ValueError: Invalid character in quality string 1213 1214 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1215 """ 1216 q_mapping = dict() 1217 for letter in range(0, 255): 1218 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET 1219 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1220 if title2ids: 1221 id, name, descr = title2ids(title_line) 1222 else: 1223 descr = title_line 1224 id = descr.split()[0] 1225 name = id 1226 record = SeqRecord(Seq(seq_string, alphabet), 1227 id=id, name=name, description=descr) 1228 qualities = [q_mapping[letter] for letter in quality_string] 1229 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1230 raise ValueError("Invalid character in quality string") 1231 #Dirty trick to speed up this line: 1232 #record.letter_annotations["phred_quality"] = qualities 1233 dict.__setitem__(record._per_letter_annotations, 1234 "phred_quality", qualities) 1235 yield record
1236
1237 -def QualPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1238 """For QUAL files which include PHRED quality scores, but no sequence. 1239 1240 For example, consider this short QUAL file:: 1241 1242 >EAS54_6_R1_2_1_413_324 1243 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1244 26 26 26 23 23 1245 >EAS54_6_R1_2_1_540_792 1246 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1247 26 18 26 23 18 1248 >EAS54_6_R1_2_1_443_348 1249 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1250 24 18 18 18 18 1251 1252 Using this module directly you might run: 1253 1254 >>> handle = open("Quality/example.qual", "rU") 1255 >>> for record in QualPhredIterator(handle): 1256 ... print record.id, record.seq 1257 EAS54_6_R1_2_1_413_324 ????????????????????????? 1258 EAS54_6_R1_2_1_540_792 ????????????????????????? 1259 EAS54_6_R1_2_1_443_348 ????????????????????????? 1260 >>> handle.close() 1261 1262 Typically however, you would call this via Bio.SeqIO instead with "qual" 1263 as the format: 1264 1265 >>> from Bio import SeqIO 1266 >>> handle = open("Quality/example.qual", "rU") 1267 >>> for record in SeqIO.parse(handle, "qual"): 1268 ... print record.id, record.seq 1269 EAS54_6_R1_2_1_413_324 ????????????????????????? 1270 EAS54_6_R1_2_1_540_792 ????????????????????????? 1271 EAS54_6_R1_2_1_443_348 ????????????????????????? 1272 >>> handle.close() 1273 1274 Becase QUAL files don't contain the sequence string itself, the seq 1275 property is set to an UnknownSeq object. As no alphabet was given, this 1276 has defaulted to a generic single letter alphabet and the character "?" 1277 used. 1278 1279 By specifying a nucleotide alphabet, "N" is used instead: 1280 1281 >>> from Bio import SeqIO 1282 >>> from Bio.Alphabet import generic_dna 1283 >>> handle = open("Quality/example.qual", "rU") 1284 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1285 ... print record.id, record.seq 1286 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1287 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1288 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1289 >>> handle.close() 1290 1291 However, the quality scores themselves are available as a list of integers 1292 in each record's per-letter-annotation: 1293 1294 >>> print record.letter_annotations["phred_quality"] 1295 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1296 1297 You can still slice one of these SeqRecord objects with an UnknownSeq: 1298 1299 >>> sub_record = record[5:10] 1300 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"] 1301 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1302 """ 1303 #Skip any text before the first record (e.g. blank lines, comments) 1304 while True: 1305 line = handle.readline() 1306 if line == "" : return #Premature end of file, or just empty? 1307 if line[0] == ">": 1308 break 1309 1310 while True: 1311 if line[0] != ">": 1312 raise ValueError("Records in Fasta files should start with '>' character") 1313 if title2ids: 1314 id, name, descr = title2ids(line[1:].rstrip()) 1315 else: 1316 descr = line[1:].rstrip() 1317 id = descr.split()[0] 1318 name = id 1319 1320 qualities = [] 1321 line = handle.readline() 1322 while True: 1323 if not line : break 1324 if line[0] == ">": break 1325 qualities.extend([int(word) for word in line.split()]) 1326 line = handle.readline() 1327 1328 if qualities and min(qualities) < 0: 1329 raise ValueError(("Negative quality score %i found in %s. " + \ 1330 "Are these Solexa scores, not PHRED scores?") \ 1331 % (min(qualities), id)) 1332 1333 #Return the record and then continue... 1334 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1335 id = id, name = name, description = descr) 1336 #Dirty trick to speed up this line: 1337 #record.letter_annotations["phred_quality"] = qualities 1338 dict.__setitem__(record._per_letter_annotations, 1339 "phred_quality", qualities) 1340 yield record 1341 1342 if not line : return #StopIteration 1343 assert False, "Should not reach this line"
1344
1345 -class FastqPhredWriter(SequentialSequenceWriter):
1346 """Class to write standard FASTQ format files (using PHRED quality scores). 1347 1348 Although you can use this class directly, you are strongly encouraged 1349 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1350 or the alias "fastq-sanger". For example, this code reads in a standard 1351 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1352 Sanger style FASTQ file: 1353 1354 >>> from Bio import SeqIO 1355 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1356 >>> out_handle = open("Quality/temp.fastq", "w") 1357 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1358 3 1359 >>> out_handle.close() 1360 1361 You might want to do this if the original file included extra line breaks, 1362 which while valid may not be supported by all tools. The output file from 1363 Biopython will have each sequence on a single line, and each quality 1364 string on a single line (which is considered desirable for maximum 1365 compatibility). 1366 1367 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1368 quality scores) is converted into a standard Sanger style FASTQ file using 1369 PHRED qualities: 1370 1371 >>> from Bio import SeqIO 1372 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1373 >>> out_handle = open("Quality/temp.fastq", "w") 1374 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1375 5 1376 >>> out_handle.close() 1377 1378 This code is also called if you use the .format("fastq") method of a 1379 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1380 1381 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1382 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1383 warning is issued. 1384 1385 P.S. To avoid cluttering up your working directory, you can delete this 1386 temporary file now: 1387 1388 >>> import os 1389 >>> os.remove("Quality/temp.fastq") 1390 """ 1391 assert SANGER_SCORE_OFFSET == ord("!") 1392
1393 - def write_record(self, record):
1394 """Write a single FASTQ record to the file.""" 1395 assert self._header_written 1396 assert not self._footer_written 1397 self._record_written = True 1398 #TODO - Is an empty sequence allowed in FASTQ format? 1399 if record.seq is None: 1400 raise ValueError("No sequence for record %s" % record.id) 1401 seq_str = str(record.seq) 1402 qualities_str = _get_sanger_quality_str(record) 1403 if len(qualities_str) != len(seq_str): 1404 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1405 % (record.id, len(seq_str), len(qualities_str))) 1406 1407 #FASTQ files can include a description, just like FASTA files 1408 #(at least, this is what the NCBI Short Read Archive does) 1409 id = self.clean(record.id) 1410 description = self.clean(record.description) 1411 if description and description.split(None, 1)[0]==id: 1412 #The description includes the id at the start 1413 title = description 1414 elif description: 1415 title = "%s %s" % (id, description) 1416 else: 1417 title = id 1418 1419 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1420
1421 -class QualPhredWriter(SequentialSequenceWriter):
1422 """Class to write QUAL format files (using PHRED quality scores). 1423 1424 Although you can use this class directly, you are strongly encouraged 1425 to use the Bio.SeqIO.write() function instead. For example, this code 1426 reads in a FASTQ file and saves the quality scores into a QUAL file: 1427 1428 >>> from Bio import SeqIO 1429 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1430 >>> out_handle = open("Quality/temp.qual", "w") 1431 >>> SeqIO.write(record_iterator, out_handle, "qual") 1432 3 1433 >>> out_handle.close() 1434 1435 This code is also called if you use the .format("qual") method of a 1436 SeqRecord. 1437 1438 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1439 1440 >>> import os 1441 >>> os.remove("Quality/temp.qual") 1442 """
1443 - def __init__(self, handle, wrap=60, record2title=None):
1444 """Create a QUAL writer. 1445 1446 Arguments: 1447 - handle - Handle to an output file, e.g. as returned 1448 by open(filename, "w") 1449 - wrap - Optional line length used to wrap sequence lines. 1450 Defaults to wrapping the sequence at 60 characters 1451 Use zero (or None) for no wrapping, giving a single 1452 long line for the sequence. 1453 - record2title - Optional function to return the text to be 1454 used for the title line of each record. By default 1455 a combination of the record.id and record.description 1456 is used. If the record.description starts with the 1457 record.id, then just the record.description is used. 1458 1459 The record2title argument is present for consistency with the 1460 Bio.SeqIO.FastaIO writer class. 1461 """ 1462 SequentialSequenceWriter.__init__(self, handle) 1463 #self.handle = handle 1464 self.wrap = None 1465 if wrap: 1466 if wrap < 1: 1467 raise ValueError 1468 self.wrap = wrap 1469 self.record2title = record2title
1470
1471 - def write_record(self, record):
1472 """Write a single QUAL record to the file.""" 1473 assert self._header_written 1474 assert not self._footer_written 1475 self._record_written = True 1476 1477 if self.record2title: 1478 title = self.clean(self.record2title(record)) 1479 else: 1480 id = self.clean(record.id) 1481 description = self.clean(record.description) 1482 if description and description.split(None, 1)[0]==id: 1483 #The description includes the id at the start 1484 title = description 1485 elif description: 1486 title = "%s %s" % (id, description) 1487 else: 1488 title = id 1489 self.handle.write(">%s\n" % title) 1490 1491 qualities = _get_phred_quality(record) 1492 try: 1493 #This rounds to the nearest integer. 1494 #TODO - can we record a float in a qual file? 1495 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1496 except TypeError, e: 1497 if None in qualities: 1498 raise TypeError("A quality value of None was found") 1499 else: 1500 raise e 1501 1502 if self.wrap: 1503 while qualities_strs: 1504 line = qualities_strs.pop(0) 1505 while qualities_strs \ 1506 and len(line) + 1 + len(qualities_strs[0]) < self.wrap: 1507 line += " " + qualities_strs.pop(0) 1508 self.handle.write(line + "\n") 1509 else: 1510 data = " ".join(qualities_strs) 1511 self.handle.write(data + "\n")
1512
1513 -class FastqSolexaWriter(SequentialSequenceWriter):
1514 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1515 1516 This outputs FASTQ files like those from the early Solexa/Illumina 1517 pipeline, using Solexa scores and an ASCII offset of 64. These are 1518 NOT compatible with the standard Sanger style PHRED FASTQ files. 1519 1520 If your records contain a "solexa_quality" entry under letter_annotations, 1521 this is used, otherwise any "phred_quality" entry will be used after 1522 conversion using the solexa_quality_from_phred function. If neither style 1523 of quality scores are present, an exception is raised. 1524 1525 Although you can use this class directly, you are strongly encouraged 1526 to use the Bio.SeqIO.write() function instead. For example, this code 1527 reads in a FASTQ file and re-saves it as another FASTQ file: 1528 1529 >>> from Bio import SeqIO 1530 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1531 >>> out_handle = open("Quality/temp.fastq", "w") 1532 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1533 5 1534 >>> out_handle.close() 1535 1536 You might want to do this if the original file included extra line breaks, 1537 which (while valid) may not be supported by all tools. The output file 1538 from Biopython will have each sequence on a single line, and each quality 1539 string on a single line (which is considered desirable for maximum 1540 compatibility). 1541 1542 This code is also called if you use the .format("fastq-solexa") method of 1543 a SeqRecord. For example, 1544 1545 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1546 >>> print record.format("fastq-solexa") 1547 @Test PHRED qualities from 40 to 0 inclusive 1548 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1549 + 1550 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1551 <BLANKLINE> 1552 1553 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1554 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1555 a warning is issued. 1556 1557 P.S. Don't forget to delete the temp file if you don't need it anymore: 1558 1559 >>> import os 1560 >>> os.remove("Quality/temp.fastq") 1561 """
1562 - def write_record(self, record):
1563 """Write a single FASTQ record to the file.""" 1564 assert self._header_written 1565 assert not self._footer_written 1566 self._record_written = True 1567 1568 #TODO - Is an empty sequence allowed in FASTQ format? 1569 if record.seq is None: 1570 raise ValueError("No sequence for record %s" % record.id) 1571 seq_str = str(record.seq) 1572 qualities_str = _get_solexa_quality_str(record) 1573 if len(qualities_str) != len(seq_str): 1574 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1575 % (record.id, len(seq_str), len(qualities_str))) 1576 1577 #FASTQ files can include a description, just like FASTA files 1578 #(at least, this is what the NCBI Short Read Archive does) 1579 id = self.clean(record.id) 1580 description = self.clean(record.description) 1581 if description and description.split(None, 1)[0]==id: 1582 #The description includes the id at the start 1583 title = description 1584 elif description: 1585 title = "%s %s" % (id, description) 1586 else: 1587 title = id 1588 1589 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1590
1591 -class FastqIlluminaWriter(SequentialSequenceWriter):
1592 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1593 1594 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1595 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1596 compatible with the standard Sanger style PHRED FASTQ files which use an 1597 ASCII offset of 32. 1598 1599 Although you can use this class directly, you are strongly encouraged to 1600 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1601 instead. This code is also called if you use the .format("fastq-illumina") 1602 method of a SeqRecord. For example, 1603 1604 >>> from Bio import SeqIO 1605 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1606 >>> print record.format("fastq-illumina") 1607 @Test PHRED qualities from 40 to 0 inclusive 1608 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1609 + 1610 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1611 <BLANKLINE> 1612 1613 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1614 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1615 warning is issued. 1616 """
1617 - def write_record(self, record):
1618 """Write a single FASTQ record to the file.""" 1619 assert self._header_written 1620 assert not self._footer_written 1621 self._record_written = True 1622 1623 #TODO - Is an empty sequence allowed in FASTQ format? 1624 if record.seq is None: 1625 raise ValueError("No sequence for record %s" % record.id) 1626 seq_str = str(record.seq) 1627 qualities_str = _get_illumina_quality_str(record) 1628 if len(qualities_str) != len(seq_str): 1629 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1630 % (record.id, len(seq_str), len(qualities_str))) 1631 1632 #FASTQ files can include a description, just like FASTA files 1633 #(at least, this is what the NCBI Short Read Archive does) 1634 id = self.clean(record.id) 1635 description = self.clean(record.description) 1636 if description and description.split(None, 1)[0]==id: 1637 #The description includes the id at the start 1638 title = description 1639 elif description: 1640 title = "%s %s" % (id, description) 1641 else: 1642 title = id 1643 1644 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1645
1646 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet = single_letter_alphabet, title2ids = None):
1647 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1648 1649 For example, consider this short QUAL file with PHRED quality scores:: 1650 1651 >EAS54_6_R1_2_1_413_324 1652 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1653 26 26 26 23 23 1654 >EAS54_6_R1_2_1_540_792 1655 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1656 26 18 26 23 18 1657 >EAS54_6_R1_2_1_443_348 1658 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1659 24 18 18 18 18 1660 1661 And a matching FASTA file:: 1662 1663 >EAS54_6_R1_2_1_413_324 1664 CCCTTCTTGTCTTCAGCGTTTCTCC 1665 >EAS54_6_R1_2_1_540_792 1666 TTGGCAGGCCAAGGCCGATGGATCA 1667 >EAS54_6_R1_2_1_443_348 1668 GTTGCTTCTGGCGTGGGTGGGGGGG 1669 1670 You can parse these separately using Bio.SeqIO with the "qual" and 1671 "fasta" formats, but then you'll get a group of SeqRecord objects with 1672 no sequence, and a matching group with the sequence but not the 1673 qualities. Because it only deals with one input file handle, Bio.SeqIO 1674 can't be used to read the two files together - but this function can! 1675 For example, 1676 1677 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1678 ... open("Quality/example.qual", "rU")) 1679 >>> for record in rec_iter: 1680 ... print record.id, record.seq 1681 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1682 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1683 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1684 1685 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1686 they are in each record's per-letter-annotation dictionary as a simple 1687 list of integers: 1688 1689 >>> print record.letter_annotations["phred_quality"] 1690 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1691 1692 If you have access to data as a FASTQ format file, using that directly 1693 would be simpler and more straight forward. Note that you can easily use 1694 this function to convert paired FASTA and QUAL files into FASTQ files: 1695 1696 >>> from Bio import SeqIO 1697 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1698 ... open("Quality/example.qual", "rU")) 1699 >>> out_handle = open("Quality/temp.fastq", "w") 1700 >>> SeqIO.write(rec_iter, out_handle, "fastq") 1701 3 1702 >>> out_handle.close() 1703 1704 And don't forget to clean up the temp file if you don't need it anymore: 1705 1706 >>> import os 1707 >>> os.remove("Quality/temp.fastq") 1708 """ 1709 from Bio.SeqIO.FastaIO import FastaIterator 1710 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \ 1711 title2ids=title2ids) 1712 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \ 1713 title2ids=title2ids) 1714 1715 #Using zip(...) would create a list loading everything into memory! 1716 #It would also not catch any extra records found in only one file. 1717 while True: 1718 try: 1719 f_rec = fasta_iter.next() 1720 except StopIteration: 1721 f_rec = None 1722 try: 1723 q_rec = qual_iter.next() 1724 except StopIteration: 1725 q_rec = None 1726 if f_rec is None and q_rec is None: 1727 #End of both files 1728 break 1729 if f_rec is None: 1730 raise ValueError("FASTA file has more entries than the QUAL file.") 1731 if q_rec is None: 1732 raise ValueError("QUAL file has more entries than the FASTA file.") 1733 if f_rec.id != q_rec.id: 1734 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \ 1735 % (f_rec.id, q_rec.id)) 1736 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1737 raise ValueError("Sequence length and number of quality scores disagree for %s" \ 1738 % f_rec.id) 1739 #Merge the data.... 1740 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"] 1741 yield f_rec
1742 #Done 1743 1744
1745 -def _test():
1746 """Run the Bio.SeqIO module's doctests. 1747 1748 This will try and locate the unit tests directory, and run the doctests 1749 from there in order that the relative paths used in the examples work. 1750 """ 1751 import doctest 1752 import os 1753 if os.path.isdir(os.path.join("..", "..", "Tests")): 1754 print "Runing doctests..." 1755 cur_dir = os.path.abspath(os.curdir) 1756 os.chdir(os.path.join("..", "..", "Tests")) 1757 assert os.path.isfile("Quality/example.fastq") 1758 assert os.path.isfile("Quality/example.fasta") 1759 assert os.path.isfile("Quality/example.qual") 1760 assert os.path.isfile("Quality/tricky.fastq") 1761 assert os.path.isfile("Quality/solexa_faked.fastq") 1762 doctest.testmod(verbose=0) 1763 os.chdir(cur_dir) 1764 del cur_dir 1765 print "Done"
1766 1767 if __name__ == "__main__": 1768 _test() 1769