If you are sequencing more than one organism, you may want to distinguish similar read names from different projects. Occasionally, sequence reads from different projects may accidentally be added to the same assembly project. Typically, sequence data from different organisms will assemble into distinct contigs if entered into the same assembly, but if that happens you want to avoid being misled by spurious paired end data relationships. For example, imagine you are sequencing both fox and rabbit genomes, and are using the following naming system:
Forward Names |
Reverse Names |
Fox genome | |
0123fwdfox.abi |
0123revfox.abi |
0124fwdfox.abi |
0124revfox.abi |
Rabbit genome | |
0123fwdrabbit.abi |
0123revrabbit.abi |
0124fwdrabbit.abi |
0124revrabbit.abi |
0124fwdrabbit.scf |
0124revrabbit.scf |
Here, sequence names comprise a number of digits, followed by “fwd” for forward or “rev” for reverse, followed by some letters defining the project (organism), followed by a file extension. Therefore, a pair is comprised of matches before and after the “fwd” or “rev.”
To make life difficult, let’s assume both projects include both “.abi” and “.scf” files, and that only reads with the same extension can safely be considered pairs. One valid pair of expressions for these names is:
Forward Name |
Reverse Name |
(\d+)fwd(\D+\.)(\D{3}) |
(\d+)rev(\D+\.)(\D{3}) |
This specification defines pairs as reads whose names match one or more digits preceding “fwd” or “rev” AND match the next one or more non-digits followed by a period, AND match exactly three non-digits following the period—i.e. all three of the phrases (strings in parentheses) must match before two reads can qualify as a pair. Note that in this case “fox” or “rabbit” could have been used instead of “\D+” in the middle phrase. The advantage of using “\D+ “ is the same expression is valid for both projects. It is also valid for any future projects using the same convention where the only change is the letters between “fwd” or “rev” and the period.
You may notice that the expression: (\d+)fwd(\D+\.)(\D{3}) has a pair of parentheses it does not need, and can be simplified to: (\d+)fwd(\D+\.\D{3})
Either expression can be used in SeqMan, allowing personal preferences to be utilized.