Convention 2

If you are sequencing more than one organism, you may want to distinguish similar read names from different projects. Occasionally, sequence reads from different projects may accidentally be added to the same assembly project. Typically, sequence data from different organisms will assemble into distinct contigs if entered into the same assembly, but if that happens you want to avoid being misled by spurious paired end data relationships. For example, imagine you are sequencing both fox and rabbit genomes, and are using the following naming system:

 

Forward Names

Reverse Names

Fox genome

0123fwdfox.abi

0123revfox.abi

0124fwdfox.abi

0124revfox.abi

Rabbit genome

0123fwdrabbit.abi

0123revrabbit.abi

0124fwdrabbit.abi

0124revrabbit.abi

0124fwdrabbit.scf

0124revrabbit.scf

 

Here, sequence names comprise a number of digits, followed by “fwd” for forward or “rev” for reverse, followed by some letters defining the project (organism), followed by a file extension. Therefore, a pair is comprised of matches before and after the “fwd” or “rev.”

 

To make life difficult, let’s assume both projects include both “.abi” and “.scf” files, and that only reads with the same extension can safely be considered pairs. One valid pair of expressions for these names is:

 

Forward Name

Reverse Name

(\d+)fwd(\D+\.)(\D{3})

(\d+)rev(\D+\.)(\D{3})

 

This specification defines pairs as reads whose names match one or more digits preceding “fwd” or “rev” AND match the next one or more non-digits followed by a period, AND match exactly three non-digits following the period—i.e. all three of the phrases (strings in parentheses) must match before two reads can qualify as a pair. Note that in this case “fox” or “rabbit” could have been used instead of “\D+” in the middle phrase. The advantage of using “\D+ “ is the same expression is valid for both projects. It is also valid for any future projects using the same convention where the only change is the letters between “fwd” or “rev” and the period.

 

You may notice that the expression: (\d+)fwd(\D+\.)(\D{3}) has a pair of parentheses it does not need, and can be simplified to: (\d+)fwd(\D+\.\D{3})

 

Either expression can be used in SeqMan, allowing personal preferences to be utilized.