Sequence functions - User Guide to SeqNinja - 18.0

Download as PDF

Functions resulting in sequence expressions always start with a dollar sign ($) and include:

Objective	Expression	Example	Comments
Reverse-complement sequences, or to assign a file to the reverse-complement of another file	`~(sequence_set)` `complement(sequence_set)`	`$rc=~(foo.fasta)` `foo_rc.fasta=complement(foo.fasta)` `MG1655_rc.fasta=~("C:\data\MG1655_e_coli_k_12substrands.fasta")`	Though the command is “complement” (for brevity), SeqNinja is actually calculating the reverse complement of the selection.
Cut a sequence into smaller pieces	`cut(sequence_set, size)`	`bar.fasta=cut(foo.gb, %i)`
Specify an overlap when cutting a sequence	`cut(sequence_set, size, offset`)	`bar.fasta=cut(foo.gb, 180, 60)`	This can be used to create faux reads from an assembled sequence.
Extract sub-sequences from sequences corresponding to the given features	`extract(sequence_set, 'feature_type[,...]')`	`bar.fasta=extract(foo.gb, 'CDS')` `bar.fasta=extract(foo.gb, 'CDS,gene')`	Single quotes are required for arguments other than the sequence set.
Extract matching features	`extract(sequence_set, 'feature_type:/tag="value"[,...]')`	`bar.fasta=extract(foo.gb, 'CDS:/gene="thrC"')` `bar.fasta=extract(foo.gb, 'CDS:/gene="thrC",CDS:/gene="thrL"')`	When extracting matching features, a qualifier can be optionally specified. If no qualifier is specified, the result includes all features of the given type. If a qualifier is specified, the result includes features that include a matching qualifier. Note: Writing many small extractions to a file format supporting features can be slow. For each extraction, all of the features are evaluated for intersection.
Extract matching features using wildcards	`extract(sequence_set, 'feature_type:/tag="value"[,...]')`	`bar.fasta=extract(foo.gb, 'CDS:/gene="thr?"')`	Wildcards can be used in the qualifier value. A ‘?’ matches exactly one arbitrary character, and a ‘*’ matches zero or more arbitrary characters. The example matches CDS features with four-character gene names beginning with “thr”.
Ignore source sequence qualifiers	`translate(sequence_set , '/codon_start=ignore')`	`$A=translate("myfile.gb", '/codon_start=ignore')`	Including “/codon_start=ignore” in the qualifiers causes any “/codon_start” qualifiers in the source sequence to be ignored.
Translate all source sequences from DNA/RNA to protein	`translate(sequence_set)`	`$A=translate("myfile.fasta")` `$B=translate("myfile.fasta", '/transl_table=11')` `Test.gb=translate("myfile.gb", '/transl_table=11')` `$C=translate( "myfile.fasta", '/transl_table=VERTM' )`	If the input sequence is annotated, each CDS feature will be translated separately, and any translation table and/or codon_start annotations will be honored. If the input is unannotated, or contains no CDS features, the entire sequence will be translated. Note that translation only progresses with sequences with lengths that are multiples of three (i.e., codons); an “extra” base or two at the end will not be reflected in the output. The standard code is used as a default unless specified differently in the file or in the qualifier overrides. The first codon in a sequence is translated as a start codon, if recognized as such in the genetic code. Otherwise, the default translation is used. Translation of a DNA or RNA sequence with ambiguities might result in an amino sequence with ambiguities, or not. For example, the result of translate (“RAT”) is the ambiguity “B”, but the result of translate (“ACN”) is “T”.
Override defaults or values specified in the file	`translate(sequence_set, '/tag=value [,/tag=value...]')`	`$A=translate("myfile.gb", '/transl_table=11,/codon_start=ignore')`	This argument is a comma-separated list of qualifiers.
Mark the sequences in a set as being DNA, RNA or protein.	`dna(sequence-set)` `rna(sequence-set)` `protein(sequence-set)`	`example.gb=protein("myfile.fasta")` `bar.fasta=protein("myfile.fasta")` `("NAN",rend)`	This ability may be useful for sequences originating in formats where type is unspecified, such as FASTA files. If you use any of these functions, the information will be added to the output file, where allowed (e.g., .gbk). The presence of sequence type information may affect the results of searching in an endpoint expression. For example, in DNA, “N”=anything, whereas in protein, “N”=asparagine.
Mark the sequences in a set as being circular or linear.	`circular(sequence-set)` `linear(sequence-set)`	`$A=circular(myfile.fasta)` `$B=linear(myfile.gb)` `myfile.gb=circular(myfile.fasta)("TAG",lend+4)` `myfile.gb=circular("ATG"+$B+"TAG")`	This ability may be useful for sequences originating in formats where type is unspecified, such as FASTA files. If you use either of these functions, the information will be added to the output file, where allowed (e.g., .gbk). The presence of sequence type information may affect the results of searching in an endpoint expression. For example, a match can cross the origin in a circular sequence, but not in a linear one.
Collect multiple sequence-sets into one	`collect(sources)` (specified either as individual arguments or as single-quoted file patterns)	`$a=collect(foo.fasta, bar.fasta, baz.fasta)` `$a=collect("KSLLQQLLTE", "ARTKQTAR", "RPKPLVDP")` `$a=collect('C:/MyFolder/.fasta')` `collect('.gb', '*.genbank')`	This function accepts one or more arguments, each of which may be a sequence expression or a file pattern. A file pattern is a single-quoted search string that can match zero or more file paths. Wildcards (asterisks) may be used in the filename part of this string. This function allows: Multi-sequence files to be defined within a script. Multiple literals or files to be used in the function argument. A custom subset of the sequences in a file, by collecting individual sequences within it. A custom set to be defined once in a script and then re-used later. Filename patterns to be combined so as to refer to filenames with multiple extensions (see lower-most example at left). Note that it is possible for a sequence to appear more than once in the resulting output.
Strip specified data out of sequences in a set	`strip (sequence expression) (named arguments)+`	`strip("foo.gb")` `strip("foo.gb", data='features')` `strip("foo.gb", data='features,comments')` `strip("foo.gb", features='CDS')` `strip("foo.gb", features!='CDS')` `strip("foo.gb", features='CDS,gene')` `strip("foo.gb", features='CDS:/gene=yaaA')`	If no arguments are provided, the function strips the sequences in the argument down to their name and residues. Most meta-information is stripped, including features and comments. Arguments: `data` – Types of data to exclude entirely. `features` – Features to exclude, specified with the same syntax as the feature-matching argument in the `extract()` function. Additionally, the pattern is negated when the operator is `!=`. For example, the argument `features!='CDS,gene'` means to strip everything except for CDS and gene features.
Annotate sequences in the first argument with features obtained from a from* argument.	`annotate (sequence expression)+(named arguments)+`	`annotate( to.fasta, fromFile='myfile.vcf' )` `annotate( to.fasta, fromFile='myfile.vcf', ids='chr1,chr2,chr3,chrX' )` `annotate( to.fasta, fromSequences=from.gb )` `annotate( to.gb, fromSequences=from.gb )` `annotate( to.gb, fromSequences=from.gb, features='CDS' )` `annotate( to.gb, fromSequences=from.gb, features!='CDS' )` `annotate( to.gb, fromSequences=from.gb, features='CDS:/gene=yaaA' )`	Arguments: `fromFile` – Features are obtained from a features file. Supported formats include .vcf (variant call format) and .starff (SeqNinja feature files). Within the features file, the features for a particular sequence must be in a contiguous block. Blocks of features can be in any order in the file; they do not have to be in the same order as the sequences provided in the first argument. The `ids` argument (below) is necessary when the identification of a sequence differs between the sequence file and the feature file. `fromSequences` – Features are obtained from sequences parallel and aligned to those in the first set. `features` – [optional] Features to include, specified with the same syntax as the feature-matching argument in the `extract()` function. The pattern is negated when the operator is `!=`. For example, the argument `features!='CDS,gene'` means to copy everything except for CDS and gene features. `ids` – [optional] Necessary with the `fromFile` when the identification of a sequence differs between the sequence file and the feature file. Its value is a list of the sequence IDs in the feature file, in the order of the sequences from the first argument.
Sample the sequences in the input set	`sample(sequence-set, argument=value)` (See description of arguments, below)	`sample("foo.fasta", from=10000, to=20000, by=10)` `sample("foo.fasta", p='0.95')` `sample("foo.fasta", name='GEK*')`	This can be useful for separating reads into different sets, or for reducing a very large number of reads to a smaller number (e.g., because of software limitations). Each of the arguments is optional. Any combination of arguments can be used, in any order. At most, one of each argument may be used. The output sequences are in the same order in which they appear in the original set. Specify sampling everything other than the specified value, precede the value with an exclamation mark `!`.
Sample the sequences in the input set	Arguments for the expression `'sample'`: `from` – defines the inclusive lower bound of the included sequences. `to` – defines the inclusive upper bound of the included sequences. `by` – includes only every n^th^ element. `p` – is the probability that each element will be included. Calculations are made separately for each element. The elements chosen can differ between executions. The number of elements chosen is not deterministic. For large numbers of sequences, it is likely close to [p * number of sequences]. `name` – specifies matching sequence names, with a single-quoted string that may include an asterisk `` as a wildcard. To exclude specific sequence names, you may use the alternate operator `!`. For example, `sample("seq.gb", name!="rat")` includes all sequences whose name does not start with “rat”. `minLength` – integral expression specifying the minimum length of matching sequences, inclusive. `maxLength` – integral expression specifying the maximum length of matching sequences, inclusive. `contains` – matches any sequence on either strand that contains the single-quoted value, which may contain ambiguity codes. `startsWith` – sequence begins with the specified characters. If the sequence type might be DNA, a match can occur on either strand. `endsWith` – sequence ends with the specified characters. If the sequence type might be DNA, a match can occur on either strand.

Sequence concatenation

Need more help with this?
Contact DNASTAR