README spexs is a program for generating patterns that are common in files of input sequences. spexs -h Gives a current command-line help http://ep.ebi.ac.uk/ spexs generates patterns that are common in files of input sequences. Input file format: * sequence on one line * different input classes can be in separate files, or they may be labeled: * label of the sequence (class) can be at the end of the line (after space) * # is comment, as well as > (Fasta's 'id' line separator) ( ../Programs/Fasta2spexs ) Usage: spexs parameters Parameters: [-h] # Show this help [-v nr] # Verbose level # Input sequences and minimum number of occurrences in each set -f [-ms nr] [-f [-ms nr]]* [-minstringsinclass class-char nr] # Total minimum number of positions and strings [-minpos nr] [-minstrings nr] # Continue search even if pattern matces only one position or string [-onepos] [-onestr] # Pattern language [-cf ] # Generated automatically in first run [-maxgrp nr] # How many group positions # "Flexible wildcards" # How many gaps of length [min_gap,max_gap] are allowed [-max_gap_nr nr] [-min_gap nr] [-max_gap nr] # minimum nr of positions between gaps, initial value [-no_gap_len nr] [-init_gap_len nr] [-only_print_if_gap_allowed] [-genorder nr] # In which order to generate patterns 0 - breadth-first 1 - depth-first 2 - most frequent-first 3 - least frequent-first X - others to be added [-freq_set nr] # Frequency-based on which input set? (default- all) [-minlevel nr] [-maxlevel nr] # Pattern 'length' restrictions # Goodness functions [-costf nr] # Cost function to calculate for each pattern [-showratio nr] [-goodratio float] # Goodness ratio; print criteria [-binomial_prob float] # Binomial probability bound # -binomial_prob 0.01 - P to observe at least nr. is less than 1% # -binomial_prob -0.01 - P to observe at most nr. is less than 1% [-showsets nr] # Experimental [-abandonequal nr] # how to deal with equal position sets [-reportduplicates nr] ###################################################################### -f filename [-f filename ...] There can be many input files. Each file contains examples from different input example classes unless the example class is given in the end of the line in the file Input file format is: input sequiences on separate lines File can contain comment lines that begin with # in the beginning of line. Fasta type sequence names are considered comments currently. Fasta-type splitting of input sequence over multiple lines is not supported (yet). Input classes are assigned in the order of the filenames on command line: 1,2,3,... If the line has one separated character in the end of a line, then it is taken as class name. E.g.: ATCTGATGCTATGATTCGATT + ATCTGATGCTATGATTCGATT - ... Lines can be of unlimited length... -f filename -ms 10 Find only patterns that occur in at least 10 sequences read from file filename -cf charset_file character set file should contain full alphabet used in the examples. Only these character groups are tried out that are defined in this file. The background distribution probabilities for each character are used for calculating the "probability measure" for each pattern. Stop characters can be used in the sequences; so that patterns should never cross the boundaries over the stop characters. For example in DNA the reverse complement can be written to the same line - then pattern occurrence in a sequence is calculated only once. If no char set file is given, the default.charset file will be constructed automatically. This is probably the best way to start defining your own charset file. Format of charset_file: ###################################################################### # Default character class file for data alphabet= %ACGT # Character probabilities based on occurrence frequency p= _ 0.0851064 p= % 0.0212766 p= A 0.340426 p= C 0.0638298 p= G 0.191489 p= T 0.297872 # TOTAL: 1 # STOP character list # List of STOP characters (members of alphabet) # STOP= %?! # Patterns do not cross over these stop characters STOP= % # Don't "jump" over % char # Character groups # g= ILE [hydro_fob] # g= AT # If no group name, then [AT] will be used automatically # g= ATCG . # . - any nucleotide g= AT # default pair is AT, output as [AT] ###################################################################### [-abandonequal nr] // how to deal with equal position sets [-reportduplicates nr] # probably some "debugging" for abandonequal Tries to eliminate "equal" patterns based purely on lists of occurrences of the patterns (not pattern contents) -minlevel nr -maxlevel nr // Pattern 'length' restrictions This is "level" in search tree (abandonequal can be confused by that) -minpos nr -minstrings nr // Total minimum number of positions and strings (sequences) At least nr seq should contain the pattern -genorder nr nr = 0 [default] breadth first nr = 1 depth first nr = 2 Go in decreasing number of sequences covered -maxgrp 1 How many groups per pattern are allowed. Groups from -cf file -max_gap_nr nr # how many gaps -min_gap nr # of minimum length -max_gap nr # and maximum length # "built-in", not most conservative actual gaps found -no_gap_len nr minimum nr of positions between gaps, -init_gap_len nr] initial value (internal nr) -only_print_if_gap_allowed 1 # can change the no_gap_len for first gap -costf 1 # sort of I-measure (nr. of bits) -showsets nr Where nr is 0 # implicit default: do'nt output sets 1 # Notify implementation (List or Bit-array) and set-size 2 # show full sets either L or B format (internal representations) 3 # show always as List format, even if Bit array 4 # Show sets in following format: Pos:Str:5<,248:4,315:5,382:6,449:7,516:8> Meaning: 5 elements; pos 248 in 4th string pos 315 in 5th string pos 382 in 6th string ... Position right now is the position in catenation of all strings. For Alvis this should be OK right now. But in future I think the relative position INSIDE the string is important, and will be implemented. -onepos -onestr To look for patterns in only 1 seq or/and only 1 pos When multiple files (classes) of sequences -f file -ms min # min nr of occurrences in that file -f file2 -ms min2 # in second file -ms relates to preceding input file -genorder 2 can be influenced when two or more sets of sequences : -freq_set nr // If frequency-based search then based on which input set (default- all) For two sets of sequences: -showratio 1 Set on printing of simple ratio (ratio in 1st set / ratio in 2nd set) -goodratio float Print only if that ratio exceeds/equals to float -binomial_prob float Print binomial probability, set threshold to float - What is the probability in input set 1 if the background probability is calculated from set 2 BP:0.03 Old: - How probable is it to have at least that number (OP) - How probable it is to have this number or less (UP) Output should be interpreted as follows: spexs -f UPS-150 -f RAND -minstrings 20 TAGTG 9.95163 1006/1059 1:516/546 2:490/513 First is pattern, second is (un)probability measure. Pattern has alltogether 1059 occurrences in 1006 sequences. Of these occurrences 546 (516 sequences) were in first dataset and 513 (490 sequences) in second dataset. spexs -f UPS-150 -f RAND -minstrings 20 -showratio 1 -goodratio 2 GAAAATTTTTC 19.9946 23/23 1:22/22 2:1/1 R:22.2353 Pattern GAAAATTTTTC occurred in 22 sequences from file UPS-150 and only 1 from file RAND. Goodness measure (ratio in UPS-150 / ratio in RAND ) is 22.2353 1020 spexs -f TEST -h 1021 spexs -f TEST -onepos 1022 spexs -f TEST -onepos -onestr 1023 spexs -f TEST -onestr 1024 spexs -f TEST -onestr 1025 spexs -f TEST -onepos -onestr 1026 cp default.charclass CH 1027 em CH 1028 spexs -f TEST -onepos -onestr -cf CH 1029 cat CH 1030 em CH 1031 spexs -f TEST -onepos -onestr -cf CH 1032 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 1033 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 |less 1034 less TEST 1035 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 |less 1036 less TEST 1037 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -abandonequal 1 1038 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -abandonequal 1 |less 1039 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -abandonequal 1 > t 1040 less t 1041 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 1042 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minlevel 10 1043 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minlevel 10 |less 1044* spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minlevel 10 -maxlevel 12 - 1045 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minpos 2 1046 less TEST 1047 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minpos 2 -minstrings 2 1048 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minpos 1 -minstrings 2 1049 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minpos 3 -minstrings 2 1050 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 1051 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 |less 1052 spexs -f Upstream -onepos -onestr -cf CH -maxgrp 1 -minstrings 500 1053 spexs -f Upstream -onepos -onestr -cf CH -maxgrp 1 -minstrings 30 1054* 1055 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 1056 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 0 1057 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 1 1058 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 0 > o.0 1059 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 1 > o.1 1060 sort o.0 > O.0.s 1061 sort o.1 > O.1.s 1062 diff O.* 1063 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 0 1064 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 1 1065 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 2 1066 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 2 | sort > O.2.s 1067 diff O.[12].* 1068 spexs -f Upstream -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 2 1069 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 2 1070 spexs -f TEST -onepos -onestr -cf CH -maxgrp 1 -minstrings 2 -genorder 2 -max_gap_nr 1 1071 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 1072 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1|less 1073 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 1074 less TEST 1075 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 1076 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 1077 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1078 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1 1079 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1080 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1 -init_gap_len 2 1081 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1 -init_gap_len 1 1082 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1 -init_gap_len -2 1083 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 2 -max_gap 10 -no_gap_len 3 -only_print_if_gap_allowed 1 -init_gap_len -2 1084* spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 2 -max_gap 10 -no_gap_len 3 -only_pri 1085 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -reportduplicates 1086 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -reportduplicates 1 1087 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -reportduplicates 1|less 1088 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -costf 0 1089 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -costf 1 1090 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -costf 2 1091 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -costf 3 1092 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -costf 2 1093 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -costf 1 1094 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 1095 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 0 1096 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 1 1097 spexs -f Upstream -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 1 1098 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 1 1099 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 2 1100 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 3 1101 less TEST 1102 spexs -f TEST -onepos -onestr -cf CH -minstrings 2 -genorder 2 -max_gap_nr 1 -showsets 4 1103 em TEST2 1104 spexs -f TEST -f TEST2 1105 spexs -f TEST -f TEST2 -onepos -onestr -cf CH -minstrings 2 -genorder 2 1106 spexs -f TEST -ms 3 -f TEST2 -onepos -onestr -cf CH -genorder 2 1107 spexs -f TEST -ms 2 -f TEST2 -onepos -onestr -cf CH -genorder 2 1108 spexs -f TEST -ms 2 -f TEST2 -ms 2 -onepos -onestr -cf CH -genorder 2 1109 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 1110 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 |less 1111 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -freq_set 1 1112 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -freq_set 1 |less 1113 spexs -f TEST -ms 1 -f TEST2 -ms 1 -f TEST2 -onepos -onestr -cf CH -genorder 2 -freq_set 1 |less 1114 spexs -f TEST -ms 1 -f TEST2 -ms 1 -f TEST2 -onepos -onestr -cf CH -genorder 2 -showratio 1 1115 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 1116 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1|less 1117 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -goodratio 2 |less 1118 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -goodratio 1.5 1119 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -goodratio 1 1120 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -goodratio 1 1121 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -goodratio 1 -binomial_prob 1122 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -goodratio 1 -binomial_prob 1 1123 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -showratio 1 -binomial_prob 1 1124 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -binomial_prob 1 1125 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -binomial_prob 0.5 1126 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -binomial_prob 0.0001 1127 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -binomial_prob 0.01 1128 spexs -f TEST -ms 1 -f TEST2 -ms 1 -onepos -onestr -cf CH -genorder 2 -binomial_prob 0.5 1129 spexs -f Upstream -ms 1 -f Random -ms 1 -onepos -onestr -cf CH -genorder 2 -binomial_prob 0.5