Skip to content

Commit

Permalink
10/23/2020
Browse files Browse the repository at this point in the history
-Removed cytoscape import upon database installation
-Fixed multi-match issues with matched datasets (latteral merge)
-Allow for direct integration of h5 and mtx files
-Disabled exon.bed export to prevent BAM file export unexplained random processing errors
  • Loading branch information
nsalomonis committed Oct 24, 2020
1 parent 6cf103d commit fc11e16
Show file tree
Hide file tree
Showing 10 changed files with 282 additions and 43 deletions.
19 changes: 13 additions & 6 deletions Config/options.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ dataset_name Give a name to this dataset enter InputCELFiles --- --- --- ---
input_cel_dir Select the CEL file containing folder folder InputCELFiles --- --- --- --- --- --- ---
input_fastq_dir (optional) Select fastq files to run in Kallisto folder InputCELFiles NA NA NA NA NA --- NA
output_CEL_dir Select an AltAnalyze result output directory folder InputCELFiles --- --- --- --- --- --- ---
multithreading Use multithreading for read genomic annotation comboBox InputCELFiles yes NA NA NA NA NA yes|no NA
multithreading Use multithreading for read genomic annotation comboBox InputCELFiles no NA NA NA NA NA yes|no NA
build_exon_bedfile Build exon coordinate bed file to obtain BAM file exon counts\k(see the online tutorial for additional details and information) single-checkbox InputCELFiles NA NA NA NA NA --- NA
channel_to_extract Extract data from the following channels comboBox InputCELFiles NA NA NA green|red|green/red ratio|red/green ratio NA NA NA
remove_xhyb Remove probesets that have large cross-hybridization scores single-checkbox InputCELFiles --- --- NA NA NA NA NA
Expand Down Expand Up @@ -139,10 +139,16 @@ modelDiscovery (LineageProflier) Iterative model discovery comboBox LineageProfi
input_data_file Select the tab-delimited expression file for ID Translation file IDConverter --- --- --- --- --- --- ---
input_source Select the file ID system (first column) comboBox IDConverter --- --- --- --- --- --- ---
output_source Select the ID system to add to this file comboBox IDConverter --- --- --- --- --- --- ---
input_file1 Select the first file to merge file MergeFiles --- --- --- --- --- --- ---
input_file2 Select the second file to merge file MergeFiles --- --- --- --- --- --- ---
input_file3 Select the third file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file4 Select the fourth file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file1 Select the 1st file to merge file MergeFiles --- --- --- --- --- --- ---
input_file2 Select the 2nd file to merge file MergeFiles --- --- --- --- --- --- ---
input_file3 Select the 3rd file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file4 Select the 4th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file5 Select the 5th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file6 Select the 6th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file7 Select the 7th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file8 Select the 8th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file9 Select the 9th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
input_file10 Select the 10th file to merge (optional) file MergeFiles --- --- --- --- --- --- ---
join_option Join files based on their comboBox MergeFiles Intersection|Union Intersection|Union Intersection|Union Intersection|Union Intersection|Union Intersection|Union Intersection|Union
ID_option Only return one-to-one ID relationships comboBox MergeFiles False|True False|True False|True False|True False|True False|True False|True
output_merge_dir Select the folder to save the merged file folder MergeFiles --- --- --- --- --- --- ---
Expand Down Expand Up @@ -175,12 +181,13 @@ FoldDiff Fold change filter cutoff enter PredictGroups 4 --- --- --- --- --- -
rho_cutoff Minimum Pearson correlation cutoff enter PredictGroups 0.2 --- --- --- --- --- --- ---
SamplesDiffering Minimum number of cells differing enter PredictGroups 4 --- --- --- --- --- --- ---
dynamicCorrelation ICGS will identify an optimal correlation cutoff comboBox PredictGroups yes yes|no yes|no yes|no yes|no yes|no yes|no yes|no
removeOutliers Remove low expression outlier samples comboBox PredictGroups yes yes|no yes|no yes|no yes|no yes|no yes|no yes|no
removeOutliers Remove low expression outlier cells comboBox PredictGroups yes yes|no yes|no yes|no yes|no yes|no yes|no yes|no
featuresToEvaluate Features to evaluate comboBox PredictGroups Genes Genes|AltExon|Both Genes|AltExon|Both Genes|AltExon|Both Genes Genes|AltExon|Both Genes|AltExon|Both Genes
restrictBy Restrict genes to protein coding comboBox PredictGroups yes yes|no yes|no yes|no yes|no yes|no yes|no yes|no
column_metric_predict Select the column clustering metric comboBox PredictGroups http://docs.scipy.org/doc/scipy/reference/spatial.distance.html cosine braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule
column_method_predict Select the column clustering method comboBox PredictGroups http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html hopach average|single|complete|weighted|ward|hopach average|single|complete|weighted|ward|hopach average|single|complete|weighted|ward|hopach average|single|complete|weighted|ward|hopach average|single|complete|weighted|ward|hopach average|single|complete|weighted|ward|hopach average|single|complete|weighted|ward|hopach
k (optional) number of user-defined ICGS clusters (k) enter PredictGroups --- --- --- --- --- --- ---
downsample (optional) Cells to down-sample to (PageRank) enter PredictGroups 2500 --- --- --- --- --- --- ---
GeneSelectionPredict (optional) Enter genes to build clusters from (guides) enter PredictGroups --- --- --- --- --- --- ---
GeneSetSelectionPredict (optional) Or select guide GeneSet/Ontology comboBox PredictGroups
PathwaySelectionPredict (optional) Select guide specific GeneSet(s) multiple-comboBox PredictGroups
Expand Down
16 changes: 9 additions & 7 deletions GO_Elite.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,12 +229,14 @@ def moveMAPPFinderFiles(input_dir):
#try: UI.WarningWindow(print_out,' Exit ')
#except Exception: print print_out
proceed = 'no'
while proceed == 'no':
try: os.remove(fn); proceed = 'yes'
except Exception:
print 'Tried to move the file',mappfinder_input,'to an archived folder, but it is currently open.'
print 'Please close this file and hit return or quit GO-Elite'
inp = sys.stdin.readline()

#while proceed == 'no':
try: os.remove(fn); proceed = 'yes'
except Exception:
pass
#print 'Tried to move the file',mappfinder_input,'to an archived folder, but it is currently open.'
#print 'Please close this file and hit return or quit GO-Elite'
#inp = sys.stdin.readline()

def checkPathwayType(filename):
type='GeneSet'
Expand Down Expand Up @@ -1982,7 +1984,7 @@ def visualizePathways(species_code,oraDirTogeneDir,combined_results):
except Exception:
pass
wp_end_time = time.time(); time_diff = int(wp_end_time-wp_start_time)
print "Wikipathways output in %d seconds" % time_diff
print "Gene set results output in %d seconds" % time_diff

def makeVariableList(wp_to_visualize,species_code,mod,imageType):
variable_list=[]
Expand Down
3 changes: 3 additions & 0 deletions LineageProfilerIterate.py
Original file line number Diff line number Diff line change
Expand Up @@ -2645,6 +2645,7 @@ def importAndCombineExpressionFiles(species,reference_exp_file,query_exp_file,cl
output_dir = root_dir+'/exp.'+string.replace(output_dir,'exp.','')
output_dir =string.replace(output_dir,'-OutliersRemoved','')
groups_dir = string.replace(output_dir,'exp.','groups.')

ref_exp_db,ref_headers,ref_col_clusters,cluster_format_reference = importExpressionFile(reference_exp_file,customLabels=customLabels)
cluster_results = clustering.remoteImportData(query_exp_file,geneFilter=ref_exp_db)
if len(cluster_results[0])>0: filterIDs = ref_exp_db
Expand Down Expand Up @@ -3587,6 +3588,7 @@ def importExpressionFile(input_file,ignoreClusters=False,filterIDs=False,customL
gene_to_symbol={}
symbol_to_gene={}
row_count=0

for line in open(input_file,'rU').xreadlines():
data = line.rstrip()
data = string.replace(data,'"','')
Expand Down Expand Up @@ -4290,6 +4292,7 @@ def convertICGSClustersToExpression(heatmap_file,query_exp_file,returnCentroids=

matrix_exp, column_header_exp, row_header_exp, dataset_name, group_db_exp = clustering.importData(expdir,geneFilter=row_header)
percent_found = (len(row_header_exp)*1.00)/len(row_header)

if percent_found<0.5:
print "...Incompatible primary ID (Symbol), converting to Ensembl"
import gene_associations; from import_scripts import OBO_import
Expand Down
17 changes: 17 additions & 0 deletions UI.py
Original file line number Diff line number Diff line change
Expand Up @@ -1090,6 +1090,9 @@ def runLineageProfiler(fl, expr_input_dir, vendor, custom_markerFinder, geneMode
except: cellLabels = False
if cellLabels == '':
cellLabels = None
if '.h5' in expr_input_dir or '.mtx' in expr_input_dir:
from import_scripts import ChromiumProcessing
expr_input_dir = ChromiumProcessing.import10XSparseMatrix(expr_input_dir,species,'cellHarmony-Query')
try: LineageProfilerIterate.runLineageProfiler(species,platform,expr_input_dir,expr_input_dir,
codingtype,compendium_platform,customMarkers=custom_markerFinder,
geneModels=geneModel,modelSize=modelSize,fl=fl,label_file=cellLabels)
Expand Down Expand Up @@ -3302,11 +3305,13 @@ def getOnlineEliteDatabase(file_location_defaults,db_version,new_species_codes,u
dbs_added = 0

AltAnalyze_folders = read_directory(''); Cytoscape_found = 'no'
"""
for dir in AltAnalyze_folders:
if 'Cytoscape_' in dir: Cytoscape_found='yes'
if Cytoscape_found == 'no':
fln,status = update.download(goelite_url+'Cytoscape/cytoscape.tar.gz','','')
if 'Internet' not in status: print "Cytoscape program folder downloaded."
"""

count = verifyFileLength('AltDatabase/TreeView/TreeView.jar')
if count==0:
Expand Down Expand Up @@ -5415,6 +5420,12 @@ def rebootAltAnalyzeGUI(selected_parameters,user_variables):
input_file2 = gu.Results()['input_file2']
input_file3 = gu.Results()['input_file3']
input_file4 = gu.Results()['input_file4']
input_file1 = gu.Results()['input_file5']
input_file2 = gu.Results()['input_file6']
input_file3 = gu.Results()['input_file7']
input_file4 = gu.Results()['input_file8']
input_file3 = gu.Results()['input_file9']
input_file4 = gu.Results()['input_file10']
join_option = gu.Results()['join_option']
ID_option = gu.Results()['ID_option']
output_merge_dir = gu.Results()['output_merge_dir']
Expand All @@ -5426,6 +5437,12 @@ def rebootAltAnalyzeGUI(selected_parameters,user_variables):
files_to_merge = [input_file1, input_file2]
if len(input_file3)>0: files_to_merge.append(input_file3)
if len(input_file4)>0: files_to_merge.append(input_file4)
if len(input_file5)>0: files_to_merge.append(input_file5)
if len(input_file6)>0: files_to_merge.append(input_file6)
if len(input_file7)>0: files_to_merge.append(input_file7)
if len(input_file8)>0: files_to_merge.append(input_file8)
if len(input_file7)>0: files_to_merge.append(input_file9)
if len(input_file8)>0: files_to_merge.append(input_file10)
values = files_to_merge, join_option, ID_option, output_merge_dir
StatusWindow(values,analysis) ### display an window with download status
AltAnalyze.AltAnalyzeSetup((selected_parameters[:-1],user_variables)); sys.exit()
Expand Down
134 changes: 134 additions & 0 deletions import_scripts/mergeFiles.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,143 @@ def cleanUpLine(line):
data = string.replace(data,'"','')
return data

def latteralMerge(files_to_merge,original_filename):
""" Merging files can be dangerous, if there are duplicate IDs (e.g., gene symbols).
To overcome issues in redundant gene IDs that are improperly matched (one row with zeros
and the other with values), this function determines if a lateral merge is more appropriate.
The latter merge:
1) Checks to see if the IDs are the same with the same order between the two or more datasets
2) merges the two or more matrices without looking at the genes.
Note: This function is attempts to be memory efficient and should be updated in the future to
merge blocks of row IDs sequentially."""

files_to_merge_revised = []
for filename in files_to_merge:
### If a sparse matrix - rename and convert to flat file
if '.h5' in filename or '.mtx' in filename:
from import_scripts import ChromiumProcessing
import export
file = export.findFilename(filename)
export_name = file[:-4]+'-filt'
if file == 'filtered_feature_bc_matrix.h5' or file == 'raw_feature_bc_matrix.h5' or 'filtered_gene_bc_matrix.h5' or file == 'raw_gene_bc_matrix.h5':
export_name = export.findParentDir(filename)
export_name = export.findFilename(export_name[:-1])
if file == 'matrix.mtx.gz' or file == 'matrix.mtx':
parent = export.findParentDir(filename)
export_name = export.findParentDir(parent)
export_name = export.findFilename(export_name[:-1])
filename = ChromiumProcessing.import10XSparseMatrix(filename,'species',export_name)
files_to_merge_revised.append(filename)
files_to_merge = files_to_merge_revised
print files_to_merge

includeFilenames = True
file_uids = {}
for filename in files_to_merge:
firstRow=True
fn=filepath(filename); x=0
if '/' in filename:
file = string.split(filename,'/')[-1][:-4]
else:
file = string.split(filename,'\\')[-1][:-4]
for line in open(fn,'rU').xreadlines():
data = cleanUpLine(line)
if '\t' in data:
t = string.split(data,'\t')
elif ',' in data:
t = string.split(data,',')
else:
t = string.split(data,'\t')
if firstRow:
firstRow = False
else:
uid = t[0]
try:
file_uids[file].append(uid)
except:
file_uids[file] = [uid]

perfectMatch = True
for file1 in file_uids:
uids1 = file_uids[file1]
for file2 in file_uids:
uids2 = file_uids[file2]
if uids1 != uids2:
print file1,file2
perfectMatch = False

if perfectMatch:
print 'All ordered IDs match in the files ... performing latteral merge instead of key ID merge to prevent multi-matches...'
firstRow=True
increment = 5000
low = 1
high = 5000
added = 1
eo = open(output_dir+'/MergedFiles.txt','w')
import collections

def exportMergedRows(low,high):
uid_values=collections.OrderedDict()
for filename in files_to_merge:
fn=filepath(filename); x=0; file_uids = {}
if '/' in filename:
file = string.split(filename,'/')[-1][:-4]
else:
file = string.split(filename,'\\')[-1][:-4]
firstRow=True
row_count = 0
uids=[] ### Over-ride this for each file
for line in open(fn,'rU').xreadlines():
row_count+=1
if row_count<=high and row_count>=low:
data = cleanUpLine(line)
if '\t' in data:
t = string.split(data,'\t')
elif ',' in data:
t = string.split(data,',')
else:
t = string.split(data,'\t')
if firstRow and low==1:
file = string.replace(file,'_matrix_CPTT','')
if includeFilenames:
header = [s + "."+file for s in t[1:]] ### add filename suffix
else:
header = t[1:]
try: uid_values[row_count]+=header
except: uid_values[row_count]=header
uids.append('UID')
firstRow=False
else:
uid = t[0]
try: uid_values[row_count] += t[1:]
except: uid_values[row_count] = t[1:]
uids.append(uid)
i=0
for index in uid_values:
uid = uids[i]
eo.write(string.join([uid]+uid_values[index],'\t')+'\n')
i+=1
print 'completed',low,high

uid_list = file_uids[file]
while (len(uid_list)+increment)>high:
exportMergedRows(low,high)
high+=increment
low+=increment
eo.close()
return True
else:
print 'Different identifier order in the input files encountered...'
return False

def combineAllLists(files_to_merge,original_filename,includeColumns=False):
headers =[]; files=[]

run = latteralMerge(files_to_merge,original_filename)
if run:
return ### Exit Merge Function

import collections
all_keys=collections.OrderedDict()
dataset_data=collections.OrderedDict()
Expand Down
Loading

0 comments on commit fc11e16

Please sign in to comment.