10/23/2020

-Removed cytoscape import upon database installation -Fixed multi-match issues with matched datasets (latteral merge) -Allow for direct integration of h5 and mtx files -Disabled exon.bed export to prevent BAM file export unexplained random processing errors
nsalomonis · Oct 24, 2020 · fc11e16 · fc11e16
1 parent 6cf103d
commit fc11e16
Show file tree

Hide file tree

Showing 10 changed files with 282 additions and 43 deletions.
diff --git a/Config/options.txt b/Config/options.txt
@@ -20,7 +20,7 @@ dataset_name	Give a name to this dataset	enter	InputCELFiles				---	---	---	---
 input_cel_dir	Select the CEL file containing folder	folder	InputCELFiles				---	---	---	---	---	---	---
 input_fastq_dir	(optional) Select fastq files to run in Kallisto	folder	InputCELFiles				NA	NA	NA	NA	NA	---	NA
 output_CEL_dir	Select an AltAnalyze result output directory	folder	InputCELFiles				---	---	---	---	---	---	---
-multithreading	Use multithreading for read genomic annotation	comboBox	InputCELFiles			yes	NA	NA	NA	NA	NA	yes|no	NA
+multithreading	Use multithreading for read genomic annotation	comboBox	InputCELFiles			no	NA	NA	NA	NA	NA	yes|no	NA
 build_exon_bedfile	Build exon coordinate bed file to obtain BAM file exon counts\k(see the online tutorial for additional details and information)	single-checkbox	InputCELFiles				NA	NA	NA	NA	NA	---	NA
 channel_to_extract	Extract data from the following channels	comboBox	InputCELFiles				NA	NA	NA	green|red|green/red ratio|red/green ratio	NA	NA	NA
 remove_xhyb	Remove probesets that have large cross-hybridization scores 	single-checkbox	InputCELFiles				---	---	NA	NA	NA	NA	NA
@@ -139,10 +139,16 @@ modelDiscovery	(LineageProflier) Iterative model discovery	comboBox	LineageProfi
 input_data_file	Select the tab-delimited expression file for ID Translation	file	IDConverter				---	---	---	---	---	---	---
 input_source	Select the file ID system (first column)	comboBox	IDConverter				---	---	---	---	---	---	---
 output_source	Select the ID system to add to this file	comboBox	IDConverter				---	---	---	---	---	---	---
-input_file1	Select the first file to merge	file	MergeFiles				---	---	---	---	---	---	---
-input_file2	Select the second file to merge	file	MergeFiles				---	---	---	---	---	---	---
-input_file3	Select the third file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
-input_file4	Select the fourth file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file1	Select the 1st file to merge	file	MergeFiles				---	---	---	---	---	---	---
+input_file2	Select the 2nd file to merge	file	MergeFiles				---	---	---	---	---	---	---
+input_file3	Select the 3rd file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file4	Select the 4th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file5	Select the 5th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file6	Select the 6th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file7	Select the 7th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file8	Select the 8th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file9	Select the 9th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
+input_file10	Select the 10th file to merge (optional)	file	MergeFiles				---	---	---	---	---	---	---
 join_option	Join files based on their	comboBox	MergeFiles				Intersection|Union	Intersection|Union	Intersection|Union	Intersection|Union	Intersection|Union	Intersection|Union	Intersection|Union
 ID_option	Only return one-to-one ID relationships	comboBox	MergeFiles				False|True	False|True	False|True	False|True	False|True	False|True	False|True
 output_merge_dir	Select the folder to save the merged file	folder	MergeFiles				---	---	---	---	---	---	---
@@ -175,12 +181,13 @@ FoldDiff	Fold change filter cutoff	enter	PredictGroups			4	---	---	---	---	---	-
 rho_cutoff	Minimum Pearson correlation cutoff	enter	PredictGroups			0.2	---	---	---	---	---	---	---
 SamplesDiffering	Minimum number of cells differing	enter	PredictGroups			4	---	---	---	---	---	---	---
 dynamicCorrelation	ICGS will identify an optimal correlation cutoff	comboBox	PredictGroups			yes	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no
-removeOutliers	Remove low expression outlier samples	comboBox	PredictGroups			yes	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no
+removeOutliers	Remove low expression outlier cells	comboBox	PredictGroups			yes	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no
 featuresToEvaluate	Features to evaluate	comboBox	PredictGroups			Genes	Genes|AltExon|Both	Genes|AltExon|Both	Genes|AltExon|Both	Genes	Genes|AltExon|Both	Genes|AltExon|Both	Genes
 restrictBy	Restrict genes to protein coding	comboBox	PredictGroups			yes	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no	yes|no
 column_metric_predict	Select the column clustering metric	comboBox	PredictGroups		http://docs.scipy.org/doc/scipy/reference/spatial.distance.html	cosine	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule	braycurtis|canberra|chebyshev|cityblock|correlation|cosine|dice|euclidean|hamming|jaccard|kulsinski|mahalanobis|matching|minkowski|rogerstanimoto|russellrao|seuclidean|sokalmichener|sokalsneath|sqeuclidean|yule
 column_method_predict	Select the column clustering method	comboBox	PredictGroups		http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html	hopach	average|single|complete|weighted|ward|hopach	average|single|complete|weighted|ward|hopach	average|single|complete|weighted|ward|hopach	average|single|complete|weighted|ward|hopach	average|single|complete|weighted|ward|hopach	average|single|complete|weighted|ward|hopach	average|single|complete|weighted|ward|hopach
 k	(optional) number of user-defined ICGS clusters (k)	enter	PredictGroups				---	---	---	---	---	---	---
+downsample	(optional) Cells to down-sample to (PageRank)	enter	PredictGroups			2500	---	---	---	---	---	---	---
 GeneSelectionPredict	(optional) Enter genes to build clusters from (guides)	enter	PredictGroups				---	---	---	---	---	---	---
 GeneSetSelectionPredict	(optional) Or select guide GeneSet/Ontology	comboBox	PredictGroups										
 PathwaySelectionPredict	(optional) Select guide specific GeneSet(s)	multiple-comboBox	PredictGroups										

diff --git a/GO_Elite.py b/GO_Elite.py
@@ -229,12 +229,14 @@ def moveMAPPFinderFiles(input_dir):
                     #try: UI.WarningWindow(print_out,' Exit ')
                     #except Exception: print print_out
                 proceed = 'no'
-                while proceed == 'no':
-                    try: os.remove(fn); proceed = 'yes'
-                    except Exception:
-                        print 'Tried to move the file',mappfinder_input,'to an archived folder, but it is currently open.'
-                        print 'Please close this file and hit return or quit GO-Elite'
-                        inp = sys.stdin.readline()
+
+                #while proceed == 'no':
+                try: os.remove(fn); proceed = 'yes'
+                except Exception:
+                    pass
+                    #print 'Tried to move the file',mappfinder_input,'to an archived folder, but it is currently open.'
+                    #print 'Please close this file and hit return or quit GO-Elite'
+                    #inp = sys.stdin.readline()
 
 def checkPathwayType(filename):
     type='GeneSet'
@@ -1982,7 +1984,7 @@ def visualizePathways(species_code,oraDirTogeneDir,combined_results):
     except Exception:
         pass
     wp_end_time = time.time(); time_diff = int(wp_end_time-wp_start_time)
-    print "Wikipathways output in %d seconds" % time_diff
+    print "Gene set results output in %d seconds" % time_diff
 
 def makeVariableList(wp_to_visualize,species_code,mod,imageType):
     variable_list=[]

diff --git a/LineageProfilerIterate.py b/LineageProfilerIterate.py
@@ -2645,6 +2645,7 @@ def importAndCombineExpressionFiles(species,reference_exp_file,query_exp_file,cl
     output_dir = root_dir+'/exp.'+string.replace(output_dir,'exp.','')
     output_dir =string.replace(output_dir,'-OutliersRemoved','')
     groups_dir = string.replace(output_dir,'exp.','groups.')
+
     ref_exp_db,ref_headers,ref_col_clusters,cluster_format_reference = importExpressionFile(reference_exp_file,customLabels=customLabels)
     cluster_results = clustering.remoteImportData(query_exp_file,geneFilter=ref_exp_db)
     if len(cluster_results[0])>0: filterIDs = ref_exp_db
@@ -3587,6 +3588,7 @@ def importExpressionFile(input_file,ignoreClusters=False,filterIDs=False,customL
             gene_to_symbol={}
             symbol_to_gene={}
     row_count=0
+
     for line in open(input_file,'rU').xreadlines():
         data = line.rstrip()
         data = string.replace(data,'"','')
@@ -4290,6 +4292,7 @@ def convertICGSClustersToExpression(heatmap_file,query_exp_file,returnCentroids=
 
     matrix_exp, column_header_exp, row_header_exp, dataset_name, group_db_exp = clustering.importData(expdir,geneFilter=row_header)
     percent_found = (len(row_header_exp)*1.00)/len(row_header)
+
     if percent_found<0.5:
         print "...Incompatible primary ID (Symbol), converting to Ensembl"
         import gene_associations; from import_scripts import OBO_import

diff --git a/UI.py b/UI.py
@@ -1090,6 +1090,9 @@ def runLineageProfiler(fl, expr_input_dir, vendor, custom_markerFinder, geneMode
             except: cellLabels = False
             if cellLabels == '':
                 cellLabels = None
+            if '.h5' in expr_input_dir or '.mtx' in expr_input_dir:
+                from import_scripts import ChromiumProcessing
+                expr_input_dir = ChromiumProcessing.import10XSparseMatrix(expr_input_dir,species,'cellHarmony-Query')
             try: LineageProfilerIterate.runLineageProfiler(species,platform,expr_input_dir,expr_input_dir,
                     codingtype,compendium_platform,customMarkers=custom_markerFinder,
                     geneModels=geneModel,modelSize=modelSize,fl=fl,label_file=cellLabels)
@@ -3302,11 +3305,13 @@ def getOnlineEliteDatabase(file_location_defaults,db_version,new_species_codes,u
     dbs_added = 0
 
     AltAnalyze_folders = read_directory(''); Cytoscape_found = 'no'
+    """
     for dir in AltAnalyze_folders:
         if 'Cytoscape_' in dir: Cytoscape_found='yes'
     if Cytoscape_found == 'no':
         fln,status = update.download(goelite_url+'Cytoscape/cytoscape.tar.gz','','')
         if 'Internet' not in status: print "Cytoscape program folder downloaded."
+    """
 
     count = verifyFileLength('AltDatabase/TreeView/TreeView.jar')
     if count==0:
@@ -5415,6 +5420,12 @@ def rebootAltAnalyzeGUI(selected_parameters,user_variables):
                     input_file2 = gu.Results()['input_file2']
                     input_file3 = gu.Results()['input_file3']
                     input_file4 = gu.Results()['input_file4']
+                    input_file1 = gu.Results()['input_file5']
+                    input_file2 = gu.Results()['input_file6']
+                    input_file3 = gu.Results()['input_file7']
+                    input_file4 = gu.Results()['input_file8']
+                    input_file3 = gu.Results()['input_file9']
+                    input_file4 = gu.Results()['input_file10']
                     join_option = gu.Results()['join_option']
                     ID_option = gu.Results()['ID_option']
                     output_merge_dir = gu.Results()['output_merge_dir']
@@ -5426,6 +5437,12 @@ def rebootAltAnalyzeGUI(selected_parameters,user_variables):
                         files_to_merge = [input_file1, input_file2]
                         if len(input_file3)>0: files_to_merge.append(input_file3)
                         if len(input_file4)>0: files_to_merge.append(input_file4)
+                        if len(input_file5)>0: files_to_merge.append(input_file5)
+                        if len(input_file6)>0: files_to_merge.append(input_file6)
+                        if len(input_file7)>0: files_to_merge.append(input_file7)
+                        if len(input_file8)>0: files_to_merge.append(input_file8)
+                        if len(input_file7)>0: files_to_merge.append(input_file9)
+                        if len(input_file8)>0: files_to_merge.append(input_file10)
                         values = files_to_merge, join_option, ID_option, output_merge_dir
                         StatusWindow(values,analysis) ### display an window with download status
                         AltAnalyze.AltAnalyzeSetup((selected_parameters[:-1],user_variables)); sys.exit()

diff --git a/import_scripts/mergeFiles.py b/import_scripts/mergeFiles.py
@@ -55,9 +55,143 @@ def cleanUpLine(line):
     data = string.replace(data,'"','')
     return data
 
+def latteralMerge(files_to_merge,original_filename):
+    """ Merging files can be dangerous, if there are duplicate IDs (e.g., gene symbols).
+    To overcome issues in redundant gene IDs that are improperly matched (one row with zeros
+    and the other with values), this function determines if a lateral merge is more appropriate.
+    The latter merge:
+    1) Checks to see if the IDs are the same with the same order between the two or more datasets
+    2) merges the two or more matrices without looking at the genes.
+    
+    Note: This function is attempts to be memory efficient and should be updated in the future to
+    merge blocks of row IDs sequentially."""
+
+    files_to_merge_revised = []
+    for filename in files_to_merge:
+        ### If a sparse matrix - rename and convert to flat file
+        if '.h5' in filename or '.mtx' in filename:
+            from import_scripts import ChromiumProcessing
+            import export
+            file = export.findFilename(filename)
+            export_name = file[:-4]+'-filt'
+            if file == 'filtered_feature_bc_matrix.h5' or file == 'raw_feature_bc_matrix.h5' or 'filtered_gene_bc_matrix.h5' or file == 'raw_gene_bc_matrix.h5':
+                export_name = export.findParentDir(filename)
+                export_name = export.findFilename(export_name[:-1])
+            if file == 'matrix.mtx.gz' or file == 'matrix.mtx':
+                parent = export.findParentDir(filename)
+                export_name = export.findParentDir(parent)
+                export_name = export.findFilename(export_name[:-1])
+            filename = ChromiumProcessing.import10XSparseMatrix(filename,'species',export_name)
+        files_to_merge_revised.append(filename)
+    files_to_merge = files_to_merge_revised
+    print files_to_merge
+
+    includeFilenames = True
+    file_uids = {}
+    for filename in files_to_merge:
+        firstRow=True
+        fn=filepath(filename); x=0
+        if '/' in filename:
+            file = string.split(filename,'/')[-1][:-4]
+        else:
+            file = string.split(filename,'\\')[-1][:-4]
+        for line in open(fn,'rU').xreadlines():         
+            data = cleanUpLine(line)
+            if '\t' in data:
+                t = string.split(data,'\t')
+            elif ',' in data:
+                t = string.split(data,',')
+            else:
+                t = string.split(data,'\t')
+            if firstRow:
+                firstRow = False
+            else:
+                uid = t[0]
+                try:
+                    file_uids[file].append(uid)
+                except:
+                    file_uids[file] = [uid]
+
+    perfectMatch = True
+    for file1 in file_uids:
+        uids1 = file_uids[file1]
+        for file2 in file_uids:
+            uids2 = file_uids[file2]
+            if uids1 != uids2:
+                print file1,file2
+                perfectMatch = False
+
+    if perfectMatch:
+        print 'All ordered IDs match in the files ... performing latteral merge instead of key ID merge to prevent multi-matches...'
+        firstRow=True
+        increment = 5000
+        low = 1
+        high = 5000
+        added = 1
+        eo = open(output_dir+'/MergedFiles.txt','w')
+        import collections 
+
+        def exportMergedRows(low,high):
+            uid_values=collections.OrderedDict()
+            for filename in files_to_merge:
+                fn=filepath(filename); x=0; file_uids = {}
+                if '/' in filename:
+                    file = string.split(filename,'/')[-1][:-4]
+                else:
+                    file = string.split(filename,'\\')[-1][:-4]
+                firstRow=True
+                row_count = 0
+                uids=[] ### Over-ride this for each file
+                for line in open(fn,'rU').xreadlines():
+                    row_count+=1
+                    if row_count<=high and row_count>=low:
+                        data = cleanUpLine(line)
+                        if '\t' in data:
+                            t = string.split(data,'\t')
+                        elif ',' in data:
+                            t = string.split(data,',')
+                        else:
+                            t = string.split(data,'\t')
+                        if firstRow and low==1:
+                            file = string.replace(file,'_matrix_CPTT','')
+                            if includeFilenames:
+                                header = [s + "."+file for s in t[1:]] ### add filename suffix
+                            else:
+                                header = t[1:]
+                            try: uid_values[row_count]+=header
+                            except: uid_values[row_count]=header
+                            uids.append('UID')
+                            firstRow=False
+                        else:
+                            uid = t[0]
+                            try: uid_values[row_count] += t[1:]
+                            except: uid_values[row_count] = t[1:]
+                            uids.append(uid)
+            i=0
+            for index in uid_values:
+                uid = uids[i]
+                eo.write(string.join([uid]+uid_values[index],'\t')+'\n')
+                i+=1
+            print 'completed',low,high
+
+        uid_list = file_uids[file]
+        while (len(uid_list)+increment)>high:
+            exportMergedRows(low,high)
+            high+=increment
+            low+=increment
+        eo.close()
+        return True
+    else:
+        print 'Different identifier order in the input files encountered...'
+        return False
+
 def combineAllLists(files_to_merge,original_filename,includeColumns=False):
     headers =[]; files=[]
 
+    run = latteralMerge(files_to_merge,original_filename)
+    if run:
+        return ### Exit Merge Function
+
     import collections 
     all_keys=collections.OrderedDict()
     dataset_data=collections.OrderedDict()