Sufficiency test
[22]:
%load_ext autoreload
%autoreload 2
from creme import creme
from creme import utils
from creme import custom_model
import pandas as pd
import matplotlib.pyplot as plt
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Load Enformer and example sequences
[93]:
data_dir = '../../../data'
track_index = [5111]
model = custom_model.Enformer(track_index=track_index)
[24]:
fasta_path = f'{data_dir}/GRCh38.primary_assembly.genome.fa'
seq_parser = utils.SequenceParser(fasta_path)
gene = 'GATA2_chr3_128487916_-'
gene_name, chrom, start, strand = gene.split('_')
wt_seq = seq_parser.extract_seq_centered(chrom, int(start), '-', model.seq_length)
[25]:
# TSS bin indeces
bins = [447, 448]
[26]:
wt = model.predict(wt_seq)[0,:,0]
[27]:
utils.plot_track([wt], color='green', zoom=[0, 896], marks=bins)
[27]:
<Axes: >
Sufficiency test
In this example we will test the necessity of (5Kb) tiles in the context of an enhancing context sequence of GATA2 gene.
To run the test we need: - a loaded model - onehot encoded sequence (WT) of the sequence - TSS tile coordinates (or any other fixed tile coordinate that is always embedded in the shuffled backgrounds) - a list of tile coordinates to test - num_shuffle - number of shuffles to perform - tile sequence (onehot) which can be extracted based on coordinates (if tile_seq is None) - optionally, we can set mean=False to not average the shuffle results - optionally, we can set return_seqs to True to return the shuffled sequeneces for future use
[28]:
perturb_window = 5000
N_shuffles = 10
tss_tile, cre_tiles = utils.set_tile_range(model.seq_length, perturb_window)
cre_tiles = cre_tiles[19:20]
print(f'Enhancing tile at position {cre_tiles[0][0]} - {cre_tiles[0][1]}')
print(f'TSS tile at center position {tss_tile[0]} - {tss_tile[1]}')
Enhancing tile at position 100804 - 105804
TSS tile at center position 95804 - 100804
Sufficiency of an enhancing tile of GATA2 gene TSS:
As we can see, the GATA2 TSS activity is low on its own (control case of just TSS) compared to the high activity of the WT. However, when we embed the enhancing CRE together with the TSS the activity goes back up (not fully but to a much higher level than the control case).
[40]:
_, pred_mut, pred_control, control_sequences = creme.sufficiency_test(model, wt_seq, tss_tile, cre_tiles, N_shuffles, mean=False, return_seqs=True)
[41]:
ax=utils.plot_track(pred_mut.mean(axis=1)[:,:,0], color='green', label='TSS+CRE')
utils.plot_track([wt], ax=ax, zoom=[400, 500], marks=[447,448], color='grey', label='WT')
utils.plot_track(pred_control.mean(axis=1)[:,:,0], color='purple', label='just TSS', ax=ax, zoom=[400, 500], marks=[447,448])
ax.legend();
Fine-tile search
Note, we need to supply these as lists, e.g. scales=[500, 50], thresholds=[0.9, 0.7], N_batches=[1, 10] will run a fine-tile seach with 2 stages: (i) at 500bp, 1 tile at a time until fraction explained drops below 0.9, (ii) at 50bp, 10 tiles at a time until fraction explained drops below 0.7.
Once we set these custom parameters we can run the search by using the following:
a loaded model
onehot encoded sequence (WT) of the sequence
control sequences. These should contain the TSS embedded already.
TSS activity when the whole CRE is embedded (with TSS) in background sequences
tile start coordinate
tile end coordinate
scales (window sizes)
thresholds (to stop the search)
fraction of step size to move the sub-tile
batch sizes
[67]:
track_index = [5111]
bins = [447, 448]
model = custom_model.Enformer(track_index=track_index, bin_index=bins)
[76]:
mut = pred_mut[:,:,bins,:].mean()
print(f'Mean CRE+TSS activity across bins 447 and 448 is {mut}')
scales = [500, 50]
thresholds = [0.9, 0.7]
frac=1
N_batches = [1, 10]
Mean CRE+TSS activity across bins 447 and 448 is 193.565185546875
What smallest set of sub-tiles recovers 80% of the activity of the GATA2 enhancer?
GATA2 TSS activity is low on its own (control case of just TSS embedded in shuffled sequences) compared to the high activity of the WT. However, when we embed an enhancing CRE (identified previously) together with the TSS the activity goes back up (not fully but to a much higher level than the control case). But do we need the whole CRE for this effect?
[78]:
%%time
optimization_results = creme.prune_sequence(model,
wt_seq,
control_sequences,
mut,
cre_tiles[0][0],
cre_tiles[0][1],
scales,
thresholds,
frac,
N_batches)
Tile size = 500, threshold = 0.9
Starting score: 1.0
10
Starting optimization...
score = 1.0
Number of tiles to test: 10
Number of tiles at the end of iteration: 10, score = 1.0085970163345337, bps = 4500.0
score = 1.0085970163345337
Number of tiles to test: 9
Number of tiles at the end of iteration: 9, score = 1.011488437652588, bps = 4000.0
score = 1.011488437652588
Number of tiles to test: 8
Number of tiles at the end of iteration: 8, score = 1.007453203201294, bps = 3500.0
score = 1.007453203201294
Number of tiles to test: 7
Number of tiles at the end of iteration: 7, score = 0.9873055219650269, bps = 3000.0
score = 0.9873055219650269
Number of tiles to test: 6
Number of tiles at the end of iteration: 6, score = 0.9530054926872253, bps = 2500.0
score = 0.9530054926872253
Number of tiles to test: 5
Number of tiles at the end of iteration: 5, score = 0.9063980579376221, bps = 2000.0
score = 0.9063980579376221
Number of tiles to test: 4
Number of tiles at the end of iteration: 4, score = 0.8719760775566101, bps = 1500.0
Tile size = 50, threshold = 0.7
Starting score: 0.9063980579376221
40
Starting optimization...
score = 0.9063980579376221
Number of tiles to test: 40
Number of tiles at the end of iteration: 40, score = 0.9034410119056702, bps = 1500.0
score = 0.9034410119056702
Number of tiles to test: 30
Number of tiles at the end of iteration: 30, score = 0.8665218353271484, bps = 1000.0
score = 0.8665218353271484
Number of tiles to test: 20
Number of tiles at the end of iteration: 20, score = 0.7589319348335266, bps = 500.0
score = 0.7589319348335266
Number of tiles to test: 10
Number of tiles at the end of iteration: 10, score = 0.09304965287446976, bps = 0.0
CPU times: user 39.3 s, sys: 8.79 s, total: 48.1 s
Wall time: 7min 45s
Breakdown of results: The run above went through several iterations of pruning and testing if the threshold of acceptable fraction explained has been crossed. At iteration 20 (the second to last iteration) we get a score which is still above the threshold. The last iteration outputs a score lower than the threshold. To now re-create the final sequence of 500bp sub-tiles we can use the ‘insert_coords’ values in the results as shown below.
[92]:
final_sequences = control_sequences.copy()
for embed_start, embed_end in optimization_results[50]['insert_coords']:
final_sequences[:, embed_start: embed_end, :] = wt_seq[embed_start: embed_end, :].copy()
final_prediction = model.predict(final_sequences)
final_prediction.mean() / mut
[92]:
0.75893193
[88]:
[ ]: