Craigslist Rent

Interpreting Housing rent in Dallas using open textual data¶

Dashboard

Load Libraries¶

In [9]:

Copied!





import pandas as pd
import numpy as np
import re, string
from tensorflow.keras.layers import  Dropout, Dense
from tensorflow.keras.models import Sequential
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import pandas as pd
import numpy as np
import re, string
from tensorflow.keras.layers import  Dropout, Dense
from tensorflow.keras.models import Sequential
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

Preparing Text Data¶

In [10]:

Copied!





pt1= pd.read_excel('./Data/Craigslist_All.xlsx')
# pt1.drop_duplicates(subset ="Description", 
#                      keep = False, inplace = True) 
print(pt1.shape)
pt1 = pt1[["Rent","Bath","Bed","Area","Description"]]
pt1.head()
pt1= pd.read_excel('./Data/Craigslist_All.xlsx')
# pt1.drop_duplicates(subset ="Description", 
#                      keep = False, inplace = True) 
print(pt1.shape)
pt1 = pt1[["Rent","Bath","Bed","Area","Description"]]
pt1.head()

(6087, 10)

Out[10]:

	Rent	Bath	Bed	Area	Description
0	1150	2Ba	2BR	1141	2/2 bath Ready Now. This is a beautiful spacio...
1	1140	1Ba	2BR	876	Garden-style charm meets clean, modern design ...
2	910	1Ba	1BR	659	VIEW OUR WEBSITE: \n\nhttp://TheThreadApt.com ...
3	759	1Ba	1BR	600	The great luxuries you have always wanted in a...
4	1070	1Ba	1BR	678	Up to $500* off plus no deposit and half-off f...

Split by Whitespace and Remove Punctuation¶

In [12]:

Copied!

from string import digits
from string import digits

In [13]:

Copied!





def clean_text(article):
    clean1 = re.sub(r'['+string.punctuation + '’—”'+']', "", article.lower())
    return re.sub(r'\W+', ' ', clean1)
remove_digits = str.maketrans('', '', digits)
pt1['tokenized'] = pt1['Description'].map(lambda x: clean_text(str(x).translate(remove_digits).strip().replace('xa0',"")))

pt1.head()
def clean_text(article):
    clean1 = re.sub(r'['+string.punctuation + '’—”'+']', "", article.lower())
    return re.sub(r'\W+', ' ', clean1)
remove_digits = str.maketrans('', '', digits)
pt1['tokenized'] = pt1['Description'].map(lambda x: clean_text(str(x).translate(remove_digits).strip().replace('xa0',"")))

pt1.head()

Out[13]:

	Rent	Bath	Bed	Area	Description	Area_Cate	tokenized
0	1150	2Ba	2BR	1141	2/2 bath Ready Now. This is a beautiful spacio...	Area:1000-1500	bath ready now this is a beautiful spacious a...
1	1140	1Ba	2BR	876	Garden-style charm meets clean, modern design ...	Area:<1000	gardenstyle charm meets clean modern design in...
2	910	1Ba	1BR	659	VIEW OUR WEBSITE: \n\nhttp://TheThreadApt.com ...	Area:<1000	view our website httpthethreadaptcom the threa...
3	759	1Ba	1BR	600	The great luxuries you have always wanted in a...	Area:<1000	the great luxuries you have always wanted in a...
4	1070	1Ba	1BR	678	Up to $500* off plus no deposit and half-off f...	Area:<1000	up to off plus no deposit and halfoff fees res...

Lemmatization¶

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.

In [14]:

Copied!

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [20]:

Copied!





lemma_word = []
wordnet_lemmatizer = WordNetLemmatizer()
for i, row in pt1.iterrows():
    word = ""
    for w in row['tokenized'].split():
        word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
        word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
        word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
        word  = word + " " + word3
    lemma_word.append(word)
pt1['lemma_word'] = lemma_word
pt1['lemma_word'].head()
lemma_word = []
wordnet_lemmatizer = WordNetLemmatizer()
for i, row in pt1.iterrows():
    word = ""
    for w in row['tokenized'].split():
        word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
        word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
        word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
        word  = word + " " + word3
    lemma_word.append(word)
pt1['lemma_word'] = lemma_word
pt1['lemma_word'].head()

Out[20]:

0     bath ready now this be a beautiful spacious a...
1     gardenstyle charm meet clean modern design in...
2     view our website httpthethreadaptcom the thre...
3     the great luxury you have always want in a on...
4     up to off plus no deposit and halfoff fee res...
Name: lemma_word, dtype: object

Filtering Extremes¶

Rent < 2.5% and Rent > 97.5%

In [15]:

Copied!

pt1["Rent"].plot(kind="density")
pt1["Rent"].plot(kind="density")

Out[15]:

<Axes: ylabel='Density'>

No description has been provided for this image

In [16]:

Copied!





n_025 = pt1['Rent'].quantile(0.025)
n_975= pt1['Rent'].quantile(0.975)
pt1 = pt1[pt1['Rent']>=n_025]
pt1 = pt1[pt1['Rent']<=n_975] 
pt1["Rent"].plot(kind="density")
n_025 = pt1['Rent'].quantile(0.025)
n_975= pt1['Rent'].quantile(0.975)
pt1 = pt1[pt1['Rent']>=n_025]
pt1 = pt1[pt1['Rent']<=n_975] 
pt1["Rent"].plot(kind="density")

Out[16]:

<Axes: ylabel='Density'>

Categorize Area as well¶

In [17]:

Copied!





# Define the bins and labels for categorization
bins = [0, 1000, 1500, 2000, 2500, 3000, 4000, float('inf')]
labels = ['<1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-4000', '>4000']

# Categorize the "Area" variable
pt1['Area_Cate'] = pd.cut(pt1['Area'], bins=bins, labels=labels, right=False)

# Add the prefix "Area_" to the categories
pt1['Area_Cate'] = 'Area:' + pt1['Area_Cate'].astype(str)

# View the result
pt1.head()
# Define the bins and labels for categorization
bins = [0, 1000, 1500, 2000, 2500, 3000, 4000, float('inf')]
labels = ['<1000', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-4000', '>4000']

# Categorize the "Area" variable
pt1['Area_Cate'] = pd.cut(pt1['Area'], bins=bins, labels=labels, right=False)

# Add the prefix "Area_" to the categories
pt1['Area_Cate'] = 'Area:' + pt1['Area_Cate'].astype(str)

# View the result
pt1.head()

Out[17]:

	Rent	Bath	Bed	Area	Description	Area_Cate	tokenized
0	1150	2Ba	2BR	1141	2/2 bath Ready Now. This is a beautiful spacio...	Area:1000-1500	bath ready now this is a beautiful spacious a...
1	1140	1Ba	2BR	876	Garden-style charm meets clean, modern design ...	Area:<1000	gardenstyle charm meets clean modern design in...
2	910	1Ba	1BR	659	VIEW OUR WEBSITE: \n\nhttp://TheThreadApt.com ...	Area:<1000	view our website httpthethreadaptcom the threa...
3	759	1Ba	1BR	600	The great luxuries you have always wanted in a...	Area:<1000	the great luxuries you have always wanted in a...
4	1070	1Ba	1BR	678	Up to $500* off plus no deposit and half-off f...	Area:<1000	up to off plus no deposit and halfoff fees res...

Remove stop words and merge all info into corpus¶

In [24]:

Copied!

# nltk.download('stopwords')
# nltk.download('stopwords')

In [21]:

Copied!

from nltk.corpus import stopwords
from nltk.corpus import stopwords

In [26]:

Copied!





X = []
for i, row in pt1.iterrows():
    for sw in stopwords.words('english'):
        words = row['Bath'] + " " + row['Bath'] + row['Area_Cate'] + " " + row['lemma_word']
        words.replace(sw,"")
    X.append(words)
    
pt1['corpus'] = X
pt1.head()
X = []
for i, row in pt1.iterrows():
    for sw in stopwords.words('english'):
        words = row['Bath'] + " " + row['Bath'] + row['Area_Cate'] + " " + row['lemma_word']
        words.replace(sw,"")
    X.append(words)
    
pt1['corpus'] = X
pt1.head()

Out[26]:

	Rent	Bath	Bed	Area	Description	Area_Cate	tokenized	lemma_word	corpus
0	1150	2Ba	2BR	1141	2/2 bath Ready Now. This is a beautiful spacio...	Area:1000-1500	bath ready now this is a beautiful spacious a...	bath ready now this be a beautiful spacious a...	2Ba 2Ba Area:1000-1500 bath ready now this...
1	1140	1Ba	2BR	876	Garden-style charm meets clean, modern design ...	Area:<1000	gardenstyle charm meets clean modern design in...	gardenstyle charm meet clean modern design in...	1Ba 1Ba Area:<1000 gardenstyle charm meet ...
2	910	1Ba	1BR	659	VIEW OUR WEBSITE: \n\nhttp://TheThreadApt.com ...	Area:<1000	view our website httpthethreadaptcom the threa...	view our website httpthethreadaptcom the thre...	1Ba 1Ba Area:<1000 view our website httpth...
3	759	1Ba	1BR	600	The great luxuries you have always wanted in a...	Area:<1000	the great luxuries you have always wanted in a...	the great luxury you have always want in a on...	1Ba 1Ba Area:<1000 the great luxury you ha...
4	1070	1Ba	1BR	678	Up to $500* off plus no deposit and half-off f...	Area:<1000	up to off plus no deposit and halfoff fees res...	up to off plus no deposit and halfoff fee res...	1Ba 1Ba Area:<1000 up to off plus no depos...

In [27]:

Copied!





pt1['num_wds'] = pt1['corpus'].str.split().apply(lambda x: len(set(x)))
ax=pt1['num_wds'].plot(kind='hist', bins=50, fontsize=14, figsize=(12,10))
ax.set_title('ADs Length in Words\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Words', fontsize=18)
pt1['num_wds'] = pt1['corpus'].str.split().apply(lambda x: len(set(x)))
ax=pt1['num_wds'].plot(kind='hist', bins=50, fontsize=14, figsize=(12,10))
ax.set_title('ADs Length in Words\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Words', fontsize=18)

Out[27]:

Text(0.5, 0, 'Number of Words')

In [28]:

Copied!

pt1 = pt1[pt1['num_wds']>=10]
pt1 = pt1[pt1['num_wds']>=10]

Tokenizing the Text¶

Use Keras' Tokenizer to convert text to sequences of integers.
The number of words (num_words) to keep in the vocabulary is based on the unique words in your corpus.

In [29]:

Copied!

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [30]:

Copied!





# Initialize and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(pt1['corpus'])
sequences = tokenizer.texts_to_sequences(pt1['corpus'])

# Pad sequences to ensure uniform input size
max_length = max(len(seq) for seq in sequences)
X_padded = pad_sequences(sequences, maxlen=max_length)

# Number of unique words in the corpus
num_words = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
num_words
# Initialize and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(pt1['corpus'])
sequences = tokenizer.texts_to_sequences(pt1['corpus'])

# Pad sequences to ensure uniform input size
max_length = max(len(seq) for seq in sequences)
X_padded = pad_sequences(sequences, maxlen=max_length)

# Number of unique words in the corpus
num_words = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
num_words

Out[30]:

Preparing the Target and Splitting the Dataset¶

Prepare your target variable and split the data into training and testing sets.

In [31]:

Copied!

from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

In [32]:

Copied!

# Target variable
y = pt1['Rent'].values

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=1)
# Target variable
y = pt1['Rent'].values

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=1)

LSTM Modeling¶

Create an LSTM model using Keras.

This example uses a LSTM layer followed by a dense layer with linear activation to predict the rent.

In [41]:

Copied!





from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from keras.callbacks import ModelCheckpoint
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from keras.callbacks import ModelCheckpoint
import matplotlib.pyplot as plt

In [35]:

Copied!





model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=100, input_length=max_length))  # 100-dimensional word vectors
model.add(LSTM(64))  # 64 LSTM units
model.add(Dense(1, activation='linear'))  # Linear activation for regression
model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=100, input_length=max_length))  # 100-dimensional word vectors
model.add(LSTM(64))  # 64 LSTM units
model.add(Dense(1, activation='linear'))  # Linear activation for regression

In [55]:

Copied!





# model.compile(optimizer='adam', loss='mean_squared_error',metrics=['mean_absolute_error'])
# # Define checkpoint callback to save the model weights
# checkpoint_path = "model_checkpoint.h5"
# checkpoint_callback = ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

# history = model.fit(X_train, y_train, batch_size=128, epochs=200, validation_split=0.2, callbacks=[checkpoint_callback])
# test_mse = model.evaluate(X_test, y_test)
# print(f'Test MSE: {test_mse}')
# model.compile(optimizer='adam', loss='mean_squared_error',metrics=['mean_absolute_error'])
# # Define checkpoint callback to save the model weights
# checkpoint_path = "model_checkpoint.h5"
# checkpoint_callback = ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

# history = model.fit(X_train, y_train, batch_size=128, epochs=200, validation_split=0.2, callbacks=[checkpoint_callback])
# test_mse = model.evaluate(X_test, y_test)
# print(f'Test MSE: {test_mse}')

In [52]:

Copied!





# Print test distribution
plt.hist(y_test, bins=20, color='blue', alpha=0.5, label='Actual')
plt.hist(model.predict(X_test), bins=20, color='red', alpha=0.5, label='Predicted')
plt.legend()
plt.title('Test Distribution - Actual vs Predicted')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

# Evaluate the model on the test set
test_mse, test_mae = model.evaluate(X_test, y_test, verbose=0)

# Calculate MAPE
y_pred = model.predict(X_test)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f'Test MSE: {test_mse:.4f}')
print(f'Test MAE: {test_mae:.4f}')
print(f'Test MAPE: {mape:.4f}%')
# Print test distribution
plt.hist(y_test, bins=20, color='blue', alpha=0.5, label='Actual')
plt.hist(model.predict(X_test), bins=20, color='red', alpha=0.5, label='Predicted')
plt.legend()
plt.title('Test Distribution - Actual vs Predicted')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

# Evaluate the model on the test set
test_mse, test_mae = model.evaluate(X_test, y_test, verbose=0)

# Calculate MAPE
y_pred = model.predict(X_test)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f'Test MSE: {test_mse:.4f}')
print(f'Test MAE: {test_mae:.4f}')
print(f'Test MAPE: {mape:.4f}%')

37/37 [==============================] - 14s 354ms/step

37/37 [==============================] - 14s 380ms/step
Test MSE: 50717.8789
Test MAE: 152.9201
Test MAPE: 30.9241%

In [53]:

Copied!





# Plot training & validation loss values
plt.plot(history.history['loss'], label='Training MSE')
plt.plot(history.history['val_loss'], label='Validation MSE')
plt.title('Model MSE')
plt.ylabel('MSE')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show()
# Plot training & validation loss values
plt.plot(history.history['loss'], label='Training MSE')
plt.plot(history.history['val_loss'], label='Validation MSE')
plt.title('Model MSE')
plt.ylabel('MSE')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show()

Test¶

Sample 1

In [54]:

Copied!





def predict_rent(description):
    # Preprocess the description
    cleaned = clean_text(description)
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized = ' '.join([lemmatizer.lemmatize(word) for word in cleaned.split()])
    seq = tokenizer.texts_to_sequences([lemmatized])
    padded = pad_sequences(seq, maxlen=max_length)
    
    # Predict
    predicted_rent = model.predict(padded)
    return predicted_rent[0][0]

# Example usage
description = """
GET LUCKY WITH ONE MONTH FREE ON 2-BEEDROOM APARTMENT HOMES!!
Limited time MOVE IN SPECIALS!

Looking for the pot of gold with your new 2bedroom 1bath home? STOP looking We HAVE IT for you!!

Our 2bedroom 1bedroom is 896sq ft with very spacious living room, beautiful hardwood flooring. There is a closet for extra storage in the hallway.

Apartment Amenities
*Beautiful Landscaped Property
*2 Sparkling Swimming Pools
*2 Fully Equipped Playgrounds Areas
*2 Laundry Care Facilities
*BBQ Grills & Picnic Areas
*Soccer Field
*Pet Park
*Wooded Setting with Mature Trees
*Convenient to Shopping, Dining & Schools
Contact us today to schedule a viewing! Online leasing is also available.
You will love our Community, Call us Today!

"""
predicted = predict_rent(description)
predicted
def predict_rent(description):
    # Preprocess the description
    cleaned = clean_text(description)
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized = ' '.join([lemmatizer.lemmatize(word) for word in cleaned.split()])
    seq = tokenizer.texts_to_sequences([lemmatized])
    padded = pad_sequences(seq, maxlen=max_length)
    
    # Predict
    predicted_rent = model.predict(padded)
    return predicted_rent[0][0]

# Example usage
description = """
GET LUCKY WITH ONE MONTH FREE ON 2-BEEDROOM APARTMENT HOMES!!
Limited time MOVE IN SPECIALS!

Looking for the pot of gold with your new 2bedroom 1bath home? STOP looking We HAVE IT for you!!

Our 2bedroom 1bedroom is 896sq ft with very spacious living room, beautiful hardwood flooring. There is a closet for extra storage in the hallway.

Apartment Amenities
*Beautiful Landscaped Property
*2 Sparkling Swimming Pools
*2 Fully Equipped Playgrounds Areas
*2 Laundry Care Facilities
*BBQ Grills & Picnic Areas
*Soccer Field
*Pet Park
*Wooded Setting with Mature Trees
*Convenient to Shopping, Dining & Schools
Contact us today to schedule a viewing! Online leasing is also available.
You will love our Community, Call us Today!

"""
predicted = predict_rent(description)
predicted

1/1 [==============================] - 0s 320ms/step

Out[54]:

1091.9424

Sample 2

In [56]:

Copied!





# Example usage
description = """
Look today MOVE tomorrow! Up to $450.00 off your total move in!
Get Started for FREE!

Stylish apartment homes with a convenient Lewisville location. Experience peaceful, cozy living complete with a refreshing swimming pool, and a community playground. Conveniently situated in Lewisville, you can spend a day shopping in Old Town Lewisville, or have fun in the sun on Lewisville Lake, then return home to appealing amenities, like wood-burning fireplaces, private patios, and hardwood style flooring, that Windsor Court provides!

Residents have easy access to I-35, and Hwy 121, Lewisville Lake, the MCL Grand Theatre, multiple golf clubs and so much more. Dive into the rich Texas culture with quick access to prime Texas barbecue and regional micro-breweries. Pair this fantastic location, with spacious floor plans and you’ve found your next home.

Barrow is an apartment community located in Denton County and the 75067 ZIP Code. This area is served by the Lewisville Independent attendance zone.

Unique Features
On-Site Management
Technology Package
Gated
Smart Thermostats & Door Locks
Washer/dryer Connections All Apartments*
Two Pools
Playground
Onsite Gym*
Grill Area

Call for our Move in Special!
"""
predicted = predict_rent(description)
predicted
# Example usage
description = """
Look today MOVE tomorrow! Up to $450.00 off your total move in!
Get Started for FREE!

Stylish apartment homes with a convenient Lewisville location. Experience peaceful, cozy living complete with a refreshing swimming pool, and a community playground. Conveniently situated in Lewisville, you can spend a day shopping in Old Town Lewisville, or have fun in the sun on Lewisville Lake, then return home to appealing amenities, like wood-burning fireplaces, private patios, and hardwood style flooring, that Windsor Court provides!

Residents have easy access to I-35, and Hwy 121, Lewisville Lake, the MCL Grand Theatre, multiple golf clubs and so much more. Dive into the rich Texas culture with quick access to prime Texas barbecue and regional micro-breweries. Pair this fantastic location, with spacious floor plans and you’ve found your next home.

Barrow is an apartment community located in Denton County and the 75067 ZIP Code. This area is served by the Lewisville Independent attendance zone.

Unique Features
On-Site Management
Technology Package
Gated
Smart Thermostats & Door Locks
Washer/dryer Connections All Apartments*
Two Pools
Playground
Onsite Gym*
Grill Area

Call for our Move in Special!
"""
predicted = predict_rent(description)
predicted

1/1 [==============================] - 0s 388ms/step

Out[56]:

1229.9175

Understand the model¶

Attention Weights¶

In the realm of natural language processing (NLP) and sequence modeling, attention weights play a pivotal role in understanding how models process input sequences and make predictions. At its core, attention mechanism mimics the human cognitive process of selectively focusing on relevant information while processing a task.
In the context of NLP, attention weights represent the degree of importance or relevance assigned to each word or token in a sequence when generating an output, such as a translation, sentiment score, or prediction.
- Higher attention weights indicate that the model assigns greater significance to the corresponding word, suggesting its crucial role in determining the output.

In [65]:

Copied!

from keras.models import Model
from keras.models import Model

In [80]:

Copied!





# Define a custom model
attention_model = Model(inputs=model.input,
                        outputs=[model.output, model.layers[1].output])

# Generate predictions and attention weights for a sample
predictions, attention_weights = attention_model.predict(X_test[:1])

# Get the mapping from word index to word
index_to_word = {index: word for word, index in tokenizer.word_index.items()}

# Plot attention weights with real words on y-axis
plt.figure(figsize=(9, 12))  # Set the size of the figure
plt.barh([index_to_word[i+1] for i in range(len(attention_weights[0]))], attention_weights[0])
plt.ylabel('Word')
plt.xlabel('Attention Weight')
plt.title('Attention Weights')
plt.show()
# Define a custom model
attention_model = Model(inputs=model.input,
                        outputs=[model.output, model.layers[1].output])

# Generate predictions and attention weights for a sample
predictions, attention_weights = attention_model.predict(X_test[:1])

# Get the mapping from word index to word
index_to_word = {index: word for word, index in tokenizer.word_index.items()}

# Plot attention weights with real words on y-axis
plt.figure(figsize=(9, 12))  # Set the size of the figure
plt.barh([index_to_word[i+1] for i in range(len(attention_weights[0]))], attention_weights[0])
plt.ylabel('Word')
plt.xlabel('Attention Weight')
plt.title('Attention Weights')
plt.show()

1/1 [==============================] - 1s 871ms/step