I am attempting to utilize StratifiedKFold in order to create train/test/val splits for a non-sklearn machine learning workflow. The goal is to split the DataFrame and maintain that division. My approach involves using .values as I am working with pandas ...
Hello, I am a Python newcomer and I'm currently following a tutorial. However, I encountered the following error: NameError: name 'tree' is not defined. The goal of my program is to classify fruits as either apples or oranges based on t ...
Currently digging into the Titanic dataset, I've been experimenting with applying an SVM to various individual features using the code snippet below: quanti_vars = ['Age','Pclass','Fare','Parch'] imp_med = Sim ...
I am currently utilizing a library called UnbalancedDataset for oversampling purposes. The dimensions of my X_train_features.shape are (30962, 15637) and y_train.shape is (30962,) type(X_train_features) is showing as scipy.sparse.csr.csr_matrix An index ...
After implementing GridsearchCV with a Ridge model and utilizing PolynomialFeatures in the preprocessing pipeline, I have successfully trained the model. To access the coefficients of the best model, I used the following code snippet: pipe = Pipeline( ...
I'm hoping I won't need an example set. In my 2D array, each sub-array contains words from sentences. To build a vocabulary of words, I am utilizing the CountVectorizer and applying fit_transform to the entire 2D array effectively. However, I have sente ...
Although I'm not a programmer by trade, I am faced with the task of determining a relationship between two variables in an equation. Despite extensively searching through Google, I haven't been able to understand how to input my data into sklearn linear_mo ...
My dataset contains a mix of categorical and non-categorical values. To handle this, I used OneHotEncoder for the categorical values and StandardScaler for the continuous values. transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHot ...
Looking to reduce the training time of my models, I decided to utilize a high-end EC2 instance. I experimented with the c5.18xlarge instance that has 2 CPUs and ran several models with the parameter n_jobs=-1. However, I noticed that only one CPU was being ...
Currently, I am utilizing pandas.get_dummies to encode categorical features during the fitting and classification process. Recently, I observed that when using Imputer(), it is inserting averages in the "off" categorical switches that are added in datafram ...
I'm currently working on evaluating a geodesic distance matrix for the TOSCA dataset, specifically looking at a 3D mesh example like the one shown below: https://i.stack.imgur.com/CofTU.png During my analysis, I experimented with two different Pyth ...
After attempting to fill NaN values in a column using the KNN imputer from Sk-learn, I noticed that some of the NaNs were still present in the imputed column. What could be causing this issue? I have already compared the count of NaNs before and after the ...
Currently, I am using PyDev in Eclipse 4.2 on Mountain Lion operating system. I have successfully installed the SciPy Superpack and can access all related packages like Scikit-learn and MatPlotLib via Python interpreter and IPython. However, when attemptin ...
https://i.stack.imgur.com/aVQeV.jpg]2]3 I am facing an issue while trying to convert float data for use in a decision tree. Every time I attempt to apply label encoder, I encounter an error stating that the argument must be a string or number. ...
Currently, I am experimenting with Random Forest Regression using criterion = mae (mean absolute error) instead of mse (mean squared error). This change is resulting in a significant impact on computation time. Specifically, the processing time has incre ...
I am currently exploring the connection between the sklearn's .fit() method and the .predict() method. I have not been able to find a comprehensive answer to this question on other online forums, although similar topics have been discussed (see here). Whi ...
Whenever I run a sample code, I keep running into this issue: "RuntimeError: A pipeline has not yet been optimized. Please call fit() first." The Challenge with TPOT Automated Machine Learning in Python. I am attempting to replicate the example: Dataset ...
I've been utilizing the URL http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html to perform cross-validation on a Logistic Regression classifier. The outcomes I acquired are: [ 0.78571429 0.64285714 0.85714286 ...
My attempts to utilize a custom model function from Sklearn appear to be yielding incorrect distances. I possess two vectors and typically compute their cosine similarity, following it up with 1 - cosine_similarity for usage as a distance metric. Here&apo ...
In my quest to discover the nearest neighbors for a dataset A containing 25,000 rows, I have ventured into fitting dataset B into a KNN model consisting of 13 million rows. The ultimate objective is to identify 25,000 rows within dataset B that closely res ...
In an attempt to cluster data based on object names, x_coordinate, y_coordinate, and corresponding temperature, I am experimenting with the mean square clustering algorithm. The goal is to group nearby objects according to location and temperature in order ...
I'm having some trouble understanding the function of each step in this particular pipeline. Could someone provide a detailed explanation of how this pipeline is functioning? I have a general idea, but more clarity would be greatly appreciated. Wha ...
Currently, I am exploring text classification with scikit-learn's TfidfVectorizer and the Nearest Neighbor algorithm. My challenge lies in determining similarity metrics between two datasets, each containing 18000 entries. I am grappling with decidin ...
When dealing with imbalanced data, I attempted to use a Random Forest Classifier with X representing the features and y representing the labels (with 90% of values as 0 and 10% as 1). Uncertain about how to handle stratification within Cross Validation, I ...
Currently, I am working on implementing a classification algorithm using a dataset related to medicinal research. My main focus is to achieve good recall in disease recognition. In order to do so, I had the idea of creating a scorer like the following: re ...
Although a similar question was raised [here], the solution provided did not work for me. In fact, another user commented on the answer, stating that it was incorrect. Despite this, the original poster (who also answered their own question) has not respond ...
Can someone explain why, after using sffs.k_feature_names_, I am only getting the positions of the best columns and not their actual names? https://i.stack.imgur.com/Rwf8K.png ...
Currently, I am working on analyzing a dataset named Adult and attempting to execute a KNN (K Nearest Neighbors) algorithm on certain columns in a new dataframe that I have created. A few of these columns have been normalized. However, during the process, ...
from sklearn.model_selection import cross_validate scores = cross_validate(LogisticRegression(class_weight='balanced',max_iter=100000), X,y, cv=5, scoring=('roc_auc', 'average_precision','f1','recall','balanced_accuracy')) scores['t ...
Within Sklearn, the n_jobs parameter is utilized in various functions to specify the number of cores to be used. This allows users to dictate the amount of processing power allocated for a specific task; for instance, inputting 1 uses one core while -1 s ...
Currently, I am in the process of developing a predictive model using linear regression on a dataset containing 157673 records. The data is stored in a CSV file and follows this format: Timestamp,Signal_1,Signal_2,Signal_3,Signal_4,Signal_5 2021-04-13 ...
As a Python beginner, I have a question regarding the behavior of this particular code block. train_features, test_features, train_labels, test_labels = train_test_split( df.drop(labels=[21], axis=1), df[21], test_size=0.2, random_state=37 I am looking t ...
Task involves encoding all the text and categorical features, then combining them to create a data matrix. However, encountering an error due to incompatible row dimensions. Progress so far: Encoding categorical feature using Label Encoder from sklearn.p ...
I am currently working with a boosted trees model and have both probabilities and classification for a test data set. My goal is to plot the roc_curve for this data, but I am struggling to determine how to define thresholds/alpha for the roc curve in sciki ...
Encountering an error while chaining the estimators and attempting to view. As a newcomer to Python, this was my first time experimenting with the pipeline function. from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression ...
Is it possible to achieve a specific dimension when using the MultiLabelBinarizer in sklearn? For instance, given the following code: from sklearn.preprocessing import MultiLabelBinarizer y = [[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]] MultiL ...
Being a novice in the realm of machine learning, I am currently delving into the realm of hyperparameters for an SVM with the goal of achieving at least 95% accuracy on the mnist digit dataset. The csv file being utilized by the code contains 784 attribut ...
import pandas as pd from sklearn.tree import DecisionTreeClassifier Health_data = pd.read_csv("Health_dataset.csv") X = Health_data.drop(columns='Conditions') y = Health_data['Conditions'] model = DecisionTreeClassifier() model.fit(X, y) Enc ...
I am working with a unique document It's not your typical text It's full of scientific terminologies The content of this document looks like this RepID,Txt 1,K9G3P9 4H477 -Q207KL41 98464 ... Q207KL41 2,D84T8X4 -D9W4S2 -D9W4S2 8E8E65 ... D9W4S2 3,-05L8 ...
I am currently exploring ways to predict company sales using LSTM. However, I have come across examples that only utilize two variables - time and sales. I believe that this may not be sufficient for accurate forecasting. Upon further research, I discovere ...
Currently, I am exploring the use of sklearnex/scikit-learn-intelex for GPU acceleration. The code snippet below is what I have implemented based on the instructions provided in 'Patching several algorithms': try: from sklearnex import patch_sklearn ...
I have been using scikit-image to successfully classify road features. Take a look at the results here: https://i.stack.imgur.com/zRJNh.jpg. However, I am facing challenges in the next step of classifying these features. Specifically, I need to classify fe ...
Trying to create a complex pipeline using custom classes resulted in an error: TypeError: fit_transform() takes 2 positional arguments but 3 were given Despite attempting solutions found for similar issues, the error persisted. class NewLabelBinarizer(L ...
Encountering some unusual errors while using the LassoCV() regressor with a grouped cross-validation object. To be more specific, when working with dataframe df and target column y, I want to conduct LeaveOneGroupOut() cross-validation. When I try the fol ...
Currently, I am in the process of training a binary classifier using python and the well-known scikit-learn module's SVM class. Upon completing the training phase, I utilize the predict method to classify data based on the guidelines outlined in sci-kit's ...
I have been implementing the PassiveAggressiveRegressor incremental classifier in my project. After every use of the partial_fit function, I make sure to save the model into a pickle file. from sklearn import linear_model import numpy as np import time X ...
Recently, I delved into exploring NLP concepts and decided to enhance my knowledge by following Python tutorials. One tutorial caught my attention when they utilized the sparse matrix of word counts, generated with CountVectorizer, as input for TfidfTransf ...
I have been utilizing the PCA class from sklearn.decomposition to reduce the dimensionality of my feature space for visualization purposes. I have a question about the outcome: Upon implementing the fit and transform functions of the PCA class, I receive ...