• CHANGELOG.md
• Example
• Installing
• Versions
• 65

# ml_preprocessing #

Data preprocessing algorithms

## What is data preprocessing? #

Data preprocessing is a set of techniques for data preparation before one can use the data in Machine Learning algorithms.

## Why is it needed? #

Let's say, you have a dataset:

``````    ----------------------------------------------------------------------------------------
| Gender | Country | Height (cm) | Weight (kg) | Diabetes (1 - Positive, 0 - Negative) |
----------------------------------------------------------------------------------------
| Female | France  |     165     |     55      |                    1                  |
----------------------------------------------------------------------------------------
| Female | Spain   |     155     |     50      |                    0                  |
----------------------------------------------------------------------------------------
| Male   | Spain   |     175     |     75      |                    0                  |
----------------------------------------------------------------------------------------
| Male   | Russia  |     173     |     77      |                   N/A                 |
----------------------------------------------------------------------------------------
``````

Everything seems good for now. Say, you're about to train a classifier to predict if a person has diabetes. But there is an obstacle - how can it possible to use the data in mathematical equations with string-value columns (`Gender`, `Country`)? And things are getting even worse because of an empty (N/A) value in `Diabetes` column. There should be a way to convert this data to a valid numerical representation. Here data preprocessing techniques come to play. You should decide, how to convert string data (aka categorical data) to numbers and how to treat empty values. Of course, you can come up with your own unique algorithms to do all of these operations, but, actually, there are a bunch of well-known well-performed techniques for doing all the conversions.

In this library, all the data preprocessing operations are narrowed to just one entity - `DataFrame`.

## DataFrame #

`DataFrame` is a factory, that creates instances of different adapters for data. For example, one can create a csv reader, that makes work with csv data easier: it's just needed to point, where a dataset resides and then get features and labels in convenient data science friendly format. Also one can specify, how to treat categorical data.

## A simple usage example #

Let's download some data from Kaggle - let it be amazing black friday dataset. It's pretty interesting data with huge amount of observations (approx. 538000 rows) and a good number of categorical features.

First, import all necessary libraries:

``````import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:xrange/zrange.dart';
``````

Then, we should read the csv and create a data frame:

``````final dataFrame = DataFrame.fromCsv('example/black_friday/black_friday.csv',
labelName: 'Purchase\r',
columns: [ZRange.closed(2, 3), ZRange.closed(5, 7), ZRange.closed(11, 11)],
rows: [ZRange.closed(0, 20)],
categories: {
'Gender': CategoricalDataEncoderType.oneHot,
'Age': CategoricalDataEncoderType.oneHot,
'City_Category': CategoricalDataEncoderType.oneHot,
'Stay_In_Current_City_Years': CategoricalDataEncoderType.oneHot,
'Marital_Status': CategoricalDataEncoderType.oneHot,
},
);
``````

Apparently, it is needed to explain input parameters.

• labelName - name of a column, that contains dependant variables
• columns - a set of intervals, representing which columns one needs to read
• rows - the same as columns, but in this case it's being described, which rows one needs to read
• categories - columns, which contains categorical data, and encoders we want these columns to be processed with. In this particular case we want to encode all the categorical columns with one-hot encoder

It's time to take a look at our processed data! Let's read it:

``````final features = await dataFrame.features;
final labels = await dataFrame.labels;

print(features);
print(labels);
``````

In the output we will see just numerical data, that's exactly we wanted to reach.

# Changelog #

## 3.2.0 #

• `ml_linalg` 9.0.0 supported

## 3.1.0 #

• `Categorical data processing`: `encoders` parameter added to `DataFrame.fromCsv` constructor

## 3.0.0 #

• `xrange` library supported: it's possible to provide `ZRange` object now instead of `tuple2` to specify a range of indices

## 2.0.0 #

• `DataFrame` introduced

## 1.1.0 #

• `Float32x4InterceptPreprocessor` added
• `readme` updated

## 1.0.0 #

• Package published

example/main.dart

``````import 'dart:async';

import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:xrange/zrange.dart';

Future main() async {
// Let's create data frame from a csv file,
//
// `labelIdx: 3` means that the label (dependent variable in terms of
// Machine Learning) column of the dataset is its third column
//
//
// `categories: {...}` means, that we want to encode values of
// `position`-column with one-hot encoder and column `country` will be
// encoded with Ordinal encoder
//
// `rows: [Tuple2<int, int>(0, 6)]` means, that we want to read range of the
// csv's rows from 0 to 6th
//
// `columns: [Tuple2<int, int>(0, 3)]` means, that we want to read range of
// the csv's columns from 0 to third columns
final data = DataFrame.fromCsv('example/dataset.csv', labelIdx: 3,
categories: {
'position': CategoricalDataEncoderType.oneHot,
'country': CategoricalDataEncoderType.ordinal,
},
rows: [ZRange.closed(0, 6)],
columns: [ZRange.closed(0, 3)],
);

// Let's read the header of the dataset, preprocessed features and labels
final features = await data.features;
final labels = await data.labels;

// And print the result
print(features);
print(labels);

// That's, actually, all you have to do to use the data further in different
// applications
}
``````

## Use this package as a library

### 1. Depend on it

``````
dependencies:
ml_preprocessing: ^3.2.0

``````

### 2. Install it

You can install packages from the command line:

with pub:

``````
\$ pub get

``````

with Flutter:

``````
\$ flutter packages get

``````

Alternatively, your editor might support `pub get` or `flutter packages get`. Check the docs for your editor to learn more.

### 3. Import it

Now in your Dart code, you can use:

``````
import 'package:ml_preprocessing/ml_preprocessing.dart';
``````
3.2.0 Apr 16, 2019
3.1.0 Apr 5, 2019
3.0.0 Apr 1, 2019
2.0.0 Mar 25, 2019
1.1.0 Jan 25, 2019
1.0.0 Jan 25, 2019
 Popularity: Describes how popular the package is relative to other packages. [more] 30 Health: Code health derived from static analysis. [more] 100 Maintenance: Reflects how tidy and up-to-date the package is. [more] 100 Overall: Weighted score of the above. [more] 65

We analyzed this package on Apr 16, 2019, and provided a score, details, and suggestions below. Analysis was completed with status completed using:

• Dart: 2.2.0
• pana: 0.12.14

#### Platforms

Detected platforms: Flutter, other

Primary library: `package:ml_preprocessing/ml_preprocessing.dart` with components: `io`.

#### Health suggestions

Fix `lib/src/data_frame/data_frame.dart`. (-0.50 points)

Analysis of `lib/src/data_frame/data_frame.dart` reported 1 hint:

line 10 col 3: Prefer using /// for doc comments.

Format `lib/src/categorical_encoder/encoder_factory_impl.dart`.

Run `dartfmt` to format `lib/src/categorical_encoder/encoder_factory_impl.dart`.

Format `lib/src/categorical_encoder/encoder_mixin.dart`.

Run `dartfmt` to format `lib/src/categorical_encoder/encoder_mixin.dart`.

Fix additional 14 files with analysis or formatting issues.

Additional issues in the following files:

• `lib/src/categorical_encoder/one_hot_encoder.dart` (Run `dartfmt` to format `lib/src/categorical_encoder/one_hot_encoder.dart`.)
• `lib/src/categorical_encoder/ordinal_encoder.dart` (Run `dartfmt` to format `lib/src/categorical_encoder/ordinal_encoder.dart`.)
• `lib/src/data_frame/csv_data_frame.dart` (Run `dartfmt` to format `lib/src/data_frame/csv_data_frame.dart`.)
• `lib/src/data_frame/encoders_processor/encoders_processor.dart` (Run `dartfmt` to format `lib/src/data_frame/encoders_processor/encoders_processor.dart`.)
• `lib/src/data_frame/encoders_processor/encoders_processor_factory.dart` (Run `dartfmt` to format `lib/src/data_frame/encoders_processor/encoders_processor_factory.dart`.)
• `lib/src/data_frame/encoders_processor/encoders_processor_factory_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/encoders_processor/encoders_processor_factory_impl.dart`.)
• `lib/src/data_frame/encoders_processor/encoders_processor_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/encoders_processor/encoders_processor_impl.dart`.)
• `lib/src/data_frame/header_extractor/header_extractor_factory_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/header_extractor/header_extractor_factory_impl.dart`.)
• `lib/src/data_frame/header_extractor/header_extractor_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/header_extractor/header_extractor_impl.dart`.)
• `lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_factory_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_factory_impl.dart`.)
• `lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/index_ranges_combiner/index_ranges_combiner_impl.dart`.)
• `lib/src/data_frame/validator/error_messages.dart` (Run `dartfmt` to format `lib/src/data_frame/validator/error_messages.dart`.)
• `lib/src/data_frame/validator/params_validator_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/validator/params_validator_impl.dart`.)
• `lib/src/data_frame/variables_extractor/variables_extractor_impl.dart` (Run `dartfmt` to format `lib/src/data_frame/variables_extractor/variables_extractor_impl.dart`.)

#### Dependencies

Package Constraint Resolved Available
Direct dependencies
Dart SDK >=2.2.0 <3.0.0
csv ^4.0.0 4.0.3
ml_linalg ^9.0.0 9.0.0
tuple ^1.0.2 1.0.2
xrange ^0.0.4 0.0.5
Transitive dependencies
matcher 0.12.5
meta 1.1.7
path 1.6.2
quiver 2.0.3
stack_trace 1.9.3
Dev dependencies
benchmark_harness >=1.0.0 <2.0.0
build_runner ^1.1.2
build_test ^0.10.2
mockito ^3.0.0
test ^1.2.0