Data Science In Your Hands

viernes, 8 de septiembre de 2017

Quantitative finance

Quantitative finance is about applying mathematics and statistics to finance. Quantitative finance helps you to price contracts such as options, manage the risk of investment portfolios and improve trade management. You can use it to work out the price of financial contracts. You can use it to manage the risk of trading and investing in these contracts. It helps you develop the skill to protect yourself against the turbulence of financial markets.

Quantitative finance is used by many professionals working in the financial industry. Investment banks use it to price and trade options and swaps. Their customers, such as the officers of retail banks and insurance companies, use it to manage their portfolios of these instruments. Brokers using electronic-trading algorithms use quantitative finance to develop their algorithms. Investment managers use ideas from modern portfolio theory to try to boost the returns of their portfolios and reduce the risks. Hedge fund managers use quantitative finance to develop new trading strategies but also to structure new products for their clients.

Quantitative finance is for banks, businesses and investors who want better control over their finances despite the random movement of the assets they trade or manage. It involves understanding the statistics of asset price movements and working out what the consequences of these fluctuations are.

Who needs quantitative finance? The answer includes banks, hedge funds, insurance companies, property investors and investment managers. Any organisation that uses financial derivatives, such as options, or manages portfolios of equities or bonds uses quantitative finance.

I tried to follow the advice of the physicist Albert Einstein that ‘Everything should be made as simple as possible, but not simpler.’

Speculators are sometimes criticised for destabilising markets, but more likely they do the opposite. To be consistently profitable, a speculator has to buy when prices are low and sell when prices are high. This practice tends to increase prices when they’re low and reduce them when they’re high. So speculation should stabilise prices (not everyone agrees with this reasoning, though).

Speculators also provide liquidity to markets. Liquidity is the extent to which a financial asset can be bought or sold without the price being affected significantly. Because speculators are prepared to buy (or sell) when others are selling (or buying), they increase market liquidity.

In contrast to speculators, hedgers like to play safe. They use financial instruments such as options and futures to protect a financial or physical investment against an adverse movement in price. A hedger protects against price rises if she intends to buy a commodity in the future and protects against price falls if she intends to sell in the future. A natural hedger is, for example, a utility company that knows it will want to purchase natural gas throughout the winter so as to generate electricity. Utility companies typically have a high level of debt (power stations are expensive!) and fixed output prices because of regulation, so they often manage their risk using option and futures contracts.

Random Behaviour of Assets

Random Walk

The random walk, a path made up from a sequence of random steps, is an idea that comes up time and again in quantitative finance. In fact, the random walk is probably the most important idea in quantitative finance.

Figure 1-1 shows the imagined path of a bug walking over a piece of paper and choosing a direction completely at random at each step. The bug doesn’t get far even after taking 20 steps.

In finance, you’re interested in the steps taken by the stock market or any other financial market. You can simulate the track taken by the stock market just like the simulated track taken by a bug. Doing so is a fun metaphor but a serious one, too. Even if this activity doesn’t tell you where the price ends up, it tells you a range within which you can expect to find the price, which can prove to be useful.

Random walks come in different forms. In Figure 1-1, the steps are all the same length. In finance, though random walks are often used with very small step sizes, in which case you get a Brownian motion. In a slightly more complex form of Brownian motion, you get the geometric Brownian motion, or GBM, which is the most common model for the motion of stock markets.

The orthodox view is that financial markets are efficient, meaning that prices reflect known information and follow a random walk pattern. It’s therefore impossible to beat the market and not worth paying anyone to manage an investment portfolio. This is the efficient market hypothesis, or EMH for short. This view is quite widely accepted and is the reason for the success of tracker funds, investments that seek to follow or track a stock index such as the Dow Jones Industrial Average. Because tracking an index takes little skill, investment managers can offer a diversified portfolio at low cost.

Anomalies are systematically found in historical stock prices that violate even weak efficiency. For example, you find momentum in most stock prices: If the Price has risen in the past few months, it will tend to rise further in the next few months. Likewise, if the price has fallen in the past few months, it will tend to continue falling in the next few months. This anomaly is quite persistent and is the basis for the trend following strategy of many hedge funds.

Indeed, if markets were informationally efficient, there would be no incentive to seek out information. The cost wouldn’t justify it. On the other hand, if everyone else is uninformed, it would be rewarding to become informed as you can trade successfully with those who know less than you.

The point that in an efficient market there’s no incentive to seek out information and so therefore no mechanism for it to become efficient is the Grossman-Stiglitz paradox, named after the American economists Sanford Grossman and Joseph Stiglitz. The implication is that markets will be efficient but certainly not perfectly efficient.

lunes, 10 de abril de 2017

Preparing to model the data: Overfitting and Underfitting

Preparing to model the data: Overfitting and Underfitting

Usually, the accuracy of the provisional model is not as high on the test set as it is on the training set, often because the provisional model is overfitting on the training set.

There is an eternal tension in model building between model complexity (resulting in high accuracy on the training set) and generalizability to the test and validation sets. Increasing the complexity of the model in order to increase the accuracy on the training set eventually and inevitably leads to a degradation in the generalizability of the provisional model to the test set.

The provisional model begins to grow in complexity from the null model (with little or no complexity), the error rates on both the training set and the test set fall. As the model complexity increases, the error rate on the training set continues to fall in a monotone manner. However, as the model complexity increases, the test set error rate soon begins to flatten out and increase because the provisional model has memorized the training set rather than leaving room for generalizing to unseen data.

The point where the minimal error rate on the test set is encountered is the optimal level of model complexity. Complexity greater than this is considered to be overfitting; complexity less than this is considered to be underfitting.

Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases.

Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data and it describes random error or noise instead of the underlying relationship.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model would have poor predictive performance.

Overfitting in Regression Analysis

In regression analysis, overfitting a model is a real problem. An overfit model can cause the regression coefficients, p-values, and R-squared to be misleading.

Overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for all of the terms in the equation.

Larger sample sizes allow you to specify more complex models. For trustworthy results, your sample size must be large enough to support the level of complexity that is required by your research question. If your sample size isn’t large enough, you won’t be able to fit a model that adequately approximates the true model for your response variable. You won’t be able to trust the results.

You must have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression. However, if the effect size is small or there is high multicollinearity, you may need more observations per term.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

A Good Fit in Machine Learning

To avoid overfitting your model in the first place, collect a sample that is large enough so you can safely include all of the predictors, interaction effects, and polynomial terms that your response variable requires. The scientific process involves plenty of research before you even begin to collect data. You should identify the important variables, the model that you are likely to specify, and use that information to estimate a good sample size.

Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

I have not encountered underfitting very often. Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few. Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

How To Limit

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:

Use a resampling technique to estimate model accuracy.
Hold back a validation dataset.

The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice.

viernes, 7 de abril de 2017

Genetic Algorithm with Java and Python

Genetic Algorithms (GAs) attempt to computationally mimic the processes by which natural selection operates, and apply them to solve business and research problems. Developed by John Holland in the 1960s abd 1970s, GAs provide a framework for studying the effects of such biologically inspired factors as mate selection, reproduction, mutation and crossover of genetic information. In the natural world, the constraints and stresses of a particular environment force the different species (and different individuals within species) to compete to produce the fittest offspring. In the world of GAs, the fittest potential solutions evolve to produce even more optimal solutions.

Not surprisingly, the field of GAs has borrowed heavily from genomic terminology. Each cell in our body contains the same set of chromosomes, string of DNA that function as a blueprint for making one of us. Then, each chromosome can be partitioned into genes, which are block of DNA designed to encode a particular trait such as en eye color. Mutation, the altering of a single gene in a chromosome of the offspring. The offspring fitness is then evaluated, either in terms of viability (living long enough to reproduce).

Now, in the field of GAs, a chromosome refers to one of the candidate solutions to the problem, a gene is a single bit or digit of the candidate solution. Most GAs function by iteratively updating a collection of potential solutions, called a population. Each member of the population is evaluated for fitness on each cycle. A new population then replaces the old population with the fittest members.

The fittest function f(x) is a real-valued function operating on the chromosome (potential solution), not the gene, so that the x in f(x) refers to the numeric value taken by the chromosome at the time of fitness evaluation.

Goal oriented problem solving

Genetic algorithms are one of the tools we can use to apply machine learning to finding good, sometimes even optimal, solutions to problems that have billions of potential solutions. They use biological processes in software to find answers to problems that have really large search spaces by continuously generating candidate solutions, evaluating how well the solutions fit the desire outcome, and refining the best solutions.

When solving a problem with a genetic algorithm, instead of asking for a specific solution, you provide characteristics that the solution must have or rules its solution must pass to be accepted. The more constraints you add the more potential solutions are blocked. Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection.

Imagine you are given 10 chances to guess a number between 1 and 1000 and the only feedback you get is whether your guess is right or wrong. Could you reliably guess the number? With only right or wrong as feedback, you have no way to improve your guesses so you have at best a 1 in 100 chance of guessing the number. A fundamental aspect of solving problems using genetic algorithms is that they must provide feedback that helps the engine select the better of two guesses. That feedback is called the fitness, for how closely guess fits the desired result. More importantly it implies a general progression.

You might be able to do better than random guessing if you have problem-specific knowledge that helps you eliminate certain number combinations. Using problem-specific knowledge to guide the genetic algorithm's creation and modification of potential solutions can help them find a solution orders of magnitude faster.

Genetic algorithms and genetic programming are very good at finding solutions to very large problems. They do it by taking millions of samples from the search space, making small changes, possibly recombining parts of the best solutions, comparing the resultant fitness against that of the current best solution, and keeping the better of the two. This process repeats until a stop condition like one of the following occurs: the known solution is found, a solution meeting all requirements is found, a certain number of generations has passed, a specific amount of time has passed, etc.

Guess the password

Imagine you are asked to guess a password; what kind of feedback would you want? These decisions are some of the ones you have to make when planning to implement a genetic algorithm to find a solution to your problem. Genetic algorithms are good at finding good solutions to problems with large search spaces because they can quickly find the parts of the guesses that improve fitness values or lead to better solutions.

This is an intuitive sample that you understand and can fall back upon when learning to use other machine learning tools and techniques, or applying genetic algorithms in your own field of expertise.

Genetic Algorithms use random exploration of the problem space combined with evolutionary processes like mutation and crossover (exchange of genetic information) to improve guesses. But also, because they have no experience in the problem domain, they try things a human would never think to try. Thus, a person using a genetic algorithm may learn more about the problem space and potential solutions. This gives them the ability to make improvements to the algorithm, in a virtous cycle.

The genetic algorithm should make informed guesses.

Genes

To begin with, genetic algorithm needs a gene set to use for building guesses. For this sample will be a generic set of letters. It also needs a target password to guess:

Java:

/**

Gene Set, we set a generic set of letters and target/desired result

private static final String geneSet = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!.,";

private static final String target = "This is a test for genetics aldorithms";

Python:

geneset = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!.,"

Generate a guess

Next the algorithm needs a way to generate a random string from the gene set.

Java:

private static final Random random = new Random();

/**

* We need a way to generate random string (a guest) from the geneset.

* @param length the number of random chars to be generated from a geneSet.

private static String generate_parent(int length) {

StringBuilder genes = new StringBuilder();

for (int i = 0; i < length; i++) {

genes.append(geneSet.charAt(random.nextInt(geneSet.length())));

}

return genes.toString();

}

Python:

import random

def _generate_parent(length, geneSet, get_fitness):

genes = []

while len(genes) < length:

sampleSize = min(length - len(genes), len(geneSet))

genes.extend(random.sample(geneSet, sampleSize))

return ''.join(genes)

Fitness

The fitness value the genetic algorithm provides is the only feedback the engine gets to guide it toward a solution. In this simple the fitness value is the total number of letters in the guess that can match the letter in the same position of the password.

Java:

/**

* We need to define a fitness

* The fitness is the only feedback the engine get t o guide it towards a solution.

* Fitness is the total number in the guess that match the letter in the same position of the password.

private static int get_fitness(String guess){

char[] guess_array = guess.toCharArray();

char[] target_array = target.toCharArray();

int contador = 0;

for(int i = 0; i < target_array.length; i++) {

if (guess_array[i] == target_array[i]) {

contador++;

}

return contador;

}

Python:

def get_fitness(guess, target):

return sum(1 for expected, actual in zip(target, guess)

if expected == actual)

Mutate

Next , the engine needs a way to produce a new guess by mutating the current one. The following implementation converts the parent string to an array, then replaces 1 letter in the array with a randomly selected one from geneSet, and fnally recombines the result into a new string.

Java:

/**

* We need a new way to produce a new guess mutating the current one. Covnerts the parent string to an array

* and replaces 1 letter in the array with a randomly selected from the geneSet.

* @return

private static String mutate(String parent){

char[] guess_array = parent.toCharArray();

char[] target_array = target.toCharArray();

for(int i = 0; i < target_array.length; i++) {

if (guess_array[i] != target_array[i]) {

char gene = geneSet.charAt(random.nextInt(geneSet.length()));

//is the new gene the same it was at that position?

//replace the selected gene if it is the same as the one it is supposed to replace , which can prevent a significant number of wasted guesses.

while (gene != guess_array[i]){

guess_array[i] = gene;

}

return String.valueOf(guess_array);

}

Python:

def _mutate(parent, geneSet, get_fitness):

index = random.randrange(0, len(parent.Genes))

childGenes = list(parent.Genes)

newGene, alternate = random.sample(geneSet, 2)

childGenes[index] = alternate \

if newGene == childGenes[index] \

else newGene

return ''.join(childGenes)

Display

Next, it is imporant to monitor what is happening so that the engine can be stopped if it gets stuck. Having a visual representation of the gene sequence, which may not be the literal gene sequence, is often critical to identifying what works and what does not so that the algorithm can be improved.

Normally the display function also outputs the fitness value and how much time has elapsed.

Java:

/**

* It is important to monitor what is happening so that the engine can be stopped if * gets stucked. Having a visual representatiion on the gene sequence , which may not be * the literal sequence, is often critical to identifying what works and what does not so * that the algorithms can be improved.

* Display function also show output the fitness value and how much time has elapsed.

private static void display(String guess, Date startTime) {

Date time = new Date();

System.out.println(String.format("Guess: " + guess+ ", Fitness: " + get_fitness(guess) + ", Time: " + (time.getTime() - startTime.getTime()) + " milisegundos" ));

}

Python:

def display(candidate, startTime):

timeDiff = datetime.datetime.now() - startTime

print("{0}\t{1}\t{2}".format(

candidate.Genes, candidate.Fitness, str(timeDiff)))

Main

The main program begins by initializing bestParent to a random sequence of letters and calling the display function.

The final piece is the heart of the genetic engine. It is a loop that:

· generate a guess
· requests the fitness for that guess, then
· compares the fitness so that of the previous best guess, and
· keeps the guess with the better fitness.

Java:

public static void main(String[] args) {

GuessTheNumber guessTheNumber = new GuessTheNumber();

//Generamos

random.setSeed(12345);

Date startTime = new Date();

String best_parent = guessTheNumber.generate_parent(target.length());

int best_fitness = guessTheNumber.get_fitness(best_parent);

guessTheNumber.display(best_parent, startTime);

while (true) {

String child = guessTheNumber.mutate(best_parent);

int child_fitness = guessTheNumber.get_fitness(child);

if (best_fitness >= child_fitness) {

display(child, startTime);

}

if (child_fitness >= best_parent.length()){

display(child, startTime);

break;

}

best_fitness = child_fitness;

best_parent = child;

}

This cycle repeats until a stop condition occurs, in this case when all the letters in the guess match thise in the target.

Run the code and you'll see the output simlir to the following. Success!

Guess: spytwdwrmCCvnDBxIoBMGoVJlCiUMEaKmWEA w, Fitness: 0, Time: 0 milisegundos

Guess: IhX is a GlKt fnV txnewiesvaNdorithms, Fitness: 24, Time: 5 milisegundos

Guess: Vhi is a cAZt fu gTnesiXsrasdorithms, Fitness: 26, Time: 5milisegundos

Guess: PhiI is a QKGt fhj gVneyi,sfafdorithms, Fitness: 26, Time: 5milisegundos

Guess: ghi. is a fjGt fD. gOneYissVaNdorithms, Fitness: 26, Time: 5 milisegundos

Guess: LhiV is a ZWEt fgK gUnexinsHaOdorithms, Fitness: 26, Time: 5 milisegundos

Guess: Thiv is a .yst fo! gtneIicspaldorithms, Fitness: 31, Time: 10 milisegundos

Guess: ThiR is a Dyst foa geneeicsDaldorithms, Fitness: 32, Time: 10 milisegundos

Guess: Thiu is a yJst fow geneeicsTaldorithms, Fitness: 32, Time: 10 milisegundos

Guess: ThiA is a oCst foY genexicsJaldorithms, Fitness: 32, Time: 10 milisegundos

Guess: Thil is a xEst fod geneWicsYaldorithms, Fitness: 32, Time: 10 milisegundos

Guess: This is a Zest for genetics aldorithms, Fitness: 37, Time: 15 milisegundos

Guess: This is a Vest for genetics aldorithms, Fitness: 37, Time: 15 milisegundos

Guess: This is a test for genetics aldorithms, Fitness: 38, Time: 15 milisegundos

Guess: This is a test for genetics algorithms, Fitness: 38, Time: 15 milisegundos

Chromosome Object

Next, we must introduce a Chromosome object that has Genes and Fitness attributes. This will make the genetic engine more flexible by making it possible to pass those values around as a unit.

public class Chromosome {

private int fitness;

private String genes;

public Chromosome(int fitness, String genes) {

this.fitness = fitness;

this.genes = genes;

}

public int getFitness() {

return fitness;

}

public void setFitness(int fitness) {

this.fitness = fitness;

}

public String getGenes() {

return genes;

}

public void setGenes(String genes) {

this.genes = genes;

}

We have a working engine, but is currently a tightly coupled class, so the next task is up to you,

- Extract the genetic engine code from that specific to guessing the password so it can be reused for other projects.
- Create a package called genetic and include the Chromosome class there.
- Try to use unit tests.
- Try the genetic engine with a longer password.
- Add support for benchmarking to genetic because it is useful to know how long engine takes to find a solution on average and the standard deviation.

Write me to email miguelurbin@gmail.com in order to send you the project in Python and Java to make your own tests or improvements.