- Normalise data to 0-1 for sigmoid to work and so that large values don't saturate the hidden layer
- Sigmoid/tanh/ReLu for hidden layer to enable non-linearity
- Linear for output layer to enable outputs that aren't just 0-1 or -1-1
- Feature engineering to prepare the data
- Sum of squared errors for cost function may not work well [see this] (unless that's just for logistic regression?)
- Use a masking layer for variable input lengths?

## Links

- Hacker's guide to Neural Networks - great writing and emphasising intuition over maths
- Machine Learning is Fun! [8 parts] - pleasant introductory reading
- How much training data do you need? - rule of thumb for linear models is 10x the number of parameters in the model (probably more for NN)