Analysis and prediction in sparse and high dimensional text data: The case of Dow Jones stock market

Sert O. C., Şahin S. D., Özyer T., Alhajj R.

Physica A: Statistical Mechanics and its Applications, vol.545, 2020 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 545
  • Publication Date: 2020
  • Doi Number: 10.1016/j.physa.2019.123752
  • Journal Name: Physica A: Statistical Mechanics and its Applications
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Artic & Antarctic Regions, Compendex, INSPEC, Public Affairs Index, zbMATH, Civil Engineering Abstracts
  • Keywords: Named entity recognition, Topic modelling, Sentiment analysis, Social network analysis, Stock market movement prediction, Msaenet
  • Istanbul Medipol University Affiliated: Yes


In this research, we proposed a text analysis system to predict stock market movements using news and social media data. It is a scalable prediction system for sparse and high dimensional feature sets. Using the developed system, we collected 12,560 articles from New York Times covering one year time period, and 2,854,333 tweets from Twitter covering 4 months time period. We analysed the collected data using entity extraction, sentiment analysis and topic modelling techniques. We applied our feature set creation and elastic net regression based training method. The analyses have been used to train different prediction models. Using these trained prediction models, we predicted stock market movements for Dow Jones Index and showed that the proposed method can make promising predictions. In different sets of experiments, highly accurate (up to 70.90% accuracy) predictions are made by the proposed approach. These predicted values also correlated (up to 0.2315 correlation coefficient value) with real Dow Jones Index values. Further, we report performance comparison results for various prediction models that we trained with different set of features to analyse the importance of time interval and feature space size. Our test results show that it is possible to make reasonable stock movement prediction by integrating news and related social media data, analysing them using named entity extraction, sentiment analysis and topic modelling techniques together with prediction models which use features that are created from these analysis results.