In this project, I built a real-time data ingestion pipeline with Apache Kafka and Spark Streaming to collect and process financial data from Yahoo Finance and Finnhub, analyze it in Jupyter Notebook, and generate financial reports using Power BI.
- Data Source: This project uses two main
data sources: Yahoo Finance API and Finnhub Stock APIYahoo Finance API: Data is collected fromYahoo Finance's APIusing theyfinancelibrary, collected inreal timewith an interval between data points of1 minute, collected data includes indicators such asOpen,Volume,Close,Datetime,...Finnhub Stock API: Data is collected fromFinnhub's APIinreal time, collected data includestransactionindicators such asv (volume),p (last price),t (time),...
- Extract Data: After being collected, data will be written to
Kafka (Kafka Producer)with differenttopicsfor each differentdata source. - Transform Data: After data is sent to
Kafka Topic, it will be read and retrieved usingSpark Streaming (Kafka Consumer)and performedreal-time processing.Sparkis set up with3 worker nodes, applyingSpark'sdistributed nature in large data processing. - Load Data: At the same time, when the data is processed, it will be loaded directly into the
CassandraDatabase usingspark. - Serving: Provide detailed insights, create
financial reportswithPower BI, andanalyzeinvestment performance to guide strategic decision-making and optimize portfolio management. - package and orchestration: Components are packaged using
Dockerand orchestrated usingApache Airflow.
Yahoo Finance APIFinnhub Stock APIApache KafkaApache SparkCassandraPower BIJupyter NotebookApache AirflowDocker

